InstantDrag performs drag-edits in about a second requiring minimal inputs and memory.

InstantDrag Teaser

Demo Video

Abstract

Drag-based image editing has recently gained popularity for its interactivity and precision. However, despite the ability of text-to-image models to generate samples within a second, drag editing still lags behind due to the challenge of accurately reflecting user interaction while maintaining image content. Some existing approaches rely on computationally intensive per-image optimization or intricate guidance-based methods, requiring additional inputs such as masks for movable regions and text prompts, thereby compromising the interactivity of the editing process. We introduce InstantDrag, an optimization-free pipeline that enhances interactivity and speed, requiring only an image and a drag instruction as input. InstantDrag consists of two carefully designed networks: a drag-conditioned optical flow generator (FlowGen) and an optical flow-conditioned diffusion model (FlowDiffusion). InstantDrag learns motion dynamics for drag-based image editing in real-world video datasets by decomposing the task into motion generation and motion-conditioned image generation. We demonstrate InstantDrag capability to perform fast, photo-realistic edits without masks or text prompts through experiments on facial video datasets and general scenes. These results highlight the efficiency of our approach in handling drag-based image editing, making it a promising solution for interactive, real-time applications.

InstantDrag Framework

Inference pipeline

Inference pipeline

Given sparse user drag input, our FlowGen estimates dense optical flow, and our FlowDiffusion edits the original image with the flow guidance. With several proposed training techniques, our inversion- and optimization-free pipeline achieves interactive drag editing in around a second, eliminating the need for auxiliary inputs.

Model architectures (FlowGen & FlowDiffusion)

Model architectures

We decouple the drag-editing task into two easier subtasks: 1) motion generation and 2) motion-conditioned image generation, assigning each task to a generative model with the appropriate capacity and design (GAN for FlowGen, and Diffusion for FlowDiffusion). FlowGen is designed by conceptualizing the task as a translation problem, where the goal is to map an RGB image with drag instructions (sparse flow) to a dense optical flow. Using Pix2Pix-like framework, FlowGen is trained from scratch using a combination of adversarial and reconstruction loss. It receives 5 channel input data (RGB image + sparse flow) and outputs 2 channel dense flow. FlowDiffusion is an optical flow conditioned diffusion model fine-tuned from Stable Diffusion v1.5. Inspired by Instruct-Pix2Pix, FlowDiffusion is trained with classifier-free guidance in mind, relying solely on the input image and downscaled optical flow for conditioning. FlowDiffusion's U-Net accepts 10-channel input, which integrates latent noise, latent image, and downscaled optical flow. Our models are trained exclusively on a large scale facial video dataset, CelebV-Text with several training techniques specifically optimized for drag-based image editing.

Experimental Results

Quantitative comparison

Quantitative Results

We randomly sampled nearby frame pairs from the TalkingHead-1KH dataset, and used them as ground truth for quantitative evaluation. 'O' and 'E' refer to scores calculated against the original (initial, input) and edited (nearby, GT output) frame images. Scores calculated against original images indicate content preservation, while scores calculated edited images reflect the resemblance to actual ground truth movement.

Human preference study & Other applications

User study results

We sample 22 drag-edited images from various domains and collect 66 responses. Participants were asked to rate each model's output on a scale from 1 (very poor) to 5 (very good) across three critera: instruction-following, identity preservation, and overall preference. Our model can also be used to downstream tasks such as keypoint-based facial expression transfer.

Additional Qualitative Results

Editng results for non-facial images (w/o fine-tuning)

General scenes

Interestingly, we found that our model, trained exclusively on a real-world facial video dataset, (although in a preliminary stage) generalizes to unseen, non-facial scenes and objects in many cases. We hypothesize that this generalizability stems from the diverse, noisy nature of the in-the-wild training data, combined with the efficient and straightforward design of our pipeline.

Comparison with other models

(Non-exhaustive) Related Works

BibTeX

@inproceedings{shin2024instantdrag,
      title     = {{InstantDrag: Improving Interactivity in Drag-based Image Editing}},
      author    = {Shin, Joonghyuk and Choi, Daehyeon and Park, Jaesik},
      booktitle = {ACM SIGGRAPH Asia 2024 Conference Proceedings},
      year      = {2024},
      pages     = {1--10},
}