MotionStream: Real-Time Video Generation with Interactive Motion Controls

Abstract

Current motion-conditioned video generation methods suffer from prohibitive latency (minutes per video) and non-causal processing that prevents real-time interaction. We present MotionStream, enabling sub-second latency with up to 29 FPS streaming generation on a single GPU. Our approach begins with augmenting a text-to-video model with motion control, which generates high-quality videos that adhere to the global text prompt and local motion guidance, but does not perform inference on-the-fly. As such, we distill this bidirectional teacher into a causal student through Self Forcing paradigm with distribution matching loss, enabling real-time streaming inference. Several key challenges arise when generating videos of long, potentially infinite time-horizons – (1) bridging the domain gap from training on finite length and extrapolate under infinite-horizon, (2) sustaining high quality, preventing error accumulations, and (3) maintaining fast inference, without incurring growth in computational costs due to increasing context windows. A key to our approach is introducing carefully designed sliding window causal attention with KV cache combined with attention sinks. By incorporating self-rollout with attention sinks and KV cache rolling during training, we properly simulate inference-time extrapolations with a fixed context window, enabling constant-speed generation of arbitrarily long videos. Our models achieve state-of-the-art results in motion following and video quality while being two order magnitude faster, uniquely enabling infinite-length streaming. With MotionStream, users can paint trajectories, control cameras, or transfer motion, and see results unfold in real-time, delivering a truly interactive experience.

Method Overview

To build a teacher motion-controlled video model, we extract and encode 2D tracks from input video using a lightweight track head. These track embeddings are combined with image, noisy video latents, and text embeddings as input to a bidirectional diffusion transformer trained with flow-matching loss (top). We introduce joint motion-text guidance as distillation target and train a few-step causal student model through Self Forcing-style DMD distillation with autoregressive rollout, rolling kv cache, and attention sink (bottom). By properly simulating inference-time extrapolation with attention sink and rolling kv cache during training, our method can generate long videos at constant throughput and latency.

More Results and Details

Real-time Streaming Demo on a Single GPU: Through simple click-and-drag sequences, MotionStream enables real-time control of diverse scenarios, including both object motion and camera movement, across various grid configurations. Given its autoregressive nature, users can also pause/resume (using the space key) and add static points or multiple moving tracks to better specify control. Benefitting from constant attention context with sink and KV cache rolling, users can experience 29 FPS at 480p using our 1.3B model and 24 FPS at 720p with our 5B model—all with subsecond latency—on a single H100 GPU (we did not use other optimization techniques beyond mixed precision and Flash Attention 3). By anchoring attention to the clean first chunk at all times (keeping minimal drifted chunks within the attention context), the model often recovers quality even after disruptions, showcasing its resilience in long video streaming scenarios. Check out our long video example in the Full Gallery, where it generates 5,000 frames. Due to the streaming nature, our current demo is highly susceptible to network latency and instability (due to multiple latency bottlenecks, your grid might be in the wrong place by the time it reaches the generation pipeline), and performance is optimal with local clusters. Note: our initial demos were recorded on a system equipped with Flash Attention 2 (FA2) resulting in ~25 FPS, while videos with FA3 tag are newly recorded using Flash Attention 3 (~30 FPS, lower latency) and an updated front-end. All streaming demo videos are from 1.3B model variant. In the demo, tracks are color-coded: green for online user-dragged motion, red for static points, and blue indicates pre-drawn paths for moving multiple points simultaneously.

Border Collie

ChameleonFA3

Hot Airballoons

Hot Airballoons 2FA3

Mammoth

Moon Flower

Rocket

Wizard

Woman

View Full Gallery

Camera Control: Using monocular depth estimation models, we can lift an image to 3D and derive 2D motion trajectories by projecting each point to camera coordinates. For LLFF evaluation, we obtained trajectories by interpolating the input and target frames as discussed in the main paper (top 2 examples). We can also perform pre-defined camera motions such as dolly zoom or arcing (bottom 2 examples). For benchmarking, we found that appending static templates such as "static scene, only camera motion, no object is moving" to be helpful.

Input Image

Motion Track

Result

Input Image

Motion Track

Result

Input Image

Motion Track

Result

Input Image

Motion Track

Result

Impact of Sparse Attention Parameters in Extrapolation Scenarios: Results demonstrate that maintaining at least one sink chunk is crucial for stable long-video generation. Experiments show that larger attention windows do not improve performance in motion-controlled scenarios. We found chunk size of 3 with both sink and window size of one chunk to be optimal. Please refer to the main text and limitation section for advantages and disadvantages of fixed attention context.

Chunk: 3 - Sink: 0 - Local window: 1

Chunk: 3 - Sink: 0 - Local window: 6

Chunk: 3 - Sink: 1 - Local window: 1

Chunk: 3 - Sink: 0 - Local window: 1

Chunk: 3 - Sink: 0 - Local window: 6

Chunk: 3 - Sink: 1 - Local window: 1

Qualitative Comparison with Other Baselines: We compare our methods with recent motion-controlled video generation approaches using samples from Sora demos. We denote our models as 1.3B-T (1.3B parameter teacher model), 1.3B-S (1.3B parameter student/distilled model), 5B-T (5B parameter teacher model), and 5B-S (5B parameter student/distilled model). Our methods consistently demonstrate high video quality and motion adherence compared to baseline approaches, with the student models achieving real-time performance.

Motion Track

GWTF

DAS

ATI

Ours 1.3B-T

Ours 1.3B-S

Ours 5B-T

Ours 5B-S

Motion Track

GWTF

DAS

ATI

Ours 1.3B-T

Ours 1.3B-S

Ours 5B-T

Ours 5B-S

Motion Track

GWTF

DAS

ATI

Ours 1.3B-T

Ours 1.3B-S

Ours 5B-T

Ours 5B-S

Motion Transfer: Given an initial image of similar structure, MotionStream can naturally transfer motions in a streaming fashion for arbitrarily long videos when tracks maintain sufficient quality. It can also be combined with real-time trackers (e.g., facial keypoints or pose estimators) to enable online motion transfer.

Source

Motion Track

Result

Source

Motion Track

Result

Source

Motion Track

Result

Source

Motion Track

Result

Guidance Ablation Study: Using high motion guidance leads to overly rigid translations due to strict track adherence and ignores text cues (below sample is prompted with "rainbow appearing at the back"). While prompt guidance alone shows inferior quantitative metrics for motion reconstruction, it enables flexible text-based control. Our proposed joint guidance strategy effectively balances these two.

Motion Track

Motion Only

Joint Guidance

Prompt Only

Motion Track

Motion Only

Joint Guidance

Prompt Only

Failure Cases: Our model can produce artifacts when motion trajectories are extremely rapid or physically implausible, and it sometimes struggles to preserve source details in highly complex scenes. In the cat, Mona Lisa, and turtle examples below, the intention was to 1) bring the cat out of the box, 2) flip the book pages, and 3) make the turtle hatch out of the egg. These examples result in physically implausible movements due to both imperfect user-drawn drag motions (limitation in the accuracy of hand-drawn trajectories) and the backbone model's limited generalization capacity. In the Diverse People case, we also observe that detailed human identities change and artifacts occur, with imperfect trajectories being continuously input, which we believe could be partially alleviated with an improved model backbone.

In the bottom two rows, we evaluate the World Exploration task using game engine footage. As rapid perspective and scene changes are common with world exploration tasks, no tracks remain visible after a few frames, necessitating a frequent re-initialization. For this reason, we repeatedly apply CoTracker 3 on video chunks (visible as periodically reappearing dense grids). Since our model assigns a distinct positional embedding to unique tracks, this discontinuous signal confuses the model (reinitialization of tracks would mean sudden appearance or disappearance), and results in transient artifacts. Furthermore, our fixed attention sink is designed for stability by anchoring the initial frame, so when the scene diverges drastically or becomes too ambiguous, the model sometimes revert to the visual features of that anchor. This indicates that while MotionStream excels at animating dynamics within a fixed scene or performing moderate camera control, the combination of point trajectories (which by nature are brittle to long videos, and frequent scene changes) and attention sink is less optimal for open-ended world exploration. Nonetheless, we believe there could be better ways, such as a dynamic attention sink or methods deriving a smooth transition between track chunks, which we leave as a promising future work. For additional discussions, please see the Limitation and Future Work section in our paper's supplementary materials.

Streaming Demo

Cat

Diverse People

Mona Lisa

Turtle

World Exploration

Ground Truth

Motion Track

Output

Ground Truth

Motion Track

Output

Additional Evaluation Examples: We present additional qualitative results on complex motion scenarios from DAVIS and Sora datasets. These examples demonstrate our model's performance on challenging cases with fast movements, occlusions, and dynamic scenes. Results compare the ground truth video, extracted motion tracks, and outputs from both our teacher and student models.

Ground Truth

Motion Track

Teacher

Student

Ground Truth

Motion Track

Teacher

Student

Ground Truth

Motion Track

Teacher

Student

Ground Truth

Motion Track

Teacher

Student

Ground Truth

Motion Track

Teacher

Student

Ground Truth

Motion Track

Teacher

Student

Ground Truth

Motion Track

Teacher

Student

Ground Truth

Motion Track

Teacher

Student

Ground Truth

Motion Track

Teacher

Student

Ground Truth

Motion Track

Teacher

Student

Ground Truth

Motion Track

Teacher

Student

Ground Truth

Motion Track

Teacher

Student

Image-to-Video Generation Without Motion Conditioning: Here, following the reviewer's suggestion, we test our models with empty track conditions where only image and text prompts are used as input. It is important to note that our training objective incentivizes the accurate following of motion conditions (we did not optimize specifically for pure I2V usage). As noted in Section 3.1, we randomly drop motion conditions during training to simulate intermittent user inputs (e.g., releasing the mouse), but not to the extent of a full unconditional I2V setting. Nonetheless, while quality is slightly lower compared to the fully conditioned setting, the model still follows text prompts effectively without visual collapse. One interesting observation is that the teacher model without the motion condition rarely produces sudden scene changes. As our model is not trained for this, we hypothesize that applying CFG with pure text without motion results in unstable outputs where the text prompt often dominates. We did not observe this behavior with guidance-distilled student models.

Ground Truth

Motion Track

Teacher (Full)

Teacher (Empty)

Student (Full)

Student (Empty)

Ground Truth

Motion Track

Teacher (Full)

Teacher (Empty)

Student (Full)

Student (Empty)

Ground Truth

Motion Track

Teacher (Full)

Teacher (Empty)

Student (Full)

Student (Empty)

Ground Truth

Motion Track

Teacher (Full)

Teacher (Empty)

Student (Full)

Student (Empty)

Ground Truth

Motion Track

Teacher (Full)

Teacher (Empty)

Student (Full)

Student (Empty)

Ground Truth

Motion Track

Teacher (Full)

Teacher (Empty)

Student (Full)

Student (Empty)

Ground Truth

Motion Track

Teacher (Full)

Teacher (Empty)

Student (Full)

Student (Empty)

Ground Truth

Motion Track

Teacher (Full)

Teacher (Empty)

Student (Full)

Student (Empty)

Ground Truth

Motion Track

Teacher (Full)

Teacher (Empty)

Student (Full)

Student (Empty)

Ground Truth

Motion Track

Teacher (Full)

Teacher (Empty)

Student (Full)

Student (Empty)

BibTeX

@inproceedings{shin2026motionstream,
  title     = {{MotionStream: Real-Time Video Generation with Interactive Motion Controls}},
  author    = {Shin, Joonghyuk and Li, Zhengqi and Zhang, Richard and Zhu, Jun-Yan and Park, Jaesik and Shechtman, Eli and Huang, Xun},
  booktitle = {Proceedings of the International Conference on Learning Representations (ICLR},
  year      = {2026}
}