Best viewed in Chrome. Please wait for videos to be loaded.
TL;DR: We present MotionStream, a streaming (real-time, long-duration) video generation system with motion controls, unlocking new possibilities for interactive content generation. More examples below!
âš¡ Note: Our model runs causally in real time on a single NVIDIA H100 GPU (29 FPS, 0.4s Latency).
All video results shown here are raw screen captures without any post-processing.
Current motion-conditioned video generation methods suffer from prohibitive latency (minutes per video) and non-causal processing that prevents real-time interaction. We present MotionStream, enabling sub-second latency with up to 29 FPS streaming generation on a single GPU. Our approach begins with augmenting a text-to-video model with motion control, which generates high-quality videos that adhere to the global text prompt and local motion guidance, but does not perform inference on-the-fly. As such, we distill this bidirectional teacher into a causal student through Self Forcing paradigm with distribution matching loss, enabling real-time streaming inference. Several key challenges arise when generating videos of long, potentially infinite time-horizons – (1) bridging the domain gap from training on finite length and extrapolate under infinite-horizon, (2) sustaining high quality, preventing error accumulations, and (3) maintaining fast inference, without incurring growth in computational costs due to increasing context windows. A key to our approach is introducing carefully designed sliding window causal attention with KV cache combined with attention sinks. By incorporating self-rollout with attention sinks and KV cache rolling during training, we properly simulate inference-time extrapolations with a fixed context window, enabling constant-speed generation of arbitrarily long videos. Our models achieve state-of-the-art results in motion following and video quality while being two order magnitude faster, uniquely enabling infinite-length streaming. With MotionStream, users can paint trajectories, control cameras, or transfer motion, and see results unfold in real-time, delivering a truly interactive experience.
To build a teacher motion-controlled video model, we extract and encode 2D tracks from input video using a lightweight track head. These track embeddings are combined with image, noisy video latents, and text embeddings as input to a bidirectional diffusion transformer trained with flow-matching loss (top). We introduce joint motion-text guidance as distillation target and train a few-step causal student model through Self Forcing-style DMD distillation with autoregressive rollout, rolling kv cache, and attention sink (bottom). By properly simulating inference-time extrapolation with attention sink and rolling kv cache during training, our method can generate long videos at constant throughput and latency.
Failure Cases: Our model can produce artifacts when motion trajectories are extremely rapid or physically implausible, and it sometimes struggles to preserve source details in highly complex scenes. In the cat, Mona Lisa, and turtle examples below, the intention was to 1) bring the cat out of the box, 2) flip the book pages, and 3) make the turtle hatch out of the egg. These examples result in physically implausible movements due to both imperfect user-drawn drag motions (limitation in the accuracy of hand-drawn trajectories) and the backbone model's limited generalization capacity. In the Diverse People case, we also observe that detailed human identities change and artifacts occur, with imperfect trajectories being continuously input, which we believe could be partially alleviated with an improved model backbone.
In the bottom two rows, we evaluate the World Exploration task using game engine footage. As rapid perspective and scene changes are common with world exploration tasks, no tracks remain visible after a few frames, necessitating a frequent re-initialization. For this reason, we repeatedly apply CoTracker 3 on video chunks (visible as periodically reappearing dense grids). Since our model assigns a distinct positional embedding to unique tracks, this discontinuous signal confuses the model (reinitialization of tracks would mean sudden appearance or disappearance), and results in transient artifacts. Furthermore, our fixed attention sink is designed for stability by anchoring the initial frame, so when the scene diverges drastically or becomes too ambiguous, the model sometimes revert to the visual features of that anchor. This indicates that while MotionStream excels at animating dynamics within a fixed scene or performing moderate camera control, the combination of point trajectories (which by nature are brittle to long videos, and frequent scene changes) and attention sink is less optimal for open-ended world exploration. Nonetheless, we believe there could be better ways, such as a dynamic attention sink or methods deriving a smooth transition between track chunks, which we leave as a promising future work. For additional discussions, please see the Limitation and Future Work section in our paper's supplementary materials.
@article{shin2025motionstream,
title={MotionStream: Real-Time Video Generation with Interactive Motion Controls},
author={Shin, Joonghyuk and Li, Zhengqi and Zhang, Richard and Zhu, Jun-Yan and Park, Jaesik and Shechtman, Eli and Huang, Xun},
journal={arXiv preprint arXiv:2511.01266},
year={2025}
}