Best viewed in Chrome. Please wait for videos to be loaded.
TL;DR: We present MotionStream, a streaming (real-time, long-duration) video generation system with motion controls, unlocking new possibilities for interactive content generation. More examples below!
âš¡ Note: Our model runs causally in real time on a single NVIDIA H100 GPU (29 FPS, 0.4s Latency).
All video results shown here are raw screen captures without any post-processing.
Current motion-conditioned video generation methods suffer from prohibitive latency (minutes per video) and non-causal processing that prevents real-time interaction. We present MotionStream, enabling sub-second latency with up to 29 FPS streaming generation on a single GPU. Our approach begins with augmenting a text-to-video model with motion control, which generates high-quality videos that adhere to the global text prompt and local motion guidance, but does not perform inference on-the-fly. As such, we distill this bidirectional teacher into a causal student through Self Forcing paradigm with distribution matching loss, enabling real-time streaming inference. Several key challenges arise when generating videos of long, potentially infinite time-horizons – (1) bridging the domain gap from training on finite length and extrapolate under infinite-horizon, (2) sustaining high quality, preventing error accumulations, and (3) maintaining fast inference, without incurring growth in computational costs due to increasing context windows. A key to our approach is introducing carefully designed sliding window causal attention with KV cache combined with attention sinks. By incorporating self-rollout with attention sinks and KV cache rolling during training, we properly simulate inference-time extrapolations with a fixed context window, enabling constant-speed generation of arbitrarily long videos. Our models achieve state-of-the-art results in motion following and video quality while being two order magnitude faster, uniquely enabling infinite-length streaming. With MotionStream, users can paint trajectories, control cameras, or transfer motion, and see results unfold in real-time, delivering a truly interactive experience.
To build a teacher motion-controlled video model, we extract and encode 2D tracks from input video using a lightweight track head. These track embeddings are combined with image, noisy video latents, and text embeddings as input to a bidirectional diffusion transformer trained with flow-matching loss (top). We introduce joint motion-text guidance as distillation target and train a few-step causal student model through Self Forcing-style DMD distillation with autoregressive rollout, rolling kv cache, and attention sink (bottom). By properly simulating inference-time extrapolation wiht attention sink and rolling kv cache during training, our method can generate long videos at constant throughput and latency.
@article{shin2025motionstream,
title={MotionStream: Real-Time Video Generation with Interactive Motion Controls},
author={Shin, Joonghyuk and Li, Zhengqi and Zhang, Richard and Zhu, Jun-Yan and Park, Jaesik and Schechtman, Eli and Huang, Xun},
journal={arXiv preprint arXiv:2511.01266},
year={2025}
}