Real-time Streaming Demo on a Single GPU: Through simple click-and-drag sequences, MotionStream enables real-time control of diverse scenarios, including both object motion and camera movement, across various grid configurations. Given its autoregressive nature, users can also pause/resume (using the space key) and add static points or multiple moving tracks to better specify control. Benefitting from constant attention context with sink and KV cache rolling, users can experience 29 FPS at 480p using our 1.3B model and 24 FPS at 720p with our 5B model—all with subsecond latency—on a single H100 GPU (we did not use other optimization techniques beyond mixed precision and Flash Attention 3). By anchoring attention to the clean first chunk at all times (keeping minimal drifted chunks within the attention context), the model often recovers quality even after disruptions, showcasing its resilience in long video streaming scenarios. Check out our long video example in the Full Gallery, where it generates 5,000 frames. Due to the streaming nature, our current demo is highly susceptible to network latency and instability (due to multiple latency bottlenecks, your grid might be in the wrong place by the time it reaches the generation pipeline), and performance is optimal with local clusters. Note: our initial demos were recorded on a system equipped with Flash Attention 2 (FA2) resulting in ~25 FPS, while videos with FA3 tag are newly recorded using Flash Attention 3 (~30 FPS, lower latency) and an updated front-end. All streaming demo videos are from 1.3B model variant. In the demo, tracks are color-coded: green for online user-dragged motion, red for static points, and blue indicates pre-drawn paths for moving multiple points simultaneously.