The intrinsic link between facial motion and speech is often overlooked in generative modeling, where talking head synthesis and text-to-speech (TTS) are typically addressed as separate tasks. This paper introduces JAM-Flow, a unified framework to simultaneously synthesize and condition on both facial motion and speech. Our approach leverages flow matching and a novel Multi-Modal Diffusion Transformer (MM-DiT) architecture, integrating specialized Motion-DiT and Audio-DiT modules. These are coupled via selective joint attention layers and incorporate key architectural choices, such as temporally aligned positional embeddings and localized joint attention masking, to enable effective cross-modal interaction while preserving modality-specific strengths. Trained with an inpainting-style objective, JAM-Flow supports a wide array of conditioning inputs—including text, reference audio, and reference motion—facilitating tasks such as synchronized talking head generation from text, audio-driven animation, and much more, within a single, coherent model. JAM-Flow significantly advances multi-modal generative modeling by providing a practical solution for holistic audio-visual synthesis.
Comparison of our JAM-Flow methods (I2V and V2V) with state-of-the-art talking head generation methods including SadTalker, AniPortrait, Hallo, and Hallo3. Each row shows the same input processed by different methods alongside the ground truth.
Audio quality comparison between F5-TTS baseline and our JAM-Flow methods. As mentioned in our paper, our primary model shows a slight decrease in TTS metrics compared to the original F5-TTS model. This is primarily attributed to the architectural modifications necessary for joint audio-motion synthesis—in pure TTS settings, our model lacks the motion cues it was designed to leverage. Nevertheless, it still delivers high-quality speech synthesis while offering significantly expanded capabilities for diverse downstream tasks. Our variant (marked with †) features a frozen Audio-DiT and no motion attention mask—a configuration closer to the original F5-TTS—which achieves better TTS metrics, while our primary model provides the optimal balance between audio quality and motion synthesis capabilities.
Demonstration of our method's capability in automated video dubbing, showing temporal alignment between generated speech and existing visual content. Compared with state-of-the-art methods including HPMDubbing, StyleDubber, and VoiceCraft-Dub.
JAM-Flow enables unique cross-modal generation capabilities that showcase the flexibility of our joint audio-motion model. Due to size limitations, we present a selection of interesting cases below. Many more combinations are possible with our unified framework.
The most straightforward multimodal generation case. Given only text input, our model generates both synchronized audio and motion from scratch.
Similar to Case 1, but with added voice cloning capability. The model generates audio matching the reference speaker's voice characteristics while creating synchronized motion.
Samples coming soon...
This case demonstrates audio generation constrained by existing motion. To ensure temporal alignment between text and video length, we reorder the words from the original prompt rather than using entirely different content. For example: "JAM-Flow matches audio to video" becomes "Audio to video JAM-Flow matches". This maintains approximate text length while creating different semantic ordering, showcasing how our model adapts audio generation to fit frozen motion patterns with altered content.
An extreme case where the model must infer appropriate audio solely from motion patterns, without any text cues. Note: Without textual guidance, the model does not generate proper sentences or meaningful words. However, the generated audio still exhibits reasonable synchronization with lip movements, demonstrating the model's learned audio-visual correlations at a phonetic level rather than semantic level.
We present two categories of failure cases to provide transparent insight into current limitations of our approach. Understanding these boundaries is crucial for future improvements.
When there is significant length mismatch between input modalities (text, audio, motion), the model struggles to maintain proper lip-sync. While our model typically handles minor mismatches by generating natural interjections (sighs, "aha", "ahh", "oh"), severe length discrepancies can cause synchronization failures where lip movements no longer align with the generated audio or text content.
Our approach relies on LivePortrait base model for keypoint detection and warping. When LivePortrait fails to detect facial keypoints—particularly common with non-realistic images like flat cartoons or highly stylized artwork in image-to-video (I2V) setups—our model cannot generate proper motion. This fundamental dependency means that inputs outside LivePortrait's detection capabilities will result in suboptimal or failed generation.