JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching

Mingi Kwon^*,1

Joonghyuk Shin^*,2

Jaeseok Jung¹

Jaesik Park^†,2

Youngjung Uh^†,1

¹Yonsei University ²Seoul National University

^*Equal Contribution ^†Equal Advising

arXiv Code & Models Coming Soon!

Please ensure to play videos with audio enabled after they are loaded for the full experience.

Abstract

The intrinsic link between facial motion and speech is often overlooked in generative modeling, where talking head synthesis and text-to-speech (TTS) are typically addressed as separate tasks. This paper introduces JAM-Flow, a unified framework to simultaneously synthesize and condition on both facial motion and speech. Our approach leverages flow matching and a novel Multi-Modal Diffusion Transformer (MM-DiT) architecture, integrating specialized Motion-DiT and Audio-DiT modules. These are coupled via selective joint attention layers and incorporate key architectural choices, such as temporally aligned positional embeddings and localized joint attention masking, to enable effective cross-modal interaction while preserving modality-specific strengths. Trained with an inpainting-style objective, JAM-Flow supports a wide array of conditioning inputs—including text, reference audio, and reference motion—facilitating tasks such as synchronized talking head generation from text, audio-driven animation, and much more, within a single, coherent model. JAM-Flow significantly advances multi-modal generative modeling by providing a practical solution for holistic audio-visual synthesis.

Talking Head Lip-Sync Comparison on HDTF Dataset

Comparison of our JAM-Flow methods (I2V and V2V) with state-of-the-art talking head generation methods including SadTalker, AniPortrait, Hallo, and Hallo3. Each row shows the same input processed by different methods alongside the ground truth.

Text-to-Speech Quality Comparison on LibriSpeech-PC test-clean

Audio quality comparison between F5-TTS baseline and our JAM-Flow methods. As mentioned in our paper, our primary model shows a slight decrease in TTS metrics compared to the original F5-TTS model. This is primarily attributed to the architectural modifications necessary for joint audio-motion synthesis—in pure TTS settings, our model lacks the motion cues it was designed to leverage. Nevertheless, it still delivers high-quality speech synthesis while offering significantly expanded capabilities for diverse downstream tasks. Our variant (marked with †) features a frozen Audio-DiT and no motion attention mask—a configuration closer to the original F5-TTS—which achieves better TTS metrics, while our primary model provides the optimal balance between audio quality and motion synthesis capabilities.

Automated Video Dubbing Performance

Demonstration of our method's capability in automated video dubbing, showing temporal alignment between generated speech and existing visual content. Compared with state-of-the-art methods including HPMDubbing, StyleDubber, and VoiceCraft-Dub.

Our Exclusive Capabilities

JAM-Flow enables unique cross-modal generation capabilities that showcase the flexibility of our joint audio-motion model. Due to size limitations, we present a selection of interesting cases below. Many more combinations are possible with our unified framework.

Case 1: Text → Audio + Motion

The most straightforward multimodal generation case. Given only text input, our model generates both synchronized audio and motion from scratch.

Case 2: Text + Reference Audio → Audio + Motion

Similar to Case 1, but with added voice cloning capability. The model generates audio matching the reference speaker's voice characteristics while creating synchronized motion.

Samples coming soon...

Case 3: Reference Motion + Target Text → Audio

This case demonstrates audio generation constrained by existing motion. To ensure temporal alignment between text and video length, we reorder the words from the original prompt rather than using entirely different content. For example: "JAM-Flow matches audio to video" becomes "Audio to video JAM-Flow matches". This maintains approximate text length while creating different semantic ordering, showcasing how our model adapts audio generation to fit frozen motion patterns with altered content.

Case 4: Reference Motion → Audio (without text)

An extreme case where the model must infer appropriate audio solely from motion patterns, without any text cues. Note: Without textual guidance, the model does not generate proper sentences or meaningful words. However, the generated audio still exhibits reasonable synchronization with lip movements, demonstrating the model's learned audio-visual correlations at a phonetic level rather than semantic level.

Failure Cases Analysis

We present two categories of failure cases to provide transparent insight into current limitations of our approach. Understanding these boundaries is crucial for future improvements.

Case 1: Input Length Mismatch Between Modalities

When there is significant length mismatch between input modalities (text, audio, motion), the model struggles to maintain proper lip-sync. While our model typically handles minor mismatches by generating natural interjections (sighs, "aha", "ahh", "oh"), severe length discrepancies can cause synchronization failures where lip movements no longer align with the generated audio or text content.

Case 2: LivePortrait Base Model Keypoint Detection Failure

Our approach relies on LivePortrait base model for keypoint detection and warping. When LivePortrait fails to detect facial keypoints—particularly common with non-realistic images like flat cartoons or highly stylized artwork in image-to-video (I2V) setups—our model cannot generate proper motion. This fundamental dependency means that inputs outside LivePortrait's detection capabilities will result in suboptimal or failed generation.