JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching

Mingi Kwon^*,1,2

Joonghyuk Shin^*,3

Jaeseok Jung^1,2

Jaesik Park^†,3

Youngjung Uh^†,1

¹Yonsei University ²CineLingo ³Seoul National University

^*Equal Contribution ^†Equal Advising

arXiv Code & Models Pending

Please play the videos with sound enabled; they may take a moment to load.

Abstract

The intrinsic link between facial motion and speech is often overlooked in generative modeling, where talking head synthesis and text-to-speech (TTS) are typically addressed as separate tasks. This paper introduces JAM-Flow, a unified framework to simultaneously synthesize and condition on both facial motion and speech. Our approach leverages flow matching and a novel Multi-Modal Diffusion Transformer (MM-DiT) architecture, integrating specialized Motion-DiT and Audio-DiT modules. These are coupled via selective joint attention layers and incorporate key architectural choices, such as temporally aligned positional embeddings and localized joint attention masking, to enable effective cross-modal interaction while preserving modality-specific strengths. Trained with an inpainting-style objective, JAM-Flow supports a wide array of conditioning inputs—including text, reference audio, and reference motion—facilitating tasks such as synchronized talking head generation from text, audio-driven animation, and much more, within a single, coherent model. JAM-Flow significantly advances multi-modal generative modeling by providing a practical solution for holistic audio-visual synthesis.

Talking Head Lip-Sync Comparison on HDTF Dataset

We compare both variants of JAM-Flow (I2V and V2V) against Wav2Lip, SadTalker, AniPortrait, EDTalk, Hallo, and Hallo3. Each row shows the same input processed by every method, presented alongside the ground truth for reference.

Text-to-Speech Quality Comparison on LibriSpeech-PC test-clean

We compare audio quality against the F5-TTS baseline. As discussed in the paper, the primary model gives up a small amount of TTS quality in exchange for joint audio-motion synthesis, since in a pure TTS setting it has no motion to condition on. The variant marked with † keeps the Audio-DiT frozen and removes the motion attention mask, a configuration closer to the original F5-TTS that recovers most of the TTS metrics, whereas the primary model is tuned toward motion synthesis instead.

Automated Video Dubbing Performance

Here we generate speech to match an existing video, and compare against HPMDubbing, StyleDubber, and VoiceCraft-Dub. The goal is for the synthesized audio to remain temporally aligned with the visible lip motion.

Cascaded Pipeline vs Our Joint Generation

We compare joint audio-motion generation with cascaded pipelines that first generate audio with F5-TTS (given ground-truth reference audio and text), then run talking-head generation on it. Two cascaded baselines are shown: (1) EDTalk, a VAE/GAN method common in this area, and (2) Hallo3, a text-to-video diffusion approach. The joint model keeps lip motion better aligned with the audio, most visibly during fast speech and difficult articulations.

Attention Masking Ablation Study

We show the qualitative effect of the asymmetric attention masking used in our model. Without masking (left), the audio and lip motion tend to drift apart and lose temporal alignment over time. With masking (right), cross-modal attention is restricted to temporally aligned regions, and the two modalities remain in sync.

Our Exclusive Capabilities

Because the model is jointly trained on audio and motion, either modality can be used to condition the generation of the other. We show a few representative cases below; the same model also supports further combinations of inputs beyond those illustrated here.

Case 1Text → Audio + Motion

Given only a text prompt, the model generates both the speech audio and the corresponding facial motion jointly, with no reference audio or motion provided.

Case 2Text + Reference Audio → Audio + Motion

In addition to the text prompt, a short reference clip is provided so that the generated speech follows the reference speaker's voice, while the model still produces motion synchronized to that speech.

Samples to be added.

Case 3Reference Motion + Target Text → Audio

In this setting the audio is generated under a fixed reference motion track. To keep the text and the video roughly the same length, we reorder the words of the original prompt rather than substituting unrelated text; for example, "JAM-Flow matches audio to video" becomes "Audio to video JAM-Flow matches". The overall length stays approximately constant while the word order changes, so the model must fit the generated audio to the fixed motion despite the altered content.

Case 4Reference Motion → Audio (without text)

Here the model is given only a reference motion track and no text at all. Unsurprisingly, the output is not intelligible speech, but it still follows the lip movements closely, which suggests that the model has learned an audio-visual correspondence at the phonetic level rather than the semantic level.

Failure Cases Analysis

For completeness, we also report two failure modes that we observed repeatedly in our experiments, which help delineate where the current model breaks down.

Case 1Input Length Mismatch Between Modalities

When the input modalities (text, audio, motion) differ substantially in length, lip-sync breaks down. Minor mismatches are usually absorbed as natural interjections (sighs, "aha", "ahh", "oh"), but large discrepancies push the lips out of sync with the generated audio or text.

Case 2LivePortrait Keypoint Detection Failure

Our pipeline relies on the LivePortrait base model for keypoint detection and warping. When LivePortrait cannot detect facial keypoints, which is common for non-realistic inputs such as flat cartoons or heavily stylized artwork in the I2V setting, the model is unable to produce correct motion. This dependency means that inputs on which LivePortrait itself fails will also fail in our setting.