Transformer-based diffusion models have recently superseded traditional U-Net architectures, with multimodal diffusion transformers (MM-DiT) emerging as the dominant approach in state-of-the-art models like Stable Diffusion 3 and Flux.1. Previous approaches relied on unidirectional cross-attention mechanisms, with information flowing from text embeddings to image latents. In contrast, MM-DiT introduces a unified attention mechanism that concatenates input projections from both modalities and performs a single full attention operation, allowing bidirectional information flow between text and image branches. This architectural shift presents significant challenges for existing editing techniques. In this paper, we systematically analyze MM-DiT’s attention mechanism by decomposing attention matrices into four distinct blocks, revealing their inherent characteristics. Through these analyses, we propose a robust prompt-based image editing method for MM-DiT that supports global-to-local edits across various MM-DiT variants, including few-step models. We believe our findings bridge the gap between existing U-Net–based methods and emerging architectures, offering deeper insights into MM-DiT’s behavioral patterns.
U-Net and MM-DiT variants: Dual-branch (SD3, Flux.1 early blocks), Single-branch (Flux.1 later blocks), MM-DiT-X (SD3.5).
\[ \mathbf{q} = \begin{bmatrix} \mathbf{q}_{i} \\ \mathbf{q}_{t} \end{bmatrix},\quad \mathbf{k} = \begin{bmatrix} \mathbf{k}_{i} \\ \mathbf{k}_{t} \end{bmatrix},\quad \mathbf{v} = \begin{bmatrix} \mathbf{v}_{i} \\ \mathbf{v}_{t} \end{bmatrix}. \]
\[ \mathbf{qk}^{\top} = \begin{bmatrix} \mathbf{q}_{i}\mathbf{k}_{i}^{\top} & \mathbf{q}_{i}\mathbf{k}_{t}^{\top} \\ \mathbf{q}_{t}\mathbf{k}_{i}^{\top} & \mathbf{q}_{t}\mathbf{k}_{t}^{\top} \end{bmatrix} \sim \begin{bmatrix}\text{I2I} & \text{T2I} \\ \text{I2T} & \text{T2T}\end{bmatrix}. \]
\[ \mathbf{qk}^{\top}\mathbf{v} = \begin{bmatrix} \mathbf{q}_{i}\mathbf{k}_{i}^{\top}\mathbf{v}_{i} + \mathbf{q}_{i}\mathbf{k}_{t}^{\top}\mathbf{v}_{t} \\ \mathbf{q}_{t}\mathbf{k}_{i}^{\top}\mathbf{v}_{i} + \mathbf{q}_{t}\mathbf{k}_{t}^{\top}\mathbf{v}_{t} \end{bmatrix}. \]
Each block can be interpreted as: I2I (image→image) self-attention that preserves structure/geometry, T2I (text→image) token-to-region alignment useful for localization masks, I2T (image→text) feedback to text representations (typically weaker due to row-wise softmax competition on image rows), and T2T (text→text) near-identity patterns emphasizing special/boundary tokens.
I2I reveals spatial/geometric bases analogous to U-Net self-attention.
T2I provides stronger, multi-region localization than I2T; preferred for attention mask extraction.
MM-DiT localizes attention more precisely than U-Net; however, we observe increasingly noisy (fragmented) attention as model scale grows and blocks deepen—aligning with recent ViT literature that reports scale-induced noise and the need for token/register stabilization.
Our MM-DiT block analysis identifies a sparse, high-quality subset; Gaussian smoothing (GS) improves metrics and further reduces noisy information in some high-capacity blocks (see paper for experiment details).
Selecting the top-5 blocks suppresses spurious activations while preserving salient regions; Gaussian smoothing removes jagged edges and cleans object boundaries.
Local edits using curated T2I masks (top-5 + GS) preserve non-target areas and avoid haloing/bleeding artifacts, leading to cleaner, semantically focused modifications.
We propose to inject source-branch image projections (qi, ki) into the target branch during early timesteps while preserving text projections (T2T). Following Prompt-to-Prompt, we adopt local blending; however, using only a curated set of T2I attention maps (top-5 blocks + GS).
Algorithm overview of our editing pipeline. Projection-level replacement allows use of SDPA kernel and curated T2I masks enable precise local blending.
Replacing full attention shifts T2T and causes text–value misalignment. Our design of keeping T2T region fixed and changing only non-text regions (qi, ki) avoids drift.
(qi, ki) replacement and I2I block replacement yield comparable results; the former is more efficient (preserves SDPA) and simpler to implement.
Impact of local blending threshold θ: higher θ favors precise local edits (e.g., text), while lower θ enables broader region changes.
Real-image editing pipeline. Our projection replacement (qi, ki) operates seamlessly with RF inversion (inversion-based editing) and also supports an inversion-free path starting from random noise; see paper for detailed formulation and scheduling.
Benchmark overview. Our method based on qi and ki balances target alignment and source preservation across SD3/SD3.5/Flux variants, and we deliberately avoid per-sample tuning (no local blending) for fair comparison; see paper for settings.
User study summary. Our findings reveal that our method uniquely balances strong target alignment (akin to direct generation, which however sacrifices preservation) with content preservation (comparable to prompt-change, which however fails to implement the edit).
@inproceedings{shin2025exploringmmdit,
title = {{Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing}},
author = {Shin, Joonghyuk and Hwang, Alchan and Kim, Yujin and Kim, Daneul and Park, Jaesik},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year = {2025}
}