Flow Matching 與 Score Matching 的理論基礎比較
Flow matching(FM)與 score matching(SM)是兩種生成模型方法,理論上密切相關但目標不同。在 score-based(擴散)模型 中,我們學習一個 score function(機率密度的梯度),或等效地學習一個 noise predictor,透過最小化一個去雜訊的均方誤差(MSE)損失。例如,在 DDPM 類型的模型中,目標是最小化:
\[\mathbb{E}[\|\epsilon - \epsilon_\theta(x_t,t)\|^2]\]以預測加上去的高斯噪聲 。
相較之下,flow matching 直接對一個預先定義的傳輸路徑中的 velocity field $v(x,t)$ 進行回歸。具體而言,FM 定義一個從雜訊到資料的條件分布族 $p_t(x\mid x_1)$(如高斯內插),然後導出該條件分布對應的真實 velocity field,再訓練一個網路 $f_\theta(x,t)$ 來擬合它。FM 的損失為:
\[\mathcal{L}_{FM} = \mathbb{E}_{t,x\sim p_t}[\,\|f_\theta(x,t) - v(x,t)\|^2\,]\]這使得 $f_\theta$ 能精確擬合真實 flow 。
總結來說,score matching 是訓練一個對 diffusion 過程中的噪聲或 score 進行預測的模型,而 flow matching 是訓練一個描述資料傳輸 ODE 的向量場模型。FM 可以被視為對 diffusion 模型的一般化形式:如果選擇 diffusion-style 的高斯分布路徑,FM 模型可以模擬 score matching 效果,但通常更穩定 。
需要注意的還有 時間方向差異:score-based 模型通常從乾淨資料(t=0)逐步加噪聲到雜訊(t=1),並在生成時反向。而 FM 模型則多為雜訊到資料的方向(0→1),透過 ODE 積分前向產生資料 。
實作差異
-
模型結構與維度要求:score matching 模型運作於資料空間中,不要求可逆性;而 flow matching 通常實作為 continuous normalizing flow(CNF),因此要求模型是可逆的,且輸入輸出維度一致。這使得 FM 更自然地應用於連續資料上,而 SM 可以更靈活地擴展到潛空間甚至離散資料。
-
訓練穩定性:兩者都使用簡單的 MSE 損失函數,不需要模擬 forward/reverse 流程。在實作上,FM 常被認為比 score-based diffusion 模型更穩定,尤其是當選用 diffusion path 進行 flow matching 訓練時 。
-
最佳化與成本:SM 與 FM 每個訓練步驟通常只需一個 forward 和 backward pass,因此每次迭代的成本相當。但傳統 CNF 的最大似然訓練需要昂貴的 ODE 求解,而 FM 可免去此成本。SM 本身也不需 ODE 求解,這使得兩者在訓練時成本相近,但 FM 更具效率。
-
生成速度:最大差異在於抽樣速度。score-based 模型(如 DDPM)需經過數百或數千步反向推理,而 FM 通常只需少數 ODE 積分步驟。Liu 等人提出的 rectified flow 僅用一個 Euler step 就能生成高品質影像 ,FM 的效率優勢明顯。
-
高維資料的效率:在高維資料(如影像)中,diffusion 模型需大量噪聲更新步驟,而 flow matching(特別是設計良好的 optimal transport path)可透過直線式流動(shorter paths)大幅減少所需的神經網路評估次數(NFEs) 。
Flow Matching 與 Score Matching 可以合併嗎?
可以,且目前已有理論與實務上的合併方向。許多文獻指出,當 flow matching 使用高斯路徑時,它和 diffusion 模型在數學上是等價的 。FM 與 SM 其實是在不同框架下參數化相同的生成過程;因此,也可以交叉使用彼此的技巧。例如可以使用 diffusion path 進行 flow matching 訓練,並透過 deterministic ODE 或 stochastic SDE 進行抽樣。
此外,像 rectified flow 就是明確將 diffusion 轉化為 flow matching 框架的一種方式 。這些研究說明 FM 與 SM 並非競爭,而是可以互補與轉換的框架。
Flow Matching 可否應用於 VAE 的 Latent Space?
可以。由於 flow matching 需處理高維輸入,因此許多研究將其應用在低維的潛空間(latent space)中,特別是與 VAE 搭配。
例如 Dao 等人(2023) 提出 “Flow Matching in Latent Space”:先訓練一個 VAE 以將資料編碼成潛空間,再在潛空間中使用 FM 進行生成建模。這種方法能有效降低計算成本,且維持高品質的圖像生成 。
同樣地,Samaddar 等人(2025) 的 Latent-CFM 利用預訓練模型提取的潛在特徵進行 flow matching。他們指出,在潛空間中訓練 FM 不僅加快訓練速度,還能提升生成品質 。這些 latent flow 模型也可與條件資訊結合(如分類標籤、inpainting、語義標籤)來實現條件生成。
與此類似的還有 diffusion 模型中的 latent 變體,例如 NVidia 的 Latent Score-based Generative Model (LSGM),將 score matching 訓練在 VAE 的潛空間中以加速生成 。這些工作都顯示,將 flow matching 應用於潛空間是可行且有效的策略。
應用與結合模型實例
Flow matching 與 score matching 均已應用於各種生成與下游任務,其中包括:
-
無條件與有條件生成:兩種方法皆能實現高品質的圖像、音訊等資料生成。例如,Lipman 等人基於 FM 的模型在 ImageNet 上達到 SOTA 表現 。Dao 的 latent FM 模型也支援高解析度條件生成(如 inpainting) 。SM 模型如 latent diffusion 也應用於文本到圖像生成(如 Stable Diffusion)。
-
表徵學習(Representation Learning):VAE 所學習的潛空間可與 FM 結合,透過在潛空間中訓練 flow,可學得有用的資料表徵,並進行如分類、生成、資料操控等下游任務。
-
半監督學習:雖然目前使用 flow matching 進行半監督學習的工作較少,但過去已有 flow-based 模型(如 FlowGMM)用於建模資料與標籤的聯合分布,可望結合 FM 進一步擴展。
-
應用於物理、語言與 3D 資料:FM 已延伸應用於影片生成、分子構象預測、離散資料建模等領域。例如 Polyak 等人將 FM 用於 foundation video models ,Hassan 等人應用於分子生成 。
-
結合 VAE 與 Flow Matching 的模型:除了上述 latent FM 外,還有如 Variational Rectified Flow Matching(Guo & Schwing, 2025)結合 latent mixture 與 rectified flow,以及 FlowLLM(Sriram 等人, 2024)用 flow matching 微調 LLM 的輸出。這些皆顯示 FM 可與潛變數模型靈活結合,形成高效生成器。
From 8-Gaussian to moons.
![[Pasted image 20250522235631.png]]
![[Pasted image 20250515212250.png]]
Flow process:
- **物理意義: 就是讓 noise 移動變成 image.
- Training:
- Input: scale input + noise, t
- Output: vector field
- $x_n = x_{n-1} + v_t(x_t, \theta)$
1 | |
Diffusion process: 物理意義: 就是 denoise 變成 image.
- Training:
- Input: scale input + noise, t
- Output: noise $\epsilon_{\theta}$
- Sampling:
- Start from noise N(0, I)
- $x_n = k x_{n-1} - (1-k) \epsilon_{\theta}$
For each timestep $t$, you use the reverse process:
\[x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \cdot \epsilon_\theta(x_t, t) \right) + \sigma_t z\]Where:
- $\epsilon_\theta(x_t, t)$: predicted noise
- $\alpha_t$, $\bar{\alpha}_t$: noise schedule values
- $\sigma_t$: standard deviation of the noise added at step $t$
- $z \sim \mathcal{N}(0, I)$: noise (used for stochasticity)
🧾 DDPM Sampling Algorithm (Step-by-Step)
1 | |
Combine Flow Matching and Score Matching
考慮 g(t) = 0, 可以 degenerate 得到 ODE. 這是 1st order 的 flow?
![[Pasted image 20250530185251.png]]
再加上 2nd order score stochastic function
![[Pasted image 20250530185557.png]]
這是 objective function. 基本是兩個 neural networks? ![[Pasted image 20250530190205.png]]
Takeaway
Diffusion process:
- 物理意義/現象是從頭到脚的 stochastic SDE
- 經由對應的 Fokker-Planck ODE equation 可以得到每個時間的 $p_t(x)$, 以及 forward/reverse dynamics.
- 其中 deterministic score function (vector field) $\nabla_{x} \log p_t(x)$ 扮演關鍵的角色。發動者是 forward path 加 noise; 關鍵是 reverse path score function 的引導者。
- 訓練 neural network 近似 score function: score matching!
- Sampling 時則是利用 random sampling from initial distribution -> Langevin dynamics + denoiser (score function) 產生 samples
- Likelihood: 還是要經過 Fokker-Planc ODE 計算出.
Flow process:
- 物理意義/現象是 deterministic ODE + random initial condition/distribution
- 物理順序是: vector field $u_t$ + initial condition 產生-> trajectory $X_t$ 產生-> flow $\phi_t(x_t)$ 產生-> distribution $p_t(x_t)$
- Flow equation 是非常簡單的 ODE: $X_0 = x_0, \frac{d X_t}{dt} = u_t(X_t)$, $X_t$ 是每一個 trajectory 對應不同的 initial condition $x_0$.
- 把所有的 initial condition trajectories 集合起來就是 flow: $\phi_t(X_t)$,當然也滿足 $\frac{d \phi_t(X_t)}{dt} = u_t(\phi_t(X_t))$
- 訓練 neural network 近似 vector fild $u_t$ (是 flow 的微分): flow matching!
- Sampling 時則是利用 random sample initial distribution -> flow ODE 產生 samples
- Likelihood: 基本經過 reversed flow 可以計算出.
Flow matching is simulation-free (不用 ODE solver) in training! but need ODE solver in sampling! Consistency mode is simulation-free in sampling!
Flow matching vs. Score Matching
我用 ChatGPT 做了一個比較:
| Aspect | Score Matching (e.g., Diffusion Models) | Flow Matching | | ————————– | ———————————————————————– | ——————————————————————————————- | | Training Objective | Match score ∇ₓ log pₜ(x) using denoising or noise-perturbed samples. | Match vector field uₜ(x) using interpolated paths between base and target. | | Training Data Pairs | Uses perturbations of single samples x ∼ q(x). | Uses sample pairs (x₀, x₁) ∼ p₀ × q. | | Stochasticity | Involves stochastic reverse-time SDE or ODE. | Can be trained and used deterministically (via ODE). | | Interpretability | Score function is less interpretable. | Vector field can be directly visualized as transport directions. | | Sample Quality | High quality, especially for complex distributions. | May suffer in multi-modal or complex geometries due to poor interpolation. | | Interpolation Issues | None — doesn’t interpolate between mismatched samples. | Yes — interpolates between randomly sampled pairs, which may be semantically unrelated. | | Manifold Respect | ✅ Yes — local noise keeps dynamics near the data manifold. | ❌ No — interpolations may go off-manifold, especially in high dimensions. | | Simulation-Free | No — often requires simulation of forward diffusion process. | Yes — training can be done without simulation of trajectories. | | Training Stability | Stable and well-understood, but computationally intensive. | Often more efficient, but may suffer from spurious vector fields. | | Mode Coverage | Good mode coverage due to noise-based training. | Can miss modes if interpolation crosses low-density areas. | | Theoretical Guarantees | Strong links to Fokker-Planck, optimal denoising, and score estimation. | Tied to optimal transport and flow ODEs. | | Applications | Denoising diffusion, image synthesis, audio generation. | Neural ODEs, flow-based generative modeling, fast training. | Add how to do sampling for both score matching and flow matching
- Score matching - Annealed Langevin dynamics based on score function denoise
- Flow matching - 利用 vector field (不是 flow) 可以得到 samples
Summary:
-
Score Matching is more robust in complex distributions and provides high sample quality, but can be computationally heavy.
-
Flow Matching is efficient and deterministic, but can suffer from poor interpolation unless guided or regularized carefully.
Appendix D: Flow Matching vs. Score Matching
What’s the issue?
In flow matching, you’re learning a vector field $u_t(x)$ such that the flow transforms samples from a base distribution $p_0$ to a target distribution $q$, by matching trajectories defined by conditional fields (e.g., between $x_0 \sim p_0$ and $x_1 \sim q$).
But during training:
- You sample independent pairs $(x_0, x_1)$ from $p_0(x_0) \times q(x_1)$.
- The vector field $u_t(x)$ is trained to point from $x_0$ to $x_1$ (linearly or via some transport plan).
This means:
You’re interpolating between randomly paired samples, not samples that lie on a meaningful trajectory between modes.
Why is this a problem?
-
Mismatch in Semantics: If $x_0$ and $x_1$ lie in different modes (e.g., digits “1” and “8”), the interpolated vector field may not follow a realistic path — it may go through “garbage” regions.
-
Conflicting Training Signals: In high dimensions, many training pairs induce vector directions that contradict each other in overlapping regions of space.
-
Generalization in Between: The learned field might look good at data points but behaves poorly between them — i.e., the model has no guarantee of generating valid intermediate samples during inference.
Contrast with Score-based Models
Score-based models (e.g., diffusion models) estimate the score $\nabla_x \log p_t(x)$, which is locally defined and smooth, and avoid the “random pairing” problem — they only rely on perturbing data and denoising, not interpolating arbitrary sample pairs.
Possible Fixes or Improvements
-
Conditional Flow Matching (CFM): Instead of random pairing, use conditional transport or kernel matching to align similar samples.
-
Use trajectories from known dynamics: If you have access to optimal transport maps (e.g., displacement interpolation), training becomes much more stable.
-
Guidance during interpolation: Add constraints or learned alignment that discourage implausible transitions.
Here’s a table comparing Score Matching (used in diffusion models) and Flow Matching (used in flow-based models):
Certainly! Here’s the updated comparison table with “Manifold Respect” moved right after “Interpolation Issues”:
| Aspect | Score Matching (e.g., Diffusion Models) | Flow Matching |
|---|---|---|
| Training Objective | Match score ∇ₓ log pₜ(x) using denoising or noise-perturbed samples. | Match vector field uₜ(x) using interpolated paths between base and target. |
| Training Data Pairs | Uses perturbations of single samples x ∼ q(x). | Uses sample pairs (x₀, x₁) ∼ p₀ × q. |
| Stochasticity | Involves stochastic reverse-time SDE or ODE. | Can be trained and used deterministically (via ODE). |
| Interpretability | Score function is less interpretable. | Vector field can be directly visualized as transport directions. |
| Sample Quality | High quality, especially for complex distributions. | May suffer in multi-modal or complex geometries due to poor interpolation. |
| Interpolation Issues | None — doesn’t interpolate between mismatched samples. | Yes — interpolates between randomly sampled pairs, which may be semantically unrelated. |
| Manifold Respect | ✅ Yes — local noise keeps dynamics near the data manifold. | ❌ No — interpolations may go off-manifold, especially in high dimensions. |
| Simulation-Free | No — often requires simulation of forward diffusion process. | Yes — training can be done without simulation of trajectories. |
| Training Stability | Stable and well-understood, but computationally intensive. | Often more efficient, but may suffer from spurious vector fields. |
| Mode Coverage | Good mode coverage due to noise-based training. | Can miss modes if interpolation crosses low-density areas. |
| Theoretical Guarantees | Strong links to Fokker-Planck, optimal denoising, and score estimation. | Tied to optimal transport and flow ODEs. |
| Applications | Denoising diffusion, image synthesis, audio generation. | Neural ODEs, flow-based generative modeling, fast training. |
🌐 Manifold Assumption Recap
The manifold hypothesis states that high-dimensional data (e.g., images) actually lies on or near a low-dimensional manifold embedded in the ambient space (e.g., ℝⁿ).
🌀 Score Matching (Diffusion Models)
-
Preserves the manifold assumption — at least during inference:
- The forward diffusion process adds noise, pushing data off the manifold.
- But during reverse-time generation, the score function learns to bring samples back toward the data manifold.
- The model gradually denoises toward realistic-looking images on or near the original manifold.
-
At each denoising step, small updates guided by the score field respect local structure.
✅ Bottom line: Score matching implicitly respects the data manifold by learning how to denoise back to it.
🔁 Flow Matching
-
Potentially breaks the manifold:
- Flow matching defines vector fields between a base distribution (e.g., Gaussian) and the data distribution, often through interpolated paths.
- These interpolations (e.g., linear interpolations in pixel space between two image samples) often leave the manifold — they can pass through unrealistic regions that are far from the true data manifold.
- The resulting vector field pushes probability mass through low-density regions, which is unnatural for image data.
🚨 Problem: If you interpolate between two real images of, say, a cat and a car, the middle of the interpolation may not correspond to any real image-like structure — it’s off-manifold.
🧠 Summary
| Score Matching | Flow Matching | |
|---|---|---|
| Manifold Respect | ✅ Mostly stays on/near the data manifold | ❌ May generate off-manifold trajectories |
| Interpolation | Implicit via small local noise steps | Explicit interpolation between possibly unrelated samples |
| Risk | Lower — local denoising follows manifold | Higher — global flows may cross unrealistic regions |
🧩 Final Thought
Flow matching can respect the manifold, but only if interpolation is carefully chosen. Some recent work uses manifold-aware interpolations (e.g., via encoders or geodesic paths in latent space) to mitigate this issue.
Would you like to see a visual example comparing on- vs. off-manifold interpolations?
Reference
Schrodinger bridge https://arxiv.org/pdf/2307.03672