![[Pasted image 20250612202606.png]] 是是否直接用 flow + diffusion mixed approach 解決這個問題?
重點在於比例。可以任意決定 u(x, t) 和 D(t) 的大小!只要符合 conservation of probability (FP Equation)!
Fokker-Planck 偏微分方程
在 generative AI 的 diffusion process 或是 flow method: 守恆量是機率 (任意時間點的機率和為 1) $\Phi = p(x, t)$,沒有 source $S=0$. 不過一般物理的 diffusion 是從高濃度向低濃度,但是 probability 卻是從低機率 diffuse 到高機率。可以等價視爲負 diffusion constant: $D = -\Gamma$ ,如果假設是 isotropic diffusion, $D(t)$ 和位置無關,可以和時間有關 (noise scheduling)。一般寫成: \(\frac{\partial p(x,t)}{\partial t}=-\nabla \cdot[\mathbf{u}(x,t)\, p(x,t)]+\frac{g^2(t)}{2}\nabla \cdot \nabla p(x,t)\) 微分 SDE/ODE 形式
Forward SDE: for training. \(d \mathbf{x}_t={\mathbf{u}}(\mathbf{x}_t, t)\, d t+g(t)\, d \mathbf{w}_t,\quad \text{ with } d \mathbf{w}_t \sim N(0, d t)\)
Reverse SDE: for sampling. 下式的 $dt$ 是負無窮小 time step. \(d \mathbf{x}_t=[{\mathbf{u}}(\mathbf{x}_t, t)-g^2(t)\nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t)]\, d t+g(t)\, d \mathbf{w}_t,\quad \text{ with } d \mathbf{w}_t \sim N(0, d t)\)
Equivalent Fokker-Planck ODE \(d \mathbf{x}_t=[{\mathbf{u}}(\mathbf{x}_t, t)-\frac{1}{2} g^2(t)\nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t)]\, d t\)
给定正向 SDE: \(dx_t = \left( u(x_t, t) - \frac{1}{2} \left( g^2(t) - \sigma^2(t) \right) \nabla_{x_t} \log p(x_t) \right) dt + \sigma(t) dw\) 给定逆向 SDE: \(dx_t = \left( u(x_t, t) - \frac{1}{2} \left( g^2(t) + \sigma^2(t) \right) \nabla_{x_t} \log p(x_t) \right) dt + \sigma(t) dw\) 此方程描述了一个扩散过程,其中:
- $u(x_t, t)$ 是确定性漂移项,
- $\nabla_{x_t} \log p(x_t)$ 是得分函数(Score Function),
- $\sigma(t) > 0$ 是扩散系数,
- $dw$ 是标准布朗运动增量。
该形式符合 正向 SDE 的标准结构(漂移项含得分函数修正,扩散项为 $\sigma(t) dw$),表示数据从初始分布 $p_0(x)$ 演化到终末分布 $p_T(x)$ 的过程。
2. 逆向 SDE(式 (13))的确认
Flow Matching 與 Score Matching 的理論基礎比較
Flow matching(FM)與 score matching(SM)是兩種生成模型方法,理論上密切相關但目標不同。在 score-based(擴散)模型 中,我們學習一個 score function(機率密度的梯度),或等效地學習一個 noise predictor,透過最小化一個去雜訊的均方誤差(MSE)損失。例如,在 DDPM 類型的模型中,目標是最小化:
\[\mathbb{E}[\|\epsilon - \epsilon_\theta(x_t,t)\|^2]\]以預測加上去的高斯噪聲 。
相較之下,flow matching 直接對一個預先定義的傳輸路徑中的 velocity field $v(x,t)$ 進行回歸。具體而言,FM 定義一個從雜訊到資料的條件分布族 $p_t(x\mid x_1)$(如高斯內插),然後導出該條件分布對應的真實 velocity field,再訓練一個網路 $f_\theta(x,t)$ 來擬合它。FM 的損失為:
\[\mathcal{L}_{FM} = \mathbb{E}_{t,x\sim p_t}[\,\|f_\theta(x,t) - v(x,t)\|^2\,]\]這使得 $f_\theta$ 能精確擬合真實 flow 。
總結來說,score matching 是訓練一個對 diffusion 過程中的噪聲或 score 進行預測的模型,而 flow matching 是訓練一個描述資料傳輸 ODE 的向量場模型。FM 可以被視為對 diffusion 模型的一般化形式:如果選擇 diffusion-style 的高斯分布路徑,FM 模型可以模擬 score matching 效果,但通常更穩定 。
需要注意的還有 時間方向差異:score-based 模型通常從乾淨資料(t=0)逐步加噪聲到雜訊(t=1),並在生成時反向。而 FM 模型則多為雜訊到資料的方向(0→1),透過 ODE 積分前向產生資料 。
實作差異
-
模型結構與維度要求:score matching 模型運作於資料空間中,不要求可逆性;而 flow matching 通常實作為 continuous normalizing flow(CNF),因此要求模型是可逆的,且輸入輸出維度一致。這使得 FM 更自然地應用於連續資料上,而 SM 可以更靈活地擴展到潛空間甚至離散資料。
-
訓練穩定性:兩者都使用簡單的 MSE 損失函數,不需要模擬 forward/reverse 流程。在實作上,FM 常被認為比 score-based diffusion 模型更穩定,尤其是當選用 diffusion path 進行 flow matching 訓練時 。
-
最佳化與成本:SM 與 FM 每個訓練步驟通常只需一個 forward 和 backward pass,因此每次迭代的成本相當。但傳統 CNF 的最大似然訓練需要昂貴的 ODE 求解,而 FM 可免去此成本。SM 本身也不需 ODE 求解,這使得兩者在訓練時成本相近,但 FM 更具效率。
-
生成速度:最大差異在於抽樣速度。score-based 模型(如 DDPM)需經過數百或數千步反向推理,而 FM 通常只需少數 ODE 積分步驟。Liu 等人提出的 rectified flow 僅用一個 Euler step 就能生成高品質影像 ,FM 的效率優勢明顯。
-
高維資料的效率:在高維資料(如影像)中,diffusion 模型需大量噪聲更新步驟,而 flow matching(特別是設計良好的 optimal transport path)可透過直線式流動(shorter paths)大幅減少所需的神經網路評估次數(NFEs) 。
Flow Matching 與 Score Matching 可以合併嗎?
可以,且目前已有理論與實務上的合併方向。許多文獻指出,當 flow matching 使用高斯路徑時,它和 diffusion 模型在數學上是等價的 。FM 與 SM 其實是在不同框架下參數化相同的生成過程;因此,也可以交叉使用彼此的技巧。例如可以使用 diffusion path 進行 flow matching 訓練,並透過 deterministic ODE 或 stochastic SDE 進行抽樣。
此外,像 rectified flow 就是明確將 diffusion 轉化為 flow matching 框架的一種方式 。這些研究說明 FM 與 SM 並非競爭,而是可以互補與轉換的框架。
Flow Matching 可否應用於 VAE 的 Latent Space?
可以。由於 flow matching 需處理高維輸入,因此許多研究將其應用在低維的潛空間(latent space)中,特別是與 VAE 搭配。
例如 Dao 等人(2023) 提出 “Flow Matching in Latent Space”:先訓練一個 VAE 以將資料編碼成潛空間,再在潛空間中使用 FM 進行生成建模。這種方法能有效降低計算成本,且維持高品質的圖像生成 。
同樣地,Samaddar 等人(2025) 的 Latent-CFM 利用預訓練模型提取的潛在特徵進行 flow matching。他們指出,在潛空間中訓練 FM 不僅加快訓練速度,還能提升生成品質 。這些 latent flow 模型也可與條件資訊結合(如分類標籤、inpainting、語義標籤)來實現條件生成。
與此類似的還有 diffusion 模型中的 latent 變體,例如 NVidia 的 Latent Score-based Generative Model (LSGM),將 score matching 訓練在 VAE 的潛空間中以加速生成 。這些工作都顯示,將 flow matching 應用於潛空間是可行且有效的策略。
應用與結合模型實例
Flow matching 與 score matching 均已應用於各種生成與下游任務,其中包括:
-
無條件與有條件生成:兩種方法皆能實現高品質的圖像、音訊等資料生成。例如,Lipman 等人基於 FM 的模型在 ImageNet 上達到 SOTA 表現 。Dao 的 latent FM 模型也支援高解析度條件生成(如 inpainting) 。SM 模型如 latent diffusion 也應用於文本到圖像生成(如 Stable Diffusion)。
-
表徵學習(Representation Learning):VAE 所學習的潛空間可與 FM 結合,透過在潛空間中訓練 flow,可學得有用的資料表徵,並進行如分類、生成、資料操控等下游任務。
-
半監督學習:雖然目前使用 flow matching 進行半監督學習的工作較少,但過去已有 flow-based 模型(如 FlowGMM)用於建模資料與標籤的聯合分布,可望結合 FM 進一步擴展。
-
應用於物理、語言與 3D 資料:FM 已延伸應用於影片生成、分子構象預測、離散資料建模等領域。例如 Polyak 等人將 FM 用於 foundation video models ,Hassan 等人應用於分子生成 。
-
結合 VAE 與 Flow Matching 的模型:除了上述 latent FM 外,還有如 Variational Rectified Flow Matching(Guo & Schwing, 2025)結合 latent mixture 與 rectified flow,以及 FlowLLM(Sriram 等人, 2024)用 flow matching 微調 LLM 的輸出。這些皆顯示 FM 可與潛變數模型靈活結合,形成高效生成器。
我們看 reverse SDE: 包含 flow and score terms, and random walk dw.
![[Pasted image 20250201233020.png]]
From 8-Gaussian to moons.
![[Pasted image 20250522235631.png]]
![[Pasted image 20250515212250.png]]
Fuse Method 1: t \in [0, 1] Noise_{FM} ~ N(0, I) Noise_{Diff} ~ N(0, I) * scale where scale = 0.1 Dataset
Training flow matching using Input: t, Noise_{FM}, Dataset + Noise_{Diff}, u_t(x\mid z)= Dataset + Noise_{Diff} - Noise_{FM}
Training diffusion using (denoiser): only when 0.9 < t < 1 Input: Dataset + Noise_{Diff}, output = Noise_{Diff}, neural network is used to predict score function.
Flow matching part:
Flow process:
- **物理意義: 就是讓 noise 移動變成 image.
- Training:
- Input: scale input + noise, t
- Output: vector field
- $x_n = x_{n-1} + v_t(x_t, \theta)$
1 | |
Diffusion process: 物理意義: 就是 denoise 變成 image.
- Training:
- Input: scale input + noise, t
- Output: noise $\epsilon_{\theta}$
- Sampling:
- Start from noise N(0, I)
- $x_n = k x_{n-1} - (1-k) \epsilon_{\theta}$
For each timestep $t$, you use the reverse process:
\[x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \cdot \epsilon_\theta(x_t, t) \right) + \sigma_t z\]Where:
- $\epsilon_\theta(x_t, t)$: predicted noise
- $\alpha_t$, $\bar{\alpha}_t$: noise schedule values
- $\sigma_t$: standard deviation of the noise added at step $t$
- $z \sim \mathcal{N}(0, I)$: noise (used for stochasticity)
🧾 DDPM Sampling Algorithm (Step-by-Step)
1 | |
Combine Flow Matching and Score Matching
考慮 g(t) = 0, 可以 degenerate 得到 ODE. 這是 1st order 的 flow?
![[Pasted image 20250530185251.png]]
再加上 2nd order score stochastic function
![[Pasted image 20250530185557.png]]
這是 objective function. 基本是兩個 neural networks? ![[Pasted image 20250530190205.png]]
Takeaway
Diffusion process:
- 物理意義/現象是從頭到脚的 stochastic SDE
- 經由對應的 Fokker-Planck ODE equation 可以得到每個時間的 $p_t(x)$, 以及 forward/reverse dynamics.
- 其中 deterministic score function (vector field) $\nabla_{x} \log p_t(x)$ 扮演關鍵的角色。發動者是 forward path 加 noise; 關鍵是 reverse path score function 的引導者。
- 訓練 neural network 近似 score function: score matching!
- Sampling 時則是利用 random sampling from initial distribution -> Langevin dynamics + denoiser (score function) 產生 samples
- Likelihood: 還是要經過 Fokker-Planc ODE 計算出.
Flow process:
- 物理意義/現象是 deterministic ODE + random initial condition/distribution
- 物理順序是: vector field $u_t$ + initial condition 產生-> trajectory $X_t$ 產生-> flow $\phi_t(x_t)$ 產生-> distribution $p_t(x_t)$
- Flow equation 是非常簡單的 ODE: $X_0 = x_0, \frac{d X_t}{dt} = u_t(X_t)$, $X_t$ 是每一個 trajectory 對應不同的 initial condition $x_0$.
- 把所有的 initial condition trajectories 集合起來就是 flow: $\phi_t(X_t)$,當然也滿足 $\frac{d \phi_t(X_t)}{dt} = u_t(\phi_t(X_t))$
- 訓練 neural network 近似 vector fild $u_t$ (是 flow 的微分): flow matching!
- Sampling 時則是利用 random sample initial distribution -> flow ODE 產生 samples
- Likelihood: 基本經過 reversed flow 可以計算出.
Flow matching is simulation-free (不用 ODE solver) in training! but need ODE solver in sampling! Consistency mode is simulation-free in sampling!
Flow matching vs. Score Matching
我用 ChatGPT 做了一個比較:
| Aspect | Score Matching (e.g., Diffusion Models) | Flow Matching | | ————————– | ———————————————————————– | ——————————————————————————————- | | Training Objective | Match score ∇ₓ log pₜ(x) using denoising or noise-perturbed samples. | Match vector field uₜ(x) using interpolated paths between base and target. | | Training Data Pairs | Uses perturbations of single samples x ∼ q(x). | Uses sample pairs (x₀, x₁) ∼ p₀ × q. | | Stochasticity | Involves stochastic reverse-time SDE or ODE. | Can be trained and used deterministically (via ODE). | | Interpretability | Score function is less interpretable. | Vector field can be directly visualized as transport directions. | | Sample Quality | High quality, especially for complex distributions. | May suffer in multi-modal or complex geometries due to poor interpolation. | | Interpolation Issues | None — doesn’t interpolate between mismatched samples. | Yes — interpolates between randomly sampled pairs, which may be semantically unrelated. | | Manifold Respect | ✅ Yes — local noise keeps dynamics near the data manifold. | ❌ No — interpolations may go off-manifold, especially in high dimensions. | | Simulation-Free | No — often requires simulation of forward diffusion process. | Yes — training can be done without simulation of trajectories. | | Training Stability | Stable and well-understood, but computationally intensive. | Often more efficient, but may suffer from spurious vector fields. | | Mode Coverage | Good mode coverage due to noise-based training. | Can miss modes if interpolation crosses low-density areas. | | Theoretical Guarantees | Strong links to Fokker-Planck, optimal denoising, and score estimation. | Tied to optimal transport and flow ODEs. | | Applications | Denoising diffusion, image synthesis, audio generation. | Neural ODEs, flow-based generative modeling, fast training. | Add how to do sampling for both score matching and flow matching
- Score matching - Annealed Langevin dynamics based on score function denoise
- Flow matching - 利用 vector field (不是 flow) 可以得到 samples
Summary:
-
Score Matching is more robust in complex distributions and provides high sample quality, but can be computationally heavy.
-
Flow Matching is efficient and deterministic, but can suffer from poor interpolation unless guided or regularized carefully.
Appendix D: Flow Matching vs. Score Matching
What’s the issue?
In flow matching, you’re learning a vector field $u_t(x)$ such that the flow transforms samples from a base distribution $p_0$ to a target distribution $q$, by matching trajectories defined by conditional fields (e.g., between $x_0 \sim p_0$ and $x_1 \sim q$).
But during training:
- You sample independent pairs $(x_0, x_1)$ from $p_0(x_0) \times q(x_1)$.
- The vector field $u_t(x)$ is trained to point from $x_0$ to $x_1$ (linearly or via some transport plan).
This means:
You’re interpolating between randomly paired samples, not samples that lie on a meaningful trajectory between modes.
Why is this a problem?
-
Mismatch in Semantics: If $x_0$ and $x_1$ lie in different modes (e.g., digits “1” and “8”), the interpolated vector field may not follow a realistic path — it may go through “garbage” regions.
-
Conflicting Training Signals: In high dimensions, many training pairs induce vector directions that contradict each other in overlapping regions of space.
-
Generalization in Between: The learned field might look good at data points but behaves poorly between them — i.e., the model has no guarantee of generating valid intermediate samples during inference.
Contrast with Score-based Models
Score-based models (e.g., diffusion models) estimate the score $\nabla_x \log p_t(x)$, which is locally defined and smooth, and avoid the “random pairing” problem — they only rely on perturbing data and denoising, not interpolating arbitrary sample pairs.
Possible Fixes or Improvements
-
Conditional Flow Matching (CFM): Instead of random pairing, use conditional transport or kernel matching to align similar samples.
-
Use trajectories from known dynamics: If you have access to optimal transport maps (e.g., displacement interpolation), training becomes much more stable.
-
Guidance during interpolation: Add constraints or learned alignment that discourage implausible transitions.
Here’s a table comparing Score Matching (used in diffusion models) and Flow Matching (used in flow-based models):
Certainly! Here’s the updated comparison table with “Manifold Respect” moved right after “Interpolation Issues”:
| Aspect | Score Matching (e.g., Diffusion Models) | Flow Matching |
|---|---|---|
| Training Objective | Match score ∇ₓ log pₜ(x) using denoising or noise-perturbed samples. | Match vector field uₜ(x) using interpolated paths between base and target. |
| Training Data Pairs | Uses perturbations of single samples x ∼ q(x). | Uses sample pairs (x₀, x₁) ∼ p₀ × q. |
| Stochasticity | Involves stochastic reverse-time SDE or ODE. | Can be trained and used deterministically (via ODE). |
| Interpretability | Score function is less interpretable. | Vector field can be directly visualized as transport directions. |
| Sample Quality | High quality, especially for complex distributions. | May suffer in multi-modal or complex geometries due to poor interpolation. |
| Interpolation Issues | None — doesn’t interpolate between mismatched samples. | Yes — interpolates between randomly sampled pairs, which may be semantically unrelated. |
| Manifold Respect | ✅ Yes — local noise keeps dynamics near the data manifold. | ❌ No — interpolations may go off-manifold, especially in high dimensions. |
| Simulation-Free | No — often requires simulation of forward diffusion process. | Yes — training can be done without simulation of trajectories. |
| Training Stability | Stable and well-understood, but computationally intensive. | Often more efficient, but may suffer from spurious vector fields. |
| Mode Coverage | Good mode coverage due to noise-based training. | Can miss modes if interpolation crosses low-density areas. |
| Theoretical Guarantees | Strong links to Fokker-Planck, optimal denoising, and score estimation. | Tied to optimal transport and flow ODEs. |
| Applications | Denoising diffusion, image synthesis, audio generation. | Neural ODEs, flow-based generative modeling, fast training. |
🌐 Manifold Assumption Recap
The manifold hypothesis states that high-dimensional data (e.g., images) actually lies on or near a low-dimensional manifold embedded in the ambient space (e.g., ℝⁿ).
🌀 Score Matching (Diffusion Models)
-
Preserves the manifold assumption — at least during inference:
- The forward diffusion process adds noise, pushing data off the manifold.
- But during reverse-time generation, the score function learns to bring samples back toward the data manifold.
- The model gradually denoises toward realistic-looking images on or near the original manifold.
-
At each denoising step, small updates guided by the score field respect local structure.
✅ Bottom line: Score matching implicitly respects the data manifold by learning how to denoise back to it.
🔁 Flow Matching
-
Potentially breaks the manifold:
- Flow matching defines vector fields between a base distribution (e.g., Gaussian) and the data distribution, often through interpolated paths.
- These interpolations (e.g., linear interpolations in pixel space between two image samples) often leave the manifold — they can pass through unrealistic regions that are far from the true data manifold.
- The resulting vector field pushes probability mass through low-density regions, which is unnatural for image data.
🚨 Problem: If you interpolate between two real images of, say, a cat and a car, the middle of the interpolation may not correspond to any real image-like structure — it’s off-manifold.
🧠 Summary
| Score Matching | Flow Matching | |
|---|---|---|
| Manifold Respect | ✅ Mostly stays on/near the data manifold | ❌ May generate off-manifold trajectories |
| Interpolation | Implicit via small local noise steps | Explicit interpolation between possibly unrelated samples |
| Risk | Lower — local denoising follows manifold | Higher — global flows may cross unrealistic regions |
🧩 Final Thought
Flow matching can respect the manifold, but only if interpolation is carefully chosen. Some recent work uses manifold-aware interpolations (e.g., via encoders or geodesic paths in latent space) to mitigate this issue.
Would you like to see a visual example comparing on- vs. off-manifold interpolations?
Reference
Schrodinger bridge https://arxiv.org/pdf/2307.03672
Appendix A
根据随机微分方程(SDE)的正向与逆向关系理论,验证如下:
1. 正向 SDE(式 (12))的确认
给定正向 SDE: \(dx_t = \left( f(x_t, t) - \frac{1}{2} \left( g^2(t) - \sigma^2(t) \right) \nabla_{x_t} \log p(x_t) \right) dt + \sigma(t) dw\) 此方程描述了一个扩散过程,其中:
- $f(x_t, t)$ 是确定性漂移项,
- $\nabla_{x_t} \log p(x_t)$ 是得分函数(Score Function),
- $\sigma(t) > 0$ 是扩散系数,
- $dw$ 是标准布朗运动增量。
该形式符合 正向 SDE 的标准结构(漂移项含得分函数修正,扩散项为 $\sigma(t) dw$),表示数据从初始分布 $p_0(x)$ 演化到终末分布 $p_T(x)$ 的过程。
2. 逆向 SDE(式 (13))的确认
给定逆向 SDE: \(dx_t = \left( f(x_t, t) - \frac{1}{2} \left( g^2(t) + \sigma^2(t) \right) \nabla_{x_t} \log p(x_t) \right) dt + \sigma(t) dw\) 根据 Anderson (1982) 的逆向 SDE 理论,一个正向 SDE: \(dx_t = \mu(x_t, t) dt + \sigma(t) dw\) 对应的逆向 SDE 为: \(dx_t = \left( \mu(x_t, t) - \sigma^2(t) \nabla_{x_t} \log p_t(x_t) \right) dt + \sigma(t) d\bar{w}\) 其中 $\bar{w}$ 是反向布朗运动。
验证步骤:
- 将正向 SDE 的漂移项记为: \(\mu(x_t, t) = f(x_t, t) - \frac{1}{2} \left( g^2(t) - \sigma^2(t) \right) \nabla_{x_t} \log p(x_t)\)
- 代入逆向 SDE 公式: \(\begin{aligned} dx_t &= \left[ \mu(x_t, t) - \sigma^2(t) \nabla_{x_t} \log p_t(x_t) \right] dt + \sigma(t) d\bar{w} \\ &= \left[ f(x_t, t) - \frac{1}{2} \left( g^2(t) - \sigma^2(t) \right) \nabla_{x_t} \log p(x_t) - \sigma^2(t) \nabla_{x_t} \log p_t(x_t) \right] dt + \sigma(t) d\bar{w} \end{aligned}\)
- 合并得分函数项: \(-\frac{1}{2} \left( g^2(t) - \sigma^2(t) \right) \nabla_{x_t} \log p(x_t) - \sigma^2(t) \nabla_{x_t} \log p(x_t) = -\frac{1}{2} \left( g^2(t) + \sigma^2(t) \right) \nabla_{x_t} \log p(x_t)\)
- 结果与式 (13) 一致: \(dx_t = \left( f(x_t, t) - \frac{1}{2} \left( g^2(t) + \sigma^2(t) \right) \nabla_{x_t} \log p(x_t) \right) dt + \sigma(t) d\bar{w}\) 其中 $d\bar{w}$ 在时间反演下等价于 $dw$,故扩散项写为 $\sigma(t) dw$ 是合理的。
3. 结论
- 式 (12) 是 正向 SDE,描述数据从简单分布向复杂分布的扩散过程。
- 式 (13) 是 逆向 SDE,用于从噪声中重建数据(生成过程),其漂移项中的得分函数系数符号和大小与理论一致(正向系数为 $-\frac{1}{2}(g^2 - \sigma^2)$,逆向系数为 $-\frac{1}{2}(g^2 + \sigma^2)$)。
两者通过得分函数 $\nabla_{x_t} \log p(x_t)$ 关联,构成完整的生成模型框架(如 Score-Based Generative Models)。验证通过。