Stanford AI- Diffusion Lecture

Excellent lectures!!!!: https://www.youtube.com/watch?v=8mxCNMJ7dHM&list=PL0H3pMD88m8XPBlWoWGyal45MtnwKLSkQ

Main Reference

https://arxiv.org/pdf/1503.03585.pdf : original Stanford Diffusion paper: very good!

https://lilianweng.github.io/posts/2021-07-11-diffusion-models/ : good blog article including conditional diffusion

https://jalammar.github.io/illustrated-stable-diffusion/ by Jay Alammar, excellent and no math!

[@alammarIllustratedStable2022a] by Jay Alammar, excellent and no math!

[@alammarIllustratedStable2022] by Jay Alammar, excellent and no math!

Takeaways

Score matching is the key! 等價於 denoise, why? 看 Tweedie’s formula!

$\tilde{x} = x + \sigma^2$ + blurred score!

![[Pasted image 20250318112923.png]]

![[Pasted image 20250318112839.png]] ![[Pasted image 20250323001649.png]] Tweedie’s Formula [8]. InEnglish,Tweedie’s Formulastates that the true mean of an exponential family distribution, given samples drawn from it, can be estimated by the maximum likelihood estimate of the samples (aka empirical mean) plus some correction term involving the score of the estimate. In the case of just one observed sample,the empirical mean is just the sample itself. It is commonly used to mitigate sample bias; if observed samples all lie on one end of the underlying distribution, then the negative score becomes large and corrects the naive maximum likelihood estimate of the samples towards the true mean.

DDPM vs. DDIM

類似 discrete case.

DDPM: predict noise using score matching! 如上式 DDIM: predict $x_0$ directly $E[x_0 \vert x_t]$j. 也就是 flow model CM: consistency mode: 利用 NN 直接 predict ODE output

![[Pasted image 20250322172344.png]]

DDPM 的兩種解釋。有三條大路通羅馬嗎？

Lagenven dynamics: score matching, random walk and reverse walk
DDPM: noise estimation and denoising!
Hierarchy VAE: ELBO
The result is the same!

Diffusion 使用的 chain rule 是 based on Markovian (所以和 Auto-Regressive 不同)
KL Divergence vs. W distance and their close form in Gaussian distribution
Mutual information: KL of P(x, y) // p(x) p(y)

爲什麽 AI 可以處理 ill-conditioned problem? 因爲有 underlying PDF!! 如果我們知道 PDF 或是可以 estimate PDF (或是用 data training 接近 underlying pdf), 我們就可以解決或是 optimize 很多 ill-conditioned problem!!

Key Assumptions/Observation of Image Space and Image Manifold

High dimension 基本 image space 是一個沙漠，絕大多數的地方都是沙。
Low dimension image manifold 沙漠中的綠洲. 所以我們只要知道 P(x), 基本可以完全得到所有 information. 這和通信完全不同！！！！ 因爲通信的 communication space 是 low dimension!! 所以一旦 signal and noise 混在一起就很難分離。但是 image space 是非常high dimensional.
What about audio space? LLM space? and other spaces?
一個例子是 DC 信號或是極低頻信號，加上 gaussian noise 後，還是可以 estimate noise! 因爲 DC 信號是 low dimension (0), 但是 gaussian 信號是 very high dimensional.

Don’t do denoise in pixel space? but in latent space? 因爲在 latent space 對 noise 更高維？

Denoiser 是關鍵！MMSE denoise 就是 score function!

知道 $P(\bar{x})$ 就可以做所有 image processing and generation 和傳統的 DSP 完全不同。

很多傳統 DSP 視為 ill-condition (e.g. super-resolution, deblur) 或是不可能 (e.g. image generation) 的問題。可以用 linear Inverse problem 描述。在知道 P(x), joint pdf or marginal pdf 後就變成可解的問題。如何得到 P(x)? 早期是靠猜。目前是靠 data training!
其中 image generation 是傳統 DSP 完全無法解的問題！利用 P(x) 可以用來解！幾個方法：
1. VAE: 利用 reconstruction loss + KL loss to train a generator.
2. GAN
3. Normalized Flow
4. EBM (energy based model)
5. Diffusion method!! -> randomly generate image then move to higher P(x), score-based 就是 diffusion
實際上我們不是真正知道 P(x), 而是訓練一個 generator or sampler, $x_s = G_{\theta}(z_s)$, $x_s$ 的 distribution 符合 P(x)。
如何從 generator G(z) 做其他的事，例如 linear inverse problem, SR, NR, compression…？

Exponential Family, Gibbs distribution is a Key,

![[Pasted image 20250122102634.png]]

Gibbs distribution: 似乎從 exponential family 得到靈感。轉換 P(x) to partition function and exponent!! 有點像 divide-and-conquer 但是利用 probability function 的特性 0 < P(x) < 1, $\rho(x)$ always positive

$P(x) = C \exp(-\rho(x))$

In the neural network $\theta$

$P(x) = \frac{1}{Z_{\theta}} \exp(-\rho_{\theta}(x))$

Diffusion Method Key Concept

利用 maximum likelihood on all image samples

$\max_{\theta} P(x_1) P(x_2)… P(x_n) = \max_{\theta} \Pi_k \frac{1}{Z^k_{\theta}} \exp(-\rho_{\theta}(x_k)) = \min_{\theta} { \log Z_{\theta} + \frac{1}{k}\Sigma_k \rho_{\theta}(x_k)}$

注意我們把 maximize P(x) 的問題轉換成 minimize $\rho_{\theta}(x_k)$
如果忽略 $Z_{\theta}$ 只是 minimize $\rho_{\theta}(x_k)$ 非常簡單
但重點是要考慮 $Z_{\theta}$, partition function, 問題變成複雜。各種方法就是爲了避免計算 $Z_{\theta}$

反向問題：如果有 $\rho_{\theta}(x)$, 如何生成 good quality image?

start with any random x
找到 $\rho_{\theta}(x)$ 的 gradient on $x$, 然後做 gradient descent on $x$?

為什麼 image generation like diffusion 是 iterative? 因為這是 1st order optimization. 如果知道 image manifold, 是否可以 2nd order？ Or Flow method?

![[Pasted image 20250118235632.png]]

![[Pasted image 20250118235116.png]

![[Pasted image 20250119001855.png]]

![[Pasted image 20250119002252.png]]

![[Pasted image 20250119002622.png]]

![[Pasted image 20250119003906.png]]

![[Pasted image 20250119005034.png]]

![[Pasted image 20250119010327.png]]

![[Pasted image 20250119010436.png]]

Deep Learning for Computer Vision/Image

ImageNet: DL supervised learning for classification: discriminative problem
DL for regression: supervised learning discriminative problem
DL Variational AE: new sample generation based on existing samples via self-supervised learning
DL Diffusion: generation based on Markovian property

Generation P(x) 的 samples 並不等於有 P(x) 的 distribution!!!!
可以參考 [[2024-05-26-Math_Sampling]] ![[Pasted image 20250119121811.png]]

不過兩者是緊密相連。

如果有 P(x), 可以用 random number generator 加上 inverse CDF, transformation, 或是 MCMC (Monte Carlo Markov Chian) 產生 samples, 這個過程稱爲 sampling. P(x) –> G 比較直覺容易。不過一般沒有這麽好的事，prior P(x).
1. 如何利用 score function 產生 sample. 這是 MCMC sampling.
2. 下一個問題是如何得到這個 score function?
相反，如果有 G, 我們可以 generate 出很多 samples, 統計 mean, variance, … , and histogram 似乎可以近似一個 P(x). 只是沒有效率。
同樣，如果有 G 而不是 P(x)

### Question1a: Given P(x), 如何產生 sample (也就是 sampler) 或是 generation model!

有很多方法可以產生 samples. 我們這裏從 MCMC 開始。利用 Langevin dynamics. 只要有 score function, 我們可以產生 emirical samples 符合 P(x)! 這只是方法之一，但是太慢。要用別的方法加速。

這是第一個 diffusion equation! Diffusion 是指 P(X(t)) over time.

![[Pasted image 20250125221658.png]] ![[Pasted image 20250125224405.png]] Don’t use Langevin for practicality because it takes 10,000 steps to converge. Only for theoretical analysis.

Question1b: 如何得到或是近似 score function?

簡單答案就是 denoiser! 正確的説法是 denoiser - original image = noise. 也就是 noise estimator!! 如何得到 denoiser, 使用 neural network!! 也就是說 Lengevin equation 中唯一的 learnable part 就是 denoiser, D! 但是這裡的 denoiser 非常簡單，只需要在一個固定而且很小的 noise (sigma=0.01) 的 denoiser. 代價就是非常久收斂！

Question 1c: 如何加速 Langevin equation?

簡單的答案是 “自污”! 壞事傳千里。Technical term: annealing, or annealing diffusion ![[Pasted image 20250125223926.png]]

It takes about 1000 steps to restore the image. ![[Pasted image 20250125224327.png]] 也就是說 Annealed Lengevin equation (AVD) 中唯一的 learnable part 就是 denoiser, D! AVD 的 denoiser 和前面的 denoiser 不同，必須在各種大小 noise 的 denoiser! 所以訓練會比較困難！

Question 1d: Use mixture of image and noise instead of additive noise!

![[Pasted image 20250125224818.png]] ![[Pasted image 20250125225037.png]] 也就是說 Annealed Lengevin equation (AVD) 中唯一的 learnable part 就是 denoiser, D! 這裡的 denoise 和 AVD 又不同！是在 image 和 noise 呈現 mixture 時的 denoise! AVD 的 denoiser 是固定 image, denoise 不同大小的 noiser. 這裡是 image 會變化。如果從 SNR 角度，兩者差不多。注意這裡的 denoiser 和後面 DDPM 非常像！

Question 1e: DDPM and 1f: DDIM

DDPM 的重點是 backward path 的 denoise 也是 Gaussian 當 step $\tau$ 很小，而且 iteration K 很大，基本就是 Gaussian noise. 一旦是 Gaussian, 就有 analytic form.

DDPM 也是 denoiser. 不過又是另一種 denoiser. 他不是 denoise 所有 noise 回到原來 image, 而是 denoise 一點 noise, 回到前一個 noisy image. 這樣比較好訓練？

Denoising Diffusion Probabilistic Models (DDPM) and Denoising Diffusion Implicit Models (DDIM) are both types of diffusion models used in generative AI, but they differ in their approach to the reverse process and sampling efficiency.

Similarities

Both DDPM and DDIM use the same forward process, gradually adding Gaussian noise to data over a series of timesteps[2].
They share the same training objective, allowing DDIM to use pre-trained DDPM models for inference[6].
Both models aim to learn the data distribution through forward and reverse diffusion processes[6].

Comparison between DDPM and DDIM in Perplexity

Differences

Reverse Process

DDPM uses a probabilistic reverse process, learning to denoise data step-by-step[2].
DDIM modifies the reverse process to make it deterministic, defining a fixed mapping between timesteps[2].

Sampling Efficiency

DDIM allows for much faster sampling compared to DDPM, making it competitive with GANs in terms of generation speed[4].
DDIM can generate samples in fewer steps, offering a trade-off between sample quality and computational efficiency[2].

Mathematical Formulation

DDPM’s reverse process is Markovian, while DDIM introduces a non-Markovian forward process[1][2].
DDIM sets the forward posterior variance to zero, allowing it to skip diffusion timesteps inside subsequences[6].

Performance

DDIM can be 10 to 50 times faster than previous conditional diffusion methods while maintaining comparable quality[3].
DDIM provides a family of generative models that can be chosen by selecting different non-Markovian diffusion processes[4].

In essence, DDPM focuses on probabilistic denoising, while DDIM introduces deterministic sampling for improved efficiency without sacrificing the model’s generative capabilities[2].

Citations: [1] https://arxiv.org/html/2402.13369v1 [2] https://aman.ai/primers/ai/diffusion-models/ [3] https://openreview.net/forum?id=8xStV6KJEr [4] https://strikingloo.github.io/wiki/ddim [5] https://www.tonyduan.com/diffusion/ddpm_vs_ddim.html [6] https://redstarhong.tistory.com/312 [7] https://sachinruk.github.io/blog/2024-02-11-DDPM-to-DDIM.html [8] https://www.reddit.com/r/StableDiffusion/comments/zgu6wd/can_anyone_explain_differences_between_sampling/

Use DDPM 訓練一個 forward path denoiser, 同樣 denoiser 可以用在 DDIM
使用 DDIM 作為 inference, 因為比較快而且 image consistency.

另外還有 CM (Consistency Model) 也是用於加速, 比較如下：

Consistency Models (CM) and Denoising Diffusion Implicit Models (DDIM) are both advancements in generative AI that aim to improve the efficiency of diffusion models. Here’s a comparison of their key features:

Similarities

Both CM and DDIM are designed to accelerate the sampling process of diffusion models.
They both aim to produce high-quality samples more efficiently than traditional Denoising Diffusion Probabilistic Models (DDPMs).
CM and DDIM can be trained using pre-trained diffusion models[1][2].

Differences

Sampling Process

DDIM modifies the reverse process of DDPMs to make it deterministic, allowing for faster sampling[2].
CM aims to directly map noise to data in a single step or very few steps[8].

Flexibility

DDIM allows for a trade-off between computation and sample quality by adjusting the number of sampling steps[2].
CM supports both one-step generation and multi-step sampling, offering more flexibility in balancing speed and quality[8].

Training Objective

DDIM uses the same training objective as DDPMs, making it compatible with pre-trained DDPM models[2].
CM introduces a new training approach called Consistency Training (CT), which is more challenging but potentially more powerful[9].

Performance

DDIM can produce samples 10 to 50 times faster than DDPMs in terms of wall-clock time[2].
CM claims to achieve state-of-the-art results in one-step generation, with competitive performance even in just two sampling steps[8].

Additional Capabilities

DDIM allows for semantically meaningful image interpolation in the latent space[5].
CM supports zero-shot editing tasks like image inpainting and super-resolution without specific training[8].

In essence, while both CM and DDIM aim to improve the efficiency of diffusion models, CM represents a more radical departure from traditional diffusion model architecture, potentially offering even faster generation at the cost of more complex training.

Citations: [1] https://openaccess.thecvf.com/content/CVPR2024/papers/Xu_Inversion-Free_Image_Editing_with_Language-Guided_Diffusion_Models_CVPR_2024_paper.pdf [2] https://openreview.net/forum?id=St1giarCHLP [3] https://github.com/alexander-soare/consistency_policy [4] https://arxiv.org/html/2411.08954v1 [5] https://strikingloo.github.io/wiki/ddim [6] https://slazebni.cs.illinois.edu/spring24/lec13_diffusion_viraj.pdf [7] https://www.roboticsproceedings.org/rss20/p071.pdf [8] https://www.semanticscholar.org/paper/Consistency-Models-Song-Dhariwal/ac974291d7e3a152067382675524f3e3c2ded11b [9] https://arxiv.org/html/2403.06807v2

Question2: 如何從 G (diffusion) 得到 P(x)?

以 diffusion DDPM 和 DDIM 非常容易： P(x) ~ P(x1 |x0) P(x2 | x1) …

Question3: 如何從 G (diffusion) 做 super resolution, compression, etc.

Guided Diffusion

class guide
image guide
text guide

Class guide

利用 AVD as example, ![[Pasted image 20250125232312.png]]

Key 是找到新的 conditional score function, how?

Guess 1: 只對 class = c 的 image 訓練 D_c denoiser? 不實際，需要太多 denoiser 而且 sample 太少。
Guess 2: 需要加一個向 class = c 的力。How? 就是 classifer 的 back-propagation gradient! 但是這個 classifier 是一個可以 detect noisy image 的 classifer. 而且在
![[Pasted image 20250125233311.png]]

![[Pasted image 20250125234659.png]]

Question2: 如何從 P(x) samples 得到 distribution, 然後可以做 classification, denoise, super resolution? 如同在 LLM decoder, 如何用 LLM decoder 做 sentiment detection, 或是 spam detection? 可以直接用問的。或是利用最後一層做 fine-tune!

Nonlinear history of AI 三個 key events:

(Physics related) Iscing equation to Hpfield: use neuron and lowest energy to find the ground state.
(Some physics related) RBM: organized neurons, hidden unit, and introduce randomness, then back-prop
(No physics analog?) Variational autoencoder: stack RBM
(Physics related) diffusion!

	Ising 模型	Hopfield 神經網絡	Hinton BM/RBM
Bit Representation	兩種自旋	二進位制	二進位制
Connectivity	鄰近作用 visible nodes	神經元連接 visible nodes	神經元連接, visible and hidden nodes
Objective	能量最低	損失函數最低	損失函數最低
Randomness	Yes	No	Yes
Distribution	Boltzmann	No	Data Distribution

![[Pasted image 20250119115717.png]]

Image Quality

Text: perplexity = exp(self-entropy or cross-entropy)

IS (Inception score = perplexity in image?) = exp(KL divergence of y_k, y_avg)!! IS 是否類似有多少種 image 的概念? 0 < IS < number of classes? NO

IS bigger is better, 代表距離很遠，沒有 mode collapse!

FID 比較像 perplexity? NO, it’s a distance between true and fake (syntheiss) image distribution NLL 比較像 log (perplexity) or entropy(x).

Diffusion Model 演進

Diffusion Model 並不是新概念，在2015年 “Deep Unsupervised Learning using Nonequilibrium Thermodynamics” 就已經提出了DPM（Diffusion Probabilistic Models）的概念。隨後在2020年 “Denoising Diffusion Probabilistic Models” 中提出DDPM模型用於圖象生成，兩者繼承關係從命名上一目瞭然。DDPM發佈後，其優異的圖象生成效果，同時引起注意，再次點燃了被GAN統治了若干年的圖象生成領域，不少優質文章就此誕生：

Deep Unsupervised Learning using Nonequilibrium Thermodynamics，2015: DPM
Denoising Diffusion Implicit Models，DDIM, 2020：在犧牲少量圖象生成多樣性，可以對DDPM的採樣效率提升10-50倍
Diffusion Models Beat GANs on Image Synthesis，2021：成功利用Diffusion Models 生成比GAN效果更好的圖象，更重要的是提出了一種 Classifier Guidance的帶條件圖象生成方法，大大拓展了Diffusion Models的使用場景
More Control for Free! Image Synthesis with Semantic Diffusion Guidance，2021：進一步拓展了Classifier Guidance的方法，除了利用Classifier ，也可以利用文本或者圖象進行帶語義條件引導圖象生成
Classifier-Free Diffusion Guidance，2021：如標題所述，提出了一種無需提前訓練任何分類器，僅通過對Diffusion Models增加約束即可實現帶條件圖象生成的方法
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models，2021：在以上這些技術能力基礎已經夯實後，OpenAI利用他們的“鈔能力”（數據、機器等各種意義上）訓練了一個超大規模Diffusion Models模型，成功超過自己上一代“基于文本的圖象生成模型”DALL·E 取得新的SOTA
再之後的2022年，OpenAI的DALL·E 2、Google的Imagen 等等各種SOTA你方唱罷我登場，也就出現了文章開頭那一幕本文僅關注DDPM及其後Diffusion Model演進，涉及的文章大致如上。

溫馨提示，DDPM和VAE（Variational AutoEncoder）在技術和流程上有著一定相似性，因此強烈建議先閲讀“當我們在談論 Deep Learning：AutoEncoder 及其相關模型”中Variational AutoEncoder部分，將有助於理解下文。

另外，下文參考了上述每篇原始論文，以及What are Diffusion Models?，有興趣的同學可以自行研究。

DDPM（Denoising Diffusion Probabilistic Models）

DDPM的核心思路非常樸素，跟VAE相似：將海量的圖象信息，通過某種統一的方式encode成一個高斯分佈，這個過程稱為擴散；然後就可以從高斯分佈中隨機採樣一份數據，並執行decode過程 (上述encode的逆過程)，預期即可生成一個有現實含義的圖象，這個過程稱為逆擴散。整個流程的示意圖如下，其中就是真實圖象，就是高斯分佈圖象。

由於DDPM中存在大量的公式推導，本文不再複述，有疑問的可以參考B站視頻“Diffusion Model擴散模型理論與完整PyTorch代碼詳細解讀”，UP主帶著大家推公式。

2015: Sohl-Dickstein, et al (Stanford) “Deep Unsupervised Learning using Nonequilibrium Thermodynamics” [@hoDenoisingDiffusion2020]: 首次提出使用

Forward: Markov diffusion kernel (Gaussian or Binomial diffusion)
Backward: ? what deep learning model?
Entropy

Deep Unsupervised Learning using Nonequilibrium: Entropy

2020 DDPM (Denoising Diffusion Probabilistic Model): probabilistic model Markov/VAE

2021

Classifier-free diffusion Guidance

DALL-E, GLIDE

DALL-E2, Googel Imagen, Midjourney