Generative AI 兩大流派:CV (Computer Vision) and NL (Natural Language)
AI 的兩大流派是 CV and NL,對應劍宗和氣宗。劍宗花俏,生成各式美麗圖案。特別 diffusion model generate image from noise, 頗有無招勝有招之態。氣宗則是穩扎穩打,步步為營。正如同 NL 的 auto-regression 永遠只看下一步如何產生。
How Does AI Generate X, Article/Image/… Samples?
Analytic AI is cheap, generative AI is difficult. A quick comparison Analytic AI vs. Generative AI Comparison Table
| Analytic/Discriminative AI | Generative AI | |
|---|---|---|
| Number of Parameters | < 10’s M |
> 1000 M |
| Computation per Inference | 1-10’s TOPS | 100’s - 1000’s TOPS |
| Core Neural Network | CNN | Transformer |
| Learning Methodology | Supervised training | Self-Supervised training + Supervised fine-tuning |
| Math Insight | $P(c\vert \mathbf{x})$: given an image/article x, what’s the probability of a dog/spam? | $P(\mathbf{x})\sim X$: generate image/article sample $X$ with text/image guidance |
- Generative AI is extremely difficult because of x’s high dimensionality:
- The probability of a Monkey typing Hamlet (30K words, or 130K letters) is $1/26^{-130000} \sim 3\times 10^{-183,945}$.
- The probability of randomly generating a 512×512=262,144 pixels natural image is $1/(2^{24})^{512\times 512}\sim 1/16,777,216^{262,144}$.
- Brute force 的機率是 $S^{-N}$, $N$ 是 token/pixel length, $S$ 是每個 token/pixel 的 discrete values/states. 我們一般稱 $N$ 是這個問題的 dimension. Monday typing Hamlet 算是 130K dimensions, 512x512 image 則是 262K dimensions. 隨著文章長度變長,或是 image size 變大,問題的 dimension 也隨之 exponentially 增加。
![[Pasted image 20250324171721.png]]
- How to obtain $P(\mathbf{x})$ to generate sample X (article, image, …) effectively and efficiently?
- (Mainstream) Auto-regression for text: OpenAI GPT-2, 2019.
- (Mainstream) Diffusion for image: Stanford NCSN1/DDIM 2(Ermon, Song), 2019.
- Other (past) generation techniques:
- VAE (Variational Autoencoder): U. of Amsterdam (Kingma, Welling), 2013.
- GAN (Generative Adversarial Network): U. of Montreal (Goodfellow), 2014.
- MAE (Masked Autoencoder): Meta (Kaiming He), 2021.
- 2013, 最早的 generative AI 始於 CV VAE 用於 image generation.
- 2019, OpenAI GPT-2, 使用 AR + Transformer 於 natural language generation.
- 2019-2020, Stanford (Ermon and Song) and Berkeley (Jonathan Ho) 使用 diffusion 取代 GAN 產生 image
- 2020?, .. use token for image patch like ViT for image understanding. Tokenization wins!
- NL camp: Attention, Transformer, Tokenization, AR
- CV camp: CNN for patchfication, VAE (use for encode/decode image),
Auto-regression (generate next token) vs. Diffusion (generate from noise)
Auto-regression: $P(\mathbf{x})=P(x_1)P(x_2\vert x_1)…P(x_n\vert x_1,…,x_{n−1})$
上式是完全等價, 沒有做任何的簡化。複雜度仍然時 $S^N$ 因為是一個一個 token (text, image) 產生,比較類似循序漸進,稱為氣宗。
Text generation:
- $P(\mathbf{x}) : P(動物園) = P(\text{動}) P(\text{物} \vert \text{動}) P(\text{園} \vert \text{動}, \text{物})$
Pros:
- Discrete or continuous distribution (text, image, video)
- Variable length (text, video)
- Scalable and also meet scaling law
Cons:
- Slow – Sequential generation
- Sampling “drifts” - accumulated errors
- Inductive bias: 對於非語言應用 (e.g. DNA sequence) 不一定從左到右
- Constrained architecture: causal attention mask
Models:
- Most popular LLMs and LMMs based on transformer (金剛掌), Mamba (蛇形拳), ViT
Diffusion: $\mathbf{x}t = \mathbf{x}{t-1} + \frac{\epsilon}{2} \nabla_\mathbf{x} \log p(\mathbf{x}_{t-1}) + \mathbf{z}_t$
Pros:
- Fast – Parallel generation
Cons:
- Continuous distribution (image, speech)
- Fixed length (image)
Models:
- DDPM, DDIM, Stable Diffusion (w/ Encoder/Decoder), DiT (patched tokens)
Image generation via SDE: Stochastic Differential Equation
- Forward SDE ($0 \to T$):
- $d\mathbf{x}_t = \sigma (t)\, d\mathbf{w}_t$
- $\mathbf{x}_t$ is random walk, 或稱為布朗運動。$\sigma(t)$ 是 noising/denoising scheduler,控制退火的速度。
- Reverse SDE ($T \to 0$):
- $d\mathbf{x}t = -\sigma(t)^{2} \underbrace{\nabla\mathbf{x} \log p(\mathbf{x}t)}{\text{Score Function}} \,dt + d\mathbf{w}_t$
Score Function: log likelihood gradient 是一個 vector field。使用 neural network 模擬。看起來變複雜,因為 output 的 dimension 從 scalar (dimension = 1) 變成 high dimension as the input (e.g. 512x512=262K for image, or 130K for text). 不過實務上不會變更複雜,因為在 training 時,本來就會計算 (high dimension) gradient. 同時在 training 和 inferencing 變的更簡單。
AR 和 Diffusion 原來是兩個 local kings, 各自有擅長的領域 in text and image。但事情開始改變。
Episode 2: Auto-Regression for Image Generation
-
AR (Auto-regression) has unified all modalities by tokenizing text, image, audio, video, etc., and predicting the next token.
-
The last hurdle: AR image generation is (1) slow and (2) low quality
- VAR (Visual AutoRegression) provides image generation via next-scale prediction
Three Different Autoregressive Generative Models
-
Autoregressive Transformer (GPT, Llama, PaLM, etc.)
- AR: Text generation by next-token prediction
-
AR Transformer (GPT, VQGAN, PaRI, etc.)
- AR: Image generation by next-image-token prediction
-
Visual Autoregressive Transformer (Our VAR)
- VAR: Image generation by next-scale (not next-resolution) prediction