Math Optimization - Convex Optimization

Posted on 2023-05-01 | In Math_AI

Reference

Youtube video: 4.2 Accelerated Gradient Descent - YouTube UT Austin: Constantine Caramanis. Very good series.

Background

用 convex function 卻使用 gradient descent, sub-gradient method, momentum, etc. 得到更多的 insight!!

而不是用 2nd order statistics like Hessian 或是牛頓法！

這樣更有物理和數學 insights!!

Lesson Learned

Convex optimization 基本是所有 optimization algorithm 的鼻祖。Convex optimization 有非常高效率的 solutions for large scale 問題。

雖然很多問題本身不是 convex, 但可以轉換成 convex optimization 再利用現成的 solution.

另一個想法是利用 convex function 驗證 algorithm!!

Convex, Smooth, Strict Convex 定義

完整的凸函數包含兩個部分：convex domain ($\R^n$) + convex function ($\R^n\to\R$)

(Constrained) domain $C \subset \R^n$ 是 convex set (凸集)：

$x_1, x_2 \in C \, \to \, t x_1 + (1-t) x_2 \in C$ where $0\le t \le 1$
- Unconstraint domain $C = \R^n$ 也是 convex set.
Convex function:
\[x_1, x_2 \in C \,\to \, f\left(t x_1 + (1-t) x_2\right) \le t f(x_1) + (1-t) f(x_2)\]

擴展 convex function 其他等價定義： Convex function 有下列等價定義

$f(x)$ 是一個 a convex function ($\R^n\to \R$) + domain ($C \subset \R^n$) 是 convex set
$\text{epi-}f$ 是一個 convex set in $\R^{n+1}$! i.e. $x’_1, x’_2 \in \text{epi-}f\, \to \, t x’_1 + (1-t) x’_2 \in \text{epi-}f$ where $0\le t \le 1$

什麽是 Epigraph (Epi - on ; graph - write)?

epi-$f$ 幾何顯示如下圖。就是 $f(x)$ (含) 以上所包含的藍色區域，非常直覺。

這個例子顯示：epi-$f$ 的維度是: 定義域維度 (1D) + 值域維度 (1D) = 2D (in these cases)

$f(x)$ 的定義域是 n 維凸集，epi-$f$ 是 n+1 維凸集。
幾何圖形更清楚。就是所有 $f(x)$ 以上畫藍色的區域 (2-D)。注意定義域 $x$ 是 1-D, 值域也是 $1$-D. 但是藍色的 epi-f 是 2D.

Epi-$f$ 有什麽用途? 可以把凸函數 (convex function) 和凸集 (convex set) 結合一起。

左圖：$f(x)$ 顯然是非凸函數，對應的 epi-$f$ 是非凸集。可以證明反之也爲真！
中/右圖： $f(x)$ 是凸函數，則 epi-$f$ 是凸集。

Epigraph 的正式定義： $\text{epi-}f = {(x,t)

x \in \text{dom} f, f(x) \le t}$

因爲 $x \in \text{dom}f \subset \R^n$, $t \in \R \to$ $(x, t) \in \R^{n+1}$. 也就是 $\text{epi-}f$ 是 n+1 維度的 set.
(Bad explanation：值域找定義域) 對於任意 $t \in \R$, 就是找出所有 $x \in \text{dom}f$ 滿足 $f(x) \le t$.
- 顯然如果 $t < \min f(x)$, 沒有任何 $(x, t)$ 滿足 $f(x)\le t$
- 隨著 $t$ 變大，開始存在 $(x,t)$ such that $f(x) \le t$.
(Right explanation：定義域找值域) 對於任意 $x \in \text{dom}f$ 找到所有 $t \ge f(x)$
如何把定義域和值域結合成新的 set? 我們可以定義 epigraph 的點是可以做 operation. $x’_1 + x’_2 = (x_1, t_1) + (x_2, t_2) = (x_1+x_2, t_1+t_2)$ and $a x’_1 = a (x_1, t_1) = (a x_1, a t_1)$.
Epigraph 不是只有定義域 $\R^n$！而是定義域 + 值域 $\R^{n+1}$！

Convex, Smooth, Strict Convex 幾何詮釋

Convex optimization: (unbounded) local minimum = global minimum

Assume $f(x)$ is a convex and differentiable function. 幾何就是 $f(x)$ 在每點 $x_0$ 的切綫上。切綫是 lower bound.

$f(x) \geqslant f(x_0)+ \nabla f \cdot (x-x_0) \quad \text{for } \forall x,x_0 \in C$

Assume $f(x)$ is convex but may not be differentiable, 可以使用 sub-gradient: 在 differentiable points gradient = sub-gradient ($g=\nabla f$), 在 non-differentiable points, 可以用 sub-gradient, $g$. Sub-gradient 形成的切綫是 lower bound.

\[f(x) \geqslant f(x_0)+g\cdot(x-x_0) \quad g \in \partial f \quad \text{for } \forall x,x_0 \in C\]

Assume $f(x)$ is Lipschitz $\beta$-smooth. 幾何上就是 $f(x)$ 在每點 $x_0$ 的切抛物綫下。切抛物綫是 upper bound.

$f(x) \leqslant f(x_0)+\nabla f \cdot (x-x_0)+\frac{\beta}{2}\|x-x_0\|^2 \quad \text{for } \forall x,x_0 \in C$

Assume $f(x)$ is strongly $\alpha$-convex. 幾何上就是 $f(x)$ 在每點 $x_0$ 的切抛物綫上。切抛物綫是 upper bound, 是比切綫更緊的 lower bound.
Assume $f(x)$ is strongly $\alpha$-convex and $\beta$-smooth. 幾何上就是 $f(x)$ 在每點 $x_0$ 都被兩個切抛物綫夾住，如下圖。可以證明 $\beta > \alpha$ $f(x) \geqslant f(x_0)+\nabla f \cdot (x-x_0)+\frac{\alpha}{2} \|x-x_0\|^2 \quad \text{for } \forall x,x_0 \in C$

Use Proximal operator $f(y) \geqslant f(x)+\operatorname{prox}(y-x) ?$

Convex Optimization

Error, Convergence Rate

Sub-gradient method

1st order method: use subgradient (gradient = subgradient if differentiable)
step size: variable, approach 0 after T iterations

Gradient descent

1st order method: use 1st derivative gradient
step size: fixed

Newton method

2nd order method: use 2nd derivative
step size: variable and based on Gradient and Hessian

	Convex	Convex + $\beta$-smooth	$\alpha$-strict Convex + $\beta$-smooth
error
rate
oracle rate

	Iteration	Error
sub-gradient method on convex	$O(1/\epsilon^2)$	$O(1/\sqrt{T})$
gradient descent on convex		$O(1/T)$
gradient descent on convex + smooth
gradient descent on $\alpha$-strict convex + $\beta$-smooth		$O\left(\left(\frac{K-1}{K+1}\right)^T\right)$ $K = \beta/\alpha > 1$
Oracle lower bound on convex (Lipschitz continuous)		$O(1/\sqrt{T})$
Oracle lower bound on convex + Lipschitz smooth		$O(1/T^2)$
Oracle lower bound on strongly convex + smooth		$O\left(\left(\frac{\sqrt{K}-1}{\sqrt{K}+1}\right)^T\right)$

Momentum gradient descent on convex + $\beta$-smooth		$O(1/T^2)$	Nestrov
Momentum gradient descent on $\alpha$-strict convex + $\beta$-smooth		$O\left(\left(\frac{\sqrt{K}-1}{\sqrt{K}+1}\right)^T\right)$
Proximal gradient on convex		$O(1/T)$
Proximal gradient on smooth and strongly convex

Algorithm	Convex	Convex+ $\beta$-smooth	$\alpha$-strongly Convex + $\beta$-smooth	Per Iteration Cost
Sub-gradient method	$O(1/\epsilon^2)$
Gradient Descent	$O(1/\epsilon)$
Oracle lower bound
Momentum GD	$O(1/\sqrt{\epsilon})$
Proximal gradient

Q&A : How about ADAM and other convergence algorithm on convex function, smooth, and strict convex + smooth????

For constraint problem, non-differentiable

projected subgradient

Stochastic subgradient optimization

Projection –> proximal operator

Proximal optimization

Legendre transform!!

Momentum Method: to close the gap between GD vs. Oracle lower bound

K = b/a, 代表 max / min eigenvalues for quadratic function

Projections and Proximal Operators

Math Optimization - Conjugate Convex

Posted on 2023-05-01 | In Math_AI

Reference

Youtube video: 4.2 Accelerated Gradient Descent - YouTube UT Austin: Constantine Caramanis. Very good series.

Background

Optimization 是非常有趣的問題。看似無常形無常法。

例如可以把 constraints 變成 Lagrange multiplier

$\min_x f(x)$ s.t. $g(x)=0$ $\to$ $\min_{x, \lambda} \mathcal{L}(x, \lambda) = f(x) + \lambda g(x)$

除了標準形，理論上也可以做各種變形

$\min_{x, \lambda} \mathcal{L}(x, \lambda) = f(x) + \lambda | g(x)|^2$

或是 $\min f(x) + \lambda_1 g(x) + \lambda_2 |g(x)|^2$ … 等等

或是把一個 domain optimization 換到另一個 domain optimization.

一個概念是把 time domain 的 optimization, 例如最小化振幅，轉換 (Fourier transform) 成 frequency domain 的 optimization, 例如最大化頻譜。注意這裏只是概念，系統的 domain 轉換就是 primal-dual optimization. 或是 Legendre transform 的 optimization.

對於 convex optimization, 還是可以整理出一個框架 (framework)。

Primal-Dual Framework

Convex optimization 有所謂 primal and dual problem.

有幾種不同的 primal-dual formulation.

第一類就是最常見的 Lagrange duality.

第二類是 conjugate duality.

所謂的 primal-dual 都是 $\min \max$ 和 $\max \min$ or $\inf \sup$ 和 $\sup \inf$

Lagrange Duality

Conjugate Duality

第二種方法是利用 conjugate function 解釋 primal and dual.

先看 convex conjugate 定義：

上面的定義太數學了。我們看幾何的意義。用最單純的二次函數爲例：$f(x) = c x^2$

conjugate function: $f^* (x^{})= \sup(x^ x - f(x))$

conjugate function 定義包含 maximum：因此 conjugate function 的 input $x^* = f’(x)$, 也就是原函數的斜率。$f^(x^)$ 的值就是通過 (x, f(x)) 切綫和 $y$ 軸的截距。可以證明 conjugate 是 concave (-convex) function. 所以也稱爲 convex conjugate.

既然 conjugate 是 -convex function, 也可以找到 maximum (or minimum 如果乘 -1). 不過 conjugate optimization 和原始 function 的 optimization 有任何關係嗎? YES!

如下圖：

(P) Primal 問題: $\min_u F_P(u)$ 和 (D) Dual 問題: $\max_p F_D(p)$

可以結合成 min-max 問題？還蠻奇怪的。

Math AI - Stochastic Differential Equation Forward

Posted on 2023-05-01 | In Math_AI

Reference

PII: 0304-4149(82)90051-5 (core.ac.uk) @andersonReversetimeDiffusion1982 : 專門針對 reverse time SDE equations.

http://pordlabs.ucsd.edu/pcessi/theory2019/gardiner_ito_calculus.pdf

Ito equations

Introduction

隨機過程簡單系統

一個問題是隨機過程簡單系統有什麽實際用途？

非常多，例如 noisy control system, robust control system, finance system

更進一步，隨機過程的時光倒流 (減熵) 有什麽用途？

信號處理的 smoothing 問題

最近大熱門的 diffusion process 就是利用時光倒流來 train 一個 neural network.

之後可以用 noise 產生 image. 更廣汎說，可以從一個機率分佈 transform 到另一個機率分佈。作爲生成式 AI 或是 style transfer AI.

有幾類隨機過程問題：

Signal + noise estimate the original signal (Kalman filter, Kalman-Bucy filter)
布朗運動 : 描述現象 only?
Distribution A + noise : diffusion denoise

此處我們討論隨機微分方程 (Ito equation) 的 forward-time 和 backward-time 的形式和對應 Fokker-Planck equations。

(Forward) Ito Differential Equation

我們從 general 的 stochastic differential equation (SDE) 開始： $\begin{align} d x_t=f\left(x_t, t\right) d t+g\left(x_t, t\right) d w_t \end{align}$ where $w_t$ has the usual properties. (就是 Wiener process, white noise)

以上是 random samples or process 的微分方程。可以用來作爲 Monte Carlo 模擬，但無法直接計算機率分佈如何隨時間變化。所以可以導入機率分佈的偏微分方程 Fokker-Planck equation 如下。

Forward unconditioned equation (Fokker-Planck equation)

也就是 Fokker-Planck Equations, yields $\begin{align} -\frac{\partial p\left(x_t, t\right)}{\partial t}= & \sum_i \frac{\partial}{\partial x_t^i}\left[p\left(x_t, t\right) f^i\left(x_t, t\right)\right] -\frac{1}{2} \sum_{i, j, k} \frac{\partial^2\left[g^{i k}\left(x_t, t\right) g^{j k}\left(x_t, t\right) p\left(x_t, t\right)\right]}{\partial x_t^i \partial x_t^j} . \end{align}$ 以上可以説是 general diffusion equation.

左邊是機率分佈對時間微分。
右邊第二項是機率分佈對空間的二次微分，就是機率分佈 Laplacian (空間 tension)。
右邊第一項是機率分佈的 transient 項？如果 eigenvalue 實部小於 0, 會隨時間 exponential 消失。

Ito Differential Equation Examples

Coefficients Without $x$ Dependence (Forward)

$\mathrm{d}x = a(t) \,\mathrm{d}t + b(t)\, \mathrm{d}W(t)$

此處 $a(t)$ 和 $b(t)$ 是 non-random function of time. 上式兩側積分：

$x(t) = x_0 + \int_{t=t_0}^{t} a(t) \,\mathrm{d}t + \int_{t=t_0}^{t} b(t)\, \mathrm{d}W(t)$

因爲 $W(t)$ 是 Wiener (zero-mean Gaussian) process. $x(t)$ 可以視爲 Gaussian random variable 的 linear combination, 還是 zero-mean Gaussian.

這個 Gaussian process 的 mean:

$\langle x(t)\rangle=\left\langle x_0\right\rangle+\int_{t_0}^t a(t) d t$ 因爲右邊第二項 Ito 積分的 mean 為 0. 再來計算 auto-correlation (including variance): $\begin{align} \langle[x(t) & -\langle x(t)\rangle][x(s)-\langle x(s)\rangle]\rangle \equiv\langle x(t), x(s)\rangle \\ & =\left\langle\int_{t_0}^t b\left(t^{\prime}\right) d W\left(t^{\prime}\right) \int_{t_0}^s b\left(s^{\prime}\right) d W\left(s^{\prime}\right)\right\rangle=\int_{t_0}^{\min (t, s)}\left[b\left(t^{\prime}\right)\right]^2 d t^{\prime} \end{align}$ 有了 Gaussin 的 mean and variance 就可以完全決定 $x(t)$.

Random process 的 mean depends on $a(t)$ 對時間的積分。
Random process 的 variance 隨著時間增加 ( $b(t)^2 > 0$ )。不過是否收斂 depending on $b(t)^2$. 如果是負的 exponential function 則收斂。
非 stationary process!
$x(t)$ 的 frequency domain spectrum 就是 FFT of auto-correlation function.

Multiplicative Linear White Noise Process (Forward)

$\mathrm{d}x = c x\, \mathrm{d}W(t)$

其 sample 的解形式如下 (make y = log x) : $x(t)=x\left(t_0\right) \exp \left\{c\left[W(t)-W\left(t_0\right)\right]-\frac{1}{2} c^2\left(t-t_0\right)\right\}$ We can calculate the mean by using the formula for any Gaussian variable $z$ with zero mean $\langle\exp z\rangle=\exp \left(\left\langle z^2\right\rangle / 2\right)$ so that $\begin{aligned} \langle x(t)\rangle=\left\langle x\left(t_0\right)\right\rangle & \exp \left[\frac{1}{2} c^2\left(t-t_0\right)-\frac{1}{2} c^2\left(t-t_0\right)\right] \\ = & \left\langle x\left(t_0\right)\right\rangle . \end{aligned}$ This result is also obvious from definition, since $\begin{align} & \langle \mathrm{d} x\rangle=\langle c x \, \mathrm{d} W(t)\rangle=0 \quad \text { so that } \\ & \frac{\mathrm{d}\langle x\rangle}{\mathrm{d} t}=0 . \end{align}$ We can also calculate the autocorrelation function $\begin{aligned} \langle x(t) x(s)\rangle & =\left\langle x\left(t_0\right)^2\right\rangle\left\langle\exp \left\{c\left[W(t)+W(s)-2 W\left(t_0\right)\right]-\frac{1}{2} c^2\left(t+s-2 t_0\right)\right\}\right\rangle \\ & =\left\langle x\left(t_0\right)^2\right\rangle \exp \left\{\frac{1}{2} c^2\left[\left\langle\left[W(t)+W(s)-2 W\left(t_0\right)\right]^2\right\rangle-\left(t+s-2 t_0\right)\right]\right\} \\ & =\left\langle x\left(t_0\right)^2\right\rangle \exp \left\{\frac{1}{2} c^2\left[t+s-2 t_0+2 \min (t, s)-\left(t+s-2 t_0\right)\right]\right\} \\ & =\left\langle x\left(t_0\right)^2\right\rangle \exp \left[c^2 \min \left(t-t_0, s-t_0\right)\right] \end{aligned}$

Random process 的 mean 不隨時間變化。Multiplicative white noise.
Random process 的 variance depends on $c > 1$ or $c <1$. 如果大於 1, variance exponentially 增加。
非 stationary process!
$x(t)$ 的 frequency domain spectrum 就是 FFT of auto-correlation function.

Ornstein-Uhlenbeck Process (Forward)

假設 $x$ 是 non-deterministic $n$-dimentional process, 並由下述綫性微分方程表示：

\[\begin{align} \mathrm{d} x=A x \mathrm{~d} t+B \mathrm{~d} w . \\ \end{align}\]

其中 $A, B$ 都是常數矩陣。$w$ 是 vector Wiener process.
$\begin{align} x(t)= x(0) \mathrm{e}^{At} + \int_{0}^t \mathrm{e}^{A(t-s)} B \mathrm{~d} w(s) \end{align}$

\[\langle x(t)\rangle=\left\langle x_0\right\rangle \mathrm{e}^{At}\]

The correlation function follows similarly $\begin{aligned} \left\langle\boldsymbol{x}(t), \boldsymbol{x}^{\mathrm{T}}(s)\right\rangle \equiv & \left\langle[\boldsymbol{x}(t)-\langle\boldsymbol{x}(t)\rangle][\boldsymbol{x}(s)-\langle\boldsymbol{x}(s)\rangle]^{\mathrm{T}}\right\rangle \\ = & \mathrm{e}^{A t}\left\langle\boldsymbol{x}(0), \boldsymbol{x}^{\mathrm{T}}(0)\right\rangle \mathrm{e}^{A^T s} +\int_0^{\min (t, s)} \mathrm{e}^{A\left(t-t^{\prime}\right)} B B^{\mathrm{T}} \mathrm{e}^{A^T\left(s-t^{\prime}\right)} d t' \end{aligned}$

(Variance) Stationary Version ($A$ 的 eigenvalues 實部小於 0)

如果 $A$ 的 eigvalues 實部小於 0，存在 wide-sense stationary solution (mean and variance): $\begin{align}x(t)=\int_{-\infty}^t \mathrm{e}^{A(t-s)} B \mathrm{~d} w(s) .\end{align}$

\[\langle x(t)\rangle= 0\] \[\begin{aligned}\left\langle\boldsymbol{x}(t), \boldsymbol{x}^{\mathrm{T}}(s)\right\rangle = \int_{-\infty}^{\min (t, s)} \mathrm{e}^{A\left(t-t^{\prime}\right)} B B^{\mathrm{T}} \mathrm{e}^{A^T\left(s-t^{\prime}\right)} d t'\end{aligned}\]

上式乍看之下很難是 stationary，但是如果我們定義 (time average, 不是 space average!) $\begin{aligned}\sigma = \left\langle\boldsymbol{x}(t), \boldsymbol{x}^{\mathrm{T}}(t)\right\rangle \end{aligned}$ 可以證明 $A \sigma + \sigma A^T = B B^T$ [reference Gardiner Ito Calculus]

因爲 $A, B$ 都是 constant matrix, 所以 $\sigma$ 本身也是 constant, 也就是 stationary! $\begin{align}P=E\left[x(t) x^{T}(t)\right] = \left<x(t)x^T(t)\right> = \sigma .\end{align}$

Appendix

Math AI - Diffusion vs. SDE

Posted on 2023-05-01 | In Math_AI

Reference

2011.13456.pdf (arxiv.org) : good reference for Diffusion to SDE from Stanford

PII: 0304-4149(82)90051-5 (core.ac.uk) @andersonReversetimeDiffusion1982 : 專門針對 reverse time SDE equations.

http://pordlabs.ucsd.edu/pcessi/theory2019/gardiner_ito_calculus.pdf

Ito equations

Introduction

全部的重點就是下圖。用一個神經網路逼近 score function! 因爲我們不知道 $p(x)$

Background

Method 1: denoising Score Matching with Langevin Dynamics (SMLD)

主要是利用 neural network 去近似 p(x) 的 score function, i.e. gradient of log likelihood.

Method 2: denoising diffusion probabilistic model (DDPM)

SDE and Reverse SDE

Forward SDE

Backward SDE

DDPM (Variance Preserving SDE)

SMLD (Variance Exploding SDE)

Sub-Variance Preserving SDE

Math AI - Stochastic Differential Equation Backward

Posted on 2023-04-29 | In Math_AI

Reference

PII: 0304-4149(82)90051-5 (core.ac.uk) @andersonReversetimeDiffusion1982 : 專門針對 reverse time SDE equations.

http://pordlabs.ucsd.edu/pcessi/theory2019/gardiner_ito_calculus.pdf

Ito equations

Introduction

時光倒流 (time reversal) 在日常生活幾乎不可能。所有的觀察 (非實驗結果) 歸納出熱力學第二定律：熵隨時間增加。因此複雜系統的演進過程，例如生米煮成熟飯，只能一個方向發展，時光倒流看起來不可能發生。但在微觀系統的方程式並沒有時光倒流的矛盾s問題，也就是 forward-time 和 backward-time 都是完全合理物理運動，而且對稱。只要把方程式的 $t \to -t$ 基本就可以描述 time reversal.

回到日常生活，對於 (非隨機過程) 簡單系統，例如撞球或是星球運行，時光倒流是合理可以發生。同樣也是把方程式的 $t \to -t$ 就可以描述。

但對於隨機過程簡單系統，這就非常有趣：到底是類似熱力學熵增加，無法時光倒流。還是比較類似微觀或簡單系統存在時光倒流，並可用簡單的 time symmetry, 也就是 $t \to -t$, 就可以描述?

答案是：Yes, 存在時光倒流。No, 不是簡單的 time symmetry ($t \to -t$).

其他問題：

簡單隨機系統有熵 (entropy) 增加的概念？或甚至 governed by entropy increase law?
- 因爲有 randomness, 所以還是有 entropy 概念？但不一定是熵增加。在 non-stationary transient process, 還是可以做時光倒流 (time reversal) 或是熵減小。

	Forward-time	Backward-time
複雜系統	熱力學熵增加	No (熵無法變小)
微觀系統	量子力學，場論 (no entropy)	Yes, $t \to -t$
簡單非隨機系統	古典力學 (no entropy)	Yes, $t \to -t$
簡單隨機系統	隨機微分方程 (SDE): Ito or Fokker-Planck equations	Yes, reverse SDE

隨機過程簡單系統

一個問題是隨機過程簡單系統有什麽實際用途？

非常多，例如 noisy control system, robust control system, finance system

更進一步，隨機過程的時光倒流 (減熵) 有什麽用途？

信號處理的 smoothing 問題

最近大熱門的 diffusion process 就是利用時光倒流來 train 一個 neural network.

之後可以用 noise 產生 image. 更廣汎說，可以從一個機率分佈 transform 到另一個機率分佈。作爲生成式 AI 或是 style transfer AI.

有幾類隨機過程問題：

Signal + noise estimate the original signal (Kalman filter, Kalman-Bucy filter)
布朗運動 : 描述現象 only?
Distribution A + noise : diffusion denoise

此處我們討論隨機微分方程 (Ito equation) 的 forward-time 和 backward-time 的形式和對應 Fokker-Planck equations。

(Forward) Ito Differential Equation

Forward unconditioned equation (Fokker-Planck equation)

左邊是機率分佈對時間微分。
右邊第二項是機率分佈對空間的二次微分，就是機率分佈 Laplacian (空間 tension)。
右邊第一項是機率分佈的 transient 項？如果 eigenvalue 實部小於 0, 會隨時間 exponential 消失。

Backward (not reverse-time, not $t \to -t$) Kolmogorov equation

for $s \geqslant t$ is

\[\begin{align} -\frac{\partial p\left(x_s, s \mid x_t, t\right)}{\partial t}= & \sum_i f^i\left(x_t, t\right) \frac{\partial p\left(x_s, s \mid x_t, t\right)}{\partial x_t^i} +\frac{1}{2} \sum_{i, j, k} g^{i k}\left(x_t, t\right) g^{j k}\left(x_t, t\right) \frac{\partial^2 p\left(x_s, s \mid x_t, t\right)}{\partial x_t^i \partial x_t^j} \end{align}\]

(Backward) Ito Differential Equation

推導參考 reference and appendix. Backward Ito Differential Equation 如下：

\[\begin{align} \mathrm{d} \bar{x}_t=\bar{f}\left(\bar{x}_t, t\right) \mathrm{d} t+g\left(\bar{x}_t, t\right) \mathrm{d} \bar{w}_t \end{align}\]

where $\bar{f}^i$ is: $\begin{align} \bar{f}^i\left(x_t, t\right)=f^i\left(x_t, t\right)-\frac{1}{p\left(x_t, t\right)} \sum_{i, k} \frac{\partial}{\partial x_t^i}\left[p\left(x_t, t\right) g^{i k}\left(x_t, t\right) g^{i k}\left(x_t, t\right)\right] \end{align}$

Ito Differential Equation Examples

綫性 Ornstein-Uhlenbeck Equation (Foward and Backward)

假設 $x$ 是 non-deterministic, wide-sense stationary $n$-dimentional process, 並由下述綫性微分方程表示：

\[\begin{align} \mathrm{d} x=A x \mathrm{~d} t+B \mathrm{~d} w . \\ \end{align}\]

其中 $A, B$ 都是常數矩陣。$A$ 的 eigvalues 實部小於 0. $w$ 是 vector Wiener process.

這個 forward-time 的解形式 (representation) 如下： $\begin{align} x(t)=\int_{-\infty}^t \mathrm{e}^{A(t-s)} B \mathrm{~d} w(s) . \end{align}$

此處假設 $A$ 的 eigenvalues 實部小於 0 (隨時間增加收斂)。

我們可能會推測 reverse-time 的解形式 (representation) 如下。但這是 time-reverse ($t\to -t$) 的想法。就是 $x(t)$ 時間減小而變大。 $\begin{align} x(t)=-\int_t^{\infty} \mathrm{e}^{\bar{A}(t-s)} \bar{B} \mathrm{~d} \bar{w}(s) . \\ \end{align}$

我們這裏想得到的是一個 forward-time representation of “$x(t)$ a reverse-time representation”. 就是把 reverse-time 轉換 (unfold) 成 forward-time 的 solution. 此時新的 $x(t)$ 隨著時間增加而變大。所以我們要魔改原始的 Ito SDE。A reverse-time SDE is： $\begin{align} \mathrm{d} x=\bar{A} x \mathrm{~d} t+\bar{B} \mathrm{~d} \bar{w} \\ \end{align}$

此處 $\bar{A}$ 的 eigenvalues 實部大於 0.
$\begin{align} P=E\left[x(t) x^{\prime}(t)\right] . \end{align}$

The matrix $P$ is the solution of the linear matrix equation $P A^{\prime}+A P=-B B^{\prime}$, and is nonsingular precisely when rank $\left[B ,A B, \cdots A^{n-1} B\right]=n$.

Suppose $P$ is nonsingular, and define a vector process $\bar{w}$ by $\begin{align} \mathrm{d} \bar{w}=\mathrm{d} w-B^{\prime} P^{-1} x \mathrm{~d} t, \quad \bar{w}(0)=0, \end{align}$ which in conjunction with 原始的 Ito equation implies $\begin{align} \mathrm{d} x=\left(A+B B^{\prime} P^{-1}\right) \mathrm{d} t+B \mathrm{~d} \bar{w} = \bar{A} \mathrm{d}t + \bar{B} \mathrm{d}\bar{w} \end{align}$

where

\[\begin{align} \bar{A} = A + B B' P^{-1}; \quad \bar{B} = B \end{align}\]

可以證明 $\bar{A} = A + B B’ P^{-1}$ 的 eigenvalues 實部大於 0。 $\bar{w}(t)$ 是 vector Wiener process with $x(t)$ independent of past increments of $\bar{w}$, but not of future ones.

Equations (12) and (13) 基本是 equations (4) and (5).

Backward Fokker-Planck Solution

\[\begin{align} -\frac{\partial p\left(x_t, t\right)}{\partial t}= & A \sum_i \frac{\partial}{\partial x_t^i} p\left(x_t, t\right) -\frac{1}{2} B B' \sum_{i, j, k} \frac{\partial^2 p\left(x_t, t\right)}{\partial x_t^i \partial x_t^j} . \end{align}\]

One has $p\left(x_t\right)=\frac{1}{(2 \pi)^{n / 2}|P|^{1 / 2}} \exp \left\{-\frac{1}{2} x_t^{\prime} P^{-1} x_t\right\}$ 重點應該是 $P$ 應該隨著時間改變。

Here, $P$ is the solution of $P A^{\prime}+A P=-B B^{\prime}$, and is assumed nonsingular. Then $\begin{aligned} \frac{1}{p\left(x_t\right)} \sum_j \frac{\partial}{\partial x_t^j}\left[p\left(x_t\right) B^{j k}\right] & =-\sum_j \left(P^{-1}\right)^{j i} x_t^i B^{j k} \\ & =-k \text { th entry of } B^{\prime} P^{-1} x . \end{aligned}$ Thus, following (3.10), $\mathrm{d} \bar{w}_{\mathrm{t}}=\mathrm{d} w-B^{\prime} P^{-1} x \mathrm{~d} t$

Appendix

Appendix A

Because $p\left(x_t, t, x_s, s\right)=p\left(x_s, s \mid x_t, t\right) p\left(x_t, t\right)$ we can attempt to obtain a partial differential equation for $p\left(x_t, t, x_s, s\right)$, regarding $x_t, t$ as the independent variables and $x_s, s$ as parameters. We obtain, combining (5.2) through $(5.4)$ $\begin{aligned} \frac{\partial p\left(x_t, t, x_s, s\right)}{\partial t}= & \text { terms involving } f, g, p\left(x_t, t\right) \text { and } p\left(x_s, s \mid x_t, t\right) \\ & \text { and their } x_t \text {-derivatives. } \end{aligned}$ tions is, for $s \geqslant t$ $\begin{aligned} -\frac{\partial}{\partial t} p\left(x_t, t, x_s, s\right)= & \sum_i \frac{\partial}{\partial x_t^i}\left[\bar{f}^i\left(x_t, t\right) p\left(x_t, t, x_s, s\right)\right] \\ & +\frac{1}{2} \sum_{i, i, k} \frac{\partial^2\left[p\left(x_t t, x_s, s\right) g^{i k}\left(x_t, t\right) g^{j k}\left(x_t, t\right)\right]}{\partial x_t^i \partial x_t^i} \end{aligned}$ where $\bar{f}^i$ is as before, viz, $\bar{f}^i\left(x_t, t\right)=f^i\left(x_t, t\right)-\frac{1}{p\left(x_t, t\right)} \sum_{i, k} \frac{\partial}{\partial x_t^i}\left[p\left(x_t, t\right) g^{i k}\left(x_t, t\right) g^{i k}\left(x_t, t\right)\right]$ The same partial differential equation (but with different boundary conditions of course) is satisfied by $p\left(x_t, t \mid x_s, s\right)$ [and in fact $p\left(x_t, t\right)$ - this is trivial to see. Just as (5.3) corresponds to the forward model $(5.1)$, so then (5.5) has to correspond to the reverse model

\[\mathrm{d} \bar{x}_t=\bar{f}\left(\bar{x}_t, t\right) \mathrm{d} t+g\left(\bar{x}_t, t\right) \mathrm{d} \bar{w}_t\]

Chatgpt_lang

Posted on 2023-04-22

如何用 ChatGPT 學習語言？

語言部分

Translation
Example
Synonym and antonym

對話部分

Speech recognition
對話
Text-to-language

Math - 積分

Posted on 2023-04-16 | In Math

Takeaway

黎曼 (Riemann) 可積分一定勒貝格 (Lesbegue) 可積分

我不認為豎著切和橫著切本質是一樣的。雖然對於連續函數兩者的值是一樣的，但本質上的區別在於單調性，和隨之而來的完備性。

黎曼積分我們在豎著切的時候，由於我們對函數的波動起伏沒有假設，所以，每一個切片的大小都是未知的。每次細化切分/改變每段的參照點的時候，帶來的面積變化也是不固定的。

但對於勒貝格積分，由於只是在考慮單值函數下面的陰影面積，所以當我們橫著切的時候（用step function逼近的時候），我們可以保證切面的測度是隨高度單調遞減的。就像漢諾塔（或者我失敗的堆成小山一樣的午餐三明治），下面的一層總是比上面的高。所以我們每次細化切分/改變參照點的時候，面積的變化是單調的。從而單調收斂定理保證，我們總可以用簡單函數逼近得到勒貝格積分的值。而這種來自於單調收斂定理的對極限的封閉的完備性，正是我們引入勒貝格積分的重要原因之一。

一般情況下涉及到具體計算都會使用黎曼積分，況且大多數情況下他們是相等的。
至於勒貝格積分的範圍比黎曼積分的範圍廣也是錯的，存在黎曼可積但是勒貝格不可積的函數，反之亦然。

從兩種積分本身的性質來看，Lebesgue積分是絕對收斂的積分，而Riemann積分不是。具體地說，我們知道，對於Riemann常義積分（定積分），可積則絕對可積，反之不對；而對於Riemann廣義積分，絕對可積則可積，反之不對。也就是說，Riemann積分意義下的絕對可積與可積是不等價的。而對於Lebesgue積分，我們知道，可積與絕對可積是一回事，也就是說，絕對可積與可積到了Lebesgue積分的意義下變成了等價關係，Lebesgue積分將絕對可積與可積統一起來。這是Lebesgue積分優於Riemann積分的一個重要區別。

從兩種積分意義下的可積函數類來看，Lebesgue可積函數構成的線性空間是完備的，而Riemann可積函數類不是完備的，也就是說，一個Riemann可積函數列的極限函數可能不再是Riemann可積的，而Lebesgue可積函數類對極限運算是封閉的。這是從函數類上看Lebesgue積分優於Riemann積分的另一個重要特點。

測度論

測度論是屬於實變領域，是在數學當中的分析領域，實變就是實分析 — 顧名思義就是在分析實數空間中的性質（實數與複數空間的差異頗大，所以還有另一個複變領域）。其實基礎的實分析大家都學過，就是大一學的微積分，再進階一點就是數學系的大家口中抱怨的高等微積分，到了研究所就會進入實變函數論。

會有測度論的發展，是因為為了要拓展黎曼積分而開始的。如下圖左所示，黎曼積分是從實軸去分割區間，把分成很多長方形區間來積分，然後在區間中任取一個tk當作長方形的高f(tk)，不過這樣會遇到一個問題：如果在一個區間內f(tk)差很多，這個積分值可能會受到tk取值的影響，故我們說這個函數黎曼不可積。

如果我們不要從實軸去分割，而是從函數值分割區間呢？就像圖片一、(右)所看到的，對a這個點去切，蒐集所有函數大於a的集合Ea={x

f(x)>a}，所以塗色面積就會是a

，

就是我們要去算的集合長度的概念，也可以稱之為測度，因此測度論就展開了。

直觀解釋黎曼 (Riemann) 積分 and 勒貝格 (Lesbegue) 積分

要直觀地解釋兩種積分的原理，可以假設我們要計算一座山在海平面以上的體積。

黎曼積分是相當於把山分為每塊都是一平方米大的方塊，測量每個方塊正中的山的高度。每個方塊的體積約為1x1x高度，因此山的母體積為所有高度的和。

勒貝格積分則是為山畫一張等高線圖，每根等高線之間的高度差為一米。每根等高線內含有的岩石土壤的體積約等於該等高線圈起來的面積乘以其厚度。因此母體積等於所有等高線內面積的和。

佛蘭德（Folland）描述黎曼積分跟勒貝格積分的不同，以非負函數 $ f:[a,b]\mapsto [0,\infty ],\;a.b\in \mathbb {R} $ 這例子來講，黎曼積分是分割 $x$-軸上的定義域區間為更小的子區間，並計算黎曼和，當子區間越來越小時黎曼和的極限就是黎曼積分；而勒貝格積分則是將 $f$ 在 $y$-軸上的對應域分割成不相交的區間 ${I_{j}}{j=1}^{n}$，並用定義域中的子集合 ${f^{-1}(I{j})=E_{j}}$ 來定義趨近 f 的簡單函數

白話就是：黎曼積分是分割定義域來計算積分；勒貝格積分則是用分割值域來計算積分。

分割值域有什麽好處? 因爲計算定義域的 “總長度” 或是 ”總面積” 可以用 ”測度理論“。比起分割定義域可能會無法計算個別無窮小定義域的 ”長度” 或是 “面積”。
最初測度理論是用來對歐幾里得空間中直線的長度，以及更廣義地，歐幾里得空間的子集的面積和體積進行仔細分析發展出來的。它尤其可以為 $\R$ 的哪些子集擁有長度這個問題提供一個系統性的回答。
對於 bouned function $f$ defined on [a,b], if $f$ is Riemann integrable, then $f$ is Lebesgue integrable.

我們看一個例子: 有理數的指示函數 $1_{\mathbb {Q} }(x)={\begin{cases}1,x\in {\mathbb {Q} }\0,x\notin {\mathbb {Q} }\end{cases}}\quad$ 是一個無處連續的函數。

在區間 $[0,1]$ 之間 $1_{\mathbb {Q} }$ 沒有黎曼積分，因為在實數中有理數和無理數都是稠密的，因此不管怎樣把 $[0,1]$ 分成子區間，每一個子區間裡面總是至少會有一個有理數和一個無理數，因此其達布積分的上限為1，而下限為0。
在區間 $[0,1]$ 內 $1_{\mathbb {Q} }$ 有勒貝格積分。事實上它等於有理數的[指示函數]，因為 $\mathbb {Q}$ 是可數集，因此 $\int _{[0,1]}1_{\mathbb {Q} }\,d\mu =\mu (\mathbb {Q} \cap [0,1])=0$

簡單說：勒貝格積分可以處理很多極限的積分。

二、集合介紹

在正式進入測度以前，要先簡單介紹在數學中重要的集合概念。集合顧名思義就是集合了一些元素，在數學中，集合還能分成可數集跟不可數集，意思就是這個集合中的元素能不能用數的，可數集又可以分成有限集與無限集（對你沒看錯，無限集也可以是可數的）。例如，有理數集合為可數集和無限集，無理數集合為不可數集。

當我們把集合的概念放入空間概念中（例如R的N次方空間），就可以定義開集(open)與閉集(closed)。在R的N次方中，開集的定義為：

「*給任意一個屬於集合A的元素a，可以找到一個足夠微小的值ε(>0)，從a延伸出去在距離ε處畫一個圈，如果圈內所有點都落在集合A裡面，則我們稱此集合A為開集合（如下圖二所示）*」。

相反的，若一個集合B的補集是開集合，則B為閉集合。所以我們也可以這樣想，當我們畫一個集合的邊界是虛線時，這個集合八成是開集合；如果有實線邊界，就會是閉集。雖然這樣想不夠嚴謹，但是比較好理解。

開集合有一些性質是必須知道的，比如說：一些（不限多少個）開集合的連集還是開集合，但有限個開集合的交集才會是開集合。相反的，在閉集合方面，根據狄摩根定律（De Morgan law）可以證得，一些（不限多少個）閉集合的交集是閉集合，有限個閉集合的連集還是閉集合。

開集合與閉集合還有很多性質，但那些都屬於高等微積分的內容，在這裡並不會細講。不過還是有兩個重要的集合在之後會提到 — Gδ和 Fσ，Gδ是一些可數個開集合的交集，而Fσ是可數個閉集合的連集。這兩個集合都不會保證是開集或閉集，因為他們不一定有是由有限個集合組成，但開集一定是由某種Gδ，且閉集一定是某種Fσ。看到這裡是不是快暈倒了呢？沒關係，後面看到這些再回來複習就好。

二、外側度

測度（measure）直觀理解上是用來量測一個集合的大小。例如，在一維的實數空間中，測度就是指集合的長度；在二維平面中，測度可以想成集合的面積，以此類推，三維空間中測度就有種體積的意義。當然，我們探討的空間可以到n維，所以必須要給測度一個更明確的定義。

不過，不是每個集合都是可以測量的（之後會提到不可測量的集合），所以在定義測度之前，要先定義了外測度(outer measure)，而這個外測度是對於所有集合都有的，就像有些函數無法做黎曼積分時，會使用上積分跟下積分來定義他的範圍的概念一樣。外測度的定義如下：

定義的意思為，對於集合E，可以被一些可數個區間Ik蓋住（covering），把這些區間蒐集成集合S（不同的covering就會有不同的S），而外測度就是指這些Ik體積總和的最大下界（就是最小可以到多小的意思），如圖三所示。也就是說，我們可以把外測度想成是在集合外面包了包裝紙，然後外測度是算包裝紙的大小，且盡量讓包裝越小越好

在定義了外測度之後，有一些定理跟性質是值得一提的。比如，如果集合E可以寫成可數個Ek的連集，那麼E的外測度會小於等於所有Ek外測度的總和，這應該是一個很直觀的定理（由廣義三角不等式）。

如果集合中只有單點(例：Ek ={ p } )，由外測度的定義可以很直觀地知道，包覆住單點的外包裝的大小可以一直縮小，所以單點的外測度是0。所以依照上面所提的定理，如果E為有理數集，則他的外測度也會是0，因為有理數集是由可數個單點所組成，把每個Ek當成有理數單點，所有Ek外測度的總和為0，所以E的外測度≤0且外測度不會是負數（僅限在此所介紹的），有理數集的外測度是0。

三、可測集、測度0

在了解外測度的定義之後，就可以真正的進入測度了。測度的定義一定要跟外測度有關，如果無關，那他們就會具有不同的意義。直接來說，如果一個集合是可測的，那麽就定義外測度就等於測度值。重點是，我們就是不知道怎麼樣的集合叫可測集呀？但如果我們目的是要讓可測集的外測度等於測度值，就要像我們剛剛所比喻的，集合的外包裝要越緊貼集合越好，也就是集合跟外包裝要幾乎沒有空隙才行。

帶著這樣的想法，我們就可以更透徹的理解可測集的定義（定義(1)）。如果我們說一個集合E是可測的，那麽對於任意小的正數ε，可以找到包含E的開集合G（外包裝），然後G扣掉E的外測度要小於ε，就如圖四、(左)所示，灰色集合要很小的意思。

其實可測集也有一個等價的定義（定義(2)）。想像一個集合也需要內包裝，那麼也是希望內包裝能夠跟集合越貼近好，所以另一個等價的定義如下：如果集合E是可測的，那麽對於任意小的正數ε，可以找到一個比E小的閉集合F（內包裝），然後E扣掉F的外測度要小於ε，就如圖四、(右)所示，藍色集合要很小的意思。

知道測度定義之後，就可以知道哪些集合是可測的。例如，開集合、閉集合、可數個可測集的連集或交集、可測集的補集⋯等（這些都可以證明），值得一提的是外測度0的集合也是可測集。這些外測度為0的可測集，因為定義外測度等於測度，所以我們也可以說他們是測度0或0測度(measure zero)。測度0在直觀上代表著那些不佔體積的集合。在一維實數空間中，測度0可以是單點，或是可數個單點；在二維空間中，直線是測度0；所以類推，在n維實數空間中，能用n-1維表示的集合都可以是測度0。

四、不可測集

上一段落說了，很多集合都是可測集，那值得疑問的是，真的有不可測集存在嗎？

義大利數學家維塔利(Vitali)就證明出存在不可測集合(non-measurable set)。在尋找的過程中，用了選擇公設跟一個定理。首先，他先定義了「*一個集合Ex為蒐集所有與x的距離是有理數的點」*，Ey為相同定義，則Ex、Ey兩集合不是全等就是不相交的集合。如果他們兩有交於至少一個點，令此點為x1，依照集合定義，x點距離x1、y點距離x1都是有理數，所以x-y也是有理數，所以y會在Ex中，故Ex、Ey兩集合會全等，如圖五所示。

六、Lebesgue 積分 vs Riemann 積分

介紹了這麼多，終於到我們這篇的重點：Lebesgue 積分。Lebesgue 積分的發展是為了要解決黎曼積分受到的限制。當一個函數不連續點夠多時，黎曼積分可能不存在，以下舉個例子來具體解釋黎曼積分的限制。

令Dirichlet 函數為：

此函數大家在微積分應該都略有所聞，是一個處處不連續的函數。當我們定義這個函數在[0,1]區間裡，在這個區間做黎曼積分的話是不可積的。因為根據黎曼積分的定義：

R(f,E)的意思是蒐集f函數以下的所有點當作這個集合，所以如果定義x在一維空間的話，那就是蒐集f函數與x軸中間的面積，其實跟一般積分的意思差不多，只是這邊用測度來表示而已(

．

n+1 是在R^(n+1)裡取測度的意思)。在這個定義之下，假設f是非負函數且定義在可測集合E，此時有另一個定理說，如果f是可測函數的話，則f在E上為Lebesgue可積。

現在知道了Lebesgue可積的定義，就可以知道黎曼積分跟Lebesgue 積分差在哪裡了。還是以Dirichlet 函數為例，如果要對Dirichlet 函數做 Lebesgue 積分，則會運用到以下這個定理：

令g=0在[0,1]區間上，又根據我們有提過有理數集合是0測度，所以f≠g 的點其實是0測度（f只在有理數點上不等於0），所以幾乎處處f=g。依據這個定理，此時Dirichlet 函數為Lebesgue可積，且f的Lebesgue 積分=g的Lebesgue 積分=0。由此可見Lebesgue成功的把黎曼積分的限制推廣出去了。

Reference

Lebesgue integral

https://en.wikipedia.org/wiki/Lebesgue_integration @wikiLebesgueIntegration2023: Lebesgue 積分英文

https://zh.wikipedia.org/zh-tw/%E5%8B%92%E8%B2%9D%E6%A0%BC%E7%A9%8D%E5%88%86 : 中文

@shionStochasticDynamic2021

https://zhuanlan.zhihu.com/p/343129740

[@ccOneArticle23]. https://zhuanlan.zhihu.com/p/589106222 very good article!!! SDE for diffusion score

https://yrgnthu.medium.com/%E5%AF%A6%E8%AE%8A%E7%9C%9F%E7%9A%84%E8%A6%81%E8%AE%80%E5%8D%81%E9%81%8D-%E5%8D%81%E5%88%86%E9%90%98%E5%B8%B6%E4%BD%A0%E4%BA%86%E8%A7%A3%E6%B8%AC%E5%BA%A6%E8%AB%96-f5a79c9cf5a7

[@yangRealAnalysis2022]

Math AI - Stochastic Differential Equation

Posted on 2023-04-16 | In Math_AI

Reference

Lebesgue integral

https://en.wikipedia.org/wiki/Lebesgue_integration @wikiLebesgueIntegration2023: Lebesgue 積分英文

https://zh.wikipedia.org/zh-tw/%E5%8B%92%E8%B2%9D%E6%A0%BC%E7%A9%8D%E5%88%86 : 中文

@shionStochasticDynamic2021

https://zhuanlan.zhihu.com/p/343129740

[@ccOneArticle23]. https://zhuanlan.zhihu.com/p/589106222 very good article!!! SDE for diffusion score

什麽是隨機微分方程 (SDE)

在幾乎所有領域，如物理，化學，生物，氣象，金融及其許多工程分支，包括機械，航天，土木，電氣工程中，動力學系統的建模與分析都是一個關鍵性任務。在建模過程中,鑒於各種原因，如系統參數的可能變化，激勵的變化，建模方案的誤差等，不確定性是不可避免的。當我們學過《高等數學》或《常微分方程》等課程之後，可知確定性激勵作用下的狀態方程是確定的，若對某一不確定性有足夠大量數據，就可用概率與統計描述該不確定性。

Diffusion process 也是 SDE 的一種。

例一：(Time Derivative) Input/Output

SDE 的關鍵: 外部作用力 + random noise

一个单自由度弹簧系统，在外激励作用下的振动方程如下

\[\begin{aligned} & m \ddot{x}+c \dot{x}+k x=F(t) \\ \end{aligned}\]

当外激励为随机过程时候，相对于确定性激励

\[\begin{aligned} & \tilde{F} (t)=F(t) + \text{"noise"} \end{aligned}\]

引入变量 $X=[x, \dot{x}]^T=\left[x_1, x_2\right]^T$ ，将原方程改写为 (一階) 状态方程，即

\[\begin{aligned} \dot{X}=A X+N \tilde{F} (t) \end{aligned}\]

将 $dt$ 除过去可得 $d X=A X d t+N \tilde{F} (t) d t$

若 $\tilde{F} (t)=0$ ，即系统无外部激励（自由振动)，原方程为 $d X=A X d t$
若 $F(t) \neq 0$ ，即系统受外激影响，原方程为随机微分方程，SDE.

例二：(Time Derivative) Hidden State/Output Estimation

Linear state space dynamic equation:

$\begin{aligned} & \dot{x}=A x+B u+w \\ & y=C x \\ & \widetilde{y}=y+v \end{aligned}$

稱為 $A, B, C, D$ matrix (此處 $D=0$)
此處 $u$ 是 input control, $w$ 是 input or state (stochastic) noise.
$y$ 是 true output, $\widetilde{y}$ 是 noisy output, $v$ 則是 output observation (stochastic) noise.

對應的 Kalman-Bucy Filter 用於 estimate $x(t), y(t)$, 就是連續版的 Kalman filter.

Kalman filter 的關鍵: SDE 的 hidden state/output 的 estimation

$\begin{aligned} & K=P C^T R^{-1} \\ & \dot{\hat{x}}=A \hat{x}+B u+K(\widetilde{y}-C \hat{x}) \\ & \dot{P}=A P+P A^T-K R K^T+Q \end{aligned}$

此處 $P$ 是 covariance of the measurement error ($v$), $Q$ 是 covariance of state error $w$.

前例基本都是假設信號本身是 deterministic, 只是被 “noise” 影響。

一般這類 “noise” 也假設是 Gaussian noise.

接下來我們要看一些不同的看法，就是信號本身就是 probablistics! 例如簡單的布朗運動，或是更複雜的影像分佈。我們要用更深入的數學, SDE, 描述這個現象！

另外是 “driving force”, 最簡單是 linear ODE. 看起來好像非常簡單。實務上 x 是 1D/2D vectors, 甚至 high dimension tensors, A 是 matrix/tensors, 可以有很多的 spatial 變化。如果 A 是 diagonal matrix 就很簡單。

但是 A 可以是 2D x vector 的差異，類似梯度 (gradient), 或是 divergence, curl 或是空間的微分 operators.

例三：(Time Derivative) Input/Output

布朗運動的關鍵: SDE 的 (外部) 作用力 = random noise

例四：(Time and Space Derivative!) Diffusion Process (Wrong!!)

(YES!) Diffusion process 的關鍵：SDE 的作用力來自 concentration gradient (濃度的梯度)，而非外部作用力。但濃度又是 SDE 的 variable. 所以是個互相影響的過程。 $\frac{d x_t}{dt} = f(x_t) = D \nabla_{x_t} \log p_t(x_t)$

Wrong!! 這是把 sample equation 和 distributin 搞混了。

Sample equation 的形式對應的解就包含 Fokker-Planck equation, 其中就有 diffusion term.

SDE 微分問題 - 標準型

隨機信號很多是連續但不可微分！一個例子就是布朗運動，數學上是處處不可微分。

所以一般不會寫成 $dX/dt = \mu(X, t)$, 而是 $dX = \mu(X, t) dt$.

從 Linear state space system 出發 : $\dot{x}=A x+B u+ w$

魔改成 stochastic dynamical system:

\[\begin{align} d X_t = a(X_t, t) dt + b(X_t, t) d W_t. \end{align}\]

此處 $a, b, W$ 對應 linear state space 的 $A, B, w$

$dt$ 都移到右邊。因為要考慮不可微分
$a, b$ 變成 function 以及包含 $t$, 而不是 linear time invariant matrix.
$b(X_t, t) d W_t$ 取代 $Bu +w$! 把 stimulus 和 noise 結合在一起。同時把 $w dt$ 改成 $dW_t$ .

(8) 是 general case. 我們考慮特例

$a(X_t, t)$ 稱為 drift coefficient; $b(X_t, t)$ 稱為 noise coefficient (drive diffusion).
SDE 稱為 stationary if $a, b$ 沒有 explicit $t$ dependent.
如果 $b$ is independent of $x$, 稱為 additive noise. 如果 $b$ depends on $x$, 稱為 multiplicative noise.
如果 $b(X_t,t) = 0$, SDE 就變成 ODE.

例二：(Time Derivative) 布朗運動

\[\begin{align} d X_t = \mu X_t \, dt + \sigma \, X_t \, d W_t. \end{align}\]

這是就簡單的 SDE.

SDE 微分和積分表示法

不妨设 $\tilde{F} (t)=W(t)$ ，引入关系式 $d B_t=W(t) d t$, 即上述随机状态方程可以改写为 $d X=A X d t+N d B_t$ 考虑受 Guass 白噪声扰动的一维动态系统，其运动微分方程形为 $d X(t)=\mu(X, t) d t+\sigma(X, t) W(t) d t$ 式中 $\mu(X, t)=\mu[X(t), t], \sigma(X, t)=\sigma[X(t), t] ， \mathrm{~W}(\mathrm{t})$ 为单位强度的Guass白噪声。

$\mu(X, t)$ 稱爲 drift coefficient
$\sigma(X, t)$ 稱爲 diffusion coefficient

此方程等价于下列积分方程: $X(t)=X_0+\int_{t_0}^t \mu[X(s), s] d s+\int_{t_0}^t \sigma[X(s), s] W(t) d s$

右邊第一個積分可解釋為均方或 sample 的 Riemann 積分。而第二個積分，利用在廣義隨機過程意義上 Gauss white noise 為 Wiener process 的導數性質，可以改寫成 $\int_{t_0}^t \sigma[X(s), s] W(t) d s = \int_{t_0}^t \sigma[X(s), s] d W(s)$

Ito 微分方程形为 $d X(t)=\mu(X, t) d t+\sigma(X, t) d W(s)$ Ito 積分方程形为 $X(t)=X_0+\int_{t_0}^t \mu[X(s), s] d s+\int_{t_0}^t \sigma[X(s), s] d W(s)$

以上都是 Sample 的微分或是積分方程式，Sample 是無法直接計算的，除了用 Monte Carlo 模擬。

但是和 probability distribution 的關係是什麽？

另一個常用的統計特性是 auto-correlation function, 可以得到不同時間 samples 的 correlation. 另外 FFT 之後可以得到 spectrum (如果是 stationary process).

Random variable/process 的統計特性 (i.e. probability distribution) 則是可計算的，至少理論上如此。

很自然我們會問 (9) 或 (10) 的 probability distribution $p_t(X_t)$ 是如何隨著時間變化?

這是 Fokker-Planck equation 討論的問題。

Ito SDE 是比較 general 的 equation. Fokker-Planck equation 則是其解。

Fokker-Planck Equation

有兩種 approaches, 一個是科學空間的方法。另一個是 Wiki Fokker-Planck equation。

Approach 1: Wiki Fokker-Planck Equation

https://en.wikipedia.org/wiki/Fokker%E2%80%93Planck_equation

Higher diemsnions

Approach 2: 科學空間方法

看起來像是 Drift Only Equation? 非常神奇，應該有問題，他好像把 drift and diffusion 混在一起！

下面的 equation (10) 似乎有問題。應該是保留第二項 (drift=0), 可以得到一個 diffusion only solution!!

以下的推導可以參考科學空間：https://spaces.ac.cn/archives/9280

先看簡化版，沒有 Wiener process 部分：

再來考慮 general case “Ito SDE”

对于SDE $d \boldsymbol{x}_t=\boldsymbol{f}_t\left(\boldsymbol{x}_t\right) d t+g_t d \boldsymbol{w}$ 根据测试函数法的相等原理得 $\frac{\partial p_t\left(\boldsymbol{x}_t\right)}{\partial t}=-\nabla_{\boldsymbol{x}_t} \cdot\left(p_t\left(\boldsymbol{x}_t\right) \boldsymbol{f}_t\left(\boldsymbol{x}_t\right)\right)+\frac{1}{2} g_t^2 \nabla_{\boldsymbol{x}} \cdot \nabla_{\boldsymbol{x}} p_t(\boldsymbol{x})$ 这就是“Fokker-Planck方程”。

$p(x_t) = p_0(x_0) * N(x_t)$ convolution
$p_0(x_0)$ 就是原始的 distribution, 例如點熱源或是 image distribution, 不是 Gaussian! 但是 mixed Gaussian distribution 如 (13)：每一個 $x_0$ 對應一個 Gaussian.
如果 $t$ 很小，$\sigma \to 0$, Gaussian kernel $N(x_t)$ 基本就是 delta function, $p(x_t)_{t=0} = p_0(t)$.
當 $t$ 越來越大，$\sigma$ 越來越大，Gaussian kernel $N(x_t)$ 變成非常寬，但是 frequency domain 非常窄。所以非常接近 Gaussian kernel!!

以上就是 diffusion process, 不論是 thermal diffusion 或是 image diffusion process!

(2) 和 (12) 其物理意義類似有一個 (probability) flow in space 如下面的例子。

Thermal flow = probability flow

Transport Equation of Flux

有趣的是，$V$ 和 $\phi$ 常常都有簡單的關係，稱爲 transport equation (spatial dependent)。以下是一些 flow 例子：

Diffusion Flow

Continuity equation, 也代表質量守恆 equation. $\nabla \cdot \mathbf{J} + \frac{\partial \varphi}{\partial t} = 0 \label{DiffCont} \quad \to \quad \frac{\partial \varphi}{\partial t} = - \nabla \cdot \mathbf{J}$ where

$\mathbf{J}$ is the diffusion flux, measures the amount of substance that will flow through a unit area during a unit time interval.
φ (for ideal mixtures) is the concentration, of which the dimension is amount of substance (mole) per unit volume.

再來因爲 diffusion flux 是 irrotational flow, $\nabla \times \mathbf{J} = 0$, 而且 potential $V$, $-\nabla V = \mathbf{J}$. 也就是驅動 diffusion flux 的勢能 $V$. 在 diffusion flux 的例子，$V$ 剛好正比于 $\varphi$, 也就是 $ V = D \varphi \rightarrow \mathbf{J} = -\nabla V = -D \nabla \varphi$. 這個比例常數, $D$, 稱爲 diffusion constant.

這就是 Fick’s first law, 也稱爲 transport law. $\mathbf{J} = - D \nabla \varphi \label{Fick1}$ 爲什麽不是用 $V$ 取代 $\varphi$, 把比例常數放在另一邊，such as $\varphi = D’ V \to D’ \mathbf{J} = -\nabla V$? 主要的原因是 continuity equation including $\mathbf{J}$ and $\varphi$ 更基本，而 $V$ and $D = V / \varphi$ 則 depends on material and environment (e.g. 溫度，溶劑)，如下表。

$D$ 除了反應 material 和 environment 的特性，另一個功能吸收收 $V$ and $\varphi$ 單位的差異, area per unit time $L^2/T$, or SI 單位 $m^2/s$. 對於類似的 transport phenomen都是同樣的單位，

結合 continuity equation 和 transport law，就得到有名的 Fick’s second law (or transport equation) PDE (partial differential equation) with Laplacian operator. 在 Steady state 就化簡成 Laplacian equation. $\frac{\partial \varphi}{\partial t} = D \nabla ^ 2 \varphi = D \Delta \varphi \label{Fick2}$

我們可以推論一下 $\varphi$ 的定性分析。我們簡化 $\nabla^2 \varphi = \Delta \varphi$ 為 1D spatial domain, 物理意義就是曲率。

對於凸函數曲率為負值，因爲 $D$ 是正值，所以代表 $\varphi$ 對時間的微分為負值。因此凸函數隨時間變平緩，如下圖右半。
對於凹函數曲率為正值，因爲 $D$ 是正值，所以代表 $\varphi$ 對時間的微分為正值。因此凹函數隨時間變平緩，如下圖左半。
因此隨時間增加，山峰會平緩，山谷也會變平坦。這和熱或擴散的物理直覺一致，會到達熱平衡或濃度平衡。

$\eqref{Fick2}$和 heat equation $\eqref{HeatEq}$ 其實一模一樣。只是把 diffusion constant $D$ 改成 thermal conductivity $k$。上式 fundmantal solution 稱爲 (heat) kernel. 基本是 variance 隨時間變大的 Gaussian function. 當然 PDE 具體問題的解 depends on the boundary condition.
$\varphi(x, t)=\frac{1}{\sqrt{4 \pi D t}} \exp \left(-\frac{x^{2}}{4 D t}\right) \label{HeatKern}$

最後一步，如何求解 flux vector field? 只要把 scalar field $\varphi$ 取 gradient in $\eqref{Fick1}$ 就可以得到 flux vector field $\mathbf{J}$.

神經網絡就是去擬合 score function of a normal distribution? 這應該只是 forward path? 不過如果是 deterministic function, 爲什麽需要神經網絡擬合？

應該是用神經網絡去擬合 $\nabla_{x_t} \log p(x_t) $

Linear SDE (Ito Lemma)

前文使用 DDPM 的角度解釋 Diffusion Model. 類似 VAE.

這裡使用 SDE 推導 Diffusion Model. 基本是用 neural network 近似 score function (gradient log P)

Appendix

Appendix A: Ito SDE to Fokker-Planck Equation

对于 SDE $d \boldsymbol{x}_t=\boldsymbol{f}_t\left(\boldsymbol{x}_t\right) d t+g_t d \boldsymbol{w}$ 我们离散化为 $\boldsymbol{x}_{t+\Delta t}=\boldsymbol{x}_t+\boldsymbol{f}_t\left(\boldsymbol{x}_t\right) \Delta t+g_t \sqrt{\Delta t} \varepsilon, \quad \varepsilon \sim \mathcal{N}(\mathbf{0}, \boldsymbol{I})$ 那么 $\begin{aligned} \phi\left(\boldsymbol{x}_{t+\Delta t}\right) & =\phi\left(\boldsymbol{x}_t+\boldsymbol{f}_t\left(\boldsymbol{x}_t\right) \Delta t+g_t \sqrt{\Delta t} \varepsilon\right) \\ & \approx \phi\left(\boldsymbol{x}_t\right)+\left(\boldsymbol{f}_t\left(\boldsymbol{x}_t\right) \Delta t+g_t \sqrt{\Delta t} \varepsilon\right) \cdot \nabla_{\boldsymbol{x} t} \phi\left(\boldsymbol{x}_t\right)+\frac{1}{2}\left(g_t \sqrt{\Delta t} \varepsilon \cdot \nabla_{\boldsymbol{x} t}\right)^2 \phi\left(\boldsymbol{x}_t\right) \end{aligned}$ 两边求期望, 注意右边要同时对 $\boldsymbol{x}_t$ 和 $\boldsymbol{\varepsilon}$ 求期望, 其中 $\boldsymbol{\varepsilon}$ 的期望可以事先求出, 结果是 $\phi\left(\boldsymbol{x}_t\right)+\Delta t \boldsymbol{f}_t\left(\boldsymbol{x}_t\right) \cdot \nabla_{\boldsymbol{x}_t} \phi\left(\boldsymbol{x}_t\right)+\frac{1}{2} \Delta t g_t^2 \nabla_{\boldsymbol{x} t} \cdot \nabla_{\boldsymbol{x}_t} \phi\left(\boldsymbol{x}_t\right)$ 于是 $\begin{aligned} & \int p_{t+\Delta t}\left(\boldsymbol{x}_{t+\Delta t}\right) \phi\left(\boldsymbol{x}_{t+\Delta t}\right) d \boldsymbol{x}_{t+\Delta t} \\ \approx & \int p_t\left(\boldsymbol{x}_t\right) \phi\left(\boldsymbol{x}_t\right) d \boldsymbol{x}_t+\Delta t \int p_t\left(\boldsymbol{x}_t\right) \boldsymbol{f}_t\left(\boldsymbol{x}_t\right) \cdot \nabla_{\boldsymbol{x}_t \phi} \phi\left(\boldsymbol{x}_t\right) d \boldsymbol{x}_t+\frac{1}{2} \Delta t g_t^2 p_t\left(\boldsymbol{x}_t\right) \nabla_{\boldsymbol{x}_t} \cdot \nabla_{\boldsymbol{x}_t} \phi\left(\boldsymbol{x}_t\right) d \boldsymbol{x}_t \end{aligned}$ 跟式 $(13)$ 、式 $(14)$ 类似，取 $\Delta \rightarrow 0$ 的极限, 得到 $\int \frac{\partial p_t\left(\boldsymbol{x}_t\right)}{\partial t} \phi\left(\boldsymbol{x}_t\right) d \boldsymbol{x}_t=\int p_t\left(\boldsymbol{x}_t\right) \boldsymbol{f}_t\left(\boldsymbol{x}_t\right) \cdot \nabla_{\boldsymbol{x}_t} \phi\left(\boldsymbol{x}_t\right) d \boldsymbol{x}_t+\frac{1}{2} g_t^2 p_t\left(\boldsymbol{x}_t\right) \nabla_{\boldsymbol{x}_t} \cdot \nabla_{\boldsymbol{x}_t} \phi\left(\boldsymbol{x}_t\right) d \boldsymbol{x}_t$ 对右边第一项应用式 $(6)$ 、对右边第二项先应用式(7)再应用式 $(6)$, 得到 $\int \frac{\partial p_t\left(\boldsymbol{x}_t\right)}{\partial t} \phi\left(\boldsymbol{x}_t\right) d \boldsymbol{x}_t=\int\left[-\nabla_{\boldsymbol{x}_t} \cdot\left(p_t\left(\boldsymbol{x}_t\right) \boldsymbol{f}_t\left(\boldsymbol{x}_t\right)\right)+\frac{1}{2} g_t^2 \nabla_{\boldsymbol{x}} \cdot \nabla_{\boldsymbol{x}} p_t(\boldsymbol{x})\right] \phi\left(\boldsymbol{x}_t\right) d \boldsymbol{x}_t$ 根据测试函数法的相等原理得 $\frac{\partial p_t\left(\boldsymbol{x}_t\right)}{\partial t}=-\nabla_{\boldsymbol{x}_t} \cdot\left(p_t\left(\boldsymbol{x}_t\right) \boldsymbol{f}_t\left(\boldsymbol{x}_t\right)\right)+\frac{1}{2} g_t^2 \nabla_{\boldsymbol{x}} \cdot \nabla_{\boldsymbol{x}} p_t(\boldsymbol{x})$ 这就是”Fokker-Planck方程”。

MLP Is All You Need

Posted on 2023-04-09 | In Language

我想用 matrix math 澄清一些觀念

Convolution Is All You Need

Posted on 2023-04-09 | In Language

我想用 matrix math 澄清一些觀念

Reference

Background

Lesson Learned

Convex, Smooth, Strict Convex 定義

什麽是 Epigraph (Epi - on ; graph - write)?

Convex, Smooth, Strict Convex 幾何詮釋

Error, Convergence Rate

Q&A : How about ADAM and other convergence algorithm on convex function, smooth, and strict convex + smooth????

Momentum Method: to close the gap between GD vs. Oracle lower bound

Projections and Proximal Operators

Reference

Background

Primal-Dual Framework

Lagrange Duality

Conjugate Duality

Reference

Introduction

隨機過程簡單系統

(Forward) Ito Differential Equation

Forward unconditioned equation (Fokker-Planck equation)

Ito Differential Equation Examples

Coefficients Without $x$ Dependence (Forward)

Multiplicative Linear White Noise Process (Forward)

Ornstein-Uhlenbeck Process (Forward)

(Variance) Stationary Version ($A$ 的 eigenvalues 實部小於 0)

Appendix

Reference

Introduction

Background

SDE and Reverse SDE

DDPM (Variance Preserving SDE)

SMLD (Variance Exploding SDE)

Sub-Variance Preserving SDE

Reference

Introduction

隨機過程簡單系統

(Forward) Ito Differential Equation

Forward unconditioned equation (Fokker-Planck equation)

Backward (not reverse-time, not $t \to -t$) Kolmogorov equation

(Backward) Ito Differential Equation

Ito Differential Equation Examples

綫性 Ornstein-Uhlenbeck Equation (Foward and Backward)

Backward Fokker-Planck Solution

Appendix

Appendix A

如何用 ChatGPT 學習語言？

語言部分

對話部分

Takeaway

測度論

直觀解釋 黎曼 (Riemann) 積分 and 勒貝格 (Lesbegue) 積分

白話就是：黎曼積分是分割定義域來計算積分；勒貝格積分則是用分割值域來計算積分。

二、集合介紹

二、外側度

三、可測集、測度0

四、不可測集

六、Lebesgue 積分 vs Riemann 積分

Reference

Reference

什麽是隨機微分方程 (SDE)

例一：(Time Derivative) Input/Output

例二：(Time Derivative) Hidden State/Output Estimation

例三：(Time Derivative) Input/Output

例四：(Time and Space Derivative!) Diffusion Process (Wrong!!)

SDE 微分問題 - 標準型

例二：(Time Derivative) 布朗運動

SDE 微分和積分表示法

Fokker-Planck Equation

Approach 1: Wiki Fokker-Planck Equation

Approach 2: 科學空間方法

Transport Equation of Flux

Diffusion Flow

Linear SDE (Ito Lemma)

Appendix

Appendix A: Ito SDE to Fokker-Planck Equation

直觀解釋黎曼 (Riemann) 積分 and 勒貝格 (Lesbegue) 積分