AI SfM - DUSt3r, MASt3r, MONSt3r

傳統 SfM Vs. AI SfM

![[Pasted image 20251006192355.png]]

傳統 SfM（例如 COLMAP、GLOMAP 等）通常將整個 SfM 任務拆分成多個子任務（特徵提取、匹配、三角化、姿態估計、束調整等）。這樣的拆分雖然在工程上清晰可維護，但存在以下問題：

前一模組的誤差往往會累積到後續模組（誤差傳遞）
後續模組的資訊不容易反饋回前面模組（模組間難以協同優化）

相對地，AI SfM（又稱 end-to-end neural SfM）力圖克服這些瓶頸。典型的方法（如 DUSt3R、MASt3R、MONSt3R、VGGT）嘗試直接由圖像輸入到 3D 重建與相機姿態輸出，中間盡量不經由手工模組化的拆分。

而 AI SfM（像 DUSt3R 等）則採取 端到端（end-to-end） 模型的方式：輸入兩張或多張圖像，直接由網絡模型輸出 3D 點雲 (或 depth / pointmaps) 及相機 pose 等，盡量減少模組間的手工接駁、誤差累積。此外有些 AI SfM 方法進一步整合 “matching” 的部分（feature mapping）或考慮動態物體（moving objects）。

AI SfM 意味著：

- 模型可以從整體優化角度考慮各子模組的相互影響
- 減少誤差在模組間的累積
- 提高速度與效率，特別是在 feed-forward 模型（如 VGGT）上
- 模型設計與訓練成本更高，需要大規模訓練資料與良好泛化策略

AI SfM 的演化版本：

DUSt3R：在兩張影像之間直接建立 2D → 3D 對應的 pointmap，從中推估深度和相機 pose，然後透過 global alignment 整合多張影像。(CVF Open Access)
MASt3R：在 DUSt3R 的基礎上增加 descriptor / feature mapping 頭，使得 matching 能與重建一體化。(LearnOpenCV)
MONSt3R：在模型中考慮場景中的移動物體，使其能處理不完全靜態的場景（動態幾何估計）(jytime.github.io)
VGGT：設計為 transformer 架構，能對多張圖像同時進行處理，直接預測 3D 點雲與相機 pose。在某些情況下可以免去傳統 BA 步驟，其 feed-forward 預測便可作為初始化，或甚至作為最終輸出。(arXiv)
DUSt3R：輸入一對圖像，直接產出第一張圖像的深度圖 (depth map) 以及第二張圖像在第一張圖像局部座標系下的 depth map；從這些輸出可以估計相機 pose。
MASt3R：在 DUSt3R 的基礎上加上 feature mapping（也就是強化 matching 這部分）。
MONSt3R：考慮動態物體（moving objects）的版本。
VGGT：更進一步的是，網絡在多張圖像上同時工作，直接給出 camera pose 和 3D 點雲，不再需要透過傳統 BA（或至少在很多情況下可以省略某些輸出步驟）。

AI SfM 方法與限制

下面是針對 DUSt3R／MASt3R／MONSt3R／VGGT 等方法的一些最新背景與補充：

方法	核心理念 / 特點	優勢	限制或挑戰
DUSt3R	將 pairwise 重建問題看成 pointmap regression（即每張圖像對應一個 “pointmap”），不依賴已知相機內參或 pose。(CVF Open Access)	統一 monocular / stereo 重建問題；減少對傳統模組的依賴；能從 pointmap 中回推匹配、pose 等資訊。(arXiv)	雖然省略了部分顯式幾何步驟，但還需要 global alignment（將 pairwise reconstruction 對齊到共用座標系）(CVF Open Access)；對更大視角變化或遮蔽重疊較低的情況可能較脆弱。
MASt3R	在 DUSt3R 的基礎上增加 feature mapping / descriptor head，使得 matching 更強、對應更精確。(LearnOpenCV)	改善匹配精度，整合 matching 和重建，更好的 local-to-global 對齊能力。(arXiv)	通常仍是 pairwise 為單位，需要 global alignment 步驟；在極端場景或遮蔽重疊少時仍有挑戰。(arXiv)
MASt3R-SfM	一個進一步整合多張影像的框架：對多張圖像應用 MASt3R 重建，然後將局部重建對齊到統一的全局座標系。(arXiv)	相比純 pairwise，可以處理多張圖像的一致性問題；簡化 pipeline；具備擴展性。(arXiv)	對 global alignment 的精度和穩定性仍有要求；資料稠密或重疊不足時可能仍失效。
MONSt3R	在考慮靜態場景基礎上，引入對動態物體（moving objects）的建模，以便即使場景中有移動物體，也能進行幾何估計。(jytime.github.io)	更具魯棒性，可以處理場景中的部分動態元素；拓展 AI SfM 在現實場景的適用性。(jytime.github.io)	動態物體的估計、分割與對齊等更複雜；計算量、模型複雜性可能較高。
VGGT (Visual Geometry Grounded Transformer)	一個 transformer 模型直接在多張圖像上工作，直接預測 3D 點與相機 pose，減少甚至省略傳統幾何優化步驟。(arXiv)	在速度與準確度上表現優越，能比 DUSt3R / MASt3R 快很多。(arXiv)；可作為 BA 的初始化，或在某些情況下無需 BA 即可獲得不錯結果。(arXiv)	對訓練資料、模型容量、泛化能力要求高；在極端視角、重疊極少或遮蔽嚴重場景可能表現不穩定。(arXiv)

此外還有一些後續或變種：

MV-DUSt3R+：一種單階段（single-stage）feed-forward 模型，直接處理多張圖像以避免大量 pairwise 組合、錯誤累積等問題。(arXiv)
D²USt3R：針對動態場景，使用 4D pointmaps（空間 + 時間）處理靜態與動態幾何。(arXiv)
InstantSplat：有些工作將 DUSt3R 結合 Gaussian splatting 技術進行更真實的重建。(ResearchHub)

而在 VGGT 的論文中，也提到它的 feed-forward 預測可以直接產生 3D 點 (point map)，可作為 BA 的初始化，有些情況甚至不需要傳統的 triangulation 或 iterative refinement。(arXiv)

VGGT 的論文中也指出，在某些基準上，VGGT + BA 的結合能大幅提升精度，AUC@10 在某些 benchmark 上從 71.26 提升到 84.91。(arXiv)

它也強調相比於 DUSt3R 和 MASt3R 只能處理成對影像 (pairwise) 的限制，VGGT 的架構能在多張影像之間整合幾何資訊，更有效率。(arXiv)

Traditional SfM vs AI SfM 比較

面向	傳統 SfM	AI SfM / Neural SfM
模組化 vs 端到端	分階段模組：feature → matching → pose → triangulation → BA → densify	從輸入影像直接到 3D 結果（或深度／點雲＋pose），少或沒有模組化拆分
誤差傳遞	模組間誤差累計（error propagation）常見	模型內可學習消除或緩和誤差累積
模組間反饋	很難把後面階段的訊息反饋回前階段	模型可在訓練時端對端地學習到前後關聯
幾何嚴格性	幾何約束強（epipolar geometry、嚴格三角化、BA）	部分或全部放鬆幾何假設，靠網絡學習先驗與統計模式
對初始估計依賴	通常需要良好的初始化、重疊角度、遮蔽少	在某些情況下表現較穩定，容錯能力強
可擴展性 / 泛化能力	模組化易於擴展、插入新方法	模型大小、訓練資料與泛化是挑戰
效率 / 速度	在大規模場景 BA 等步驟可能瓶頸	若設計得好，能相當快速（especially feed-forward 模型）
可解釋性	各步驟明確可解釋	模型內部操作常為黑箱，需要額外設計來釐清

座標系（Coordinate Systems）

在 Structure-from-Motion (SfM) 或任何視覺觀測問題中，通常涉及三個座標空間：

Global 3D Space (coordinate) 物體在全域三維空間中的位置，例如以世界為原點的 world coordinate system。
Local 3D Camera Space (coordinate) 物體在相機的局部三維座標系中。相機的 pose（姿態）（即位置與方向）決定了此局部座標與全域座標之間的轉換關係。
Local 2D Pixel Plane (coordinate) 這是影像平面上的像素座標，是將局部相機座標中的三維點投影（projection）到二維平面上的結果。

這三個空間之間的轉換由 線性代數（linear algebra） 所描述：

從 (1) Global 3D space → (2) Local 3D camera space 是一個可逆的剛體變換（rigid transformation），包含旋轉矩陣 $R$ 與平移向量 $t$： $X_{camera} = R (X_{world} - t)$
從 (2) Local 3D camera space → (3) Local 2D pixel plane 是一個投影轉換（projection transformation），通常用針孔相機模型表示： $x = K [R | t] X_{world}$ 其中 $K$ 是 intrinsic matrix（內部參數矩陣）。

此投影的反向（從 2D 回推 3D）是病態問題（ill-conditioned）：若不知相機姿態（camera pose）與深度（depth），則無法唯一決定 3D 位置。這正是為什麼需要 Bundle Adjustment (BA) —— 同時估計場景中物體的 3D 座標與相機的姿態，使重投影誤差最小化。

Core geometric question:

👉 In AI-based SfM models like DUSt3R, MASt3R, VGGT, which of the classical projection parameters (R, t, K) are explicitly or implicitly estimated — and how?

1️⃣ Recap: Classical SfM Parameterization

In traditional Structure-from-Motion (SfM), we explicitly estimate the following:

Symbol	Name	Description	How it’s obtained
K	Intrinsic matrix	Describes camera’s internal geometry (focal length, principal point, skew).	Known (from calibration) or estimated via camera calibration. Usually fixed across images.
R, t	Extrinsic parameters (rotation, translation)	Describe the camera’s pose — how it’s oriented and located in world coordinates.	Estimated through pose estimation using feature correspondences (via essential matrix, PnP, or BA).
X	3D point positions	Scene structure, reconstructed by triangulation.	Computed from multiple 2D correspondences and camera poses.

2️⃣ DUSt3R (CVPR 2024) — “Pointmap Regression” Model

DUSt3R (by Wang et al., 2024, CVPR) fundamentally changes the game: It doesn’t explicitly estimate $R, t, K$ at first. Instead, it learns to regress 3D pointmaps directly from pairs of images.

🔹 What DUSt3R actually predicts

For each image pair $(I_1, I_2)$:

Output	Meaning
Pointmap₁ (P₁)	Each pixel in image 1 is assigned a 3D point in camera 1’s local coordinate frame.
Pointmap₂ (P₂)	Each pixel in image 2 is assigned a 3D point expressed in camera 1’s local coordinate frame (note: not its own).
Confidence map	Optional quality score for point predictions.

So DUSt3R learns: $f(I_1, I_2) \rightarrow (P_1, P_2)$ where $P_2$ is already aligned to the coordinate frame of camera 1.

🔹 Recovering $R, t$ from DUSt3R outputs

Once you have $P_1$ and $P_2$ in the same coordinate frame, you can estimate the relative pose between the two cameras by solving:

\[\min_{R, t} | P_2 - (R P_1 + t) |^2\]

That’s a standard rigid alignment problem, solved via Procrustes / Umeyama algorithm or SVD-based least squares. Hence, DUSt3R gives dense correspondences, from which $R$ and $t$ can be recovered post hoc.

🔹 What about $K$ (intrinsics)?

DUSt3R assumes normalized image coordinates (i.e., camera intrinsics are either known or absorbed into network normalization). Thus:

$K$ is not predicted,
It is assumed known or fixed (often set to identity).

✅ Summary (DUSt3R):

Parameter	Estimated?	How	Comment
K (intrinsic)	❌ Not estimated	Assumed known or identity	DUSt3R uses normalized coordinates
R, t (extrinsic)	✅ Indirectly estimated	From aligning predicted pointmaps	Equivalent to relative camera pose
3D structure (X)	✅ Directly predicted	Dense pointmap regression	In local camera coordinates

3️⃣ MASt3R — Adds Feature Matching Awareness

MASt3R builds on DUSt3R, but integrates feature matching and 3D consistency into a single transformer.

Differences:

Learns dense correspondences across multiple views.
Produces more robust and consistent pointmaps.
Still does not directly output $R, t$, but they can be recovered as before (rigid alignment).

So, in MASt3R:

Parameter	Estimated?	Notes
K	❌ fixed or normalized	Still assumed known
R, t	✅ recovered post-hoc	From pointmap alignment
3D structure	✅ directly predicted	Higher-quality dense pointmaps

4️⃣ MONSt3R — Extends to Moving Objects

MONSt3R adds motion segmentation and dynamic scene handling:

Predicts multiple rigid motions per scene.
Each moving object (or background) has its own $R_i, t_i$.

Thus, MONSt3R implicitly estimates multiple sets of extrinsic transforms (for each object), but again via alignment of learned 3D pointmaps — not explicit matrix regression.

5️⃣ VGGT — Visual Geometry Grounded Transformer (2025)

VGGT takes it further:

It processes multiple images jointly (not just pairs).
Directly predicts both 3D structure and camera poses in one forward pass.

🔹 VGGT output

VGGT predicts: ${ R_i, t_i }*{i=1..N}, \quad { P_i }*{i=1..N}$ jointly for all input images $I_1 \ldots I_N$.

Thus it explicitly estimates the extrinsics (unlike DUSt3R / MASt3R), and can:

Provide $R, t$ for each camera.
Produce consistent 3D structure across all views.
Often, it can bypass or reduce the need for BA.

$K$ is still assumed known (same as before), or absorbed by coordinate normalization.

✅ Summary Comparison

Model	K (intrinsic)	R, t (extrinsic)	3D Structure	Notes
DUSt3R	Assumed known or identity	Derived from aligning predicted pointmaps	Predicted (dense pointmap)	Pairwise model
MASt3R	Same as DUSt3R	Derived post-hoc	Predicted (better consistency)	Adds feature mapping
MONSt3R	Same	Multiple $R,t$ per object	Predicted per object	Handles dynamic scenes
VGGT	Known / normalized	Predicted directly by network	Predicted jointly across views	Multi-view, end-to-end model

6️⃣ Why This Matters (Conceptually)

Classical SfM	AI SfM
$R, t, K$ estimated explicitly via geometric optimization (essential matrix, BA)	Network learns implicit geometry priors, producing outputs from which these can be recovered
Geometry guaranteed by algebra	Geometry approximated / learned statistically
Needs correspondences	Learns correspondences internally
BA (nonlinear least squares) needed	Often bypassed or replaced by learned consistency

🧩 Intuitive Analogy

Traditional SfM: “I carefully measure geometry and solve the math.” 「我仔細地測量幾何，然後解出數學方程。」
DUSt3R / MASt3R: “I directly imagine what the 3D scene looks like from both cameras — then geometry naturally falls out.” 「我直接『想像』兩個相機看到的 3D 世界長什麼樣 —— 幾何自然就會浮現。」
VGGT: “I see all views at once, and jointly infer the 3D world and where each camera was.” 「我同時看見所有視角，並且『共同推論』出整個 3D 世界，以及每個相機所在的位置。」