YouTube for camera extrinsic and intrinsic! Computer Vision: The Camera Matrix
Introduction
I am into bouldering and would like to do a project using AI for coaching.
Pre-Stage
Training Video Collection
Scape youtube video of bouldering Search by names: Janja, Ai Mori
Download youtube video: https://www.youtube.com/shorts/m4b0uDBh4nE use iMyFone TopClipper (no good)
Video Editing to remove head and tail. CapCut or Microsoft Clipchamp -> 用 claude code 改寫一個簡單的 code to trim the video.
AI Coach 1 (Failed)
先做 3D pose detection (利用 mmpose) 先是整張 image, 後來是先 people detection (mmdetect) 再 pose detection. => 再把 background 分離和補全 (利用 temporal homography + medium filtering) => background 再用 vggt 轉成 3D point cloud.
問題
- 把 people 和 background 分離,分別做 3D。面臨最後整合在同一個座標系的問題。
- 另外更嚴重的問題是 loss information cue. 因為人在空間中移動,其實提供 3D background cue. 如果分開重建,反而損失空間 information.
AI Coach 2 (Monst3r)
此處我們先利用 Monst3r 做 3D point cloud (or depth maps) with moving objects.
基本思路和 AI Coach 1 類似
- 先用 segmentation (SAM2) 分開 moving objects, 用 mask 取代 moving objects. 剩下的 background 利用 Dust3r 做 pairwise frames 的 feature extraction and matching and depth estimation (feedforward) 再加上 camera pose estimation (BA). 記住 DUST3r 只有 feedforward (快) 準確度有限,所以還是要有 BA 提高準確度和估計 camera pose. 這點和 VGGT 不同。 Question: 可以用 SAM2 + VGGT?
- 假設 moving objects 沒有佔很大的區域。 SAM2 + Dust3r 是 OK.
- 下一個問題是如何把 moving object 融合回去? Monst3r 的做法
AI Coach 3: 直接 feedforward + moving objects (VGGT+)
why use sam2 and optical flow?
⏺ Great question! Let me explain why MonST3R uses SAM2 and optical flow:
SAM2 (Segment Anything Model 2)
Purpose: Dynamic/static scene segmentation
Why needed:
- MonST3R assumes the scene is mostly static for global optimization
- Dynamic objects (people running, cars moving, etc.) violate the static scene assumption
- Without masking, moving objects create inconsistent geometry and trajectory errors
How it helps:
- SAM2 identifies dynamic foreground objects across frames
- Creates dynamic_masks/ that mark moving regions
- These regions are excluded from optimization (or weighted differently)
- Result: Cleaner background reconstruction, better camera poses
Example: In lady-running, the running person is dynamic while the background is static. SAM2 masks the person so optimization focuses on the static background.
Optical Flow (RAFT)
Purpose: Dense pixel correspondence between consecutive frames
Why needed:
- MonST3R’s initial depth predictions may have temporal inconsistencies
- Consecutive frames should have smooth, consistent motion
- Need additional constraints beyond just geometric reconstruction
How it helps:
- RAFT computes pixel motion between frames (2D flow vectors)
- Flow loss (flow_loss_weight=0.01) enforces that:
- 3D points should project consistently across frames when accounting for camera motion
- Temporal smoothness in depth and camera trajectory
- Acts as a regularization during optimization iterations
The Pipeline: Frame 1, Frame 2 → RAFT → Flow vectors ↓ Frame 1, Frame 2 → MonST3R → Initial 3D points + poses ↓ Optimization with flow loss ↓ Refined 3D points + smooth poses
Why Real-time Mode (–real_time) is Faster
Skips both:
- No SAM2 segmentation
- No optical flow computation
- No iterative optimization with flow loss
- Just feed-forward depth prediction
That’s why it’s ~10x faster but lower quality - no temporal consistency enforcement or dynamic object handling.
Trade-offs
- With SAM2 + Flow (Options A, B, C): Better quality, slower, handles dynamic scenes
- Without (Option D): Faster, but temporal inconsistencies and dynamic object artifacts
For your climbing video (janja1_10fps.mp4), SAM2 would mask the climber, and optical flow would ensure smooth camera motion estimation.
AI Coach 4 (SAM2 + COLMAP + 2D Pose lift to 3D)
Project Purpose
This is a 3D pose estimation pipeline for analyzing climbing videos from a handheld/moving camera. It reconstructs both the static climbing wall and the dynamic human pose in a unified 3D coordinate system.
Architecture
Two-Stage Approach
Stage 1: Remove climber → reconstruct static background + camera poses (COLMAP) Stage 2: Estimate climber’s 3D pose → transform to world coordinates
Why This Design?
- COLMAP requires a static scene - moving objects break Structure-from-Motion (SfM)
- SAM 2 provides temporally consistent segmentation masks across video frames
- Coordinate transformation ensures climber and wall share the same 3D reference frame
Directory Structure
bouldering/ ├── roof_climb.mp4 # Input video (3.4MB) ├── frames/ # Extracted frames (167 frames) ├── masks_sam2/ # Person segmentation masks ├── frames_masked/ # Masked frames for COLMAP ├── colmap_workspace/ # COLMAP reconstruction outputs │ └── sparse/0/ │ ├── cameras.txt # Camera intrinsics │ ├── images.txt # Camera extrinsics per frame │ └── points3D.txt # 3D background point cloud ├── poses_output/ # Pose estimation results │ ├── poses_results.pkl # 2D poses from MediaPipe │ └── poses_3d.pkl # 3D poses in world coordinates ├── poses_visualized/ # 2D pose visualizations ├── github/sam2/ # SAM 2 submodule └── VIBE/ # VIBE submodule (optional)
Key Scripts
- Frame Extraction
- extract_frames.sh: FFmpeg wrapper to extract frames from video
- Usage: bash extract_frames.sh roof_climb.mp4 frames/
- Person Segmentation (Stage 1a)
- mask_generation_sam2.py: SAM 2 video segmentation
- Uses facebook/sam2.1-hiera-tiny (or larger models)
- Auto-detects person with YOLOv8 or uses centered box prompt
- Generates temporally consistent masks across video frames
- Output: One binary mask per frame in masks_sam2/
- COLMAP Pipeline (Stage 1b)
- run_colmap.sh / run_colmap_original.sh: COLMAP execution wrappers
- filter_keypoints_for_colmap.py: Removes features inside person masks
- colmap_instructions.txt: Detailed COLMAP workflow
- Output: Camera poses (R, t) and background 3D points
- Pose Detection (Stage 2a)
- run_pose_detection.py: 2D pose detection
- Uses MediaPipe Pose (fallback) or MMPose (if available)
- Detects 33 keypoints per frame (MediaPipe format)
- Output: poses_output/poses_results.pkl
- 3D Lifting (Stage 2b)
- lift_poses_to_3d.py: Lifts 2D poses to 3D using COLMAP camera parameters
- Single-view depth estimation (assumes person ~5m from camera)
- Uses camera intrinsics (K) and extrinsics (R, t) from COLMAP
- Transforms from camera coordinates to world coordinates
- Output: poses_output/poses_3d.pkl
- Visualization
- visualize_poses.py: Visualizes 2D poses overlaid on frames
- visualize_3d_scene.py: Combines wall point cloud + 3D poses
- Creates two views: multiple poses and single pose detail
- Exports: 3d_scene_combined.png, 3d_scene_single_pose.png, 3d_scene_data.npz
- Transform & Scale (To Be Used)
- transform_and_scale.py: Coordinate transformation with scale alignment
- Not currently in use (lift_poses_to_3d.py does similar work)
- Includes anatomical height heuristic for scale (~1.7m person)
- Utilities
- test_sam2_output.py: Verify SAM 2 mask quality
- run_sam2_batched.py: Batch processing for SAM 2
- view_npz.py: Inspect NPZ data files
Data Flow
roof_climb.mp4 ↓ [extract_frames.sh] frames/.jpg ↓ [mask_generation_sam2.py] masks_sam2/_mask.png ↓ [apply masks to frames] frames_masked/.jpg ↓ [COLMAP pipeline] colmap_workspace/sparse/0/{cameras,images,points3D}.txt │ └─────────────────────┐ ↓ frames/.jpg ──→ [run_pose_detection.py] ──→ poses_results.pkl ↓ [lift_poses_to_3d.py + COLMAP params] ↓ poses_3d.pkl ↓ [visualize_3d_scene.py] ↓ 3d_scene_combined.png + 3d_scene_data.npz
Technology Stack
Dependencies
- Python 3.10 (conda environment: openmmlab)
- SAM 2.1: Video segmentation (facebook/sam2)
- COLMAP: Structure-from-Motion for static scene reconstruction
- MediaPipe: 2D/3D pose detection (33 keypoints)
- MMPose (optional): More accurate 2D pose detection
- VIBE (optional): SMPL-based 3D pose estimation
- OpenCV: Image processing
- NumPy/Matplotlib: Data processing and visualization
Hardware Requirements
- GPU: CUDA-enabled (RTX 3090 or similar)
- VRAM: 6GB+ (hiera_t), 18GB+ (hiera_l)
Current Status
Based on the outputs in the directory:
Completed:
- ✅ Frame extraction (167 frames)
- ✅ SAM 2 mask generation
- ✅ COLMAP reconstruction (38MB database, sparse reconstruction)
- ✅ 2D pose detection (MediaPipe)
- ✅ 3D pose lifting with COLMAP parameters
- ✅ 3D scene visualization (wall + poses)
Generated Outputs:
- colmap_3d_points.png / colmap_3d_points_masked.png
- pose_summary.jpg
- 3d_scene_combined.png / 3d_scene_single_pose.png
- 3d_scene_data.npz (777KB)
Key Coordinate Systems
Camera Coords (MediaPipe/VIBE) → World Coords (COLMAP) Transformation: X_world = R @ X_cam + t where R, t come from colmap_workspace/sparse/0/images.txt
Documentation Files
- CLAUDE.md: This file - project instructions for Claude Code
- README.md: Detailed pipeline explanation and usage guide
- INITIAL.md: Initial project setup notes
- requirements.md: Dependency installation guide
- colmap_instructions.txt: COLMAP workflow steps
Model Options
SAM 2 Models
- hiera_t (tiny): Fast, 6GB VRAM, medium quality
- hiera_s (small): Balanced
- hiera_b (base): Better edges, 10GB VRAM
- hiera_l (large): Best quality, 18GB VRAM
Pose Models
- MediaPipe Pose: Fast, 33 keypoints, good for single person
- MMPose RTMPose: More accurate, COCO format (17 keypoints)
- VIBE: SMPL mesh output, temporal consistency
Next Steps / Future Enhancements
Not currently implemented but documented:
- SMPL mesh fitting for full body model
- Neural rendering (NeRF/Gaussian Splatting)
- Multi-person tracking
- Real-time inference
- Dense COLMAP reconstruction
File Naming Conventions
- Frames: 00001.jpg, 00002.jpg, …
- Masks: 00001_mask.png, 00002_mask.png, …
- Outputs: poses_results.pkl, poses_3d.pkl, 3d_scene_data.npz
This is a well-structured research pipeline that successfully combines state-of-the-art computer vision techniques (SAM 2, COLMAP, MediaPipe) to reconstruct both static environment and dynamic human motion from a single handheld camera video.
Pose Detection
利用 mediapipe 做 pose detection.
轉 3D
應用
- Video segmentation
- Video pose language: Label flagging, cross-over, rock-over, …
Stage 1: MediaPipe - 2D Keypoints
Stage 2: MMPose - 2D Keypoints
Stage 4: VGGT - MoVieS
Stage 3: MMPose - 3D Keypoints
First MMDetection, 首先做 object detection (YOLOX-S). 再直接做 3D keypoints / pose detection (rtmpose model). 從 3D keypoints in camera space 再轉成 2D keypoints 也做 2D pixel coordinate overlay!
flowchart TD
%% Step 1: Detection
A[Full Image] --> B[Detection YOLOX-S / RTMDet-tiny]
B --> C[Crop Person Region]
%% Step 2: 3D Pose Estimation
C --> D[RTMPose3D RTMw3D-L]
D --> E[3D Keypoints Camera Space]
%% Step 3: 2D Derivation
E --> F[2D Keypoints Pixel]
%% Step 4: Demo Transform
E --> G[Demo Coordinates]
%% Step 5: Ground Rebasing
G --> H[Grounded 3D Keypoints]
%% Visualization modes
F --> I[Standard Overlay]
G --> J[Demo Visualizer]
F --> K[Performance Mode]
🗂️ Model and Coordinate System Summary
Model and Coordinate System Table
| Model Type | Model Name | Purpose | Input | Output Coordinates | Coordinate Space | Usage |
|---|---|---|---|---|---|---|
| 🔍 Detection | YOLOX-S | Person localization | Full image | Bounding boxes [x1, y1, x2, y2] | Pixel coordinates | Crop person regions for pose estimation |
| RTMDet-tiny (fallback) | Person detection | Full image | Bounding boxes [x1, y1, x2, y2] | Pixel coordinates | Fallback person detection | |
| 🎯 3D Pose | RTMPose3D (RTMw3D-L) | End-to-end 3D pose | Person bbox crop | keypoints_3d [K, 3] | Camera space coordinates | 3D analysis, visualization |
| 📐 2D Pose | Derived from RTMPose3D | 2D projection | Same as 3D | keypoints_2d [K, 2] | Pixel coordinates | 2D overlay, visualization |
🗺️ Coordinate Systems and Transformations
| Coordinate System | Description | Format | Range | Source | Usage |
|---|---|---|---|---|---|
| Pixel Coordinates | Image pixel space | [u, v] | [0, width] × [0, height] | Detection model, 2D projection | 2D overlay, bounding boxes |
| Camera Space | 3D world coordinates | [X, Y, Z] | Real-world units (mm) | RTMPose3D direct output | 3D analysis, depth estimation |
| Demo Transform | 3D visualization space | [-X, Z, -Y] | Axis-swapped, rebased | Demo coordinate transform | Side-by-side visualization |
🔄 Data Flow and Transformations
| Step | Process | Input | Transformation | Output | Code Location |
|---|---|---|---|---|---|
| 1 | Detection | Full image | YOLOX inference | Person bboxes (pixel) | rtm_pose_analyzer.py:720 |
| 2 | 3D Pose Estimation | Cropped person | RTMPose3D inference | keypoints_3d (camera space) | rtm_pose_analyzer.py:768 |
| 3 | 2D Derivation | 3D keypoints | Pixel projection | keypoints_2d_overlay (pixel) | rtm_pose_analyzer.py:864 |
| 4 | Demo Transform | 3D keypoints | keypoints = -keypoints[..., [0, 2, 1]] |
Demo coordinates | mmpose_visualizer.py:228 |
| 5 | Ground Rebasing | Demo coordinates | keypoints[..., 2] -= min(Z) |
Grounded 3D | mmpose_visualizer.py:238 |
📊 Camera Parameters
| Parameter | Default Value | Source | Usage | Formula |
|---|---|---|---|---|
| Focal Length | fx = 1145.04, fy = 1143.78 | Fixed in model | Camera space conversion | X_cam = (u - cx) / fx * Z |
| Principal Point | cx, cy = image_center | Dynamic (image size / 2) | Optical center | Y_cam = (v - cy) / fy * Z |
| Camera Params | Can override from dataset | Dataset metadata | Higher accuracy | Replaces defaults when available |
🎭 Visualization Modes
| Mode | 2D Source | 3D Source | Coordinate Transform | Purpose |
|---|---|---|---|---|
| Standard Overlay | keypoints_2d_overlay | keypoints_3d | Direct pixel coordinates | Basic 2D pose overlay |
| Demo Visualizer | keypoints_2d_overlay | Demo transformed | Axis swap + rebase | Side-by-side 2D+3D view |
| Performance Mode | Cached analysis | Cached analysis | Frame interpolation | Fast processing |
🔧 Key Implementation Details
| Aspect | Implementation | Location | Notes |
|---|---|---|---|
| Model Loading | Single RTMPose3D model | rtm_pose_analyzer.py:172 |
End-to-end 3D estimation |
| Detection Fallback | Full-frame if no persons | rtm_pose_analyzer.py:757 |
Graceful degradation |
| Coordinate Storage | Dual format storage | rtm_pose_analyzer.py:875-876 |
Both pixel and camera space |
| Frame Sync | Demo-exact processing | rtm_pose_analyzer.py:1175-1177 |
Eliminates frame interleaving |
🎯 Summary
This implementation uses a unified architecture where:
-
🔍 Detection Model: Localizes persons in pixel space
-
🎯 Single RTMPose3D Model: Produces both 2D and 3D coordinates from the same inference
-
📐 Coordinate Systems: Multiple representations for different purposes
-
🎭 Visualization: Dual-mode support for standard overlay and demo visualizer
👉 The key insight is that both 2D and 3D keypoints come from the same RTMPose3D model, ensuring perfect geometric consistency while supporting multiple coordinate representations for different use cases (overlay, analysis, visualization).
Why Detection
Multiple people 可以分別做 pose detection.
Detection Model Used:
- Primary: YOLOX-S (yolox_s_8x8_300e_coco)
- Fallback 1: YOLOX-S from model registry
- Fallback 2: RTMDet-tiny (rtmdet_tiny_8x32_300e_coco)
Purpose of Detection Model:
-
Person Localization: def _detect_persons(self, frame): “"”Detect person bounding boxes for pose estimation input””” det_result = inference_detector(self.detector, frame) # Filter person class (class_id = 0) with confidence > 0.3
- Improved Pose Accuracy:
- Crop-based processing: Focus RTMPose3D on detected person regions
- Bbox-relative coordinates: Better accuracy within person bounds
- Multi-person handling: Separate each detected person for individual pose estimation
-
Processing Pipeline: if self.detector is not None and self.enable_detection: bboxes = self._detect_persons(frame) # Get person bounding boxes pose_results = inference_topdown(self.pose3d_model, frame, bboxes) # Pose estimation within boxes else: # Fallback: full-frame analysis bboxes = np.array([[0, 0, width, height]], dtype=np.float32)
- Benefits:
- Higher precision: RTMPose3D works better on cropped person regions
- Better multi-person support: Individual detection → individual pose estimation
- Computational efficiency: Process only relevant image regions
- Robust handling: Graceful fallback to full-frame if detection fails
🔧 Summary
- Camera projection: Standard pinhole camera model with focal length and principal point
- 2D stability: Direct pixel coordinates, minimal transformation
- 3D flicker: Multiple coordinate transformations amplify prediction noise
- Detection purpose: Person localization for improved pose estimation accuracy and multi-person support
🧠 Key Insights
Shared Source:
- Both 2D and 3D keypoints originate from the same RTMPose3D model inference
- The model is trained end-to-end to predict 3D coordinates directly
Different Representations:
- 2D: Image pixel coordinates for overlay visualization
- 3D: Camera space coordinates with depth information for spatial analysis
Coordinate Consistency:
- Since they share the same source, the 2D and 3D keypoints are geometrically consistent
- The 2D coordinates represent the projection of the 3D points onto the image plane
- This ensures perfect alignment between 2D overlay and 3D visualization
💡 Practical Implications
- Accuracy: Both coordinate sets have the same detection accuracy since they’re from one model
- Consistency: No drift between 2D and 3D representations
- Efficiency: Single inference produces both coordinate systems
- Reliability: Strong geometric relationship between 2D and 3D poses
Appendix
Perfect 👍 — let’s build a ready-to-run Python notebook example for your case: MMPose 2D keypoints + bundle adjustment cameras → 3D pose triangulation + visualization.
This example assumes:
-
You already have:
- 2D pose JSONs from MMPose (
pose2d_results/) - Camera parameters (
cameras.json) - Optional 3D point cloud (
scene.ply)
- 2D pose JSONs from MMPose (
-
You want a 3D skeleton in the SfM world coordinate system.
🧠 Step-by-Step Code (≈100 lines)
1 | |
📁 Folder Structure Example
1 | |
🧩 cameras.json Format Example
1 | |
🧠 Optional Extensions
- Add temporal smoothing with a Kalman filter or Gaussian filter on
pose3dsequences. - Integrate with SMPLify-X to fit full 3D body meshes.
- For multiple people, repeat the pipeline per person (MMPose detects multiple instances).
Would you like me to extend this into a multi-frame version (loop over video frames, smooth temporally, and export .ply animation)?
That’s often the next step once single-frame triangulation works.
Reference
- Magnus youtube example