Introduction
I am into bouldering and would like to do a project using AI for coaching.
Pre-Stage
Training Video Collection
Scape youtube video of bouldering Search by names: Janja, Ai Mori
Download youtube video: https://www.youtube.com/shorts/m4b0uDBh4nE use iMyFone TopClipper (no good)
Video Editing to remove head and tail. CapCut or Microsoft Clipchamp
Pose Detection
利用 mediapipe 做 pose detection.
轉 3D
應用
- Video segmentation
- Video pose language: Label flagging, cross-over, rock-over, …
Stage 1: MediaPipe - 2D Keypoints
Stage 2: MMPose - 2D Keypoints
Stage 4: VGGT - MoVieS
Stage 3: MMPose - 3D Keypoints
First MMDetection, 首先做 object detection (YOLOX-S). 再直接做 3D keypoints / pose detection (rtmpose model). 從 3D keypoints in camera space 再轉成 2D keypoints 也做 2D pixel coordinate overlay!
flowchart TD
%% Step 1: Detection
A[Full Image] --> B[Detection YOLOX-S / RTMDet-tiny]
B --> C[Crop Person Region]
%% Step 2: 3D Pose Estimation
C --> D[RTMPose3D RTMw3D-L]
D --> E[3D Keypoints Camera Space]
%% Step 3: 2D Derivation
E --> F[2D Keypoints Pixel]
%% Step 4: Demo Transform
E --> G[Demo Coordinates]
%% Step 5: Ground Rebasing
G --> H[Grounded 3D Keypoints]
%% Visualization modes
F --> I[Standard Overlay]
G --> J[Demo Visualizer]
F --> K[Performance Mode]
🗂️ Model and Coordinate System Summary
Model and Coordinate System Table
| Model Type | Model Name | Purpose | Input | Output Coordinates | Coordinate Space | Usage |
|---|---|---|---|---|---|---|
| 🔍 Detection | YOLOX-S | Person localization | Full image | Bounding boxes [x1, y1, x2, y2] | Pixel coordinates | Crop person regions for pose estimation |
| RTMDet-tiny (fallback) | Person detection | Full image | Bounding boxes [x1, y1, x2, y2] | Pixel coordinates | Fallback person detection | |
| 🎯 3D Pose | RTMPose3D (RTMw3D-L) | End-to-end 3D pose | Person bbox crop | keypoints_3d [K, 3] | Camera space coordinates | 3D analysis, visualization |
| 📐 2D Pose | Derived from RTMPose3D | 2D projection | Same as 3D | keypoints_2d [K, 2] | Pixel coordinates | 2D overlay, visualization |
🗺️ Coordinate Systems and Transformations
| Coordinate System | Description | Format | Range | Source | Usage |
|---|---|---|---|---|---|
| Pixel Coordinates | Image pixel space | [u, v] | [0, width] × [0, height] | Detection model, 2D projection | 2D overlay, bounding boxes |
| Camera Space | 3D world coordinates | [X, Y, Z] | Real-world units (mm) | RTMPose3D direct output | 3D analysis, depth estimation |
| Demo Transform | 3D visualization space | [-X, Z, -Y] | Axis-swapped, rebased | Demo coordinate transform | Side-by-side visualization |
🔄 Data Flow and Transformations
| Step | Process | Input | Transformation | Output | Code Location |
|---|---|---|---|---|---|
| 1 | Detection | Full image | YOLOX inference | Person bboxes (pixel) | rtm_pose_analyzer.py:720 |
| 2 | 3D Pose Estimation | Cropped person | RTMPose3D inference | keypoints_3d (camera space) | rtm_pose_analyzer.py:768 |
| 3 | 2D Derivation | 3D keypoints | Pixel projection | keypoints_2d_overlay (pixel) | rtm_pose_analyzer.py:864 |
| 4 | Demo Transform | 3D keypoints | keypoints = -keypoints[..., [0, 2, 1]] |
Demo coordinates | mmpose_visualizer.py:228 |
| 5 | Ground Rebasing | Demo coordinates | keypoints[..., 2] -= min(Z) |
Grounded 3D | mmpose_visualizer.py:238 |
📊 Camera Parameters
| Parameter | Default Value | Source | Usage | Formula |
|---|---|---|---|---|
| Focal Length | fx = 1145.04, fy = 1143.78 | Fixed in model | Camera space conversion | X_cam = (u - cx) / fx * Z |
| Principal Point | cx, cy = image_center | Dynamic (image size / 2) | Optical center | Y_cam = (v - cy) / fy * Z |
| Camera Params | Can override from dataset | Dataset metadata | Higher accuracy | Replaces defaults when available |
🎭 Visualization Modes
| Mode | 2D Source | 3D Source | Coordinate Transform | Purpose |
|---|---|---|---|---|
| Standard Overlay | keypoints_2d_overlay | keypoints_3d | Direct pixel coordinates | Basic 2D pose overlay |
| Demo Visualizer | keypoints_2d_overlay | Demo transformed | Axis swap + rebase | Side-by-side 2D+3D view |
| Performance Mode | Cached analysis | Cached analysis | Frame interpolation | Fast processing |
🔧 Key Implementation Details
| Aspect | Implementation | Location | Notes |
|---|---|---|---|
| Model Loading | Single RTMPose3D model | rtm_pose_analyzer.py:172 |
End-to-end 3D estimation |
| Detection Fallback | Full-frame if no persons | rtm_pose_analyzer.py:757 |
Graceful degradation |
| Coordinate Storage | Dual format storage | rtm_pose_analyzer.py:875-876 |
Both pixel and camera space |
| Frame Sync | Demo-exact processing | rtm_pose_analyzer.py:1175-1177 |
Eliminates frame interleaving |
🎯 Summary
This implementation uses a unified architecture where:
-
🔍 Detection Model: Localizes persons in pixel space
-
🎯 Single RTMPose3D Model: Produces both 2D and 3D coordinates from the same inference
-
📐 Coordinate Systems: Multiple representations for different purposes
-
🎭 Visualization: Dual-mode support for standard overlay and demo visualizer
👉 The key insight is that both 2D and 3D keypoints come from the same RTMPose3D model, ensuring perfect geometric consistency while supporting multiple coordinate representations for different use cases (overlay, analysis, visualization).
Why Detection
Multiple people 可以分別做 pose detection.
Detection Model Used:
- Primary: YOLOX-S (yolox_s_8x8_300e_coco)
- Fallback 1: YOLOX-S from model registry
- Fallback 2: RTMDet-tiny (rtmdet_tiny_8x32_300e_coco)
Purpose of Detection Model:
-
Person Localization: def _detect_persons(self, frame): “"”Detect person bounding boxes for pose estimation input””” det_result = inference_detector(self.detector, frame) # Filter person class (class_id = 0) with confidence > 0.3
- Improved Pose Accuracy:
- Crop-based processing: Focus RTMPose3D on detected person regions
- Bbox-relative coordinates: Better accuracy within person bounds
- Multi-person handling: Separate each detected person for individual pose estimation
-
Processing Pipeline: if self.detector is not None and self.enable_detection: bboxes = self._detect_persons(frame) # Get person bounding boxes pose_results = inference_topdown(self.pose3d_model, frame, bboxes) # Pose estimation within boxes else: # Fallback: full-frame analysis bboxes = np.array([[0, 0, width, height]], dtype=np.float32)
- Benefits:
- Higher precision: RTMPose3D works better on cropped person regions
- Better multi-person support: Individual detection → individual pose estimation
- Computational efficiency: Process only relevant image regions
- Robust handling: Graceful fallback to full-frame if detection fails
🔧 Summary
- Camera projection: Standard pinhole camera model with focal length and principal point
- 2D stability: Direct pixel coordinates, minimal transformation
- 3D flicker: Multiple coordinate transformations amplify prediction noise
- Detection purpose: Person localization for improved pose estimation accuracy and multi-person support
🧠 Key Insights
Shared Source:
- Both 2D and 3D keypoints originate from the same RTMPose3D model inference
- The model is trained end-to-end to predict 3D coordinates directly
Different Representations:
- 2D: Image pixel coordinates for overlay visualization
- 3D: Camera space coordinates with depth information for spatial analysis
Coordinate Consistency:
- Since they share the same source, the 2D and 3D keypoints are geometrically consistent
- The 2D coordinates represent the projection of the 3D points onto the image plane
- This ensures perfect alignment between 2D overlay and 3D visualization
💡 Practical Implications
- Accuracy: Both coordinate sets have the same detection accuracy since they’re from one model
- Consistency: No drift between 2D and 3D representations
- Efficiency: Single inference produces both coordinate systems
- Reliability: Strong geometric relationship between 2D and 3D poses
Reference
- Magnus youtube example