Metacast: 3D Human Pose Estimation

PointPose: Efficient Multi-Person 3D Pose Estimation via Continuous Point Cloud Representation

3D Human Pose Estimation Point Cloud Multi-person

A technical research I architected and executed at Unity. For technical details, please see our paper.

Abstract

Current point cloud-based 3D human pose estimation methods lag behind image-based approaches despite inherent advantages, including continuous output space (avoiding quantization errors), reduced computational complexity (eliminating voxel-based heatmap computations), and inherent extendability to other 3D vision tasks. However, existing point cloud models struggle with multi-person scenarios and fail to generalize to highly complex poses, limiting their practical adoption. To address these gaps, we propose a novel point cloud-based framework for robust multi-person 3D pose estimation. Our method integrates three key innovations: (1) a hierarchical feature extraction module inspired by the PointNet architecture, optimized for sparse and unordered point cloud data; (2) an efficient multi-person matching strategy that disentangles pose estimation across individuals in crowded scenes; and (3) a multi-task loss function designed to learn the latent distribution of human poses, significantly improving generalization to challenging articulations. Extensive experiments show that our method delivers state-of-the-art accuracy on two highly challenging martial arts benchmark datasets. It outperforms image-based approaches by 25% while also cutting computational costs by 2.4 times.

Data

1. UMA Synthetic Dataset

UMA
Figure: Examples from UMA Synthetic Dataset

Synthetically-made point cloud dataset within Unity engine

  • Train set input: point-cloud coordinates
  • Diversity: 6000 scenes
  • Input dimension: 3 x 500 x 6000 (human body in point cloud) + 3 x 18 x 6000 (Ground Truth)
  • Inference input: 3 x 500 x 6000 (human body in point cloud)
  • 2. UFC benchmark 24: Motion Captured with real UFC player

    Figure: A footage from UFC Motion Capture
    GIF 1 GIF 2 GIF 3
    Figure: Liqiang at the Motion Capture center

    Real actions captured by motion capture suits

  • Train set input: point-cloud coordinates
  • Diversity: 100 real fighting scenes performed by 2 professional UFC athletes
  • Purpose: Provide corner case examples to the model
  • Method

    diagram
    Figure: The overall pipeline

    The pipeline consists of three main components: PointNet feature extraction, Matching Strategy for Multi-person Scenarios, and Multi-task learning for underlying distribution. The PointNet feature extraction module is responsible for capturing the spatial relationships between points in the input point cloud. The Matching Strategy for Multi-person Scenarios disentangles pose estimation across individuals in crowded scenes. Finally, the Multi-task learning for underlying distribution module is designed to learn the latent distribution of human poses, significantly improving generalization to challenging articulations.

    Experiments

    UMA_test
    Figure: Some visualizations of predictions on UMA
    Back to Research