Works at Unity

Abstract

Current point cloud-based 3D human pose estimation methods lag behind image-based approaches despite inherent advantages, including continuous output space (avoiding quantization errors), reduced computational complexity (eliminating voxel-based heatmap computations), and inherent extendability to other 3D vision tasks. However, existing point cloud models struggle with multi-person scenarios and fail to generalize to highly complex poses, limiting their practical adoption. To address these gaps, we propose a novel point cloud-based framework for robust multi-person 3D pose estimation. Our method integrates three key innovations: (1) a hierarchical feature extraction module inspired by the PointNet architecture, optimized for sparse and unordered point cloud data; (2) an efficient multi-person matching strategy that disentangles pose estimation across individuals in crowded scenes; and (3) a multi-task loss function designed to learn the latent distribution of human poses, significantly improving generalization to challenging articulations. Extensive experiments show that our method delivers state-of-the-art accuracy on two highly challenging martial arts benchmark datasets. It outperforms image-based approaches by 25% while also cutting computational costs by 2.4 times.

Data

1. UMA Synthetic Dataset

Figure: Examples from UMA Synthetic Dataset

Synthetically-made point cloud dataset within Unity engine

Train set input: point-cloud coordinates

Diversity: 6000 scenes

Input dimension: 3 x 500 x 6000 (human body in point cloud) + 3 x 18 x 6000 (Ground Truth)

Inference input: 3 x 500 x 6000 (human body in point cloud)

2. UFC benchmark 24: Motion Captured with real UFC player

Figure: A footage from UFC Motion Capture

Figure: Liqiang at the Motion Capture center

Real actions captured by motion capture suits

Train set input: point-cloud coordinates

Diversity: 100 real fighting scenes performed by 2 professional UFC athletes

Purpose: Provide corner case examples to the model

Method

Figure: The overall pipeline

The pipeline consists of three main components: PointNet feature extraction, Matching Strategy for Multi-person Scenarios, and Multi-task learning for underlying distribution. The PointNet feature extraction module is responsible for capturing the spatial relationships between points in the input point cloud. The Matching Strategy for Multi-person Scenarios disentangles pose estimation across individuals in crowded scenes. Finally, the Multi-task learning for underlying distribution module is designed to learn the latent distribution of human poses, significantly improving generalization to challenging articulations.

Experiments

Figure: Some visualizations of predictions on UMA

Back to Research