Learning Articulated Shape with Keypoint Pseudo-labels from Web Images

Anastasis Stathopoulos      Georgios Pavlakos2      Ligong Han      Dimitris Metaxas     
1Rutgers University      2UC Berkeley     
CVPR 2023

Github code


This paper shows that it is possible to learn models for monocular 3D reconstruction of articulated objects (e.g., horses, cows, sheep), using as few as 50-150 images labeled with 2D keypoints. Our proposed approach involves training category-specific keypoint estimators, generating 2D keypoint pseudo-labels on unlabeled web images, and using both the labeled and self-labeled sets to train 3D reconstruction models. It is based on two key insights: (1) 2D keypoint estimation networks trained on as few as 50-150 images of a given object category generalize well and generate reliable pseudo-labels; (2) a data selection mechanism can automatically create a "curated" subset of the unlabeled web images that can be used for training -- we evaluate four data selection methods. Coupling these two insights enables us to train models that effectively utilize web images, resulting in improved 3D reconstruction performance for several articulated object categories beyond the fully-supervised baseline. Our approach can quickly bootstrap a model and requires only a few images labeled with 2D keypoints. This requirement can be easily satisfied for any new object category. To showcase the practicality of our approach for predicting the 3D shape of arbitrary object categories, we annotate 2D keypoints on giraffe and bear images from COCO -- the annotation process takes less than 1 minute per image.


Overview of the proposed framework. It includes: (a) training a category-specific keypoint estimator with a limited labeled set S (e.g., 50-150 images with keypoint annotations), (b) generating keypoints pseudo-labels on web images, (c) automatic curation of web images using a data selection criterion to create a subset U′, and (d) training a model for 3D shape prediction with images from S and U′.

Given a small set S of images (e.g., 50-150) labeled with 2D keypoints, we train a 2D keypoint estimation network hφ and generate keypoint pseudo-labels on web images (set U). We select a subset of U to train a 3D shape predictor fθ. Two methods for data selection can be seen here: (a) CF-CM: an auxiliary 2D keypoint estimator g generates predictions on U and images with the smallest discrepancy between the keypoint estimates of hφ and gψ are selected (criterion (c)); (b) CF-CM2: fθ is trained with samples from S and generates predictions on U. Images with the smallest discrepancy between the keypoint estimates of hφ and the reprojected keypoints from fθ are selected (criterion (d)) to retrain fθ.


Comparison between models trained with and without keypoints PLs. We compare the predictions of CMR trained with and without keypoint pseudo-labels (PLs). For each input image, we show the predicted shape and texture from the inferred camera view, while we rendered the predicted shape and texture from alternative viewpoints. It is clear that using keypoint pseudo-labels substantially improves 3D reconstuction performance.

Data selection vs trainig with all downloaded data. We compare the predictions of ACSM trained with 150 images labeled with keypoint and keypoint pseudo-labels from (i) all downloaded web images, (ii) selected images with one of the proposed data selection criteria. For each input image, we show the predicted shape from the inferred camera view and from alternative viewpoints. A data selection method is necessary to effectively utilize web images in our setting.



This webpage template was borrowed from some colorful folks. Icons: Flaticon.