The Invisible EgoHand: 3D Hand Forecasting through EgoBody Pose Estimation

Abstract

Forecasting hand motion and pose from an egocentric perspective is essential for understanding human intention. However, existing methods focus solely on predicting positions without considering articulation, and only when the hands are visible in the field of view. This limitation overlooks the fact that approximate hand positions can still be inferred even when they are outside the camera's view. In this paper, we propose a method to forecast the 3D trajectories and poses of both hands from an egocentric video, both in and out of the field of view. We propose a diffusion-based transformer architecture for Egocentric Hand Forecasting, EgoH4, which takes as input the observation sequence and camera poses, then predicts future 3D motion and poses for both hands of the camera wearer. We leverage full-body pose information, allowing other joints to provide constraints on hand motion. We denoise the hand and body joints along with a visibility predictor for hand joints and a 3D-to-2D reprojection loss that minimizes the error when hands are in-view. We evaluate EgoH4 on the Ego-Exo4D dataset, combining subsets with body and hand annotations. We train on 156K sequences and evaluate on 34K sequences, respectively. EgoH4 improves the performance by 3.4cm and 5.1cm over the baseline in terms of ADE for hand trajectory forecasting and MPJPE for hand pose forecasting.

Our Method, EgoH4

The framework of our proposed method, EgoH4. We show the denoising network in a single denoising step. During training, we estimate the original data 𝑥₀ from an arbitrary noise level n to learn the denoising network. During inference, we iteratively denoise the noisy joints over the maximum diffusion step N from N to 0.

Hand Trajectory Forecasting

Qualitative results for hand trajectory forecasting. We show sample qualitative resutls across activities: cooking, covid testing, basketball, and dance exercises. Dots in red, green, blue, purple, and orange represent the prediction of left/right future hands, ground-truth of lef/right hands, and the prediction of body joints at the last observable frame, respectively. For each track darker colors indicate later times.

Hand Pose Forecasting

Qualitative results for hand pose forecasting. Dots in red and green denote the prediction and ground-truth, respectively. Note that we expand the image plane so that we can also show the out-of-view hands.

BibTeX

@article{Hatano2025EgoH4, author = {Hatano, Masashi and Zhu, Zhifan and Saito, Hideo and Damen, Dima}, title = {The Invisible EgoHand: 3D Hand Forecasting through EgoBody Pose Estimation}, journal = {arXiv preprint arXiv:2504.08654}, year = {2025}, }