The Invisible EgoHand: 3D Hand Forecasting through EgoBody Pose Estimation

1 Keio University, 2 University of Bristol
MY ALT TEXT
Given signals during observation: camera poses, images, and visible hand locations in 2D, our proposed method EgoH4 forecasts future 3D hand pose. EgoH4 can forecast hand joints even when hands are out of view during observation. We show visible 2D hand positions overlaid on the observation frames t1 and t2, and the corresponding camera poses attached on the heads. At t2, the right hand is invisible. In the forecasting frame, the right hand is back in view while the left hand is now out of view.

Abstract

Forecasting hand motion and pose from an egocentric perspective is essential for understanding human intention. However, existing methods focus solely on predicting positions without considering articulation, and only when the hands are visible in the field of view. This limitation overlooks the fact that approximate hand positions can still be inferred even when they are outside the camera's view. In this paper, we propose a method to forecast the 3D trajectories and poses of both hands from an egocentric video, both in and out of the field of view. We propose a diffusion-based transformer architecture for Egocentric Hand Forecasting, EgoH4, which takes as input the observation sequence and camera poses, then predicts future 3D motion and poses for both hands of the camera wearer. We leverage full-body pose information, allowing other joints to provide constraints on hand motion. We denoise the hand and body joints along with a visibility predictor for hand joints and a 3D-to-2D reprojection loss that minimizes the error when hands are in-view. We evaluate EgoH4 on the Ego-Exo4D dataset, combining subsets with body and hand annotations. We train on 156K sequences and evaluate on 34K sequences, respectively. EgoH4 improves the performance by 3.4cm and 5.1cm over the baseline in terms of ADE for hand trajectory forecasting and MPJPE for hand pose forecasting.

Our Method, EgoH4

MY ALT TEXT
The framework of our proposed method, EgoH4. We show the denoising network in a single denoising step. During training, we estimate the original data 𝑥0 from an arbitrary noise level n to learn the denoising network. During inference, we iteratively denoise the noisy joints over the maximum diffusion step N from N to 0.

Qualitative Videos

BibTeX


        @article{Hatano2025EgoH4,
          author = {Hatano, Masashi and Zhu, Zhifan and Saito, Hideo and Damen, Dima},
          title = {The Invisible EgoHand: 3D Hand Forecasting through EgoBody Pose Estimation},
          journal = {arXiv preprint arXiv:XXXX.XXXXXX},
          year = {2025},
        }