Prime and Reach:
Synthesising Body Motion for Gaze-Primed Object Reach

1Keio University 2University of Bristol *:Equal Contribution
teaser1 teaser2
Prime & Reach sequences from HD-EPIC, using full-body pose from EgoAllo. (Left) A sequence starting with the intention to reach the container (cyan sphere). Gaze priming is evident (gaze intersecting the object) during the approach before reaching the object. (Right) Similar behaviour is noted for priming and picking up the scale (cyan sphere). [darker colors indicate later time].

Prime & Reach Motion Generation

Abstract

Human motion generation is a challenging task that aims to create realistic motion imitating natural human behaviour. We focus on the well-studied behaviour of priming an object/location for pick up or put down – that is, the spotting of an object/location from a distance, known as gaze priming, followed by the motion of approaching and reaching the target location. To that end, we curate, for the first time, 23.7K gaze-primed human motion sequences for reaching target object locations from five publicly available datasets, i.e., HD-EPIC, MoGaze, HOT3D, ADT, and GIMO. We pre-train a text-conditioned diffusion-based motion generation model, then fine-tune it conditioned on goal pose or location, on our curated sequences. Importantly, we evaluate the ability of the generated motion to imitate natural human movement through several metrics, including the `Reach Success' and a newly introduced `Prime Success' metric. On the largest dataset, HD-EPIC, our model achieves 60% prime success and 89% reach success when conditioned on the goal object location.

Prime and Reach Data Curation

A critical aspect largely unexplored is the role of gaze in priming or ‘spotting’ objects prior to the reaching motion. We take this missed opportunity and curate for the first time prime and reach motion sequences. In total, we curate 23,728 prime and reach sequences from five datasets namely HD-EPIC, MoGaze, HOT3D, ADT and GIMO. We showcase sample prime and reach sequences and data statistics below.

Prime & Reach Motion Diffusion Model

We generate human motion sequences \(\{x^i\}_{i=1}^N\), where \(x^i \in R^{J\times 3}\) represents \(J\) body joints. Starting from pure noise, the transformer decoder generates motion through iterative denoising over multiple diffusion timesteps \(t=\{T, ... , 0\}\), producing clean motion at \(t=0\). This generation is guided through a set of conditions injected into the decoder. We condition our prime and reach motion generation on —

  • Text prompt: We describe the action as e.g., ‘The person moves across and picks/puts an object.’
  • Initial state of the body describing where and how the motion initiates.
  • Goal: (1) goal pose at the end of the motion or (2) goal location to reach.

MY ALT TEXT

Qualitative Results


We showcase some qualitative results on 3 datasets: Ground truth sequence in light green, goal-pose conditioned prediction in translucent yellow, and target location conditioned generation in brown. We show the pose at the initial, prime, and reach timesteps. Prime direction for both ground truth and predictions are shown using arrows, and target object location is shown in sphere.

BibTex


        @article{
          hatano2025primeandreach,
          title={Prime and Reach: Synthesising Body Motion for Gaze-Primed Object Reach},
          author={Hatano, Masashi and Sinha, Saptarshi and Chalk, Jacob and Li, Wei-Hong and Saito, Hideo and Damen, Dima},
          journal={arXiv preprint arxiv:2512.16456},
          year={2025},
        }
    

Since we use HD-EPIC, MoGaze, HOT3D, ADT, and GIMO to curate our sequences, you should cite each of the datasets as below.



        @inproceedings{Perrett_2025_CVPR,
            author={Perrett, Toby and Darkhalil, Ahmad and Sinha, Saptarshi and Emara, Omar and Pollard, Sam and Parida, Kranti Kumar and Liu, Kaiting and Gatti, Prajwal and Bansal, Siddhant and Flanagan, Kevin and Chalk, Jacob and Zhu, Zhifan and Guerrier, Rhodri and Abdelazim, Fahd and Zhu, Bin and Moltisanti, Davide and Wray, Michael and Doughty, Hazel and Damen, Dima},
            title={HD-EPIC: A Highly-Detailed Egocentric Video Dataset},
            booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
            year={2025},
            pages={23901-23913}
        }
    


        @article{Kratzer_2021,
          author={Kratzer, Philipp and Bihlmaier, Simon and Midlagajni, Niteesh Balachandra and Prakash, Rohit and Toussaint, Marc and Mainprice, Jim},
          journal={IEEE Robotics and Automation Letters}, 
          title={MoGaze: A Dataset of Full-Body Motions that Includes Workspace Geometry and Eye-Gaze}, 
          year={2021},
          pages={367-373}
        }
    


        @inproceedings{Banerjee_2025_CVPR,
            author    = {Banerjee, Prithviraj and Shkodrani, Sindi and Moulon, Pierre and Hampali, Shreyas and Han, Shangchen and Zhang, Fan and Zhang, Linguang and Fountain, Jade and Miller, Edward and Basol, Selen and Newcombe, Richard and Wang, Robert and Engel, Jakob Julian and Hodan, Tomas},
            title     = {HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos},
            booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
            year      = {2025},
            pages     = {7061-7071}
        }
    


        @inproceedings{Pan_2023_ICCV,
            author    = {Pan, Xiaqing and Charron, Nicholas and Yang, Yongqian and Peters, Scott and Whelan, Thomas and Kong, Chen and Parkhi, Omkar and Newcombe, Richard and Ren, Yuheng (Carl)},
            title     = {Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception},
            booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
            year      = {2023},
            pages     = {20133-20143}
        }
    


        @inproceedings{zheng2022gimo,
          title={Gimo: Gaze-informed human motion prediction in context},
          author={Zheng, Yang and Yang, Yanchao and Mo, Kaichun and Li, Jiaman and Yu, Tao and Liu, Yebin and Liu, C Karen and Guibas, Leonidas J},
          booktitle={European Conference on Computer Vision},
          pages={676--694},
          year={2022}
        }
    

Acknoweledgements: This work uses publicly available datasets and annotations to curate P&R sequences. Research at the University of Bristol is supported by EPSRC UMPIRE (EP/T004991/1). M Hatano is supported by JST BOOST, Japan Grant Number JPMJBS2409, and Amano Institute of Technology. S Sinha and J Chalk are supported by EPSRC DTP studentship. At Keio University, we used ABCI 3.0 provided by AIST and AIST Solutions. At the University of Bristol, we acknowledge the use of resources provided by the Isambard-AI National AI Research Resource (AIRR), funded by the UK Government's Department for Science, Innovation and Technology (DSIT) via UK Research and Innovation; and the Science and Technology Facilities Council [ST/AIRR/I-A-I/1023]. In particular, we acknowledge the usage of GPU Node hours granted as part of the Sovereign AI Unit call project “Gen Model in Ego-sensed World” (Aug-Nov 2025) as well as the usage of GPU Node hours granted by AIRR Early Access Project ANON-BYYG-VXU6-M (March-May 2025).