Human motion generation is a challenging task that aims to create realistic motion imitating natural human behaviour. We focus on the well-studied behaviour of priming an object/location for pick up or put down – that is, the spotting of an object/location from a distance, known as gaze priming, followed by the motion of approaching and reaching the target location. To that end, we curate, for the first time, 23.7K gaze-primed human motion sequences for reaching target object locations from five publicly available datasets, i.e., HD-EPIC, MoGaze, HOT3D, ADT, and GIMO. We pre-train a text-conditioned diffusion-based motion generation model, then fine-tune it conditioned on goal pose or location, on our curated sequences. Importantly, we evaluate the ability of the generated motion to imitate natural human movement through several metrics, including the `Reach Success' and a newly introduced `Prime Success' metric. On the largest dataset, HD-EPIC, our model achieves 60% prime success and 89% reach success when conditioned on the goal object location.
A critical aspect largely unexplored is the role of gaze in priming or ‘spotting’ objects prior to the reaching motion. We take this missed opportunity and curate for the first time prime and reach motion sequences. In total, we curate 23,728 prime and reach sequences from five datasets namely HD-EPIC, MoGaze, HOT3D, ADT and GIMO. We showcase sample prime and reach sequences and data statistics below.
We generate human motion sequences \(\{x^i\}_{i=1}^N\), where \(x^i \in R^{J\times 3}\) represents \(J\) body joints. Starting from pure noise, the transformer decoder generates motion through iterative denoising over multiple diffusion timesteps \(t=\{T, ... , 0\}\), producing clean motion at \(t=0\). This generation is guided through a set of conditions injected into the decoder. We condition our prime and reach motion generation on —
We showcase some qualitative results on 3 datasets: Ground truth sequence in light green, goal-pose conditioned prediction in translucent yellow, and target location conditioned generation in brown. We show the pose at the initial, prime, and reach timesteps. Prime direction for both ground truth and predictions are shown using arrows, and target object location is shown in sphere.
@article{
hatano2025primeandreach,
title={Prime and Reach: Synthesising Body Motion for Gaze-Primed Object Reach},
author={Hatano, Masashi and Sinha, Saptarshi and Chalk, Jacob and Li, Wei-Hong and Saito, Hideo and Damen, Dima},
journal={arXiv preprint arxiv:2512.16456},
year={2025},
}
Since we use HD-EPIC, MoGaze, HOT3D, ADT, and GIMO to curate our sequences, you should cite each of the datasets as below.
@inproceedings{Perrett_2025_CVPR,
author={Perrett, Toby and Darkhalil, Ahmad and Sinha, Saptarshi and Emara, Omar and Pollard, Sam and Parida, Kranti Kumar and Liu, Kaiting and Gatti, Prajwal and Bansal, Siddhant and Flanagan, Kevin and Chalk, Jacob and Zhu, Zhifan and Guerrier, Rhodri and Abdelazim, Fahd and Zhu, Bin and Moltisanti, Davide and Wray, Michael and Doughty, Hazel and Damen, Dima},
title={HD-EPIC: A Highly-Detailed Egocentric Video Dataset},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2025},
pages={23901-23913}
}
@article{Kratzer_2021,
author={Kratzer, Philipp and Bihlmaier, Simon and Midlagajni, Niteesh Balachandra and Prakash, Rohit and Toussaint, Marc and Mainprice, Jim},
journal={IEEE Robotics and Automation Letters},
title={MoGaze: A Dataset of Full-Body Motions that Includes Workspace Geometry and Eye-Gaze},
year={2021},
pages={367-373}
}
@inproceedings{Banerjee_2025_CVPR,
author = {Banerjee, Prithviraj and Shkodrani, Sindi and Moulon, Pierre and Hampali, Shreyas and Han, Shangchen and Zhang, Fan and Zhang, Linguang and Fountain, Jade and Miller, Edward and Basol, Selen and Newcombe, Richard and Wang, Robert and Engel, Jakob Julian and Hodan, Tomas},
title = {HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2025},
pages = {7061-7071}
}
@inproceedings{Pan_2023_ICCV,
author = {Pan, Xiaqing and Charron, Nicholas and Yang, Yongqian and Peters, Scott and Whelan, Thomas and Kong, Chen and Parkhi, Omkar and Newcombe, Richard and Ren, Yuheng (Carl)},
title = {Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year = {2023},
pages = {20133-20143}
}
@inproceedings{zheng2022gimo,
title={Gimo: Gaze-informed human motion prediction in context},
author={Zheng, Yang and Yang, Yanchao and Mo, Kaichun and Li, Jiaman and Yu, Tao and Liu, Yebin and Liu, C Karen and Guibas, Leonidas J},
booktitle={European Conference on Computer Vision},
pages={676--694},
year={2022}
}
Acknoweledgements: This work uses publicly available datasets and annotations to curate P&R sequences. Research at the University of Bristol is supported by EPSRC UMPIRE (EP/T004991/1). M Hatano is supported by JST BOOST, Japan Grant Number JPMJBS2409, and Amano Institute of Technology. S Sinha and J Chalk are supported by EPSRC DTP studentship. At Keio University, we used ABCI 3.0 provided by AIST and AIST Solutions. At the University of Bristol, we acknowledge the use of resources provided by the Isambard-AI National AI Research Resource (AIRR), funded by the UK Government's Department for Science, Innovation and Technology (DSIT) via UK Research and Innovation; and the Science and Technology Facilities Council [ST/AIRR/I-A-I/1023]. In particular, we acknowledge the usage of GPU Node hours granted as part of the Sovereign AI Unit call project “Gen Model in Ego-sensed World” (Aug-Nov 2025) as well as the usage of GPU Node hours granted by AIRR Early Access Project ANON-BYYG-VXU6-M (March-May 2025).