TL;DR: We propose a novel learning-from-human framework that explicitly models intention to capture the causal structure of manipulation behavior.
Embodied foundation models have achieved significant breakthroughs in robotic manipulation, but they heavily rely on large amounts of robot demonstrations. Although recent works have explored leveraging human data to alleviate this dependency, effectively extracting transferable knowledge remains a significant challenge due to the inherent human-robot embodiment gap. To address this, we argue that the intention underlying human actions can serve as a powerful intermediate representation to bridge this gap. In this paper, we introduce VLIA, a novel framework that explicitly learns and transfers human intention to facilitate robotic manipulation. Specifically, we model intention through gaze, as it naturally precedes physical actions and serves as an observable proxy for human intent. VLIA is first pretrained on a large-scale egocentric human dataset to capture human intention and its synergy with action, followed by finetuning on a small set of robot and human data. During inference, the model adopts a Chain-of-Thought reasoning paradigm, sequentially predicting intention before executing the action. Extensive evaluations including simulations and real-world experiments, long-horizon and fine-grained tasks, as well as few-shot learning and robustness assessments, demonstrate that our method outperforms existing baselines, exhibits exceptional generalization, and achieves state-of-the-art performance.
We curate large-scale egocentric datasets for pretraining, which contains hand and gaze annotations with validity masks, unified coordinates, and diverse backgrounds, actions, and objects. Long videos are segmented into shorter clips, resulting in more than 150M frames.
Our model receives a task description, an egocentric observation, and the human or robot state as inputs. It first predicts discrete intention tokens, followed by continuous action generation via an intention–action reasoning chain. By explicitly modeling intention as an intermediate representation, the framework bridges high-level task understanding and low-level control. We instantiate intention as gaze, parameterized as 2D image coordinates.
To evaluate cross-embodiment transfer and robustness in real-world deployment, we conduct extensive real-robot experiments covering gripper-based manipulation, dexterous manipulation, long-horizon tasks, and fine-grained operations. The green cross indicates the predicted intention. Our video is played at 4x speed.
Tighten the screw.
Stack the blocks.
Put the remote in the drawer.
Type "VLIA" on the keyboard.
We conducted generalization tests for both objects and backgrounds. For the "Put the bottle on the plate" task, the training data contained only one type of water bottle. For the "Put the fruit on the plate" task, the training data only involved grasping lemons in a clean background.
Put the bottle on the plate.
Put the fruit on the plate.
Put the fruit on the plate.
Quantitative comparison between our method and baseline methods on the real-robot experiment.
We evaluate the model's instruction-following ability on human data. The top row shows predictions using the original language instructions from the human dataset, while the bottom row shows predictions using counterfactual language instructions for the same visual input. We find that the model can first infer the correct intent from the task description and then generate the appropriate wrist trajectory.
| Method | ID | OOD-Position | OOD-Object | OOD-Scene |
|---|---|---|---|---|
| ACT | 16/20 | 1/10 | 4/10 | 0/10 |
| DP | 13/20 | 2/10 | 3/10 | 0/10 |
| π0.5 | 17/20 | 3/10 | 4/10 | 3/10 |
| ours w/o CoT | 16/20 | 5/10 | 6/10 | 5/10 |
| ours (robot only) | 14/20 | 2/10 | 4/10 | 0/10 |
| ours (robot+human finetune) | 13/20 | 1/10 | 2/10 | 0/10 |
| ours (robot+human pretrain) | 17/20 | 3/10 | 5/10 | 2/10 |
| ours | 19/20 | 6/10 | 8/10 | 6/10 |
Quantitative evaluation of generalization performance.
| Task \ Method | ID | OOD-Distractors | OOD-Lighting | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| lfa | dp | hrdt | π0.5 | ours | lfa | dp | hrdt | π0.5 | ours | lfa | dp | hrdt | π0.5 | ours | |
| cube transfer | 87 | 75 | 89 | 94 | 100 | 12 | 9 | 28 | 25 | 35 | 0 | 0 | 12 | 32 | 36 |
| hook package | 23 | 7 | 13 | 20 | 32 | 8 | 3 | 0 | 11 | 14 | 0 | 0 | 0 | 10 | 19 |
| peg insertion | 10 | 2 | 17 | 15 | 18 | 7 | 0 | 2 | 6 | 9 | 0 | 0 | 0 | 7 | 16 |
| pour test tube | 41 | 23 | 24 | 34 | 39 | 11 | 7 | 14 | 24 | 28 | 0 | 0 | 3 | 25 | 22 |
| slot insertion | 43 | 32 | 42 | 54 | 60 | 23 | 12 | 19 | 47 | 56 | 0 | 0 | 11 | 44 | 50 |
| thread needle | 56 | 30 | 46 | 33 | 43 | 21 | 9 | 23 | 19 | 23 | 0 | 0 | 12 | 20 | 21 |
| average | 43 | 28 | 39 | 41 | 49 | 14 | 7 | 14 | 22 | 28 | 0 | 0 | 6 | 23 | 27 |
Quantitative comparison between our method and baseline methods on AV-ALOHA benchmark.
Cube transfer
Hook package
Peg insertion
Pour test tube
Slot insertion
Thread needle
@article{li2026gazevla,
title={GazeVLA: Learning Human Intention for Robotic Manipulation},
author={Li, Chengyang and Xiong, Kaiyi and Xu, Yuan and Qian, Lei and Wang, Yizhou and Zhu, Wentao},
journal={arXiv preprint arXiv:2604.22615},
year={2026}
}