GazeVLA: Learning Human Intention for Robotic Manipulation

TL;DR: We propose a novel learning-from-human framework that explicitly models intention to capture the causal structure of manipulation behavior.

Abstract

Embodied foundation models have achieved significant breakthroughs in robotic manipulation, but they heavily rely on large amounts of robot demonstrations. Although recent works have explored leveraging human data to alleviate this dependency, effectively extracting transferable knowledge remains a significant challenge due to the inherent human-robot embodiment gap. To address this, we argue that the intention underlying human actions can serve as a powerful intermediate representation to bridge this gap. In this paper, we introduce VLIA, a novel framework that explicitly learns and transfers human intention to facilitate robotic manipulation. Specifically, we model intention through gaze, as it naturally precedes physical actions and serves as an observable proxy for human intent. VLIA is first pretrained on a large-scale egocentric human dataset to capture human intention and its synergy with action, followed by finetuning on a small set of robot and human data. During inference, the model adopts a Chain-of-Thought reasoning paradigm, sequentially predicting intention before executing the action. Extensive evaluations including simulations and real-world experiments, long-horizon and fine-grained tasks, as well as few-shot learning and robustness assessments, demonstrate that our method outperforms existing baselines, exhibits exceptional generalization, and achieves state-of-the-art performance.

Dataset

We curate large-scale egocentric datasets for pretraining, which contains hand and gaze annotations with validity masks, unified coordinates, and diverse backgrounds, actions, and objects. Long videos are segmented into shorter clips, resulting in more than 150M frames.

Framework

Our model receives a task description, an egocentric observation, and the human or robot state as inputs. It first predicts discrete intention tokens, followed by continuous action generation via an intention–action reasoning chain. By explicitly modeling intention as an intermediate representation, the framework bridges high-level task understanding and low-level control. We instantiate intention as gaze, parameterized as 2D image coordinates.

Real-World Experiments

To evaluate cross-embodiment transfer and robustness in real-world deployment, we conduct extensive real-robot experiments covering gripper-based manipulation, dexterous manipulation, long-horizon tasks, and fine-grained operations. The green cross indicates the predicted intention. Our video is played at 4x speed.

Tighten the screw.

Stack the blocks.

Put the remote in the drawer.

Type "VLIA" on the keyboard.

Generalization

We conducted generalization tests for both objects and backgrounds. For the "Put the bottle on the plate" task, the training data contained only one type of water bottle. For the "Put the fruit on the plate" task, the training data only involved grasping lemons in a clean background.

Put the bottle on the plate.

Put the fruit on the plate.

Quantitative Results

Quantitative comparison between our method and baseline methods on the real-robot experiment.

Instruction Following on Human Videos

We evaluate the model's instruction-following ability on human data. The top row shows predictions using the original language instructions from the human dataset, while the bottom row shows predictions using counterfactual language instructions for the same visual input. We find that the model can first infer the correct intent from the task description and then generate the appropriate wrist trajectory.

Ablation Study

Method	ID	OOD-Position	OOD-Object	OOD-Scene
ACT	16/20	1/10	4/10	0/10
DP	13/20	2/10	3/10	0/10
π_0.5	17/20	3/10	4/10	3/10
ours w/o CoT	16/20	5/10	6/10	5/10
ours (robot only)	14/20	2/10	4/10	0/10
ours (robot+human finetune)	13/20	1/10	2/10	0/10
ours (robot+human pretrain)	17/20	3/10	5/10	2/10
ours	19/20	6/10	8/10	6/10

Quantitative evaluation of generalization performance.

Simulation Experiments

Task \ Method	ID					OOD-Distractors					OOD-Lighting
	lfa	dp	hrdt	π_0.5	ours	lfa	dp	hrdt	π_0.5	ours	lfa	dp	hrdt	π_0.5	ours
cube transfer	87	75	89	94	100	12	9	28	25	35	0	0	12	32	36
hook package	23	7	13	20	32	8	3	0	11	14	0	0	0	10	19
peg insertion	10	2	17	15	18	7	0	2	6	9	0	0	0	7	16
pour test tube	41	23	24	34	39	11	7	14	24	28	0	0	3	25	22
slot insertion	43	32	42	54	60	23	12	19	47	56	0	0	11	44	50
thread needle	56	30	46	33	43	21	9	23	19	23	0	0	12	20	21
average	43	28	39	41	49	14	7	14	22	28	0	0	6	23	27

Quantitative comparison between our method and baseline methods on AV-ALOHA benchmark.

Cube transfer

Hook package

Peg insertion

Pour test tube

Slot insertion

Thread needle

BibTeX

        
@article{li2026gazevla,
  title={GazeVLA: Learning Human Intention for Robotic Manipulation}, 
  author={Li, Chengyang and Xiong, Kaiyi and Xu, Yuan and Qian, Lei and Wang, Yizhou and Zhu, Wentao},
  journal={arXiv preprint arXiv:2604.22615},
  year={2026}
}