A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation

Rongtao Xu1,*, Jian Zhang1,*, Minghao Guo1,*, Youpeng Wen2,*, Haoting Yang3, Min Lin2, Jianzheng Huang3, Zhe Li3, Kaidong Zhang2, Liqiong Wang3, Yuxuan Kuang1, Meng Cao1, Feng Zheng3,†, Xiaodan Liang1,2,†
1 MBZUAI 2 Sun Yat-sen University 3 Southern University of Science and Technology
* Indicates Equal Contribution Indicates Corresponding Author
MY ALT TEXT

The A0 model decomposes robotic manipulation tasks into two levels: (1) high-level spatial affordance understanding and (2) low-level action execution. A0 leverages an Embodiment-Agnostic Affordance Representation to predict object-centric contact points and post-contact trajectories. The architecture includes well-designed key components for affordance learning. A0 is pre-trained on a largescale dataset of contact points and fine-tuned on annotated trajectories, enabling generalization across diverse robotic platforms. Zoom-in for the best of views.

Real World Demos



Flower

Flower@Realman

R2

Place object @Realman



r1.gif

Wipe the blackboard @Realman

biarm.gif

Place objects @Realman

f2.gif

Place on the plate @Franka



f5.gif

Wipe the blackboard @Franka

k1.gif

Press button @Kinova Gen3

k2.gif

Open drawer meanwhile @Kinova Gen3

k3.gif

Pick place @Kinova Gen3



f6.gif

Wipe the blackboard @Franka

k5.gif

Put the blue block on top of the red block @Kinova Gen3

f4.gif

Wipe the white board @Franka

f3.gif

Press button @Franka



f1.gif

Open drawer @Franka

k4.gif

Wipe board @Kinova Gen3

MY ALT TEXT
MY ALT TEXT
MY ALT TEXT

Abstract

Robotic manipulation faces critical challenges in understanding spatial affordances—the "where" and "how" of object interactions—essential for complex manipulation tasks like wiping a board or stacking objects. Existing methods, including modular-based and end-to-end approaches, often lack robust spatial reasoning capabilities. Unlike recent point-based and flow-based affordance methods that focus on dense spatial representations or trajectory modeling, we propose A0, a hierarchical affordance-aware diffusion model that decomposes manipulation task into high-level spatial affordance understanding and low-level action execution. A0 leverages the Embodiment-Agnostic Affordance Representation, which captures object-centric spatial affordances by predicting contact point and post-contact trajectories. A0 is pre-trained on 1 million contact points data and fine-tuned on annotated trajectories, enabling generalization across platforms. Key components include Position Offset Attention for motion-aware feature extraction and a Spatial Information Aggregation Layer for precise coordinate mapping. The model's output is executed by the action execution module. Experiments on multiple robotic systems (Franka, Kinova, Realman and Dobot) demonstrate A0's superior performance in complex tasks, showcasing its efficiency, flexibility, and real-world applicability.

MY ALT TEXT

Comparison of different manipulation methods.

MY ALT TEXT

Overview of A0 model.

MY ALT TEXT

Qualitative Results.

MY ALT TEXT

Pretraining significantly lowers T-waypoint MAE and improves generalization, underscoring its value for robust manipulation.

\( \mathrm{MAE}\downarrow \) HOI4D-22k Maniskill-5k DROID-3k
\( A_0\text{-1B} \) 47.5 5.5 17.5
\( A_0\text{-1B w/o POA} \) 47.9 6.3 18.5
\( A_0\text{-1B w/o SIAL} \) 61.1 10.2 19.6
Table 1: Ablation studies of network architecture. \( A_0\text{-1B} \) is pretrained on Pixmo-One-Point. 'POA' denotes Position Offset Attention and 'SIAL' denotes Spatial Information Aggregation Layer. We use MAE (lower is better) as evaluation metric.
Robot Method Place Object Open Drawer Press Button Wipe Board Avg. Success
Kinova MOKA 70 50 30 30 45.00
ReKep 75 55 5 0 33.75
\( A_0\text{-1B} \) 60 65 40 50 53.75
Franka Magma 25 10 30 0 16.25
Molmo 60 40 55 20 43.75
\( A_0\text{-1B} \) 60 75 70 45 62.50
Table 2: Performance evaluation of different large language model-based policies across four manipulation tasks on two distinct robotic platforms. Our method demonstrates strong platform-agnostic capabilities by achieving consistently high success rates across both Kinova Gen3 and Franka Emika robots.
~ Wipe Board Steps
RDT-1B [1] 10 25–50
\( \pi_0 \) [2] 35 25–50
\( \pi_0 \)+FAST [2] 30 25–50
\( A_0\text{-1B} \) 50 4–5
Table 3: Comparison with RDT-1B and \( \pi_0 \) on the Wipe Board task using the Kinova platform, highlighting our method's superiority in trajectory-following and task execution efficiency.

BibTeX

@misc{xu2025a0affordanceawarehierarchicalmodel,
        title={A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation}, 
        author={Rongtao Xu and Jian Zhang and Minghao Guo and Youpeng Wen and Haoting Yang and Min Lin and Jianzheng Huang and Zhe Li and Kaidong Zhang and Liqiong Wang and Yuxuan Kuang and Meng Cao and Feng Zheng and Xiaodan Liang},
        year={2025},
        eprint={2504.12636},
        archivePrefix={arXiv},
        primaryClass={cs.RO},
        url={https://arxiv.org/abs/2504.12636}, 
  }