Bootstrapping Imitation Learning for Long-horizon Manipulation via Hierarchical Data Collection Space

1CVTE   2Sun Yat-sen University    3Southwest Jiaotong University   4The University of Hong Kong  

Abstract

     Imitation learning (IL) with human demonstrations is a promising method for robotic manipulation tasks. While minimal demonstrations enable robotic action execution, achieving high success rates and generalization requires high cost, e.g., continuously adding data or incrementally conducting human-in-loop processes with complex hardware/software systems. In this paper, we rethink the state/action space of the data collection pipeline as well as the underlying factors responsible for the prediction of non-robust actions. To this end, we introduce a Hierarchical Data Collection Space (HD-Space) for robotic imitation learning, a simple data collection scheme, endowing the model to train with proactive and high-quality data. Specifically, We segment the fine manipulation task into multiple key atomic tasks from a high-level perspective and design atomic state/action spaces for human demonstrations, aiming to generate robust IL data. We conduct empirical evaluations across two simulated and five real-world long-horizon manipulation tasks and demonstrate that IL policy training with HD-Space-based data can achieve significantly enhanced policy performance. HD-Space allows the use of a small amount of demonstration data to train a more powerful policy, particularly for long-horizon manipulation tasks. We aim for HD-Space to offer insights into optimizing data quality and guiding data scaling.

HD-Space

pipeline

     Conceptual comparison of human demonstration data spaces for imitation learning. For simplification, we use a 2D manipulation state/action space for elaboration. (a) The naive data collection method records the entire manipulation trajectory, recording a single distribution of trajectories in the second to fourth stages. Therefore, the model is prone to incorrectly predicted trajectories in the subtasks of stages 2 to 4. (b) The HIL data collection method first deploys a baseline model for inference and then cooperates with the hardware system to correct the incorrect trajectories. The theoretical data collection space is the dynamic space where the model is prone to incorrect trajectories. Although the data collection becomes smaller, it still requires a complex hardware system and extra time to wait for the triggering of incorrect trajectories. (c) Our proposed HD-Space segments overall data collection space into multiple overlapped atomic spaces and then collects the manipulation data in each atomic space uniformly. It not only reduces the data collection space but also can efficiently traverse the data that are prone to errors in predicting trajectories after the subtasks of the first stage. Moreover, HD-Space can also provide a better baseline model for HIL.

pipeline

     Practice and comparison of naive and HD-Space data collection methods in the ``put the teacup into the box'' task. It first segments the task into multiple atomic subtasks and designs corresponding atomic state/action spaces for exploring a more robust data collection process.

pipeline

     Robust space considerations. Blue and purple dotted lines represent the trajectories covered by naive and HD-Space data collection processes. Orange lines represent prediction actions.

Results

pipeline
pipeline

Video Demo of Grabbing the Teacup into the Box

     In this video, we show how robotic arm opens the box first, then picks up the tea cup and puts it into the box, and finally closes the box. To save time, we mainly present few different locations of tea cup and box for grabbing the teacup into the box.

Video Demo of Grabbing all Electronic Pen Placed Randomly

     In this video, we show how robotic arm picks up all the electronic pens on the table that are randomly squared and then places them in the storage. To save time, we mainly present few different locations of electronic pens for grabbing all electronic pen placed randomly.

Video Demo of Grabbing the Bowls on Mobile Conveyor Belt

     In this video, we show how robotic arm grabs bowls that appear on the conveyor belt (8cm/s) at any time. To save time, we mainly present few different locations of bowls for grabbing the bowls on mobile conveyor belt.

Video Demo of Grabbing the Bowls on Fast Mobile Conveyor Belt

     In this video, we show how robotic arm grabs bowls that appear on the conveyor belt (16cm/s) at any time. To save time, we mainly present few different locations of bowls for grabbing the bowls on fast mobile conveyor belt.

Video Demo of Grabbing the Spoons inside Bowls on Mobile Conveyor Belt

     In this video, we show how robotic arm grabs spoons inside bowls that appear on the conveyor belt (8cm/s) at any time. To save time, we mainly present few different locations of spoons and bowls for grabbing the spoons inside bowls on mobile conveyor belt.

Video Demo of Grabbing the Spoons inside Bowls on Fast Mobile Conveyor Belt

     In this video, we show how robotic arm grabs spoons inside bowls that appear on the conveyor belt (16cm/s) at any time. To save time, we mainly present few different locations of spoons and bowls for grabbing the spoons inside bowls on fast mobile conveyor belt.

BibTeX

@article{hdspace,
      title={Bootstrapping Imitation Learning for Long-horizon Manipulation via Hierarchical Data Collection Space},
      author={Jinrong, Yang and Kexun, Chen and Zhuolin, Li and Shengkai, Wu and Yong, Zhao and Liangliang, Ren and Wenqiu, Luo and Chaohui, Shang and Meiyu, Zhi and Linfeng, Gao and Mingshan, Sun and Hui, Cheng},
      journal={arXiv preprint arXiv: 2505.17389},
      year={2025}
    }