Imitation learning (IL) with human demonstrations is a promising method for robotic manipulation tasks. While minimal demonstrations enable robotic action execution, achieving high success rates and generalization requires high cost, e.g., continuously adding data or incrementally conducting human-in-loop processes with complex hardware/software systems. In this paper, we rethink the state/action space of the data collection pipeline as well as the underlying factors responsible for the prediction of non-robust actions. To this end, we introduce a Hierarchical Data Collection Space (HD-Space) for robotic imitation learning, a simple data collection scheme, endowing the model to train with proactive and high-quality data. Specifically, We segment the fine manipulation task into multiple key atomic tasks from a high-level perspective and design atomic state/action spaces for human demonstrations, aiming to generate robust IL data. We conduct empirical evaluations across two simulated and five real-world long-horizon manipulation tasks and demonstrate that IL policy training with HD-Space-based data can achieve significantly enhanced policy performance. HD-Space allows the use of a small amount of demonstration data to train a more powerful policy, particularly for long-horizon manipulation tasks. We aim for HD-Space to offer insights into optimizing data quality and guiding data scaling.
Conceptual comparison of human demonstration data spaces for imitation learning. For simplification, we use a 2D manipulation state/action space for elaboration. (a) The naive data collection method records the entire manipulation trajectory, recording a single distribution of trajectories in the second to fourth stages. Therefore, the model is prone to incorrectly predicted trajectories in the subtasks of stages 2 to 4. (b) The HIL data collection method first deploys a baseline model for inference and then cooperates with the hardware system to correct the incorrect trajectories. The theoretical data collection space is the dynamic space where the model is prone to incorrect trajectories. Although the data collection becomes smaller, it still requires a complex hardware system and extra time to wait for the triggering of incorrect trajectories. (c) Our proposed HD-Space segments overall data collection space into multiple overlapped atomic spaces and then collects the manipulation data in each atomic space uniformly. It not only reduces the data collection space but also can efficiently traverse the data that are prone to errors in predicting trajectories after the subtasks of the first stage. Moreover, HD-Space can also provide a better baseline model for HIL.
Practice and comparison of naive and HD-Space data collection methods in the ``put the teacup into the box'' task. It first segments the task into multiple atomic subtasks and designs corresponding atomic state/action spaces for exploring a more robust data collection process.
Robust space considerations. Blue and purple dotted lines represent the trajectories covered by naive and HD-Space data collection processes. Orange lines represent prediction actions.
@article{hdspace, title={Bootstrapping Imitation Learning for Long-horizon Manipulation via Hierarchical Data Collection Space}, author={Jinrong, Yang and Kexun, Chen and Zhuolin, Li and Shengkai, Wu and Yong, Zhao and Liangliang, Ren and Wenqiu, Luo and Chaohui, Shang and Meiyu, Zhi and Linfeng, Gao and Mingshan, Sun and Hui, Cheng}, journal={arXiv preprint arXiv: 2505.17389}, year={2025} }