LAPA: Latent Action Pretraining from Videos

Seonghyeon Ye^*¹, Joel Jang^*²,
Byeongguk Jeon¹, Sejune Joo¹, Jianwei Yang³, Baolin Peng³, Ajay Mandlekar⁴,
Reuben Tan³, Yu-Wei Chao⁴, Yuchen Lin⁵, Lars Liden³,
Kimin Lee¹^†, Jianfeng Gao³^†, Luke Zettlemoyer²^†, Dieter Fox^2,4^†, Minjoon Seo¹^†

¹KAIST ²University of Washington
³Microsoft Research ⁴NVIDIA ⁵Allen Institute for AI

* Equal contribution, † Equal advising

ArXiv Code

Model

Abstract

We introduce Latent Action Pretraining for general Action models (LAPA), the first unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. Existing Vision-Language-Action models require action labels typically collected by human teleoperators during pretraining, which significantly limits possible data sources and scale. In this work, we propose a method to learn from internet-scale videos that do not have robot action labels. We first train an action quantization model leveraging VQ- VAE-based objective to learn discrete latent actions between image frames, then pretrain a latent VLA model to predict these latent actions from observations and task descriptions, and finally finetune the VLA on small-scale robot manipulation data to map from latent to robot actions. Experimental results demonstrate that our method significantly outperforms existing techniques that train robot manipulation policies from large-scale videos. Furthermore, it outperforms the state-of- the-art VLA model trained with robotic action labels on real-world manipulation tasks that require language conditioning, generalization to unseen objects, and semantic generalization to unseen instructions. Training only on human manipulation videos also shows positive transfer, opening up the potential for leveraging web-scale data for robotics foundation model.

Overview of LAPA

LAPA is divided into two stages: Latent Action Quantization and Latent Pretraining. First, we use a VQ-VAE based objective to capture the discretized latent delta information between consecutive frames in a video. Next, a pretrained VLM is trained to predict the latent action designated by the encoder of the Latent Action Quantization model, given the current image and the language instruction. After Latent Pretraining, we finetune the VLA model on a small number of ground-truth action-labeled trajectories to map the latent space to the actual action space.

Experiments

Real-Robot Experiments

Cross-Embodiment

For cross-embodiment setting, we pretrain the VLAs on the WidowX embodiment (Bridgev2) and fine-tune them on the data collected with the Franka robot. By comparing LAPA (Bridge) which does not leverage action-labeled trajectories during pretraining with models that use action-labeled trajectories during pretraining, we observe an interesting finding: LAPA which is pretrained without ground truth action labels, outperform VLAs that use action labeled pretraining data (ActionVLA (Bridge) and OpenVLA (Bridge)) on average success rate of the 3 tasks. We hypothesize that VLA models pretrained on ground truth action labels have overfitted to the WidowX action space from the Bridgev2 dataset, hampering cross-embodiment adaptability to action distribution shifts during fine-tuning.

Multi-Embodiment

For multi-embodiment setting, we pretrain the VLAs on Open-X Embodiment which consists of robot trajectories of multiple embodiments. When comparing LAPA (Open-X) with OpenVLA (Open-X), we see that LAPA significantly outperforms OpenVLA on 2 out of 3 tasks. This highlights LAPA's effectiveness in a multi-embodiment setting by showcasing its ability to leverage a shared latent action space during pretraining, akin to how language and image representations are learned in an unsupervised manner. In contrast, contemporary action pretraining methods may suffer from reduced positive transfer between datasets due to the variability in action representation spaces across different embodiments and datasets.

Learning from Human Manipulation Videos

To extend LAPA on human manipulation videos where the action labels are not present, we pretrain LAPA on Something-Something V2 Dataset (220K videos) and fine-tune on robot embodiment. The embodiment gap for this case is extreme (human to robot). Surprisingly, we can see that LAPA trained with human videos outperforms OpenVLA (Bridge) on average. Despite the larger embodiment gap for LAPA (Human to robot vs. Robot to robot), it learns a better prior for robot manipulation. This result highlights the potential of raw human manipulation videos from the web compared to expensive robot manipulation data, which requires time-intensive teleoperation to collect. We expect that applying our approach on large-scale internet videos (e.g., YouTube videos) could unlock the potential for large-scale pretraining of a generalist action foundational model, similar to foundational models in Natural Language Processing or Computer Vision.

Analyzing Latent Actions

For interpretation, we condition the current image observation and each latent action on the decoder of the latent action quantization model, and present the reconstructed images.

We observe that each latent action can be mapped into a semantic action of the robot arm. For example, latent action 0 corresponds to moving a bit left and forward.

For human videos where the camera view changes in a single video, we observe that each latent action can be mapped into a semantic action including camera movements. For example, latent action [3,5,2,7] corresponds to moving the camera a bit down while [4,2,0,0] corresponds to moving the camera slightly up.

For multi-embodiment setting, we observe that each latent action can be mapped into a similar semantic action similar semantic action even though the embodiments are different. This supports our previous claim that latent actions are learned in a shared representation space, regardless of the embodiment or dataset, facilitating stronger positive transfer across diverse datasets.

Generated Rollout from LAPA

Ground Truth Trajectory

We analyze the coarse-grained planning capability of LAPA through a closed-loop rollout by using LAPA model that has only undergone pretraining. When conditioned on the current observation and the instruction to "take the broccoli out of the pot", LAPA generates robot trajectories that successfully reaches for the broccoli, moves down to grab it, and, as the arm moves away from the pot, the broccoli disappears. This shows the potential for LAPA as a general-purpose robotic world model, not only predicting actions but also the outcomes of the actions.

Rollout Videos

Seen Objects, Unseen Combinations

Knock mustard down Scratch

Scratch
❌

Knock mustard down OpenVLA

OpenVLA
❌

Knock mustard down LAPA

LAPA
✅

Pick orange block, put in sink Scratch

Scratch
❌

Pick orange block, put in sink OpenVLA

OpenVLA
✅

Pick orange block, put in sink LAPA

LAPA
✅

Unseen Objects

Knock pringles down Scratch

Scratch
❌

Knock pringles down OpenVLA

OpenVLA
⚠️

Knock pringles down LAPA

LAPA
⚠️

Cover donut with towel Scratch

Scratch
⚠️

Cover donut with towel OpenVLA

OpenVLA
❌

Cover donut with towel LAPA

LAPA
✅

Pick paprika, put in sink Scratch

Scratch
❌

Pick paprika, put in sink OpenVLA

OpenVLA
✅

Pick paprika, put in sink LAPA

LAPA
✅

Unseen Instructions

Knock an object for cleaning Scratch

Scratch
❌

Knock an object for cleaning OpenVLA

OpenVLA
❌

Knock an object for cleaning LAPA

LAPA
✅

Cover a yellow object with towel Scratch

Scratch
❌

Cover a yellow object with towel OpenVLA

OpenVLA
⚠️

Cover a yellow object with towel LAPA

LAPA
⚠️

Bi-Manual

Unseen Object Combinations

Put gray plate on container and peach on plate OpenVLA

OpenVLA
❌

Put gray plate on container and peach on plate LAPA

LAPA
⚠️

Unseen Objects

Put white plate on container and soup on plate OpenVLA

OpenVLA
❌

Put white plate on container and soup on plate LAPA

LAPA
⚠️

Unseen Instructions

Put darker plate on container and round object OpenVLA

OpenVLA
❌

Put darker plate on container and round object LAPA

LAPA
⚠️

Both OpenVLA and LAPA struggles on Bi-manual robot setup, indicating much room for improvement.

BibTeX

@misc{ye2024latentactionpretrainingvideos,
        title={Latent Action Pretraining from Videos}, 
        author={Seonghyeon Ye and Joel Jang and Byeongguk Jeon and Sejune Joo and Jianwei Yang and Baolin Peng and Ajay Mandlekar and Reuben Tan and Yu-Wei Chao and Bill Yuchen Lin and Lars Liden and Kimin Lee and Jianfeng Gao and Luke Zettlemoyer and Dieter Fox and Minjoon Seo},
        year={2024},
        eprint={2410.11758},
        archivePrefix={arXiv},
        primaryClass={cs.RO},
        url={https://arxiv.org/abs/2410.11758}, 
  }