LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning

In recent years, instruction-tuned Large Multimodal Models (LMMs) have beensuccessful at several tasks, including image captioning and visual questionanswering; yet leveraging these models remains an open question for robotics.Prior LMMs for robotics applications have been extensively trained on languageand action data, but their ability to generalize in different settings hasoften been less than desired. To address this, we introduce LLARVA, a modeltrained with a novel instruction tuning method that leverages structuredprompts to unify a range of robotic learning tasks, scenarios, andenvironments. Additionally, we show that predicting intermediate 2-Drepresentations, which we refer to as “visual traces”, can help further alignvision and action spaces for robot learning. We generate 8.5M image-visualtrace pairs from the Open X-Embodiment dataset in order to pre-train our model,and we evaluate on 12 different tasks in the RLBench simulator as well as aphysical Franka Emika Panda 7-DoF robot. Our experiments yield strongperformance, demonstrating that LLARVA - using 2-D and language representations- performs well compared to several contemporary baselines, and can generalizeacross various robot environments and configurations.

Further reading