DriVLMe: Enhancing LLM-based Autonomous Driving Agents with Embodied and Social Experiences

Recent advancements in foundation models (FMs) have unlocked new prospects inautonomous driving, yet the experimental settings of these studies arepreliminary, over-simplified, and fail to capture the complexity of real-worlddriving scenarios in human environments. It remains under-explored whether FMagents can handle long-horizon navigation tasks with free-from dialogue anddeal with unexpected situations caused by environmental dynamics or taskchanges. To explore the capabilities and boundaries of FMs faced with thechallenges above, we introduce DriVLMe, a video-language-model-based agent tofacilitate natural and effective communication between humans and autonomousvehicles that perceive the environment and navigate. We develop DriVLMe fromboth embodied experiences in a simulated environment and social experiencesfrom real human dialogue. While DriVLMe demonstrates competitive performance inboth open-loop benchmarks and closed-loop human studies, we reveal severallimitations and challenges, including unacceptable inference time, imbalancedtraining data, limited visual understanding, challenges with multi-turninteractions, simplified language generation from robotic experiences, anddifficulties in handling on-the-fly unexpected situations like environmentaldynamics and task changes.

Further reading