DriVLMe: Enhancing LLM-based Autonomous Driving Agents with Embodied and Social Experiences
On this page
Recent advancements in foundation models (FMs) have unlocked new prospects inautonomous driving, yet the experimental settings of these studies arepreliminary, over-simplified, and fail to capture the complexity of real-worlddriving scenarios in human environments. It remains under-explored whether FMagents can handle long-horizon navigation tasks with free-from dialogue anddeal with unexpected situations caused by environmental dynamics or taskchanges. To explore the capabilities and boundaries of FMs faced with thechallenges above, we introduce DriVLMe, a video-language-model-based agent tofacilitate natural and effective communication between humans and autonomousvehicles that perceive the environment and navigate. We develop DriVLMe fromboth embodied experiences in a simulated environment and social experiencesfrom real human dialogue. While DriVLMe demonstrates competitive performance inboth open-loop benchmarks and closed-loop human studies, we reveal severallimitations and challenges, including unacceptable inference time, imbalancedtraining data, limited visual understanding, challenges with multi-turninteractions, simplified language generation from robotic experiences, anddifficulties in handling on-the-fly unexpected situations like environmentaldynamics and task changes.
Further reading
- Access Paper in arXiv.org