PlanAgent: A Multi-modal Large Language Agent for Closed-loop Vehicle Motion Planning

Vehicle motion planning is an essential component of autonomous drivingtechnology. Current rule-based vehicle motion planning methods performsatisfactorily in common scenarios but struggle to generalize to long-tailedsituations. Meanwhile, learning-based methods have yet to achieve superiorperformance over rule-based approaches in large-scale closed-loop scenarios. Toaddress these issues, we propose PlanAgent, the first mid-to-mid planningsystem based on a Multi-modal Large Language Model (MLLM). MLLM is used as acognitive agent to introduce human-like knowledge, interpretability, andcommon-sense reasoning into the closed-loop planning. Specifically, PlanAgentleverages the power of MLLM through three core modules. First, an EnvironmentTransformation module constructs a Bird’s Eye View (BEV) map and alane-graph-based textual description from the environment as inputs. Second, aReasoning Engine module introduces a hierarchical chain-of-thought from sceneunderstanding to lateral and longitudinal motion instructions, culminating inplanner code generation. Last, a Reflection module is integrated to simulateand evaluate the generated planner for reducing MLLM’s uncertainty. PlanAgentis endowed with the common-sense reasoning and generalization capability ofMLLM, which empowers it to effectively tackle both common and complexlong-tailed scenarios. Our proposed PlanAgent is evaluated on the large-scaleand challenging nuPlan benchmarks. A comprehensive set of experimentsconvincingly demonstrates that PlanAgent outperforms the existingstate-of-the-art in the closed-loop motion planning task. Codes will be soonreleased.

Further reading