AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments
On this page
Evaluating large language models (LLM) in clinical scenarios is crucial toassessing their potential clinical utility. Existing benchmarks rely heavily onstatic question-answering, which does not accurately depict the complex,sequential nature of clinical decision-making. Here, we introduce AgentClinic,a multimodal agent benchmark for evaluating LLMs in simulated clinicalenvironments that include patient interactions, multimodal data collectionunder incomplete information, and the usage of various tools, resulting in anin-depth evaluation across nine medical specialties and seven languages. Wefind that solving MedQA problems in the sequential decision-making format ofAgentClinic is considerably more challenging, resulting in diagnosticaccuracies that can drop to below a tenth of the original accuracy. Overall, weobserve that agents sourced from Claude-3.5 outperform other LLM backbones inmost settings. Nevertheless, we see stark differences in the LLMs’ ability tomake use of tools, such as experiential learning, adaptive retrieval, andreflection cycles. Strikingly, Llama-3 shows up to 92
Further reading
- Access Paper in arXiv.org