MileBench: Benchmarking MLLMs in Long Context

Despite the advancements and impressive performance of Multimodal LargeLanguage Models (MLLMs) on benchmarks, their effectiveness in real-world,long-context, and multi-image tasks is unclear due to the benchmarks’ limitedscope. Existing benchmarks often focus on single-image and short-text samples,and when assessing multi-image tasks, they either limit the image count orfocus on specific task (e.g time-series captioning), potentially obscuring theperformance challenges of MLLMs. To address these limitations, we introduceMileBench, a pioneering benchmark designed to test the MultImodal Long-contExtcapabilities of MLLMs. This benchmark comprises not only multimodal longcontexts, but also multiple tasks requiring both comprehension and generation.We establish two distinct evaluation sets, diagnostic and realistic, tosystematically assess MLLMs’ long-context adaptation capacity and their abilityto complete tasks in long-context scenarios. Our experimental results, obtainedfrom testing 22 models, revealed that while the closed-source GPT-4ooutperforms others, most open-source MLLMs struggle in long-context situations.Interestingly, the performance gap tends to widen with an increase in thenumber of images. We strongly encourage an intensification of research effortstowards enhancing MLLMs’ long-context capabilities, especially in scenariosinvolving multiple images.

Further reading