Benchmarking Vision Language Models for Cultural Understanding
On this page
Foundation models and vision-language pre-training have notably advancedVision Language Models (VLMs), enabling multimodal processing of visual andlinguistic data. However, their performance has been typically assessed ongeneral scene understanding - recognizing objects, attributes, and actions -rather than cultural comprehension. This study introduces CulturalVQA, a visualquestion-answering benchmark aimed at assessing VLM’s geo-diverse culturalunderstanding. We curate a collection of 2,378 image-question pairs with 1-5answers per question representing cultures from 11 countries across 5continents. The questions probe understanding of various facets of culture suchas clothing, food, drinks, rituals, and traditions. Benchmarking VLMs onCulturalVQA, including GPT-4V and Gemini, reveals disparity in their level ofcultural understanding across regions, with strong cultural understandingcapabilities for North America while significantly lower performance forAfrica. We observe disparity in their performance across cultural facets too,with clothing, rituals, and traditions seeing higher performances than food anddrink. These disparities help us identify areas where VLMs lack culturalunderstanding and demonstrate the potential of CulturalVQA as a comprehensiveevaluation set for gauging VLM progress in understanding diverse cultures.
Further reading
- Access Paper in arXiv.org