Multi-Object Hallucination in Vision-Language Models
On this page
Large vision language models (LVLMs) often suffer from object hallucination,producing objects not present in the given images. While current benchmarks forobject hallucination primarily concentrate on the presence of a single objectclass rather than individual entities, this work systematically investigatesmulti-object hallucination, examining how models misperceive (e.g., inventnonexistent objects or become distracted) when tasked with focusing on multipleobjects simultaneously. We introduce Recognition-based Object ProbingEvaluation (ROPE), an automated evaluation protocol that considers thedistribution of object classes within a single image during testing and usesvisual referring prompts to eliminate ambiguity. With comprehensive empiricalstudies and analysis of potential factors leading to multi-objecthallucination, we found that (1). LVLMs suffer more hallucinations whenfocusing on multiple objects compared to a single object. (2). The testedobject class distribution affects hallucination behaviors, indicating thatLVLMs may follow shortcuts and spurious correlations. (3). Hallucinatorybehaviors are influenced by data-specific factors, salience and frequency, andmodel intrinsic behaviors. We hope to enable LVLMs to recognize and reasonabout multiple objects that often occur in realistic visual scenes, provideinsights, and quantify our progress towards mitigating the issues.
Further reading
- Access Paper in arXiv.org