Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

Vision Language Models (VLMs) demonstrate remarkable proficiency inaddressing a wide array of visual questions, which requires strong perceptionand reasoning faculties. Assessing these two competencies independently iscrucial for model refinement, despite the inherent difficulty due to theintertwined nature of seeing and reasoning in existing VLMs. To tackle thisissue, we present Prism, an innovative framework designed to disentangle theperception and reasoning processes involved in visual question solving. Prismcomprises two distinct stages: a perception stage that utilizes a VLM toextract and articulate visual information in textual form, and a reasoningstage that formulates responses based on the extracted visual information usinga Large Language Model (LLM). This modular design enables the systematiccomparison and assessment of both proprietary and open-source VLM for theirperception and reasoning strengths. Our analytical framework provides severalvaluable insights, underscoring Prism’s potential as a cost-effective solutionfor vision-language tasks. By combining a streamlined VLM focused on perceptionwith a powerful LLM tailored for reasoning, Prism achieves superior results ingeneral vision-language tasks while substantially cutting down on training andoperational expenses. Quantitative evaluations show that Prism, when configuredwith a vanilla 2B LLaVA and freely accessible GPT-3.5, delivers performance onpar with VLMs 10 × larger on the rigorous multimodal benchmark MMStar.The project is released at: https://github.com/SparksJoe/Prism.

Further reading