mDPO: Conditional Preference Optimization for Multimodal Large Language Models
On this page
Direct preference optimization (DPO) has shown to be an effective method forlarge language model (LLM) alignment. Recent works have attempted to apply DPOto multimodal scenarios but have found it challenging to achieve consistentimprovement. Through a comparative experiment, we identify the unconditionalpreference problem in multimodal preference optimization, where the modeloverlooks the image condition. To address this problem, we propose mDPO, amultimodal DPO objective that prevents the over-prioritization of language-onlypreferences by also optimizing image preference. Moreover, we introduce areward anchor that forces the reward to be positive for chosen responses,thereby avoiding the decrease in their likelihood – an intrinsic problem ofrelative preference optimization. Experiments on two multimodal LLMs ofdifferent sizes and three widely used benchmarks demonstrate that mDPOeffectively addresses the unconditional preference problem in multimodalpreference optimization and significantly improves model performance,particularly in reducing hallucination.
Further reading
- Access Paper in arXiv.org