WPO: Enhancing RLHF with Weighted Preference Optimization

Reinforcement learning from human feedback (RLHF) is a promising solution toalign large language models (LLMs) more closely with human values. Off-policypreference optimization, where the preference data is obtained from othermodels, is widely adopted due to its cost efficiency and scalability. However,off-policy preference optimization often suffers from a distributional gapbetween the policy used for data collection and the target policy, leading tosuboptimal optimization. In this paper, we propose a novel strategy to mitigatethis problem by simulating on-policy learning with off-policy preference data.Our Weighted Preference Optimization (WPO) method adapts off-policy data toresemble on-policy data more closely by reweighting preference pairs accordingto their probability under the current policy. This method not only addressesthe distributional gap problem but also enhances the optimization processwithout incurring additional costs. We validate our method on instructionfollowing benchmarks including Alpaca Eval 2 and MT-bench. WPO not onlyoutperforms Direct Preference Optimization (DPO) by up to 5.6

Further reading