Dense Reward for Free in Reinforcement Learning from Human Feedback

Reinforcement Learning from Human Feedback (RLHF) has been credited as thekey advance that has allowed Large Language Models (LLMs) to effectively followinstructions and produce useful assistance. Classically, this involvesgenerating completions from the LLM in response to a query before using aseparate reward model to assign a score to the full completion. As anauto-regressive process, the LLM has to take many “actions” (selectingindividual tokens) and only receives a single, sparse reward at the end of anepisode, a setup that is known to be difficult to optimise in traditionalreinforcement learning. In this work we leverage the fact that the reward modelcontains more information than just its scalar output, in particular, itcalculates an attention map over tokens as part of the transformerarchitecture. We use these attention weights to redistribute the reward alongthe whole completion, effectively densifying the signal and highlighting themost important tokens, all without incurring extra computational cost orrequiring any additional modelling. We demonstrate that, theoretically, thisapproach is equivalent to potential-based reward shaping, ensuring that theoptimal policy remains unchanged. Empirically, we show that it stabilisestraining, accelerates the rate of learning, and, in practical cases, may leadto better local optima.

Further reading