Reinforcement Learning from Human Feedback
Reinforcement Learning from Human Feedback (RLHF) was introduced in the InstructGPT paper. Suppose that
Reinforcement Learning Recap
Reinforcement learning works as follows. In an environment with state

The aim is to learn a policy such that the expected reward is maximized
- Agent: Pre-trained and fine-tuned Language model
- State: Prompt
- Policy: Next-token distribution given the prompt
- Next-state: Sample from the categorical distribution represented by the LM
Baseline, Policy and Reward Models
We take our SF-tuned language model
- Baseline
: will have frozen parameters (unchanged through the entire RLHF process). - Policy
: Will end up being our aligned model, we change its parameters during RLHF. - Reward Model
: We copy a modification: we replace the last un-embedding layer: at the end of it will have one last linear layer that maps from to so that we have one logit per token in the dictionary. This final linear layer is removed and replaced with a linear layer that maps to : we want a scalar reward. We call all of its parameters .
However, in the original InstructGPT paper they said that for
Dataset of Human Preferences
For the next step, we focus exclusively on
- For each prompt of interest
, sample from the policy - Ask human labelers to rank all the resulting pairs of outputs (notice that there are
pairs for each ). For each pair , , the preferred output is now labelled and the not-preferred output is labelled where “w” and “l” stand for “winner” and “loser”. Mathematically, we write .
This gives us the following human preferences dataset
Training the Reward Model
The reward model will take pairs
Notice that the preference distribution (and so the loss) depends on the difference between the winning reward and the losing reward and it is therefore shift-invariant. After the reward model has been trained, the authors compute the average reward
Reinforcement Learning
Now that we have available a reward model, we can use reinforcement learning to change the parameters of our policy