Reinforcement Learning from Human Feedback

Reinforcement Learning from Human Feedback (RLHF) was introduced in the InstructGPT paper. Suppose that πθ is the distribution learned by a language model after Unsupervised Pre-Training (UPT) and Supervised Fine-Tuning (SFT), that is πθ(x)=Categorical(softmax(LLMθ(x))). We would like to align our model to human preferences (or rather, contracted human labelers). RLHF frames this task as reinforcement learning.

Reinforcement Learning Recap

Reinforcement learning works as follows. In an environment with state st there is an agent that can take actions at in this environment according to a distribution πθ(atst), known as policy. After the agent performs an action, the environment then returns to the agent a state after the action st+1 and a reward rt.

drawing

The aim is to learn a policy such that the expected reward is maximized maxθEπθ(atst)[R(St)] As we will see later, in the case of Language Modelling:

  1. Agent: Pre-trained and fine-tuned Language model
  2. State: Prompt
  3. Policy: Next-token distribution given the prompt
  4. Next-state: Sample from the categorical distribution represented by the LM

Baseline, Policy and Reward Models

We take our SF-tuned language model πθ and create three identical copies:

  1. Baseline π0: will have frozen parameters θ (unchanged through the entire RLHF process).
  2. Policy πθ: Will end up being our aligned model, we change its parameters during RLHF.
  3. Reward Model rϕ: We copy πθ a modification: we replace the last un-embedding layer: at the end of LLMθ it will have one last linear layer that maps from Rnemb to Rnvocab so that we have one logit per token in the dictionary. This final linear layer is removed and replaced with a linear layer that maps Rnemb to R: we want a scalar reward. We call all of its parameters ϕ.

However, in the original InstructGPT paper they said that for rϕ they used the same architecture as LLMθ, except that they made it much smaller (6B parameters compared to 175B). This is because it allows for faster reward modelling training and because using a larger reward model proved unstable.

Dataset of Human Preferences

For the next step, we focus exclusively on rϕ. To generate a dataset of human preferences Dpreferences we follow these steps

  1. For each prompt of interest xn, sample K from the policy yn,1,,yn,Kπθ(xn)
  2. Ask human labelers to rank all the resulting pairs of outputs (notice that there are (K2) pairs for each xn). For each pair (yn,k,yn,j), kj, the preferred output is now labelled yn,w and the not-preferred output is labelled yn,l where “w” and “l” stand for “winner” and “loser”. Mathematically, we write yn,wyn,l.

This gives us the following human preferences dataset Dpreferences={(xn,yn,w,yn,l):n=1,,npreferences}

Training the Reward Model

The reward model will take pairs (x,y) of prompt and target and return a scalar reward rϕ(x,y). We would like to train the parameters ϕ. How do we train rϕ? One option is to use the Bradley-Terry (BT) model, which assumes that the preferences in the dataset were generated by a true latent reward model r(x,y) and that the preference distribution is p(ywylx)=sigmoid(r(x,yw)r(x,yl)) The higher this difference in reward, the higher the preference probability. This fundamentally transforms the problem into a binary classification problem between the class preferable and the class non-preferable. Therefore an obvious choice is to train rϕ by binary cross-entropy (or equivalently, maximum likelihood). The loss (negative log likelihood) is then LRM(ϕ)=E(x,yw,yl)Dpreferences[log(sigmoid(rϕ(x,yw)rϕ(x,yl)))]. As usual, we compute batched gradients however, the authors of InstructGPT suggest that to avoid overfitting during training, one should gather together all the comparisons from the same prompt into the same batch and perform a single forward pass of learning for the reward model.

Notice that the preference distribution (and so the loss) depends on the difference between the winning reward and the losing reward and it is therefore shift-invariant. After the reward model has been trained, the authors compute the average reward r^ across all human rankings and add a bias to the reward model (basically, they subtract this mean reward) so that the expected value of the reward is zero E(x,y)Dpreferences[rϕ(X,Y)]=0.

Reinforcement Learning

Now that we have available a reward model, we can use reinforcement learning to change the parameters of our policy πθ. We will see PPO and DPO.

Previous
Next