Direct Preference Optimization

From the Bayesian perspective, we know that the optimal policy is given by πϕ(x)=1Zπ0(x)exp(rϕ(x)/β),Z=π0(x)exp(rϕ(x)/β)dx The key insight behind Direct Preference Optimization is that one re-write this to express the reward function as follows rϕ(yx)=βlog(πϕ(yx)π0(yx))+βlogZ. This is essentially just another parametrization of the reward model. Now, recall that the Bradley-Terry model depends only on the difference in reward and therefore plugging this parametrization (and using our policy πθ rather than the actual optimal πϕ) there the normalizing constant cancels out pθ(ywylx)=sigmoid(βlog(πθ(ywx)π0(ywx))βlog(πθ(ylx)π0(ylx))) We can therefore perform maximum likelihood directly on the policy. Our loss function therefore becomes LDPO(θ)=E(x,yw,yl)Dpreferences[log{sigmoid(βlog(πθ(ywx)π0(ywx))βlog(πθ(ylx)π0(ylx)))}]

Previous
Next