Direct Preference Optimization

From the Bayesian perspective, we know that the optimal policy is given by $π_{ϕ}^{*} (x) = \frac{1}{Z} π_{0} (x) \exp (r_{ϕ} (x) / β), Z = \int π_{0} (x^{'}) \exp (r_{ϕ} (x^{'}) / β) d x^{'}$ The key insight behind Direct Preference Optimization is that one re-write this to express the reward function as follows $r_{ϕ} (y ∣ x) = β \log (\frac{π_{ϕ}^{*} (y ∣ x)}{π_{0} (y ∣ x)}) + β \log Z .$ This is essentially just another parametrization of the reward model. Now, recall that the Bradley-Terry model depends only on the difference in reward and therefore plugging this parametrization (and using our policy $π_{θ}$ rather than the actual optimal $π_{ϕ}^{*}$ ) there the normalizing constant cancels out $p_{θ}^{*} (y_{w} ≻ y_{l} ∣ x) = sigmoid (β \log (\frac{π_{θ} (y_{w} ∣ x)}{π_{0} (y_{w} ∣ x)}) - β \log (\frac{π_{θ} (y_{l} ∣ x)}{π_{0} (y_{l} ∣ x)}))$ We can therefore perform maximum likelihood directly on the policy. Our loss function therefore becomes $L_{DPO} (θ) = - E_{(x, y_{w}, y_{l}) \sim D_{preferences}} [\log {sigmoid (β \log (\frac{π_{θ} (y_{w} ∣ x)}{π_{0} (y_{w} ∣ x)}) - β \log (\frac{π_{θ} (y_{l} ∣ x)}{π_{0} (y_{l} ∣ x)}))}]$

Last updated on Jul 17, 2024

Edit this page