From the Bayesian perspective, we know that the optimal policy is given by
The key insight behind Direct Preference Optimization is that one re-write this to express the reward function as follows
This is essentially just another parametrization of the reward model. Now, recall that the Bradley-Terry model depends only on the difference in reward and therefore plugging this parametrization (and using our policy rather than the actual optimal ) there the normalizing constant cancels out
We can therefore perform maximum likelihood directly on the policy. Our loss function therefore becomes