Bayesian Perspective

The optimization problem of PPO can be rewritten from a Bayesian perspective $\begin{aligned} max_{θ} {E_{π_{θ}} [r_{ϕ} (X, Y)] - β KL (π_{θ} ∥ π_{0})} & = max_{θ} E_{π_{θ}} [r_{ϕ} (X, Y) - β \log π_{θ} (X) + β \log π_{0} (X)] \\ = max_{θ} E_{π_{θ}} [(\frac{r_{ϕ} (X, Y)}{β} + \log π_{0} (X)) - β \log π_{θ} (X)] \\ = min_{θ} KL (π_{θ} ∥ π_{ϕ}^{*}) \end{aligned}$ where we have defined the distribution $π_{ϕ}^{*} (x) = \frac{1}{Z} π_{0} (x) \exp (r_{ϕ} (x) / β), Z = \int π_{0} (x^{'}) \exp (r_{ϕ} (x^{'}) / β) d x^{'}$ which can be interpreted as a posterior distribution with prior $π_{0}$ and evidence represented by the reward model $r_{ϕ}$ . This Bayesian perspective was fully fleshed out in this paper. Notice that then the task of performing RLHF can be cast as variational inference with proposal $π_{θ}$ and target $π_{ϕ}^{*}$ .

Last updated on Jul 17, 2024

Edit this page