Bayesian Perspective
The optimization problem of PPO can be rewritten from a Bayesian perspective $$ \begin{align} \max_{\theta} \left\{\mathbb{E}_{\pi_\theta}\left[r_\phi(X, Y)\right] - \beta\text{KL}(\pi_\theta \parallel \pi_0)\right\} &= \max_\theta \mathbb{E}_{\pi_\theta}\left[r_\phi(X, Y) - \beta\log \pi_\theta(X) + \beta\log\pi_0(X)\right] \\ &= \max_{\theta}\mathbb{E}_{\pi_\theta}\left[\left(\frac{r_{\phi}(X, Y)}{\beta} + \log\pi_0(X)\right) - \beta\log\pi_{\theta}(X)\right] \\ &= \min_{\theta} \text{KL}\left(\pi_\theta \parallel \pi^{*}_{\phi}\right) \end{align} $$ where we have defined the distribution $$ \pi^{*}_{\phi}(x) = \frac{1}{Z} \pi_0(x)\exp\left(r_{\phi}(x) / \beta\right), \qquad Z = \int \pi_0(x’)\exp\left(r_{\phi}(x’) / \beta\right) dx' $$ which can be interpreted as a posterior distribution with prior $\pi_0$ and evidence represented by the reward model $r_{\phi}$. This Bayesian perspective was fully fleshed out in this paper. Notice that then the task of performing RLHF can be cast as variational inference with proposal $\pi_\theta$ and target $\pi^{*}_\phi$.