The optimization problem of PPO can be rewritten from a Bayesian perspective
where we have defined the distribution
which can be interpreted as a posterior distribution with prior and evidence represented by the reward model . This Bayesian perspective was fully fleshed out in this paper. Notice that then the task of performing RLHF can be cast as variational inference with proposal and target .