Bayesian Perspective

The optimization problem of PPO can be rewritten from a Bayesian perspective maxθ{Eπθ[rϕ(X,Y)]βKL(πθπ0)}=maxθEπθ[rϕ(X,Y)βlogπθ(X)+βlogπ0(X)]=maxθEπθ[(rϕ(X,Y)β+logπ0(X))βlogπθ(X)]=minθKL(πθπϕ) where we have defined the distribution πϕ(x)=1Zπ0(x)exp(rϕ(x)/β),Z=π0(x)exp(rϕ(x)/β)dx which can be interpreted as a posterior distribution with prior π0 and evidence represented by the reward model rϕ. This Bayesian perspective was fully fleshed out in this paper. Notice that then the task of performing RLHF can be cast as variational inference with proposal πθ and target πϕ.

Previous
Next