Summary of the SNEP algorithm
A minimalist and practical introduction.
At a Glance
Coordinate maximization of \[ \begin{alignat}{2} &\!\max_{\mu_0, [\mu_i, \nu_i, \theta_i']_{i=1}^n} &\qquad& \left\{\langle\mu_0, \theta_0\rangle + \sum_{i=1}^{n} \langle \nu_i, 1\rangle - A^*(\mu_0) -\sum_{i=1}^{n} \beta_i[A^*(\mu_i, \nu_i)+ A(\theta_i') - \mu_i^\top \theta_i']\right\}\\ &\text{s.t.} & & (\mu_i, \nu_i)\in \mathcal{M}(s(x), \ell_i(x)) &\forall i=1, \ldots, n\\ & & & \mu_i = \mu_0 & \forall i=1, \ldots, n\\ & & & \mu_0\in \mathcal{M}(s(x)) \\ & & & \theta_i'\in\Theta \end{alignat} \]
Context
In Hasenclever et al. (2017) the prior is in the base exponential family \[ p_0(x) = \exp(\theta_0^\top s(x) - A(\theta_0)) \] Each node \(i=1, \ldots, n\) has a partition of the data. Local likelihood \(\ell_i(x)\) can be formed \[ \ell_i(x) =\sum_{c\in S_i} \log p(y_c \mid x) \] Posterior can be written as extended exponential family \[ \widetilde{p}(x) \propto p_0(x) \exp\left(\sum_{i=1}^n \ell_i(x)\right) \propto \exp\left(\widetilde{\theta}^\top \widetilde{s}(x) - \widetilde{A}(\widetilde{\theta})\right) \] where \[ \begin{align}\widetilde{s}(x) &= [s(x), \ell_1(x), \ldots, \ell_n(x)] &&\Longleftrightarrow \qquad (\phi(x), \Phi^1(x), \ldots, \Phi^{n}(x)) \\\widetilde{\theta} &= [\theta_0, \boldsymbol{1}_n] &&\Longleftrightarrow \qquad (\theta, \widetilde{\theta}^1, \ldots, \widetilde{\theta}^n)\end{align} \]
Aim
Want to learn the posterior distribution \(\widetilde{p}(x)\). Do this by finding the log-partition function and the mean parameters, which completely specify the posterior distribution. Both can be found altogether by solving the variational problem represented by the convex dual. This problem is difficult to solve, so we solve an approximated problem instead.
Variational Problem
The approximated problem that we need to solve to find the (approximated) mean parameters and the (approximated) log-partition function is \[ \begin{alignat}{2} &\!\max_{\mu_0, [\mu_i, \nu_i]_{i=1}^n} &\qquad& \left\{\langle\mu_0, \theta_0\rangle + \sum_{i=1}^{n} \langle \nu_i, 1\rangle - A^*(\mu_0) -\sum_{i=1}^{n} \beta_i[A^*(\mu_i, \nu_i) - A^*(\mu_i)]\right\}\\ &\text{s.t.} & & (\mu_i, \nu_i)\in \mathcal{M}(s(x), \ell_i(x)) &\forall i=1, \ldots, n\\ & & & \mu_i = \mu_0 & \forall i=1, \ldots, n\\ & & & \mu_0\in \mathcal{M}(s(x)) \end{alignat} \] Solving this gives an approximation to the posterior that belongs to the exponential family \[ \widetilde{p}(x)\propto \exp\left(\left\langle \theta_0 + \sum_{i=1}^n \lambda_i, s(x)\right\rangle\right) \] Since exponential family is closed under multiplication (adding up the coefficients), we can interpret this approximation as follows. Each \(\lambda_i\) is the natural parameter of an exponential family approximating the intractable likelihood term in site \(i\) given by \(\exp(\ell_i(x))\).
Updates
The tilted distribution in each node is given by \[ p_i(x) \propto \exp\left(\left\langle \theta_0 + \sum_{j=1}^n \lambda_j - \frac{\lambda_i}{\beta_i}, s(x) \right\rangle + \left\langle \frac{1}{\beta_i}, \ell_i(x)\right\rangle\right) \] Therefore at each iteration we compute the mean parameters of the tilted distribution \[ \left(\mathbb{E}_{p_i}[s(X)], \mathbb{E}_{p_i}[\ell_i(X)]\right) = (\mu_i, \nu_i) \] and then perform moment matching by adjusting \(\lambda_i\) \[ \lambda_i^{\text{new}} \longleftarrow \nabla_{\mu_i} A^*(\mu_i) - \theta_0 - \sum_{j\neq i} \lambda_j \]
In practice, to avoid oscillations, we add some damping \[ \lambda_i^{\text{new}} \longleftarrow \alpha \lambda_i + (1-\alpha)\left(\nabla_{\mu_i} A^*(\mu_i) - \theta_0 - \sum_{j\neq i} \lambda_j\right) \qquad\qquad \text{where } \alpha\in[0, 1] \] Notice that here \(\nabla_{\mu_i} A^*(\mu_i)\) is the natural parameter corresponding to the mean parameter \(\mu_i\).
Stochastic Natural-Gradient Expectation Propagation
The only difficult step is to compute the mean parameters of the tilted distribution \[ \mathbb{E}_{p_i}\left[(s(X), \ell_i(X)\right] \] because \(\ell_i(x)\) are intractable. For this reason, authors derive an alternative to EP and PowerEP. Basically they introduce additional auxiliary natural parameter vector \(\theta'\), obtaining a new variational objective function. \[ \begin{alignat}{2} &\!\max_{\mu_0, [\mu_i, \nu_i, \theta'_i]_{i=1}^n} &\qquad& \left\{\langle\mu_0, \theta_0\rangle + \sum_{i=1}^{n} \langle \nu_i, 1\rangle - A^*(\mu_0) -\sum_{i=1}^{n} \beta_i[A^*(\mu_i, \nu_i)+ A(\theta_i') - \mu_i^\top \theta_i']\right\}\\ &\text{s.t.} & & (\mu_i, \nu_i)\in \mathcal{M}(s(x), \ell_i(x)) &\forall i=1, \ldots, n\\ & & & \mu_i = \mu_0 & \forall i=1, \ldots, n\\ & & & \mu_0\in \mathcal{M}(s(x)) \\ & & & \theta_i'\in\Theta \end{alignat} \] The key is that maximizing this new objective with respect to the auxiliary variables \(\theta'\) yields the original problem. To solve this new variational problem they introduce Lagrange multipliers and switch to the dual problem. \[ \begin{alignat}{2} &\!\max_{[\theta_i']_{i=1}^n}\min_{[\lambda_i]_{i=1}^n} &\qquad& A\left(\theta_0 + \sum_{i=1}^n \lambda_i\right) + \sum_{i=1}^{n} \beta_i\left[A_i\left(\theta_i'-\frac{\lambda_i}{\beta_i}, \frac{1}{\beta_i}\right)- A(\theta_i')\right]\\ &\text{s.t.} & & \theta_i'\in\Theta \qquad \qquad \forall \, i=1, \ldots, n \end{alignat} \]
Bibliography
Hasenclever, Leonard, Stefan Webb, Thibaut Lienart, Sebastian Vollmer, Balaji Lakshminarayanan, Charles Blundell, and Yee Whye Teh. 2017. “Distributed Bayesian Learning with Stochastic Natural Gradient Expectation Propagation and the Posterior Server.” J. Mach. Learn. Res. 18 (1). JMLR.org: 3744–80.