Gaussian Expectation Propagation

Multivariate Normal Distribution in the Exponential Family

Remember from a previous blog post that a pdf \(p(\boldsymbol{\mathbf{x}}) = \mathcal{N}(\boldsymbol{\mathbf{x}}; \boldsymbol{\mathbf{\mu}}, \boldsymbol{\mathbf{\Sigma}})\) can be written as \[ p(\boldsymbol{\mathbf{x}}) = \exp\left\{\langle \boldsymbol{\mathbf{\Sigma}}^{-1}\boldsymbol{\mathbf{\mu}}, \boldsymbol{\mathbf{x}}\rangle + \left\langle \text{vec}\left(-\frac{1}{2}\boldsymbol{\mathbf{\Sigma}}^{-1}\right), \text{vec}\left(\boldsymbol{\mathbf{x}}\boldsymbol{\mathbf{x}}^\top\right)\right\rangle -\frac{1}{2}\left[d\log2\pi + \log|\boldsymbol{\mathbf{\Sigma}}| +\boldsymbol{\mathbf{\mu}}^\top\boldsymbol{\mathbf{\Sigma}}^{-1}\boldsymbol{\mathbf{\mu}}\right]\right\} \]

Following the work of Barthelmé, Chopin, and Cottet (2015) we can relabel the expression above as follows \[ p(\boldsymbol{\mathbf{x}}) \propto \exp\left\{\langle \boldsymbol{\mathbf{r}}, \boldsymbol{\mathbf{x}}\rangle + \left\langle \text{vec}\left(-\frac{1}{2}\boldsymbol{\mathbf{Q}}\right), \text{vec}\left(\boldsymbol{\mathbf{x}}\boldsymbol{\mathbf{x}}^\top\right)\right\rangle \right\} \] where we have defined \[ \boldsymbol{\mathbf{r}}:= \boldsymbol{\mathbf{\Sigma}}^{-1} \boldsymbol{\mathbf{\mu}}\qquad \text{and} \qquad \boldsymbol{\mathbf{Q}}:= \boldsymbol{\mathbf{\Sigma}}^{-1} \]

Target Distribution

We assume the target distribution is intractable, but can be factorized into a product of \(K+1\) terms \[ f(\boldsymbol{\mathbf{\gamma}}) \propto \prod_{k=0}^K f_k(\boldsymbol{\mathbf{\gamma}}_k) \] where usually

  • \(f(\boldsymbol{\mathbf{\gamma}})=p(\boldsymbol{\mathbf{\gamma}}\mid \boldsymbol{\mathbf{x}})\) is a posterior distribution.
  • \(f_0(\boldsymbol{\mathbf{\gamma}}_0) = p(\boldsymbol{\mathbf{\gamma}}; \boldsymbol{\mathbf{\theta}}_0)\) is a prior distribution.
  • \(f_1(\boldsymbol{\mathbf{\gamma}}_1), \ldots, f_K(\boldsymbol{\mathbf{\gamma}}_k)\) are likelihood terms, which are intractable.

From here onwards we assume that the prior distribution \(f_0(\boldsymbol{\mathbf{\gamma}})\) is a multivariate gaussian distribution \(\mathcal{N}(\boldsymbol{\mathbf{\gamma}}; \boldsymbol{\mathbf{r}}_0, \boldsymbol{\mathbf{Q}}_0)\) (with natural parameters \(\boldsymbol{\mathbf{\theta}}_0 = \left(\boldsymbol{\mathbf{r}}_0, -\frac{1}{2}\boldsymbol{\mathbf{Q}}_0\right)^\top\)). \[ f_0(\boldsymbol{\mathbf{\gamma}}) = p(\boldsymbol{\mathbf{\gamma}}; \boldsymbol{\mathbf{r}}_0, \boldsymbol{\mathbf{Q}}_0) = \propto \exp\left\{\langle \boldsymbol{\mathbf{r}}_0, \boldsymbol{\mathbf{x}}\rangle + \left\langle \text{vec}\left(-\frac{1}{2}\boldsymbol{\mathbf{Q}}_0\right), \text{vec}\left(\boldsymbol{\mathbf{x}}\boldsymbol{\mathbf{x}}^\top\right)\right\rangle \right\} \]

Note: The parameters \(\boldsymbol{\mathbf{\gamma}}_0, \ldots, \boldsymbol{\mathbf{\gamma}}_k\) are not the components of \(\boldsymbol{\mathbf{\gamma}}\), they are different parameters.

Global Approximation

The global approximation is defined as \[ g(\boldsymbol{\mathbf{\theta}}) \propto \prod_{k=0}^K g_k(\boldsymbol{\mathbf{\theta}}_k) \] where we set the first term to be equal to the (tractable) prior, which is a multivariate gaussian distribution \[ g_0(\boldsymbol{\mathbf{\theta}}) \propto \exp\left\{\langle \boldsymbol{\mathbf{r}}_0, \boldsymbol{\mathbf{x}}\rangle + \left\langle \text{vec}\left(-\frac{1}{2}\boldsymbol{\mathbf{Q}}_0\right), \text{vec}\left(\boldsymbol{\mathbf{x}}\boldsymbol{\mathbf{x}}^\top\right)\right\rangle \right\} \] Furthermore, we assume that each factor \(g_k(\boldsymbol{\mathbf{\theta}}_k)\) for \(k=1, \ldots, K\) also follows a Multivariate Normal Distribution with natural parameter \(\boldsymbol{\mathbf{\theta}}_k = (\boldsymbol{\mathbf{r}}_k, \boldsymbol{\mathbf{Q}}_k)^\top\). Luckily, the product of Gaussians is again a Gaussian distribution. In particular we have \[ \begin{align} g(\boldsymbol{\mathbf{\theta}}) &\propto \prod_{k=0}^K \exp\left\{ \boldsymbol{\mathbf{r}}_k^\top\boldsymbol{\mathbf{x}}+ \text{vec}\left(-\frac{1}{2}\boldsymbol{\mathbf{Q}}_k\right)^\top \text{vec}\left(\boldsymbol{\mathbf{x}}\boldsymbol{\mathbf{x}}^\top\right) \right\} \\ &= \exp\left\{\left(\sum_{k=0}^K \boldsymbol{\mathbf{r}}_k\right)^\top\boldsymbol{\mathbf{x}}+ \text{vec}\left(-\frac{1}{2}\sum_{k=0}^K \boldsymbol{\mathbf{Q}}_k\right)^\top \text{vec}\left(\boldsymbol{\mathbf{x}}\boldsymbol{\mathbf{x}}^\top\right)\right\} \\ &:= \exp\left\{\boldsymbol{\mathbf{r}}^\top\boldsymbol{\mathbf{x}}+ \text{vec}\left(-\frac{1}{2}\boldsymbol{\mathbf{Q}}\right)^\top \text{vec}\left(\boldsymbol{\mathbf{x}}\boldsymbol{\mathbf{x}}^\top\right)\right\} \end{align} \] where we have defined the global natural parameters to be \[ \boldsymbol{\mathbf{r}}:= \sum_{k=0}^K \boldsymbol{\mathbf{r}}_k \qquad \text{and} \qquad \boldsymbol{\mathbf{Q}}:= \sum_{k=0}^K \boldsymbol{\mathbf{Q}}_k \] In other words, the natural parameters of the global approximation are found by summing the natural parameters of all the sites.

Note: The parameters \(\boldsymbol{\mathbf{\theta}}_0, \ldots, \boldsymbol{\mathbf{\theta}}_k\) are not the components of \(\boldsymbol{\mathbf{\theta}}\), they are different parameters.

Cavity Distribution

The cavity distribution at the \(k^{\text{th}}\) site is given by the product of all but the \(k^{\text{th}}\) Multivariate Gaussian. This means that we can write the cavity distribution as \[ g_{-k}(\boldsymbol{\mathbf{\theta}}- \boldsymbol{\mathbf{\theta}}_k) \propto \exp\left\{\left(\boldsymbol{\mathbf{r}}- \boldsymbol{\mathbf{r}}_k\right)^\top\boldsymbol{\mathbf{x}}+ \text{vec}\left(-\frac{1}{2}\left(\boldsymbol{\mathbf{Q}}- \boldsymbol{\mathbf{Q}}_k\right)\right)^\top \text{vec}\left(\boldsymbol{\mathbf{x}}\boldsymbol{\mathbf{x}}^\top\right)\right\} \] In other words, the natural parameters of the cavity distribution are found by taking the difference between the global natural parameters and the natural parameters of the \(k^{\text{th}}\) site.

Tilted Distribution

The tilted distribution, also called pseudo-posterior, is found by multiplying the cavity distribution by the \(k^{\text{th}}\) local likelihood term. In other words, the tilted distribution is a (pseudo-)posterior where we use the cavity distribution as a (pseudo-)prior and the single local likelihood term as the likelihood.

In general, computing moments of this distribution will be intractable, however we how that calculating moments of this distribution will be easier than calculating moments of the entire target distribution. \[ \begin{align} g_{\backslash k}(\widetilde{\boldsymbol{\mathbf{\gamma}}}_k) &\propto f_k(\boldsymbol{\mathbf{\gamma}}_k) \exp\left\{\left(\boldsymbol{\mathbf{r}}- \boldsymbol{\mathbf{r}}_k\right)^\top\boldsymbol{\mathbf{x}}+ \text{vec}\left(-\frac{1}{2}\left(\boldsymbol{\mathbf{Q}}- \boldsymbol{\mathbf{Q}}_k\right)\right)^\top \text{vec}\left(\boldsymbol{\mathbf{x}}\boldsymbol{\mathbf{x}}^\top\right)\right\} \end{align} \]

Gaussian Expectation Propagation Updates

Initialization

  • Choose natural parameters for the prior distribution \(\boldsymbol{\mathbf{r}}_0\) and \(\boldsymbol{\mathbf{Q}}_0\).
  • For every site \(k=1, \ldots, K\) choose the parameters \(\boldsymbol{\mathbf{r}}_k\) and \(\boldsymbol{\mathbf{Q}}_k\).
  • Compute the natural parameters of the global approximation \[ \boldsymbol{\mathbf{r}}\leftarrow \sum_{k=0}^K \boldsymbol{\mathbf{r}}_k \qquad \text{and} \qquad \boldsymbol{\mathbf{Q}}\leftarrow \sum_{k=0}^K \boldsymbol{\mathbf{Q}}_k \]

Updates

  • At the \(k^{\text{th}}\) site subtract the \(k^{\text{th}}\) natural parameters from the global ones, to obtain the natural parameters of the cavity distribution \[ \boldsymbol{\mathbf{r}}_{-k}\leftarrow \boldsymbol{\mathbf{r}}- \boldsymbol{\mathbf{r}}_k \qquad \text{and} \qquad \boldsymbol{\mathbf{Q}}_{-k} \leftarrow \boldsymbol{\mathbf{Q}}- \boldsymbol{\mathbf{Q}}_k \]
  • Sample from the tilted distribution using an MCMC sampler to obtain \(N\) samples \(\widetilde{\boldsymbol{\mathbf{\gamma}}}_k^{(1)}, \ldots, \widetilde{\boldsymbol{\mathbf{\gamma}}}_k^{(N)}\) from \[ g_{\backslash k}(\widetilde{\boldsymbol{\mathbf{\gamma}}}_k) \propto f_k(\boldsymbol{\mathbf{\gamma}}_k) \exp\left\{\left(\boldsymbol{\mathbf{r}}- \boldsymbol{\mathbf{r}}_k\right)^\top\boldsymbol{\mathbf{x}}+ \text{vec}\left(-\frac{1}{2}\left(\boldsymbol{\mathbf{Q}}- \boldsymbol{\mathbf{Q}}_k\right)\right)^\top \text{vec}\left(\boldsymbol{\mathbf{x}}\boldsymbol{\mathbf{x}}^\top\right)\right\} \]
  • Perform moment-matching by computing the mean and the variance-coviariance matrix from the samples, and assign those values to the mean and the variance-covariance matrix of the newly-found global gaussian approximation. \[ \begin{align} \boldsymbol{\mathbf{\mu}}^{\text{new}, k}&\leftarrow \frac{1}{N}\sum_{j=1}^N \widetilde{\boldsymbol{\mathbf{\gamma}}}_i^{(j)} \\ \boldsymbol{\mathbf{\Sigma}}^{\text{new}, k} &\leftarrow \frac{1}{N - 1}\sum_{j=1}^N \left(\widetilde{\boldsymbol{\mathbf{\gamma}}}_i^{(j)} - \boldsymbol{\mathbf{\mu}}^{\text{new}, k}\right) \left(\widetilde{\boldsymbol{\mathbf{\gamma}}}_i^{(j)} - \boldsymbol{\mathbf{\mu}}^{\text{new}, k}\right)^\top \end{align} \] Notice that every site \(k=1, \ldots, K\) has found (possibly) different new mean and variance-covariance parameters. We denote by \(\boldsymbol{\mathbf{\mu}}^{\text{new}, k}\) the new mean parameter for the multivariate normal global approximation, found at site \(k\).
  • Next, at every site \(i=1, \ldots, K\), convert the global mean and variance-covariance parameters into natural parameters \[ \begin{align} \boldsymbol{\mathbf{Q}}^{\text{new}, k} &\leftarrow \left(\boldsymbol{\mathbf{\Sigma}}^{\text{new}, k}\right)^{-1} \\ \boldsymbol{\mathbf{r}}^{\text{new}, k} &\leftarrow \boldsymbol{\mathbf{Q}}^{\text{new}, k} \boldsymbol{\mathbf{\mu}}^{\text{new}, k} \end{align} \]
  • At site \(k\) find the new natural parameters for the \(k^{\text{th}}\) approximating factor \(g_k(\boldsymbol{\mathbf{\theta}}_k)\) by taking the difference between the new global natural parameters and the natural parameters of the old cavity distribution. \[ \begin{align} \boldsymbol{\mathbf{Q}}_k^{\text{new}} &\leftarrow \boldsymbol{\mathbf{Q}}^{\text{new}, k} - \boldsymbol{\mathbf{Q}}_{-k} \\ \boldsymbol{\mathbf{r}}_k^{\text{new}} &\leftarrow \boldsymbol{\mathbf{r}}^{\text{new}, k} - \boldsymbol{\mathbf{r}}_{-k} \end{align} \]
  • From each site, send \(\boldsymbol{\mathbf{r}}_k^{\text{new}}\) and \(\boldsymbol{\mathbf{Q}}_k^{\text{new}}\) to the posterior server and update global approximation \[ \begin{align} \boldsymbol{\mathbf{r}}^{\text{new}}_{\text{parallel}} &\leftarrow \sum_{k=1}^K \boldsymbol{\mathbf{r}}_k^{\text{new}} \\ \boldsymbol{\mathbf{Q}}^{\text{new}}_{\text{parallel}} &\leftarrow \sum_{k=1}^K \boldsymbol{\mathbf{Q}}_k^{\text{new}} \end{align} \]

Bibliography

Barthelmé, Simon, Nicolas Chopin, and Vincent Cottet. 2015. “Divide and Conquer in ABC: Expectation-Progagation Algorithms for Likelihood-Free Inference.” https://arxiv.org/abs/1512.00205.
Avatar
Mauro Camara Escudero
Statistical Machine Learning Ph.D.

My research interests include approximate manifold sampling and generative models.