Multivariate Normal as an Exponential Family Distribution

Exponential Family of Distributions

A density \(f(\boldsymbol{\mathbf{x}})\) belongs to the exponential family of distributions if we can write it as \[ f(\boldsymbol{\mathbf{x}}; \boldsymbol{\mathbf{\theta}}) = \exp\left\{\langle\boldsymbol{\mathbf{\theta}}, \phi(\boldsymbol{\mathbf{x}})\rangle - A(\boldsymbol{\mathbf{\theta}})\right\} \] we call \(\boldsymbol{\mathbf{\theta}}\) its natural parameters, while we call \(\mathbb{E}_f[\phi(\boldsymbol{\mathbf{X}})]\) its mean parameters.

Multivariate Normal Distribution

A pdf \(f\) is a multivariate normal distribution if \[ f(\boldsymbol{\mathbf{x}}) = (2\pi)^{-\frac{d}{2}}\text{det}(\boldsymbol{\mathbf{\Sigma}})^{-\frac{1}{2}}\exp\left\{-\frac{1}{2}(\boldsymbol{\mathbf{x}}- \boldsymbol{\mathbf{\mu}})^\top \boldsymbol{\mathbf{\Sigma}}^{-1}(\boldsymbol{\mathbf{x}}- \boldsymbol{\mathbf{\mu}})\right\} \]

This can be rearranged as \[ f(\boldsymbol{\mathbf{x}}) = \exp\left\{\boldsymbol{\mathbf{x}}^\top\boldsymbol{\mathbf{\Sigma}}^{-1}\boldsymbol{\mathbf{\mu}}-\frac{1}{2}\boldsymbol{\mathbf{x}}^\top\boldsymbol{\mathbf{\Sigma}}^{-1}\boldsymbol{\mathbf{x}}-\frac{1}{2}\left[d\log2\pi + \log|\boldsymbol{\mathbf{\Sigma}}| +\boldsymbol{\mathbf{\mu}}^\top\boldsymbol{\mathbf{\Sigma}}^{-1}\boldsymbol{\mathbf{\mu}}\right]\right\} \]

Frobenius Inner Product

Notice that we can write the second term as \[ -\frac{1}{2}\boldsymbol{\mathbf{x}}^\top \boldsymbol{\mathbf{\Sigma}}^{-1}\boldsymbol{\mathbf{x}}= -\frac{1}{2}\sum_{k=1}^d \sum_{j=1}^d x_k\Sigma_{kj}^{-1}x_j \] similarly, the following expression can be written in the same way \[ \begin{align} \text{tr}\left[-\frac{1}{2}\boldsymbol{\mathbf{\Sigma}}^{-1}\boldsymbol{\mathbf{x}}\boldsymbol{\mathbf{x}}^\top\right] &= -\frac{1}{2}\text{tr}\left[ \begin{pmatrix} \Sigma_{11}^{-1} & \cdots & \Sigma_{1d}^{-1} \\ \vdots & \ddots & \vdots \\ \Sigma_{d1}^{-1} & \cdots & \Sigma_{dd}^{-1} \end{pmatrix} \begin{pmatrix} x_1^2 & \cdots & x_1x_d\\ \vdots & \ddots & \vdots\\ x_dx_1 & \cdots & x_d^2 \end{pmatrix} \right]\\ &= -\frac{1}{2}\text{tr}\left[ \begin{pmatrix} \sum_{j=1}^d\Sigma_{1j}^{-1}x_jx_1 & \cdots & \sum_{j=1}\Sigma_{1j}^{-1}x_jx_d \\ \vdots & \ddots & \vdots \\ \sum_{j=1}^d \Sigma_{dj}^{-1}x_jx_1 & \cdots & \sum_{j=1}^d \Sigma_{dj}^{-1}x_jx_d \end{pmatrix} \right]\\ &= -\frac{1}{2}\sum_{k=1}^d\sum_{j=1}^d x_{k}\Sigma_{kj}^{-1}x_j \end{align} \] Now notice that this is nothing but the Frobenius inner product between two real and symmetric matrices (which can be written both in terms of a trace and in terms of \(\text{vec}\)-torizing operations) \[ \begin{align} \left\langle -\frac{1}{2}\boldsymbol{\mathbf{\Sigma}}^{-1}, \boldsymbol{\mathbf{x}}\boldsymbol{\mathbf{x}}^\top\right\rangle_F &= \text{tr}\left(-\frac{1}{2}\boldsymbol{\mathbf{\Sigma}}^{-1}\boldsymbol{\mathbf{x}}\boldsymbol{\mathbf{x}}^\top\right) \\ &= \text{vec}\left(-\frac{1}{2}\boldsymbol{\mathbf{\Sigma}}^{-1}\right)^\top\text{vec}\left(\boldsymbol{\mathbf{x}}\boldsymbol{\mathbf{x}}^\top\right) \end{align} \] where the vectorized operation for an \(n\times m\) matrix \(A\) simply stacks the rows one at a time to create a \((nm)\times 1\) vector \[ \text{vec}[A] = \text{vec}\left[\begin{pmatrix} a_{11} & \cdots & a_{1m} \\ \vdots & \ddots & \vdots \\ a_{n1} & \cdots & a_{nm}\end{pmatrix}\right] = \begin{pmatrix} a_{11} \\ \vdots \\ a_{1m} \\ \vdots \\ a_{nm}\end{pmatrix} \]

this allows us to write the pdf as \[ f(\boldsymbol{\mathbf{x}}) = \exp\left\{\langle \boldsymbol{\mathbf{\Sigma}}^{-1}\boldsymbol{\mathbf{\mu}}, \boldsymbol{\mathbf{x}}\rangle + \left\langle \text{vec}\left(-\frac{1}{2}\boldsymbol{\mathbf{\Sigma}}^{-1}\right), \text{vec}\left(\boldsymbol{\mathbf{x}}\boldsymbol{\mathbf{x}}^\top\right)\right\rangle -\frac{1}{2}\left[d\log2\pi + \log|\boldsymbol{\mathbf{\Sigma}}| +\boldsymbol{\mathbf{\mu}}^\top\boldsymbol{\mathbf{\Sigma}}^{-1}\boldsymbol{\mathbf{\mu}}\right]\right\} \]

Natural Parameters of a Multivariate Normal Distribution

Since \(\boldsymbol{\mathbf{\Sigma}}^{-1}\boldsymbol{\mathbf{\mu}}\) can be written as \[ \boldsymbol{\mathbf{\Sigma}}^{-1}\boldsymbol{\mathbf{\mu}}= \begin{pmatrix} \sum_{j=1}^d \Sigma_{1j}^{-1}\mu_j \\ \vdots \\ \sum_{j=1}^d \Sigma_{dj}^{-1}\mu_j \end{pmatrix} \] the natural parameters are given by \[ \boldsymbol{\mathbf{\theta}}= \begin{pmatrix} \boldsymbol{\mathbf{\Sigma}}^{-1}\boldsymbol{\mathbf{\mu}}\\ \text{vec}\left[-\frac{1}{2}\boldsymbol{\mathbf{\Sigma}}^{-1}\right] \end{pmatrix} = \begin{pmatrix} \sum_{j=1}^d \Sigma_{1j}^{-1}\mu_j \\ \vdots \\ \sum_{j=1}^d \Sigma_{dj}^{-1}\mu_j \\ -\frac{1}{2}\Sigma_{11}^{-1} \\ \vdots \\ -\frac{1}{2}\Sigma_{1d}^{-1} \\ \vdots \\ -\frac{1}{2}\Sigma_{dd}^{-1} \end{pmatrix}_{(d + d^2)\times 1} \]

In a similar fashion, we can find the sufficient statistics as \[ \phi(\boldsymbol{\mathbf{x}}) = \begin{pmatrix} \boldsymbol{\mathbf{x}}\\ \text{vec}(\boldsymbol{\mathbf{x}}\boldsymbol{\mathbf{x}}^\top) \end{pmatrix} = \begin{pmatrix} x_1 \\ \vdots \\ x_d \\ x_1^2 \\ \vdots \\ x_1x_d \\ \vdots \\ x_d^2 \end{pmatrix}_{(d + d^2)\times 1} \]

This gives the complete expression for the multivariate normal distribution as part of the Exponential Family of distributions

\[\begin{align} f(\boldsymbol{\mathbf{x}}) &= \exp\left\{\langle \theta, \phi(\boldsymbol{\mathbf{x}})\rangle - A(\boldsymbol{\mathbf{\theta}})\right\} \\ &= \exp\left\{ \left(\sum_{j=1}^d \Sigma_{1j}^{-1}\mu_j, \cdots, \sum_{j=1}^d \Sigma_{dj}^{-1}\mu_j, -\frac{1}{2}\Sigma_{11}^{-1}, \cdots, -\frac{1}{2}\Sigma_{1d}^{-1}, \cdots, -\frac{1}{2}\Sigma_{dd}^{-1} \right)^\top \begin{pmatrix} x_1 \\ \vdots \\ x_d \\ x_1^2 \\ \vdots \\ x_1x_d \\ \vdots \\ x_d^2 \end{pmatrix} -\frac{1}{2}\left[d\log2\pi + \log|\boldsymbol{\mathbf{\Sigma}}| +\boldsymbol{\mathbf{\mu}}^\top\boldsymbol{\mathbf{\Sigma}}^{-1}\boldsymbol{\mathbf{\mu}}\right] \right\} \end{align}\]

Mean Parameters of a Multivariate Normal Distribution

Remember that the expected value of a matrix or a vector is taken element-wise, so that if we were to compute \[ \begin{align} (\mathbb{E}[\boldsymbol{\mathbf{x}}\boldsymbol{\mathbf{x}}^\top])_{ij} &= \mathbb{E}[(\boldsymbol{\mathbf{x}}\boldsymbol{\mathbf{x}}^\top)_{ij}] \\ &= \mathbb{E}[x_ix_j] \\ &= \text{cov}(x_i, x_j) + \mathbb{E}[x_i]\mathbb{E}[x_j] \\ &= \Sigma_{ij} + \mu_i \mu_j \end{align} \] This means that taking the expected value of our vectorized matrix \(\text{vec}(\boldsymbol{\mathbf{x}}\boldsymbol{\mathbf{x}}^\top)\) is equivalent to vectorizing the expected value of \(\boldsymbol{\mathbf{x}}\boldsymbol{\mathbf{x}}^\top\) \[ \begin{align} \mathbb{E}\left[\text{vec}(\boldsymbol{\mathbf{x}}\boldsymbol{\mathbf{x}}^\top)\right] &= \mathbb{E}\left[ \begin{pmatrix} x_1^2 \\ \vdots \\ x_1x_d \\ \vdots \\ x_d^2 \end{pmatrix} \right] \\ &= \begin{pmatrix} \mathbb{E}\left[x_1^2\right] \\ \vdots \\ \mathbb{E}\left[x_1x_d\right] \\ \vdots \\ \mathbb{E}\left[x_d^2\right] \end{pmatrix} \\ &= \begin{pmatrix} \Sigma_{11} + \mu_1^2 \\ \vdots \\ \Sigma_{1d} + \mu_1\mu_d \\ \vdots \\ \Sigma_{dd} + \mu_d^2 \end{pmatrix} \\ &= \text{vec}\left(\mathbb{E}\left[\boldsymbol{\mathbf{x}}\boldsymbol{\mathbf{x}}^\top\right]\right) \end{align} \] Finally, the mean parameters are given by \[ \begin{align} \mathbb{E}\left[\phi(\boldsymbol{\mathbf{x}})\right] &= \mathbb{E}\left[\begin{pmatrix} \boldsymbol{\mathbf{x}}\\ \text{vec}(\boldsymbol{\mathbf{x}}\boldsymbol{\mathbf{x}}^\top) \end{pmatrix}\right] \\ &= \begin{pmatrix} \boldsymbol{\mathbf{\mu}}\\ \text{vec}\left(\boldsymbol{\mathbf{\Sigma}}+ \boldsymbol{\mathbf{\mu}}\boldsymbol{\mathbf{\mu}}^\top\right) \end{pmatrix} \end{align} \]

Avatar
Mauro Camara Escudero
Statistical Machine Learning Ph.D.

My research interests include approximate manifold sampling and generative models.