Towards SMC: Importance Sampling Explained

Simple Explanation of Importance Sampling for Sequential Monte Carlo (SMC)

Classic Importance Sampling

Suppose that our aim is to compute the posterior expectation \[ \mathbb{E}_{p(\boldsymbol{\mathbf{\theta}}\mid \boldsymbol{\mathbf{y}})}[h(\boldsymbol{\mathbf{\theta}})] = \int h(\boldsymbol{\mathbf{\theta}})p(\boldsymbol{\mathbf{\theta}}\mid\boldsymbol{\mathbf{y}}) d\boldsymbol{\mathbf{\theta}} \] of some function \(h(\boldsymbol{\mathbf{\theta}})\) with respect to a posterior distribution

\[\begin{align} p(\boldsymbol{\mathbf{\theta}}\mid \boldsymbol{\mathbf{y}}) = \frac{\widetilde{p}(\boldsymbol{\mathbf{\theta}}\mid \boldsymbol{\mathbf{y}})}{Z_p} \qquad \text{and} \qquad Z_p = \int \widetilde{p}(\boldsymbol{\mathbf{\theta}}\mid \boldsymbol{\mathbf{y}}) d\boldsymbol{\mathbf{\theta}}. \end{align}\]

Suppose also that it is impractical to sample directly from \(p(\boldsymbol{\mathbf{\theta}}\mid \boldsymbol{\mathbf{y}})\). In Importance Sampling we choose an importance distribution \(q(\boldsymbol{\mathbf{\theta}})\) from which it is easier to sample from, and rewrite the expectation above as an expectation with respect to \(q(\boldsymbol{\mathbf{\theta}})\) so that we can use its samples instead \[ \begin{align} \mathbb{E}_{p(\boldsymbol{\mathbf{\theta}}\mid \boldsymbol{\mathbf{y}})}[h(\boldsymbol{\mathbf{\theta}})] &= \int h(\boldsymbol{\mathbf{\theta}})p(\boldsymbol{\mathbf{\theta}}\mid\boldsymbol{\mathbf{y}}) d\boldsymbol{\mathbf{\theta}}\\ &= \int h(\boldsymbol{\mathbf{\theta}})\frac{\widetilde{p}(\boldsymbol{\mathbf{\theta}}\mid \boldsymbol{\mathbf{y}})}{Z_p}d\boldsymbol{\mathbf{\theta}}\\ &= \frac{1}{Z_p}\int h(\boldsymbol{\mathbf{\theta}}) \frac{\widetilde{p}(\boldsymbol{\mathbf{\theta}}\mid \boldsymbol{\mathbf{y}})}{q(\boldsymbol{\mathbf{\theta}})} q(\boldsymbol{\mathbf{\theta}}) d\boldsymbol{\mathbf{\theta}}\\ &= \frac{\int h(\boldsymbol{\mathbf{\theta}}) \frac{\widetilde{p}(\boldsymbol{\mathbf{\theta}}\mid \boldsymbol{\mathbf{y}})}{q(\boldsymbol{\mathbf{\theta}})} q(\boldsymbol{\mathbf{\theta}}) d\boldsymbol{\mathbf{\theta}}}{\int \frac{\widetilde{p}(\boldsymbol{\mathbf{\theta}}\mid \boldsymbol{\mathbf{y}})}{q(\boldsymbol{\mathbf{\theta}})} q(\boldsymbol{\mathbf{\theta}})d\boldsymbol{\mathbf{\theta}}} \\ &= \frac{\mathbb{E}_{q(\boldsymbol{\mathbf{\theta}})}\left[\frac{\widetilde{p}(\boldsymbol{\mathbf{\theta}}\mid \boldsymbol{\mathbf{y}})}{q(\boldsymbol{\mathbf{\theta}})}h(\boldsymbol{\mathbf{\theta}})\right]}{\mathbb{E}_{q(\boldsymbol{\mathbf{\theta}})}\left[\frac{\widetilde{p}(\boldsymbol{\mathbf{\theta}}\mid \boldsymbol{\mathbf{y}})}{q(\boldsymbol{\mathbf{\theta}})}\right]}. \end{align} \] Now we approximate this expectation using Monte Carlo using \(\boldsymbol{\mathbf{\theta}}^{(1)}, \ldots, \boldsymbol{\mathbf{\theta}}^{(N)}\sim q(\boldsymbol{\mathbf{\theta}})\) \[\begin{align} \mathbb{E}_{p(\boldsymbol{\mathbf{\theta}}\mid \boldsymbol{\mathbf{y}})}[h(\boldsymbol{\mathbf{\theta}})] &= \frac{\mathbb{E}_{q(\boldsymbol{\mathbf{\theta}})}\left[\frac{\widetilde{p}(\boldsymbol{\mathbf{\theta}}\mid \boldsymbol{\mathbf{y}})}{q(\boldsymbol{\mathbf{\theta}})}h(\boldsymbol{\mathbf{\theta}})\right]}{\mathbb{E}_{q(\boldsymbol{\mathbf{\theta}})}\left[\frac{\widetilde{p}(\boldsymbol{\mathbf{\theta}}\mid \boldsymbol{\mathbf{y}})}{q(\boldsymbol{\mathbf{\theta}})}\right]} \\ &\approx \frac{\frac{1}{N}\sum_{i=1}^N \frac{\widetilde{p}(\boldsymbol{\mathbf{\theta}}^{(i)}\mid \boldsymbol{\mathbf{y}})}{q(\boldsymbol{\mathbf{\theta}}^{(i)})}h(\boldsymbol{\mathbf{\theta}}^{(i)})}{\frac{1}{N}\sum_{i=1}^N \frac{\widetilde{p}(\boldsymbol{\mathbf{\theta}}^{(i)}\mid \boldsymbol{\mathbf{y}})}{q(\boldsymbol{\mathbf{\theta}}^{(i)})}} \\ &= \sum_{i=1}^N \left[\frac{\frac{\widetilde{p}(\boldsymbol{\mathbf{\theta}}^{(i)}\mid \boldsymbol{\mathbf{y}})}{q(\boldsymbol{\mathbf{\theta}}^{(i)})}}{\sum_{j=1}^N \frac{\widetilde{p}(\boldsymbol{\mathbf{\theta}}^{(j)}\mid \boldsymbol{\mathbf{y}})}{q(\boldsymbol{\mathbf{\theta}}^{(j)})}}\right] h(\boldsymbol{\mathbf{\theta}}^{(i)}) \\ &= \sum_{i=1}^N \left[\frac{\widetilde{w}(\boldsymbol{\mathbf{\theta}}^{(i)})}{\sum_{j=1}^N \widetilde{w}(\boldsymbol{\mathbf{\theta}}^{(j)})}\right]h(\boldsymbol{\mathbf{\theta}}^{(i)}) \\ &= \sum_{i=1}^N w(\boldsymbol{\mathbf{\theta}}^{(i)}) h(\boldsymbol{\mathbf{\theta}}^{(i)}) \end{align}\]

where we have defined \(w(\boldsymbol{\mathbf{\theta}}^{(i)})\) and \(\widetilde{w}(\boldsymbol{\mathbf{\theta}}^{(i)})\) are called normalized importance weights and unnormalized importance weights respectively and are defined below \[ \widetilde{w}(\boldsymbol{\mathbf{\theta}}^{(i)}) = \frac{\widetilde{p}(\boldsymbol{\mathbf{\theta}}^{(i)}\mid \boldsymbol{\mathbf{y}})}{q(\boldsymbol{\mathbf{\theta}}^{(i)})} \qquad \text{and} \qquad w(\boldsymbol{\mathbf{\theta}}^{(i)}) = \frac{\widetilde{w}(\boldsymbol{\mathbf{\theta}}^{(i)})}{\sum_{j=1}^N \widetilde{w}(\boldsymbol{\mathbf{\theta}}^{(j)}) } \]

Properties of Classic Importance Sampling

  • For the variance of the estimator of \(\mathbb{E}_{p(\boldsymbol{\mathbf{\theta}}\mid \boldsymbol{\mathbf{y}})}[h(\boldsymbol{\mathbf{\theta}})]\) to be finite we require \(\text{supp}(q) \supset \text{supp}(p)\).
  • Notice that the vector of weights \(\boldsymbol{\mathbf{w}} = (w(\boldsymbol{\mathbf{\theta}}^{(1)}), \ldots, w(\boldsymbol{\mathbf{\theta}}^{(N)}))^\top\) tells us how well \(q(\boldsymbol{\mathbf{\theta}})\) fits \(p(\boldsymbol{\mathbf{\theta}}\mid\boldsymbol{\mathbf{y}})\). In particular when \(q\equiv p\) we have \(\boldsymbol{\mathbf{w}} = \left(\frac{1}{N}, \ldots, \frac{1}{N}\right)^\top\), i.e. a vector of uniform weights. So the more uniform the weights, the better the fit. In practice, there will be only few dominant weights, and if \(q\) badly fits \(p\) then the vast majority of weights will be neglegible, and one or two weights will dominate.

Bibliography

Barber, David. 2012. Bayesian Reasoning and Machine Learning. USA: Cambridge University Press.

Doucet, Arnaud, Simon Godsill, and Christophe Andrieu. 2000. “On Sequential Monte Carlo Sampling Methods for Bayesian Filtering.” Statistics and Computing 10 (3). USA: Kluwer Academic Publishers: 197–208. https://doi.org/10.1023/A:1008935410038.

Gelman, Andrew, John B. Carlin, Hal S. Stern, and Donald B. Rubin. 2004. Bayesian Data Analysis. 2nd ed. Chapman; Hall/CRC.

Liu, Jun S. 2008. Monte Carlo Strategies in Scientific Computing. Springer Publishing Company, Incorporated.

Naesseth, Christian A., Fredrik Lindsten, and Thomas B. Schön. 2019. “Elements of Sequential Monte Carlo.”

Robert, Christian P., and George Casella. 2005. Monte Carlo Statistical Methods (Springer Texts in Statistics). Berlin, Heidelberg: Springer-Verlag.

Avatar
Mauro Camara Escudero
Statistical Machine Learning Ph.D.

My research interests include approximate manifold sampling and generative models.