MCMC Literature

1953 - Metropolis

Aim is to approximate the following integral (canonical ensamble) $\overset{―}{F} = \frac{\int F \exp (- E / k T) d p d q}{\int \exp (- E / k T) d p d q}$ This is not feasible using standard numerical methods (using a grid of points). One could choose integration points at random uniformly and then give these points a weight $\exp (- E / k T)$ however this would not be practical as most of the mass would be located in a small set. Instead we choose points with probability $\exp (- E / k T)$ and then weight them evenly. We move the point $(X, Y)$ to $\begin{array}{r} X = X + α ξ_{1} ξ_{1} \sim U (- 1, 1) \\ Y = Y + α ξ_{2} ξ_{2} \sim U (- 1, 1) \end{array}$ We then calculate the change in energy of the system $Δ E$ and if $Δ E < 0$ (we have moved to a region of lower energy i.e. higher probability) we accept the move. Otherwise if $Δ E > 0$ we allow the move with probability $\exp (- Δ E / k T)$ . We then approximate the expectation as $\overset{―}{F} = \frac{1}{M} \sum_{i = 1}^{M} F_{j}$ The proof of correctness essentially proves that we choose points with probability $\exp (- E / k T)$ .

Proof that the method is ergodic, i.e. we can reach any point on the domain.
Detail balance proof

1970 Hastings

Features of MCMC:

Computations depend only on target $p (x)$ via the ratio $p (x^{'}) / p (x)$ therefore normalizing constants need not be known, no factorization of $p (x)$ is necessary.
Samples are obtained via a Markov Chain, hence are correlated. Therefore estimating standard deviation of our estimates of expectations require more care.

We have a probability distribution $π = (π_{0}, \dots, π_{S})$ and a function $f$ defined on the state space. We wish to estimate $I = E_{π} [f] = \sum_{i = 0}^{S} f (i) π_{i}$ We construct $P$ so that $π$ is its unique stationary distribution $π = π P$ and we can approximate $I$ with $\hat{I} = \frac{1}{N} \sum_{t = 1}^{N} f (X (t)) .$ If the Markov Chain defined by $P$ is finite and irreducible we know $\hat{I}$ is asymptotically normally distributed and $\hat{I} \to I$ in mean square as $N \to \infty$ . Notice that since $X (t)$ is a Markov Chain, it only depends on the previous state, so it is asymptotically stationary (meaning that the distribution of $X$ does not change when shifted in time). Hence we can estimate the variance of this estimator using techniques for a stationary process:

$var (\overset{―}{Y}) = \frac{σ^{2}}{N} \sum_{j = - N + 1}^{N - 1} (1 - \frac{| j |}{N}) corr (Y (t), Y (t +)) ⟶ 2 π \frac{g (0)}{N} as N \to \infty$

We construct $P$ as follows. Let $Q = (q_{i j})$ be a transition matrix and ${\begin{cases} p_{i j} = q_{i j} α_{i j} & Probability of leaving i \\ p_{i i} = 1 - \sum_{i \neq j} p_{i j} & Probability of staying at i \end{cases}$ where $α_{i j}$ is the probability of accepting the proposed state $j$ from $i$ and is defined as $α_{i j} = \frac{s_{i j}}{1 + \frac{π_{i}}{π_{j}} \frac{q_{i j}}{q_{j i}}}$ and $s_{i j}$ is a symmetric function of $i$ and $j$ chosen so that $α_{i j} \in [0, 1]$ for all $i, j$ . In particular the Metropolis Method uses $s_{i j} = {\begin{cases} 1 + \frac{π_{i} q_{i j}}{π_{j} q_{j i}} & if \frac{π_{j} q_{j i}}{π_{i} q_{i j}} \geq 1 \\ 1 + \frac{π_{j} q_{j i}}{π_{i} q_{i j}} & if \frac{π_{j} q_{j i}}{π_{i} q_{i j}} < 1 \end{cases}$ and when the proposal is symmetric we get $α_{i j} = {\begin{cases} 1 & \frac{π_{j}}{π_{i}} \geq 1 \\ \frac{π_{j}}{π_{i}} & \frac{π_{j}}{π_{i}} < 1 \end{cases}$ Hastings also produces a proof for the fixed-scan Gibbs sampler. If the dimension is $d$ then we consider the process observed at times $0, d, 2 d, \dots$ . This is a Markov Process with transition $P = P_{1} \dots P_{d}$ . As long as $π P_{k} = π$ for every $k$ then $π$ will be a stationary distribution of $P$ because $π P = π P_{1} \dots P_{d} = π P_{2} \dots P_{d} = \dots = π$ To make sure that the stationary distribution is unique, then we must check the irreducibility of $P$ . Hastings advices:

Choose $Q$ proposing far away points but with low rejection rate.
Choose initial state in region of high probability, if possible.

1994 Tierney

Markov Transition Kernel

A Markov Transition Kernel on $(E, E)$ is a map $P : E \times E \to [0, 1]$ such that

$P (\cdot, A)$ is a measurable function, for any fixed $A \in E$
$P (x, \cdot)$ is a probability measure on $(E, E)$ for any fixed $x \in E$ .

Time-homogeneous MC Kernel

We call a sequence of $E$ -valued random variables ${X_{n} : n \geq 0}$ a time-homogeneous Markov Chain if the kernel has the following form $P (X_{n}, A) = P (X_{n + 1} \in A ∣ X_{0}, \dots, X_{n}) .$ It is a Markov Chain because the probability that $X_{n + 1} \in A$ depends only on $X_{n}$ and not on the whole history. It is time-homogeneous because the kernel $P$ is the same used independently of time (we don’t use a different kernel $P_{n}$ at every step $n$ ). We denote by $P_{x}$ probabilities for the Markov Chain with kernel $P$ , started at $x$ .

Kernel Operations

Let $ν$ be a probability measure , $P$ a kernel on $(E, E)$ and $h$ be a real-valued $E -$ measurable function.

Kernel operates on the left with probability measures $(ν P) (A) = \int P (x, A) ν (d x)$
Kernel operates on the right with measurable functions $(P h) (x) = \int h (y) P (x, d y)$

Notice that both of the properties above are just expectations of the from $ν h = ν [h] = E_{ν} [h] = \int h (y) ν (d y)$ In particular, $ν P$ is the expectation with respect to $ν$ of the measurable function $P (\cdot, A)$ , i.e. $ν P = E_{ν} [P (x, A)]$ . The second expression is the expectation of $h$ with respect to the measure $P (x, \cdot)$ , i.e. $P h = E_{P (x, \cdot)} [h]$

Invariant Measure

We say that kernel $P$ leaves the measure $π$ invariant if for every measurable $A$ $π (A) = \int π (d x) P (x, A)$ Essentially this means that if we operate the kernel $P$ on the left against $π$ the resulting measure is still $π$ , this is written as $π = π P$ .

Irreducibility

Let $π$ be a $σ$ -finite measure. A kernel $P$ on $(E, E)$ is $π -$ irreducible if $π (E) > 0$ and for each point $x \in E$ and set $A \in E$ with $π (A) > 0$ I can find an integer $n = n (x, A) \geq 1$ such that $P^{n} (x, A) > 0$ .

This means that as long as the measure $π$ gives positive mass to the whole of $E$ and then I can always find a number of steps that will lead me from any $x \in E$ to any $A \in E$ , where $A$ must have positive mass under $π$ . Basically I can visit the whole space if you give me long enough time!

Periodicity

A $π$ -irreducible kernel $P$ is periodic if there exists an integer $d \geq 2$ and a sequence ${E_{0}, E_{1}, \dots, E_{d - 1}}$ of $d$ non-empty disjoint sets in $E$ such that for all $i = 0, \dots, d - 1$ and all $x \in E$ $P (x, E_{j}) = 1 for j = i + 1 mod d$ This means that if I am at $x \in E_{i}$ then I will move with probability $1$ to $E_{j}$ in $1$ step (where $j = i + 1 mod d$ ). The result of this is that we are going to visit each of $E_{i}$ periodically. If the kernel is not periodic, it is aperiodic.

Recurrence

Suppose $X_{n}$ is a Markov Chain generated by kernel $P$ which is $π$ -irreducible (can explore the whole space) and has $π$ as its invariant distribution. We say that $X_{n}$ is recurrent if for every $B$ with positive mass $π (B) > 0$ we have

$P_{x} {X_{n} \in B i.o} for all x$
$P_{x} {X_{n} \in B i.o} = 1 for π -almost all x$

here i.o. means infinitely often. Basically the first condition says that, no matter your starting point, you have a positiev probability of visiting each set $B$ with positive mass infinitely often. The second condition tells us that if you start from a $x$ in a set of positive measure then actually you will visit each $B$ infinitely often with probability $1$ .

If $P_{x} {X_{n} \in B i.o} = 1$ for every $x$ then we say the chain is Harris recurrent. Basically this means that we have probability $1$ of visiting each $B$ infinitely often, even if we start from an $x$ in a set of measure zero.

Suppose $P$ has a unique invariant distribution. We say $P$ is positive recurrent if the total mass of this measure is finite.

Unique Invariant

$\begin{array}{r} P π -irreducible \\ P π -invariant \end{array}} ⟹ P recurrent and π unique invariant$

Total Variation Norm

If $λ$ is a bounded signed measure on $(E, E)$ then we define the total variation norm as $∥ λ ∥= sup_{A \in E} λ (A) - inf_{A \in E} λ (A)$

Equilibrium Distribution

Suppose that $P$ is $π$ -irreducible and $π$ is its invariant distribution, i.e. $π = π P$ . Then $P$ is positive recurrent and $π$ is the unique invariant measure of $P$ . If $P$ is also aperiodic then for $π$ -almost all $x$ $∥ P^{n} (x, \cdot) - π ∥_{TV} \to 0$ If $P$ is Harris recurrent then the convergence occurs for all $x$ . Basically this means that if $P$ can explore the whole space, has $π$ as its invariant distribution and is aperiodic, then $π$ will also be its equilibrium distribution.

NB: Above we used $P^{n} (x, A) = P (X_{n} \in A ∣ X_{0} = x)$ .

Importantly, the assumptions above are both necessary and sufficient. This means that if $∥ P^{n} (x, \cdot) - π ∥_{TV} \to 0$ then the chain is $π$ -irreducible, aperiodic, positive Harris recurrent and has invariant distribution $π$ .

Harris Recurrent Checks

Let $h$ be a non-negative real-valued function. We say $h$ is harmonic for $P$ if $h = P h$ . Now let $P$ be recurrent. Then it is also Harris recurrent if and only if every bounded harmonic function is a constant.

Suppose $P$ is $π$ -irreducible and $π P = π$ . If the measure $P (x, \cdot)$ is absolutely continuous with respect to $π$ for all $x$ , i.e. $P (x, \cdot) ≪ π$ , then $P$ is Harris recurrent. This condition is often used for Gibbs samplers.

Irreducible Metropolis is Harris

Let $P$ be a $π$ -irreducible Metropolis kernel. Then $P$ is Harris recurrent.

Ergodicity

A Markov Chain is called ergodic if it is positive Harris recurrent and aperiodic.

There are three additional stronger forms of ergodicity:

Ergodicity of degree $2$ : Let $S_{B}$ denote the hitting time for set $B$ . An ergodic chain with invariant distribution $π$ is ergodic of degree $2$ if $\int_{B} E_{x} [S_{B}^{2}] π (d x) < \infty$ For these types of chains we have $n ∥ P^{n} (x, \cdot) - π ∥_{TV} \to 0$
Geometric Ergodicity: An ergodic Markov CHain with invariant distribution $π$ is geometrically ergodic if there exsts a non-negative extended real-valued function $M$ with $π (| M |) < \infty$ and a positive constant $r < 1$ such that $∥ P^{n} (x, \cdot) - π ∥_{TV} \leq M (x) r^{n} \forall x$ Basically this means that TV norm is upper-bounded by a dampened function.
Uniform Ergodicity: In addition, the chain is uniformly ergodic if there exists a positive constant $M$ and a positive constant $r < 1$ such that ${sup}_{x \in E} ∥ P^{n} (x, \cdot) - π ∥ \leq M r^{n}$ Basically this means the sup of the TV norm is upper-bounded by a dampened constant.

The three forms of ergodicity are related by the following relation $Uniform Ergodicity ⟹ Geometric Ergodicity ⟹ Degree 2 Ergodicity$

Minorization and Small Sets

Suppose $P$ is a $π$ -irreducible kernel. Let $m \geq 1$ be an integer, $β > 0$ be a constant, $C \in E$ be a set and $ν$ be a probability measure on $E$ . We say that the kernel $P$ satisfied a minorization condition $M (m, β, C, ν)$ if $π (C) > 0$ $β ν (\cdot) \leq P^{m} (x, \cdot) \forall x \in C$ We say $C$ is a small set for $P$ .

Basically it means this: Suppose we have a set $C$ with positive $π$ mass. If we start from any point $x \in C$ the $m$ -step transition kernel $P^{m} (x, \cdot)$ is a measure that dominates $β ν (\cdot)$ .

Convergence results

$P$ is uniformly ergodic $⟺$ the state space $E$ is small (i.e. a minorization condition $M (m, β, E, ν)$ is satisfied)
If $P$ satisfies a minorizatio condition $M (m, β, E, ν)$ then the convergence rate $r$ satisfies $r^{m} \leq (1 - β)$
Suppose $π$ is a target measure and $q$ proposal measure, both of them bounded and bounded away from $0$ on $E^{+}$ . Suppose the Lebesgue measure assigns finite measure to $E^{+}$ , i.e. $μ (E^{+}) < \infty$ . The Metropolis Kernel targeting $π$ with proposal $q$ satisfies a minorization condition $M (1, β, E ν)$ where $ν \propto μ ∣_{E^{+}}$ . This means the Metropolis kernel is therefore uniformly ergodic because $E$ is a small set.
An Independence Metropolis kernel with proposal density $f$ and bounded weight function $w = π / f$ satisfies a minorization condition $M (1, β, E, π)$ with $β = (sup w)^{- 1}$ and is thus uniformly ergodic. The convergence rate $r$ satisfies $r \leq (1 - β) = (1 - (sup w)^{- 1})$
Uniform ergodicity for mixtures: Suppose $P_{1}$ and $P_{2}$ are $π$ -invariant and $P_{1}$ is uniformly ergodic. The mixture kernel $α P_{1} + (1 - α) P_{2}$ is uniformly ergodic.
Uniform ergodicity for cycles: Suppose $P_{1}$ and $P_{2}$ are $π$ -invariant and that $P_{1}$ satisfies a minorization condition $M (1, β, E, ν)$ for some $β$ and $ν$ . Then $P_{1} P_{2}$ and $P_{2} P_{1}$ are uniformly ergodic.

Making mixtures/cycles ergodic

Since the independence kernel with bounded weigh function satisfies a minorization condition, then we can add this to any cycle or mixture of kernels to make the overall transition kernel uniformly ergodic. This basically means that we insert a “restart step”. We need to make sure that $f$ has sufficiently thick tails.

Convergence of Averages

Let $X_{n}$ be an ergodic Markov Chain with equilibrium distribution $π$ . Suppose the function $f$ is real-valued and $π (| f |) < \infty$ . Then for any initial distribution we have ${\overset{―}{f}}_{n} := \frac{1}{n} \sum_{i = 1}^{n} f (X_{i}) ⟶ π (f) almost surely$

Central Limit Theorem (version 1)

Suppose $X_{n}$ is ergodic of degree $2$ with equilibrium distribution $π$ . Suppose $f$ is real-valued and bounded. Then, there exists a real number $σ (f)$ such that the distribution of $\sqrt{n} ({\overset{―}{f}}_{n} - π (f))$ converges weakly to a normal distribution wiht mean $0$ and variance $σ (f)^{2}$ for any initial distribution.

Central Limit Theorem (version 2)

Suppose $X_{n}$ is uniformly ergodic with equilibrium distribution $π$ and suppose $f$ is real-valued and $π (f^{2}) < \infty$ . Then here exists a real number $σ (f)$ such that the distribution of $\sqrt{n} ({\overset{―}{f}}_{n} - π (f))$ converges weakly to a normal distribution wiht mean $0$ and variance $σ (f)^{2}$ for any initial distribution.

Last updated on May 12, 2021

Edit this page