1953 - Metropolis
Aim is to approximate the following integral (canonical ensamble)
This is not feasible using standard numerical methods (using a grid of points). One could choose integration points at random uniformly and then give these points a weight however this would not be practical as most of the mass would be located in a small set. Instead we choose points with probability and then weight them evenly. We move the point to
We then calculate the change in energy of the system and if (we have moved to a region of lower energy i.e. higher probability) we accept the move. Otherwise if we allow the move with probability . We then approximate the expectation as
The proof of correctness essentially proves that we choose points with probability .
- Proof that the method is ergodic, i.e. we can reach any point on the domain.
- Detail balance proof
1970 Hastings
Features of MCMC:
- Computations depend only on target via the ratio therefore normalizing constants need not be known, no factorization of is necessary.
- Samples are obtained via a Markov Chain, hence are correlated. Therefore estimating standard deviation of our estimates of expectations require more care.
We have a probability distribution and a function defined on the state space. We wish to estimate
We construct so that is its unique stationary distribution
and we can approximate with
If the Markov Chain defined by is finite and irreducible we know is asymptotically normally distributed and in mean square as . Notice that since is a Markov Chain, it only depends on the previous state, so it is asymptotically stationary (meaning that the distribution of does not change when shifted in time). Hence we can estimate the variance of this estimator using techniques for a stationary process:
We construct as follows. Let be a transition matrix and
where is the probability of accepting the proposed state from and is defined as
and is a symmetric function of and chosen so that for all . In particular the Metropolis Method uses
and when the proposal is symmetric we get
Hastings also produces a proof for the fixed-scan Gibbs sampler. If the dimension is then we consider the process observed at times . This is a Markov Process with transition . As long as for every then will be a stationary distribution of because
To make sure that the stationary distribution is unique, then we must check the irreducibility of . Hastings advices:
- Choose proposing far away points but with low rejection rate.
- Choose initial state in region of high probability, if possible.
1994 Tierney
Markov Transition Kernel
A Markov Transition Kernel on is a map such that
- is a measurable function, for any fixed
- is a probability measure on for any fixed .
Time-homogeneous MC Kernel
We call a sequence of -valued random variables a time-homogeneous Markov Chain if the kernel has the following form
It is a Markov Chain because the probability that depends only on and not on the whole history. It is time-homogeneous because the kernel is the same used independently of time (we don’t use a different kernel at every step ).
We denote by probabilities for the Markov Chain with kernel , started at .
Kernel Operations
Let be a probability measure , a kernel on and be a real-valued measurable function.
- Kernel operates on the left with probability measures
- Kernel operates on the right with measurable functions
Notice that both of the properties above are just expectations of the from
In particular, is the expectation with respect to of the measurable function , i.e. . The second expression is the expectation of with respect to the measure , i.e.
Invariant Measure
We say that kernel leaves the measure invariant if for every measurable
Essentially this means that if we operate the kernel on the left against the resulting measure is still , this is written as .
Irreducibility
Let be a -finite measure. A kernel on is irreducible if and for each point and set with I can find an integer such that .
This means that as long as the measure gives positive mass to the whole of and then I can always find a number of steps that will lead me from any to any , where must have positive mass under . Basically I can visit the whole space if you give me long enough time!
Periodicity
A -irreducible kernel is periodic if there exists an integer and a sequence of non-empty disjoint sets in such that for all and all
This means that if I am at then I will move with probability to in step (where ). The result of this is that we are going to visit each of periodically. If the kernel is not periodic, it is aperiodic.
Recurrence
Suppose is a Markov Chain generated by kernel which is -irreducible (can explore the whole space) and has as its invariant distribution. We say that is recurrent if for every with positive mass we have
here i.o. means infinitely often. Basically the first condition says that, no matter your starting point, you have a positiev probability of visiting each set with positive mass infinitely often. The second condition tells us that if you start from a in a set of positive measure then actually you will visit each infinitely often with probability .
If for every then we say the chain is Harris recurrent. Basically this means that we have probability of visiting each infinitely often, even if we start from an in a set of measure zero.
Suppose has a unique invariant distribution. We say is positive recurrent if the total mass of this measure is finite.
Unique Invariant
Total Variation Norm
If is a bounded signed measure on then we define the total variation norm as
Equilibrium Distribution
Suppose that is -irreducible and is its invariant distribution, i.e. . Then is positive recurrent and is the unique invariant measure of . If is also aperiodic then for -almost all
If is Harris recurrent then the convergence occurs for all . Basically this means that if can explore the whole space, has as its invariant distribution and is aperiodic, then will also be its equilibrium distribution.
NB: Above we used .
Importantly, the assumptions above are both necessary and sufficient. This means that if
then the chain is -irreducible, aperiodic, positive Harris recurrent and has invariant distribution .
Harris Recurrent Checks
Let be a non-negative real-valued function. We say is harmonic for if .
Now let be recurrent. Then it is also Harris recurrent if and only if every bounded harmonic function is a constant.
Suppose is -irreducible and . If the measure is absolutely continuous with respect to for all , i.e. , then is Harris recurrent. This condition is often used for Gibbs samplers.
Irreducible Metropolis is Harris
Let be a -irreducible Metropolis kernel. Then is Harris recurrent.
Ergodicity
A Markov Chain is called ergodic if it is positive Harris recurrent and aperiodic.
There are three additional stronger forms of ergodicity:
- Ergodicity of degree : Let denote the hitting time for set . An ergodic chain with invariant distribution is ergodic of degree if
For these types of chains we have
- Geometric Ergodicity: An ergodic Markov CHain with invariant distribution is geometrically ergodic if there exsts a non-negative extended real-valued function with and a positive constant such that
Basically this means that TV norm is upper-bounded by a dampened function.
- Uniform Ergodicity: In addition, the chain is uniformly ergodic if there exists a positive constant and a positive constant such that
Basically this means the sup of the TV norm is upper-bounded by a dampened constant.
The three forms of ergodicity are related by the following relation
Minorization and Small Sets
Suppose is a -irreducible kernel. Let be an integer, be a constant, be a set and be a probability measure on . We say that the kernel satisfied a minorization condition if
We say is a small set for .
Basically it means this: Suppose we have a set with positive mass. If we start from any point the -step transition kernel is a measure that dominates .
Convergence results
- is uniformly ergodic the state space is small (i.e. a minorization condition is satisfied)
- If satisfies a minorizatio condition then the convergence rate satisfies
- Suppose is a target measure and proposal measure, both of them bounded and bounded away from on . Suppose the Lebesgue measure assigns finite measure to , i.e. . The Metropolis Kernel targeting with proposal satisfies a minorization condition where . This means the Metropolis kernel is therefore uniformly ergodic because is a small set.
- An Independence Metropolis kernel with proposal density and bounded weight function satisfies a minorization condition with and is thus uniformly ergodic. The convergence rate satisfies
- Uniform ergodicity for mixtures: Suppose and are -invariant and is uniformly ergodic. The mixture kernel is uniformly ergodic.
- Uniform ergodicity for cycles: Suppose and are -invariant and that satisfies a minorization condition for some and . Then and are uniformly ergodic.
Making mixtures/cycles ergodic
Since the independence kernel with bounded weigh function satisfies a minorization condition, then we can add this to any cycle or mixture of kernels to make the overall transition kernel uniformly ergodic. This basically means that we insert a “restart step”. We need to make sure that has sufficiently thick tails.
Convergence of Averages
Let be an ergodic Markov Chain with equilibrium distribution . Suppose the function is real-valued and . Then for any initial distribution we have
Central Limit Theorem (version 1)
Suppose is ergodic of degree with equilibrium distribution . Suppose is real-valued and bounded. Then, there exists a real number such that the distribution of
converges weakly to a normal distribution wiht mean and variance for any initial distribution.
Central Limit Theorem (version 2)
Suppose is uniformly ergodic with equilibrium distribution and suppose is real-valued and . Then here exists a real number such that the distribution of
converges weakly to a normal distribution wiht mean and variance for any initial distribution.