MCMC Literature

1953 - Metropolis

Aim is to approximate the following integral (canonical ensamble) F=Fexp(E/kT)dpdqexp(E/kT)dpdq This is not feasible using standard numerical methods (using a grid of points). One could choose integration points at random uniformly and then give these points a weight exp(E/kT) however this would not be practical as most of the mass would be located in a small set. Instead we choose points with probability exp(E/kT) and then weight them evenly. We move the point (X,Y) to X=X+αξ1ξ1U(1,1)Y=Y+αξ2ξ2U(1,1) We then calculate the change in energy of the system ΔE and if ΔE<0 (we have moved to a region of lower energy i.e. higher probability) we accept the move. Otherwise if ΔE>0 we allow the move with probability exp(ΔE/kT). We then approximate the expectation as F=1Mi=1MFj The proof of correctness essentially proves that we choose points with probability exp(E/kT).

  • Proof that the method is ergodic, i.e. we can reach any point on the domain.
  • Detail balance proof

1970 Hastings

Features of MCMC:

  • Computations depend only on target p(x) via the ratio p(x)/p(x) therefore normalizing constants need not be known, no factorization of p(x) is necessary.
  • Samples are obtained via a Markov Chain, hence are correlated. Therefore estimating standard deviation of our estimates of expectations require more care.

We have a probability distribution π=(π0,,πS) and a function f defined on the state space. We wish to estimate I=Eπ[f]=i=0Sf(i)πi We construct P so that π is its unique stationary distribution π=πP and we can approximate I with I^=1Nt=1Nf(X(t)). If the Markov Chain defined by P is finite and irreducible we know I^ is asymptotically normally distributed and I^I in mean square as N. Notice that since X(t) is a Markov Chain, it only depends on the previous state, so it is asymptotically stationary (meaning that the distribution of X does not change when shifted in time). Hence we can estimate the variance of this estimator using techniques for a stationary process:

var(Y)=σ2Nj=N+1N1(1|j|N)corr(Y(t),Y(t+))2πg(0)Nas N

We construct P as follows. Let Q=(qij) be a transition matrix and {pij=qijαijProbability of leaving ipii=1ijpijProbability of staying at i where αij is the probability of accepting the proposed state j from i and is defined as αij=sij1+πiπjqijqji and sij is a symmetric function of i and j chosen so that αij[0,1] for all i,j. In particular the Metropolis Method uses sij={1+πiqijπjqjiif πjqjiπiqij11+πjqjiπiqijif πjqjiπiqij<1 and when the proposal is symmetric we get αij={1πjπi1πjπiπjπi<1 Hastings also produces a proof for the fixed-scan Gibbs sampler. If the dimension is d then we consider the process observed at times 0,d,2d,. This is a Markov Process with transition P=P1Pd. As long as πPk=π for every k then π will be a stationary distribution of P because πP=πP1Pd=πP2Pd==π To make sure that the stationary distribution is unique, then we must check the irreducibility of P. Hastings advices:

  • Choose Q proposing far away points but with low rejection rate.
  • Choose initial state in region of high probability, if possible.

1994 Tierney

Markov Transition Kernel

A Markov Transition Kernel on (E,E) is a map P:E×E[0,1] such that

  • P(,A) is a measurable function, for any fixed AE
  • P(x,) is a probability measure on (E,E) for any fixed xE.

Time-homogeneous MC Kernel

We call a sequence of E-valued random variables {Xn:n0} a time-homogeneous Markov Chain if the kernel has the following form P(Xn,A)=P(Xn+1AX0,,Xn). It is a Markov Chain because the probability that Xn+1A depends only on Xn and not on the whole history. It is time-homogeneous because the kernel P is the same used independently of time (we don’t use a different kernel Pn at every step n). We denote by Px probabilities for the Markov Chain with kernel P, started at x.

Kernel Operations

Let ν be a probability measure , P a kernel on (E,E) and h be a real-valued Emeasurable function.

  • Kernel operates on the left with probability measures (νP)(A)=P(x,A)ν(dx)
  • Kernel operates on the right with measurable functions (Ph)(x)=h(y)P(x,dy)

Notice that both of the properties above are just expectations of the from νh=ν[h]=Eν[h]=h(y)ν(dy) In particular, νP is the expectation with respect to ν of the measurable function P(,A), i.e. νP=Eν[P(x,A)]. The second expression is the expectation of h with respect to the measure P(x,), i.e. Ph=EP(x,)[h]

Invariant Measure

We say that kernel P leaves the measure π invariant if for every measurable A π(A)=π(dx)P(x,A) Essentially this means that if we operate the kernel P on the left against π the resulting measure is still π, this is written as π=πP.

Irreducibility

Let π be a σ-finite measure. A kernel P on (E,E) is πirreducible if π(E)>0 and for each point xE and set AE with π(A)>0 I can find an integer n=n(x,A)1 such that Pn(x,A)>0.

This means that as long as the measure π gives positive mass to the whole of E and then I can always find a number of steps that will lead me from any xE to any AE, where A must have positive mass under π. Basically I can visit the whole space if you give me long enough time!

Periodicity

A π-irreducible kernel P is periodic if there exists an integer d2 and a sequence {E0,E1,,Ed1} of d non-empty disjoint sets in E such that for all i=0,,d1 and all xE P(x,Ej)=1for j=i+1 mod d This means that if I am at xEi then I will move with probability 1 to Ej in 1 step (where j=i+1 mod d). The result of this is that we are going to visit each of Ei periodically. If the kernel is not periodic, it is aperiodic.

Recurrence

Suppose Xn is a Markov Chain generated by kernel P which is π-irreducible (can explore the whole space) and has π as its invariant distribution. We say that Xn is recurrent if for every B with positive mass π(B)>0 we have

  • Px{XnB i.o } for all x
  • Px{XnB i.o }=1 for π-almost all x

here i.o. means infinitely often. Basically the first condition says that, no matter your starting point, you have a positiev probability of visiting each set B with positive mass infinitely often. The second condition tells us that if you start from a x in a set of positive measure then actually you will visit each B infinitely often with probability 1.

If Px{XnB i.o }=1 for every x then we say the chain is Harris recurrent. Basically this means that we have probability 1 of visiting each B infinitely often, even if we start from an x in a set of measure zero.

Suppose P has a unique invariant distribution. We say P is positive recurrent if the total mass of this measure is finite.

Unique Invariant

Pπ-irreduciblePπ-invariant}P recurrent and π unique invariant 

Total Variation Norm

If λ is a bounded signed measure on (E,E) then we define the total variation norm as λ∥=supAEλ(A)infAEλ(A)

Equilibrium Distribution

Suppose that P is π-irreducible and π is its invariant distribution, i.e. π=πP. Then P is positive recurrent and π is the unique invariant measure of P. If P is also aperiodic then for π-almost all x Pn(x,)πTV0 If P is Harris recurrent then the convergence occurs for all x. Basically this means that if P can explore the whole space, has π as its invariant distribution and is aperiodic, then π will also be its equilibrium distribution.

NB: Above we used Pn(x,A)=P(XnAX0=x).

Importantly, the assumptions above are both necessary and sufficient. This means that if Pn(x,)πTV0 then the chain is π-irreducible, aperiodic, positive Harris recurrent and has invariant distribution π.

Harris Recurrent Checks

Let h be a non-negative real-valued function. We say h is harmonic for P if h=Ph. Now let P be recurrent. Then it is also Harris recurrent if and only if every bounded harmonic function is a constant.

Suppose P is π-irreducible and πP=π. If the measure P(x,) is absolutely continuous with respect to π for all x, i.e. P(x,)π, then P is Harris recurrent. This condition is often used for Gibbs samplers.

Irreducible Metropolis is Harris

Let P be a π-irreducible Metropolis kernel. Then P is Harris recurrent.

Ergodicity

A Markov Chain is called ergodic if it is positive Harris recurrent and aperiodic.

There are three additional stronger forms of ergodicity:

  • Ergodicity of degree 2: Let SB denote the hitting time for set B. An ergodic chain with invariant distribution π is ergodic of degree 2 if BEx[SB2]π(dx)< For these types of chains we have nPn(x,)πTV0
  • Geometric Ergodicity: An ergodic Markov CHain with invariant distribution π is geometrically ergodic if there exsts a non-negative extended real-valued function M with π(|M|)< and a positive constant r<1 such that Pn(x,)πTVM(x)rnx Basically this means that TV norm is upper-bounded by a dampened function.
  • Uniform Ergodicity: In addition, the chain is uniformly ergodic if there exists a positive constant M and a positive constant r<1 such that supxEPn(x,)πMrn Basically this means the sup of the TV norm is upper-bounded by a dampened constant.

The three forms of ergodicity are related by the following relation Uniform ErgodicityGeometric ErgodicityDegree 2 Ergodicity

Minorization and Small Sets

Suppose P is a π-irreducible kernel. Let m1 be an integer, β>0 be a constant, CE be a set and ν be a probability measure on E. We say that the kernel P satisfied a minorization condition M(m,β,C,ν) if π(C)>0 βν()Pm(x,)xC We say C is a small set for P.

Basically it means this: Suppose we have a set C with positive π mass. If we start from any point xC the m-step transition kernel Pm(x,) is a measure that dominates βν().

Convergence results

  • P is uniformly ergodic the state space E is small (i.e. a minorization condition M(m,β,E,ν) is satisfied)
  • If P satisfies a minorizatio condition M(m,β,E,ν) then the convergence rate r satisfies rm(1β)
  • Suppose π is a target measure and q proposal measure, both of them bounded and bounded away from 0 on E+. Suppose the Lebesgue measure assigns finite measure to E+, i.e. μ(E+)<. The Metropolis Kernel targeting π with proposal q satisfies a minorization condition M(1,β,Eν) where νμE+. This means the Metropolis kernel is therefore uniformly ergodic because E is a small set.
  • An Independence Metropolis kernel with proposal density f and bounded weight function w=π/f satisfies a minorization condition M(1,β,E,π) with β=(supw)1 and is thus uniformly ergodic. The convergence rate r satisfies r(1β)=(1(supw)1)
  • Uniform ergodicity for mixtures: Suppose P1 and P2 are π-invariant and P1 is uniformly ergodic. The mixture kernel αP1+(1α)P2 is uniformly ergodic.
  • Uniform ergodicity for cycles: Suppose P1 and P2 are π-invariant and that P1 satisfies a minorization condition M(1,β,E,ν) for some β and ν. Then P1P2 and P2P1 are uniformly ergodic.

Making mixtures/cycles ergodic

Since the independence kernel with bounded weigh function satisfies a minorization condition, then we can add this to any cycle or mixture of kernels to make the overall transition kernel uniformly ergodic. This basically means that we insert a “restart step”. We need to make sure that f has sufficiently thick tails.

Convergence of Averages

Let Xn be an ergodic Markov Chain with equilibrium distribution π. Suppose the function f is real-valued and π(|f|)<. Then for any initial distribution we have fn:=1ni=1nf(Xi)π(f) almost surely

Central Limit Theorem (version 1)

Suppose Xn is ergodic of degree 2 with equilibrium distribution π. Suppose f is real-valued and bounded. Then, there exists a real number σ(f) such that the distribution of n(fnπ(f)) converges weakly to a normal distribution wiht mean 0 and variance σ(f)2 for any initial distribution.

Central Limit Theorem (version 2)

Suppose Xn is uniformly ergodic with equilibrium distribution π and suppose f is real-valued and π(f2)<. Then here exists a real number σ(f) such that the distribution of n(fnπ(f)) converges weakly to a normal distribution wiht mean 0 and variance σ(f)2 for any initial distribution.

Next