Variational Auto-Encoders and the Expectation-Maximization Algorithm

Relationship between Variational Autoencoders (VAE) and the Expectation Maximization Algorithm

Latent Variable Models (LVM)

Suppose we observe some data D={x1,,xN}. Often we know that what we have observed is only part of the whole picture, and to understand the system at hand we have to introduce some latent variables. Consider, for now, a single data point x and its corresponding latent variables z. Then, we might be interested in the following three tasks.

  • Generative Modelling: Generating samples from the following distributions, efficiently. pθ(x)=pθ(x,z)dz
  • Posterior Inference: Modelling the posterior distribution over the latent variables. pθ(zx)=pθ(x,z)pθ(x)
  • Parameter Estimation: Performing Maximum Likelihood Estimation (MLE) or Maximum-A-Posteriori estimation (MAP) for the parameter θ: θ=argmaxθxDpθ(x)orθ=argmaxθ[xDpθ(x)]p(θ)

The settings and problems described above are quite standard and of widespread interest. One method to perform Maximum Likelihood Estimation in Latent Variable Models is the Expectation-Maximization algorithm, while a method to perform posterior inference is Mean-Field Variational Inference.

Expectation-Maximization Algorithm for Maximum Likelihood Estimation

Suppose that, for some reason, we have a fairly good estimate for the parameter, let’s call this guess θ^. How can we improve this guess? One way to go about this is to use θ^ to construct the posterior distribution at each data point {pθ^(zx):xD} And then we find an updated and improved parameter value by maximizing the expected complete-data log likelihood argmaxθEpθ^(zx)[logpθ(x,z)] That is, rather than maximizing the incomplete-data likelihood pθ^(x), we maximize the joint likelihood pθ^(x,z), but since we don’t actually have the latent variables z we average this complete-data likelihood with respect to the posterior of the latent variables given the data and the parameter value pθ^(zx). By iterating this proceedure we obtain the following algorithm.

  • Initialize θ(0) and set t=0.
  • Until convergence:
    • Compute posterior distribution of the latent variables given the observations {pθ(t)(zx):xD}
    • Choose new parameter value θ(t+1) so that it maximises xDEpθ(t)(zx)[logpθ(x,z)]

Problem: The EM Algorithm breaks if pθ(t)(zx) are intractable.

Mean-Field Variational Inference for Posterior Inference

A well-known method, alternative to MCMC, for posterior inference if Mean-Field Variational Inference. In Variational Inference we essentially define a family of distributions that we think is somehow representative of the true posterior distribution pθ(zx). Then, we choose the member of that family that is closest, in some sense, to the true posterior. In Mean-Field Variational Inference we assume that such a family of distributions is factorized into a product of terms, one for each data point.

i=1Nqϕi(zi)i=1Npθ(zix)

We can see that each factor in the product on the RHS is described by a vector of parameters ϕi. Usually, to judge how close each member of the chosen family is to the true posterior, we use the KL-divergence. This means that we need to solve an optimization problem for each factor in the approximation, i.e. for each data point ϕi=argminϕiKL(qϕi(zx)||pθ(zx)).

Problem: It clearly doesn’t scale well with large datasets and it breaks if Eqϕi[] are intractable.

Ideally, we would like to use only one vector of parameter ϕ, which is shared across data points. This is called amortized inference.

Variational Autoencoders at a Glance

So what are Variational Autoencoders or Auto-Encoding Variational Bayes?? Below is a summary of what they are and what they are used for.

  • What is AEVB used for? Inference and Generative Modelling in LVMs.
  • How do AEVBs work? Optimization of an unbiased estimator of an objective function using SGD.
  • What are VAEs? They are AEVB where the probability distributions in the LVM are parametrized by Neural Networks.

Auto-Encoding Variational Bayes Objective Function

In this section, we derive the objective function of AEVB. First, let us introduce a so-called recognition model qϕ(zx). This is a chosen distribution that we want to use to approximate the true posterior distribution pθ(zx), similar to how we did it for Mean-Field Variational Inference. The key difference here is that instead of having a different vector of parameters for each data point, we share a single vector of parameters across data points. Then, we consider the KL divergence between this approximating distribution and the true posterior

KL(qϕ(zx)||pθ(zx))=Eqϕ[logqϕ(zx)logpθ(zx)]=Eqϕ[log(pθ(x,z)qϕ(zx))]+logpθ(x):=Lθ,ϕ(x)+logpθ(x)

Notice how we managed to write the KL divergence in terms of the log marginal likelihood pθ(x) and a term that we call Lθ,ϕ(x). We can now rearrange the following expression and notice that, since the KL divergence is always non-negative, Lθ,ϕ(x) provides a lower bound for the marginal log-likelihood.

logpθ(x)=KL(qϕ(zx)||pθ(zx))+Lθ,ϕ(x)Lθ,ϕ(x)

For this reason, we call Lθ,ϕ(x) the Evidence Lower BOund (ELBO). With an i.i.d dataset we can see that this relationships holds for the whole dataset:

xDlogpθ(x)xDLθ,ϕ(x)

Why is it useful to find a lower bound on the log marginal likelihood? Because by maximizing the ELBO we get two birds with one stone. First of all, notice how by maximizing the ELBO with respect to θ, we can expect to approximately maximize also the log marginal likelihood. Similarly, by maximizing the ELBO with respect to ϕ we can see that, since the ELBO can be written as the log marginal likelihood minus the kl divergence, this is equivalent to minimizing the KL divergence. In summary we can write:

maxθ,ϕLθ,ϕ(x){maxθxDlogpθ(x)as logpθ(x)Lθ,ϕ(x)minϕxDKLas logpθ(x)KL=Lθ,ϕ(x)

Repectively, such maximization have a very practical results:

  • The generative model pθ(x) improves.
  • The approximation qϕ(zx) improves.

Lastly, one can also write the ELBO in a different way. This second formulation is often convenient because it will tend to have estimates with lower variance.

Lθ,ϕ(x)=logpθ(x)KL(qϕ||pθ(zx))=Eqϕ[log(pθ(xz)pθ(z))logqϕ(zx)]=Eqϕ[logpθ(xz)]Expected Reconstruction ErrorKL(qϕ(zx)||pθ(z))Regularization Term

As can be seen above, the ELBO can be written as a sum of two terms: expected reconstruction error and the KL divergence between the approximation and the latent prior. This KL-divergence can be interpreted as a regularization term trying to keep the approximation close to the prior. This regularization term can sometimes backfire and maintain the approximation to be exactly equal to the prior. To deal with such scenarios, called posterior collapse, people have been developing new methods, such as delta-VAEs.

Optimization of the ELBO Objective Function

One way to optimize the ELBO with respect to ϕ and θ is to perform gradient descent. Since our aim is to find an algorithm that scales well with large datasets, we want to use Stochastic Gradient Ascent. In order to do so, we need to be able to compute ϕLθ,ϕ(x) and θLθ,ϕ(x). However, notice how in both formulations of the ELBO we find expectations with respect to qϕ(zx), which depends on ϕ

Lθ,ϕ(x)={Eqϕ[logpθ(x,z)logqϕ(zx)]Eqϕ[logpθ(xz)]KL(qϕ||pθ(z))

This means that when taking the gradient of the ELBO with respect to ϕ we cannot exchange the gradient and the expectation sign

ϕEqϕ(zx)[]Eqϕ(zx)[ϕ] We would like to do this “exchange” operation so that we can approximate the gradient with a simple Monte Carlo estimate as it is usually done in Stochastic Gradient Ascent. To go around this problem our question becomes:

Can we write the expectation with respect to a distribution that is independent of ϕ? ϕEqϕ(zx)[]=Ep(ϵ)[ϕ]

If we think about it, we already know a special case in which this can be done. For instance, suppose that the distribution qϕ(zx) is actually a multivariate normal distribution N(μ,Σ). Then, we can rewrite a sample zqϕ(zx) in terms of a standard multivariate normal distribution z=μ+LϵwhereϵN(0,I) where LL is the Cholesky decomposition of Σ. Notice how essentially we’ve written the random variable z, which depends on the parameters ϕ=(μ,Σ) in terms of another random variables ϵ that is independent of ϕ and a deterministic (linear) transformation, which instead does depend on ϕ. We can then write an expectation with respect to N(μ,L) in terms of N(0,I): EN(μ,L)[f(z)]=EN(0,I)[f(μ+Lϵ)]

More generally, we can write a sample zqϕ(zx) as a deterministic function of x and ϵ, parametrized by ϕ z=gϕ(x,ϵ) where ϵ is drawn from a distribution p(ϵ) independent of ϕ. Then we can exchange the expectation and gradient operators as follows

ϕEqϕ(zx)[f(z)]=Ep(ϵ)[ϕf(gϕ(x,ϵ))]

This is called the reparametrization trick. At this point we can finally obtain unbiased estimators of the ELBO (or equivalently, of its gradients) based on ϵ(i)i.i.d.p(ϵ)

L~θ,ϕ(x)={1Li=1L[logpθ(x,gϕ(ϵ(i),x))logqϕ(gϕ(ϵ(i),x)x)]1Li=1L[logpθ(xgϕ(ϵ(i),x))]KL(qϕ||pθ(z))Often availablein closed form

Then to perform Stochastic Gradient Ascent we simply sample a mini-batch of data MD and use mini-batch gradients 1|M|xMθ,ϕL~θ,ϕ(x)

We call this Auto-Encoding Variational Bayes.

Estimating the Marginal Likelihood

After training, one can estimate the log marginal likelihood by using importance sampling

logpθ(x)=logpθ(x,z)dz=logEqϕ(zx)[pθ(x,z)qϕ(zx)]log1Li=1Lpθ(x,z(i))qϕ(z(i)x)z(i)i.i.d.qϕ(zx)

Variational Autoencoders

Before introducing what a Variational Autoencoder is, we need to understand what we mean when we say that we parametrise a distribution using a neural network. Suppose that x is a binary vector of Bernoulli trials. Then pθ(xz) is parametrized by a vector of probabilities p which can be constructed via a Multi-Layer Perceptron with an approrpriate output layer (e.g. softmax).

Parametrizing a distribution with a MLP

and the log-likelihood is, of course logpθ(xz)=jxjlogpj+(1xj)log(1pj)

A Variational Autoencoder is simply Auto-Encoding Variational Bayes where both the approximating distribution qϕ(zx) and pθ(xz) are parametrized using two different Neural Networks, as shown below.

Variational Autoencoder Diagram

Relationship between EM algorithm and VAE

So what is the relationship between the Expectation-Maximization algorithm and Variational Autoencoders? To get there we need to understand the EM algorithm in terms of Variational Inference. That is, we need to understand how the EM algorithm can be cast into the framework of variational inference. Recall that the EM algorithm proceeds in the following two steps:

  • Compute “current” posterior (which we can call approximate since θ(t)) will likely be, before convergence, different from the true θ) {pθ(t)(zx):xD}
  • Find optimal parameter θ(t+1)=argmaxθxDEpθ(t)(zx)[logpθ(x,z)]

Now consider the ELBO in its two different forms, but now rather than considering it as a function of x parametrized by θ and ϕ (i.e. Lθ,ϕ(x)), consider it as a functional of qϕ and a function of θ, i.e. Lx(θ,qϕ).

Lx(θ,qϕ)={logpθ(x)KL(qϕ||pθ(zx))(1)Eqϕ[logpθ(x,z)]Eqϕ[logqϕ](2)

We can find two identical steps as those of the EM algorithm by performing maximization of the ELBO with respect to qϕ first, and then with respect to θ:

  • E-step: Maximize (1) with respect to qϕ (this makes the KL-divergence zero and the bound is tight) {pθ(t)(zx)=argmaxqϕLx(θ(t),qϕ):xD}
  • M-step: Maximize (2) with respect to θ θ(t+1)=argmaxθxDLx(θ,pθ(t)(zx))

The relationship between the Expectation Maximization algorithm and Variational Auto-Encoders can therefore be summarized as follows:

  • EM algorithm and VAE optimize the same objective function.
  • When expectations are in closed-form, one should use the EM algorithm which uses coordinate ascent.
  • When expectations are intractable, VAE uses stochastic gradient ascent on an unbiased estimator of the objective function.

Bibliography

“VAE = Em.” 2017. Machine Thoughts. https://machinethoughts.wordpress.com/2017/10/02/vae-em/.

n.d. The Variational Auto-Encoder. https://ermongroup.github.io/cs228-notes/extras/vae/.

Avatar
Mauro Camara Escudero
Machine Learning Engineer

My research interests include approximate manifold sampling and generative models.

Related