Improving a Variational Autoencoder with Normalizing Flows

In order to fully grasp the concepts explained here, I strongly recommend you to read my three posts on Variational Autoencoders (in the following order)

Theory of Vanilla VAEs

Recall that in a Vanilla VAE we feed $x$ into an encoder neural network and obtain $(μ, \log σ)$ . These are the amortized parameters of our approximate posterior distribution

$q_{ϕ} (z ∣ x) = N (z ∣ μ_{ϕ} (x), diag (σ_{ϕ}^{2} (x)))$

To get a latent sample $z \sim q_{ϕ} (z ∣ x)$ we need to use the reparametrization trick. This requires sampling $ϵ \sim N (0, I)$ and then scaling it and shifting it according to the output of the neural network

$z = μ_{ϕ} (x) + σ_{ϕ}^{2} (x) ⊙ ϵ .$

To learn the parameters of our neural network our aim is to maximize the ELBO

$L_{ϕ, θ} (x) = E_{q_{ϕ} (z ∣ x)} [\log p_{θ} (x ∣ z)] - KL (q_{ϕ} (z ∣ x) ∥ p (z))$

The reconstruction error (the first term) is easy to compute in the Normal and Bernoulli case. In what follows, we will assume that the likelihood is a product of Bernoullis. This is the usual set-up when working with MNIST. The likelihood is then

$p_{θ} (x ∣ z) = \prod_{i = 1}^{dim (x)} p_{i}^{x_{i}} (1 - p_{i})^{1 - x_{i}}$

where $p = (p_{1}, \dots, p_{dim (x)})^{⊤}$ is the output of the decoder network, and $p \in [0, 1]^{dim (x)}$ . The reconstruction error can then be written as

$\begin{aligned} E_{q_{ϕ} (z ∣ x)} [\log p_{θ} (x ∣ z)] & = E_{q_{ϕ} (z ∣ x)} [\log \prod_{i = 1}^{dim (x)} p_{i} (z)^{x_{i}} (1 - p_{i} (z))^{1 - x_{i}}] \\ = E_{q_{ϕ} (z ∣ x)} [\sum_{i = 1}^{dim (x)} x_{i} \log p_{i} (z) + (1 - x_{i}) \log (1 - p_{i} (z))] \\ \approx \sum_{j = 1}^{n_{z}} \sum_{i = 1}^{dim (x)} x_{i} \log p_{i} (z) + (1 - x_{i}) \log (1 - p_{i} (z)) z^{(j)} \sim q_{ϕ} (z ∣ x) \end{aligned}$

where $n_{z}$ is the number of samples that we sample from $q_{ϕ} (z ∣ x)$ . Usually we simply set $n_{z} = 1$ . This means we only sample one latent variable per datapoint. The objective function to minimize (I have flipped the sign) is therefore

$\begin{aligned} - L_{ϕ, θ} (x) & = - \sum_{i = 1}^{dim (x)} x_{i} \log p_{i} (z) + (1 - x_{i}) \log (1 - p_{i} (z)) - \frac{1}{2} \sum_{j = 1}^{dim (z)} (1 + \log σ_{j}^{2} - μ_{j}^{2} - σ_{j}^{2}) \\ = BCE (p, x) - \frac{1}{2} \sum_{j = 1}^{dim (z)} (1 + \log σ_{j}^{2} - μ_{j}^{2} - σ_{j}^{2}) \end{aligned}$

Using Pytorch we can code it as

def vae_loss(image, reconstruction, mu, logvar):
  """Loss for the Variational AutoEncoder."""
  # Compute the binary_crossentropy.
  recon_loss = F.binary_cross_entropy(
      input=reconstruction.view(-1, 28*28),    # input is p(z) (the mean reconstruction)
      target=image.view(-1, 28*28),            # target is x   (the true image)
      reduction='sum'                          
  )
  # Compute KL divergence using formula (closed-form)
  kl = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
  return reconstruction_loss + kl

VAE with Normalizing Flows

This time, we not only want our encoder to output $(μ, \log σ)$ to shift and scale $ϵ \sim N (0, I)$ . We also want to feed

$N (z ∣ μ_{ϕ} (x), diag (σ_{ϕ}^{2} (x)))$

through a series of $K$ transformations each one of them depending on a set of parameters $λ_{k}$ . Denoting $λ = (λ_{1}, \dots, λ_{K})$ we essentially want our Encoder to work as follows:

$x ⟶ Encoder ⟶ (μ, \log σ, λ_{1}, \dots, λ_{K}) = (μ, \log σ, λ)$

Then we would firstly use $(μ, \log σ)$ to compute $z_{0}$ using the reparametrization trick $z_{0} = μ + σ ⊙ ϵ ϵ \sim N (0, I)$

and finally we would feed $z_{0}$ into the series of transformations to reach the final $z_{K}$

$z_{K} = f_{K} \circ f_{K - 1} \circ \dots \circ f_{2} \circ f_{1} (z_{0}) .$

This means that our approximating distribution is not

$q_{ϕ} (z ∣ x) = N (z ∣ μ_{ϕ} (x), diag (σ_{ϕ}^{2} (x)))$

anymore but, rather, it can be found using the usual change of variable formula

$\log q_{ϕ} (z ∣ x) = \log q_{K} (z_{K}) = \log q_{0} (z_{0}) - \sum_{k = 1}^{K} \ln | det \frac{\partial f_{k}}{\partial z_{k - 1}} |$

where the base distribution $q_{0} (z_{0})$ is the old approximate posterior distribution $q_{0} (z_{0}) = N (z_{0} ∣ μ_{ϕ} (x), diag (σ_{ϕ}^{2} (x))) .$

Thanks to the law of the unconscious statistician we have

$\begin{aligned} L_{ϕ, θ} (x) & = E_{q_{ϕ} (z ∣ x)} [\log p_{θ} (x ∣ z)] - KL (q_{ϕ} (z ∣ x) ∥ p (z)) \\ = E_{q_{K} (z_{K})} [\log p_{θ} (x ∣ z_{K})] - E_{q_{K} (z_{K})} [\log q_{K} (z_{K}) - \log p (z_{K})] \\ = E_{q_{0} (z_{0})} [\log p_{θ} (x ∣ z_{K})] - E_{q_{0} (z_{0})} [\log q_{K} (z_{K}) - \log p (z_{K})] \end{aligned}$

As usual, we can approximate this using Monte Carlo and generally we only need one sample. By drawing $z_{0} \sim q_{0} (z_{0}) = N (μ, diag (σ))$ we can approximate the ELBO as follows

$\begin{aligned} L_{ϕ, θ} (x) & \approx [\sum_{i = 1}^{dim (x)} x_{i} \log p_{i} (z_{K}) + (1 - x_{i}) \log (1 - p_{i} (z_{K}))] - \log q_{K} (z_{K}) + \log p (z_{K}) . \end{aligned}$

This means that our objective function is given by $\begin{aligned} - L_{ϕ, θ} (x) & = BCE (p, x) + \log q_{0} (z_{0}) + LADJ - \log p (z_{K}) \end{aligned}$

where the Log-Absolute-Determinant-Jacobian is the usual $LADJ = \sum_{k = 1}^{K} \ln | det \frac{\partial f_{k}}{\partial z_{k - 1}} |$

Last updated on Dec 10, 2020

Edit this page