Putting it all together

Normalizing Flows as Maximum Likelihood Estimation

At its most basic level, Normalizing Flows can be seen as a parametric density estimation method based on Maximum Likelihood. In our case, we took a simple density $p(z) = \mathcal{N}(0, 1)$ and we fed it through a simple transformation consisting only of a linear shift $g(z) = z + \mu$. We then found the optimal parameter value $\mu$ by maximizing the likelihood of the data using gradient descent.

Shifting a standard normal distribution $p(z) = \mathcal{N}(0, 1)$ by a constant term $\mu$ gives a new normal distribution $p(x; \mu)$ with shifted mean $p(x; \mu) = \mathcal{N}(\mu, 1)$. What we did then was minimizing (with respect to $\mu$) the KL divergence between the true data distribution that generated the data $p_{\text{data}}(x) = \mathcal{N}(2, 1)$ and the parametrized model $p(x; \mu) = \mathcal{N}(\mu, 1)$.

$$ \widehat{\mu} = \arg\min_{\mu} \text{KL}(p_{\text{data}}(x) \parallel p(x; \mu)) $$

Since minimizing the KL divergence between the true data distribution and a model is equivalent to maximum likelihood estimation

$$ \begin{align} \widehat{\mu} &= \arg\min_{\mu} \text{KL}(p_{\text{data}}(x) \parallel p(x; \mu)) \newline &= \arg\min_{\mu} \mathbb{E}_{p_{\text{data}}(x)}\left[\log p_{\text{data}}(x) - \log p(x; \mu)\right] \newline &= \arg \max_{\mu} \mathbb{E}_{p_{\text{data}}(x)}\left[\log p(x; \mu)\right] \newline &= \arg \max_{\mu} \frac{1}{n} \sum_{i=1}^n \log p(x^{(i)}; \mu) \qquad\qquad\qquad x^{(i)} \overset{\text{i.i.d.}}{\sim} p_{\text{data}}(x) \end{align} $$

and since the likelihood is tractable thanks to the change of variable formula, we can essentially fit $\mu$ by maximum likelihood.


Normalizing Flows in the General Univariate Case

Now that we have build some sort of intuition for normalizing flows, let’s put everything together. Suppose that we have i.i.d. samples from a data distribution $$x^{(1)}, \ldots, x^{(n)} \overset{\text{i.i.d.}}{\sim} p_{\text{data}}(x)$$ and that our aim is to estimate such distribution with another distribution $p(x)$.

The Normalizing Flows method starts with a simple density $p(z)$ for which we know its analytical form, for instance a standard normal distribution $p(z) = \mathcal{N}(0, 1)$. Then, it defines a one-to-one differentiable transformation $g_{\theta}(z)$ depending on parameters $\theta$ that maps samples from this simple distribution to new samples following a distribution given by the change of variable formula $$ \log p(x; \theta) = \log p(g^{-1}_{\theta}(x)) + \log \left|\frac{\partial}{\partial x} g^{-1}_{\theta}(x)\right|. $$

Then, finds the parameter values that maximize the likelihood, for instance using gradient descent iterations

$$ \theta_{t} \longleftarrow \theta_{t-1} + \gamma \frac{\partial}{\partial \theta_{t-1}} \log p(x; \theta_{t-1}). $$

Previous
Next