Normalizing Flows
General Idea
Let $(\mathsf{Y}, \mathcal{Y})$ be a measurable space and $\nu$ be a probability distribution on it that we call the data distribution. Let $Y_1, \ldots, Y_N$ be IID random variables on this space distributed according to $\pi$, for which we observe realizations $y_1, \ldots, y_N$.Our aim is to be able to be able to generate novel samples that are approximately distributed according to $\pi$.
In the Normalizing Flows framework, we introduce another fictional measurable space $(\mathsf{Z}, \mathcal{Z})$ with a family of probability measures $\{\nu_\phi\,:\,\phi\in\Phi\}$ on it, and a family of bijective functions $\{f_\theta\,:\,\theta\in\Theta\}$ such that the data generating process works as follows
\begin{align} z &\sim \nu_{\phi^*} \\ y &= f_{\theta^*}(z). \end{align} where $\phi^*\in\Phi$ and $\theta^*\in\Theta$. That is, the data we observe is generated by first sampling from a fixed specific distribution in the family $\{\nu_\phi\,:\,\phi\in\Phi\}$ and this sample is fed through a fixed bijection in the family $\{f_\theta\,:\,\theta\in\Theta\}$.
The aim in Normalizing Flows is to learn $\phi^*$ and $\theta^*$ by minimizing some loss function.
Notice that our assumption above means that we assume that the likelihood distribution for a single observed data $y_i$ is $\nu_\phi^{f_\theta}$, i.e. the pushforward of $\nu_\phi$ by $f_\theta$, and so for the whole dataset \begin{align} \prod_{i=1}^N \nu_\phi^{f_\theta}(y_i). \end{align}
Notice that the training phase and data generation phase are very different because training requires the normalizing direction ($\mathsf{Y}\to\mathsf{Z}$) while generating new samples requires the generative direction ($\mathsf{Z}\to\mathsf{Y}$)
- Training: Requires repeated likelihood evaluations $\implies$ requires $f_\theta^{-1}$ and $\log|\det J_{f_\theta^{-1}}|$
- Generation: Requires evaluating $f_\theta$.
Depending on the application one may wish to model either $f$ or $f^{-1}$, since given either choice, computing the inverse may be difficult in practice. Typically, for density estimation the Normalizing Flow network models $f^{-1}$, whereas for Variational Inference the inverse is not necessary and so one models $f$.
Types of Flows
- Element-wise: $f(z) = (h(z_1), \ldots, h(z_d))^\top$ where $h:\mathbb{R}\to\mathbb{R}$ non-linear, and $\mathsf{Z}=\mathbb{R}^d$. Problem: no correlation between dimensions.
- Linear: $f(z) = Az + b$ with $A$ invertible, but $\det(A)$ is expensive for large $d$. Options:
- $A$ triangular, or a convex combination of $K$ triangular matrices $\sum_{k=1}^K \omega_k A_k$. Problem: sensitive to ordering of dimensions.
- $A$ permutation to remove ordering. Problem: must be fixed by user, limited flexibility.
- $A$ orthogonal e.g. HouseHolder Flows where one layer performs $\left(I - 2 \frac{vv^\top}{|v|^2}\right) z$
- LU factorization: $f(z) = PLUz + b$ where $L$ lower-triangular with ones on diagonals, $U$ upper triangular with non-zero diagonal entries and $P$ permutation matrix. Again, $P$ must be fixed by user.
- QR decomposition: $A = Q(R + \text{diag}(s))$ where $\text{diag}(s)$ is used to stabilize the decomposition, and $Q=Q_1Q_2\ldots Q_n$ and $Q_i - I - 2\frac{v_iv_i^\top}{v_i^\top v_i}$. Basically, to avoid fixing a permutation matrix, they use QR decomposition where $Q$ is modelled through Householder transforms.
- Convolutions: several approaches, but all try to make convolutions whose inverse and determinant are tractable.
- Planar: they stretch and contract the distribution along some directions $f(z) = z + u h(w^\top z + b)$ where $h$ smooth non-linearity. Basically it’s a one-layer NN with a single hidden unit with residual connection. Problem: single hidden unit means limited expressivity.
- Sylvester: $f(z) = z + Uh(W^\top z + b)$ where $W$ is $d\times m$ and $m$ is the number of “hidden units”. When $m$ is small, determinant computation is efficient. Problem: hard to invert in general.
- Radial: Modify distribution around a center point $f(z) = z + \frac{\beta}{\alpha + |z - z_0|}(z - z_0)$. Problem: hard to invert.
- Coupling: $f(z) = [h(z^A, T(z^B)), z^B]$ where $h$ is a bijection known as coupling function, $T$ is a conditioner acting only on $z^B$. Typically the coupling function is element-wise. $T$ can be arbitrarily complex and typically it is a Neural Network.
- Autoregressive Flows: $y = f(z)$ where each output dimension $y_t$ depends on the previous entries $y_t = h(z_t, T_t(z_{1:t}))$ where $h$ is a bijection parametrized by $\theta$ and $T_t$ are arbitrary functions and $T_1$ is constant. E.g.
- Masked Autoregressive Flows (MAF): Computing the inverse is challenging since it requires computing it sequentially.
- Inverse Autoregressive Flows (IAF): Here the forward computation is sequential, but the inverse is cheap. Notice that while most flows focus on having the normalizing direction being cheap (for cheap training), IAF instead models the generative direction so that training is expensive but generation is cheap. This is handy in Stochastic Variational Inference, wheras MAFs should be preferred in density estimation.
- Residual: $f(z) = z + F(z)$ where $F$ is a FFNN. Motivation: residual connection is the discretization of a first order ODE $\dot{x_t} = F(x_t, \theta_t)$. Problem: no closed-form inverse, but can be found via fixed-point iterations.
- Continuous: Instead of discretizing the ODE like residual flows, let $\Phi^t(z)$ be the solution at time $t$ and $x(0) = z$ be the initial condition. The map $\Phi^t$ is a group of diffeomorphisms on $\mathbb{R}^d$ parametrized by $t\in [0, 1]$, also known as smooth flow.
- Neural ODE (NODE): Model $y = f(z)$ using $f=\Phi^{1}$: an infinitely deep NN with input $z$ and continuous weights $\theta(t)$. Invertibility follows from uniqueness+existence of solutions of the ODE. Trained them using adjoint sensitivity method (i.e. backprop in continuous time).
References
- Improving Variational Auto-Encoders using convex combination linear Inverse Autoregressive Flow for linear flow with $A = \sum_{k=1}^K \omega_k A_k$.
- Improving Variational Auto-Encoders using Householder Flow linear flow using orthogonal matrices that are constructed as products of $K$ Householder transformations.
- Emerging Convolutions for Generative Normalizing Flows linear flows using $QR$ decomposition and $Q$ constructed using products of Householder transformations.