Normalizing Flows
General Idea
Let
In the Normalizing Flows framework, we introduce another fictional measurable space
The aim in Normalizing Flows is to learn
and by minimizing some loss function.
Notice that our assumption above means that we assume that the likelihood distribution for a single observed data
Notice that the training phase and data generation phase are very different because training requires the normalizing direction (
- Training: Requires repeated likelihood evaluations
requires and - Generation: Requires evaluating
.
Depending on the application one may wish to model either
Types of Flows
- Element-wise:
where non-linear, and . Problem: no correlation between dimensions. - Linear:
with invertible, but is expensive for large . Options: triangular, or a convex combination of triangular matrices . Problem: sensitive to ordering of dimensions. permutation to remove ordering. Problem: must be fixed by user, limited flexibility. orthogonal e.g. HouseHolder Flows where one layer performs- LU factorization:
where lower-triangular with ones on diagonals, upper triangular with non-zero diagonal entries and permutation matrix. Again, must be fixed by user. - QR decomposition:
where is used to stabilize the decomposition, and and . Basically, to avoid fixing a permutation matrix, they use QR decomposition where is modelled through Householder transforms. - Convolutions: several approaches, but all try to make convolutions whose inverse and determinant are tractable.
- Planar: they stretch and contract the distribution along some directions
where smooth non-linearity. Basically it’s a one-layer NN with a single hidden unit with residual connection. Problem: single hidden unit means limited expressivity.- Sylvester:
where is and is the number of “hidden units”. When is small, determinant computation is efficient. Problem: hard to invert in general.
- Sylvester:
- Radial: Modify distribution around a center point
. Problem: hard to invert. - Coupling:
where is a bijection known as coupling function, is a conditioner acting only on . Typically the coupling function is element-wise. can be arbitrarily complex and typically it is a Neural Network. - Autoregressive Flows:
where each output dimension depends on the previous entries where is a bijection parametrized by and are arbitrary functions and is constant. E.g.- Masked Autoregressive Flows (MAF): Computing the inverse is challenging since it requires computing it sequentially.
- Inverse Autoregressive Flows (IAF): Here the forward computation is sequential, but the inverse is cheap. Notice that while most flows focus on having the normalizing direction being cheap (for cheap training), IAF instead models the generative direction so that training is expensive but generation is cheap. This is handy in Stochastic Variational Inference, wheras MAFs should be preferred in density estimation.
- Residual:
where is a FFNN. Motivation: residual connection is the discretization of a first order ODE . Problem: no closed-form inverse, but can be found via fixed-point iterations. - Continuous: Instead of discretizing the ODE like residual flows, let
be the solution at time and be the initial condition. The map is a group of diffeomorphisms on parametrized by , also known as smooth flow.- Neural ODE (NODE): Model
using : an infinitely deep NN with input and continuous weights . Invertibility follows from uniqueness+existence of solutions of the ODE. Trained them using adjoint sensitivity method (i.e. backprop in continuous time).
- Neural ODE (NODE): Model
References
- Improving Variational Auto-Encoders using convex combination linear Inverse Autoregressive Flow for linear flow with
. - Improving Variational Auto-Encoders using Householder Flow linear flow using orthogonal matrices that are constructed as products of
Householder transformations. - Emerging Convolutions for Generative Normalizing Flows linear flows using
decomposition and constructed using products of Householder transformations.