Normalizing Flows

General Idea

Let (Y,Y) be a measurable space and ν be a probability distribution on it that we call the data distribution. Let Y1,,YN be IID random variables on this space distributed according to π, for which we observe realizations y1,,yN.Our aim is to be able to be able to generate novel samples that are approximately distributed according to π.

In the Normalizing Flows framework, we introduce another fictional measurable space (Z,Z) with a family of probability measures {νϕ:ϕΦ} on it, and a family of bijective functions {fθ:θΘ} such that the data generating process works as follows

zνϕy=fθ(z). where ϕΦ and θΘ. That is, the data we observe is generated by first sampling from a fixed specific distribution in the family {νϕ:ϕΦ} and this sample is fed through a fixed bijection in the family {fθ:θΘ}.

The aim in Normalizing Flows is to learn ϕ and θ by minimizing some loss function.

Notice that our assumption above means that we assume that the likelihood distribution for a single observed data yi is νϕfθ, i.e. the pushforward of νϕ by fθ, and so for the whole dataset i=1Nνϕfθ(yi).

Notice that the training phase and data generation phase are very different because training requires the normalizing direction (YZ) while generating new samples requires the generative direction (ZY)

  1. Training: Requires repeated likelihood evaluations requires fθ1 and log|detJfθ1|
  2. Generation: Requires evaluating fθ.

Depending on the application one may wish to model either f or f1, since given either choice, computing the inverse may be difficult in practice. Typically, for density estimation the Normalizing Flow network models f1, whereas for Variational Inference the inverse is not necessary and so one models f.

Types of Flows

  1. Element-wise: f(z)=(h(z1),,h(zd)) where h:RR non-linear, and Z=Rd. Problem: no correlation between dimensions.
  2. Linear: f(z)=Az+b with A invertible, but det(A) is expensive for large d. Options:
    • A triangular, or a convex combination of K triangular matrices k=1KωkAk. Problem: sensitive to ordering of dimensions.
    • A permutation to remove ordering. Problem: must be fixed by user, limited flexibility.
    • A orthogonal e.g. HouseHolder Flows where one layer performs (I2vv|v|2)z
    • LU factorization: f(z)=PLUz+b where L lower-triangular with ones on diagonals, U upper triangular with non-zero diagonal entries and P permutation matrix. Again, P must be fixed by user.
    • QR decomposition: A=Q(R+diag(s)) where diag(s) is used to stabilize the decomposition, and Q=Q1Q2Qn and QiI2vivivivi. Basically, to avoid fixing a permutation matrix, they use QR decomposition where Q is modelled through Householder transforms.
    • Convolutions: several approaches, but all try to make convolutions whose inverse and determinant are tractable.
  3. Planar: they stretch and contract the distribution along some directions f(z)=z+uh(wz+b) where h smooth non-linearity. Basically it’s a one-layer NN with a single hidden unit with residual connection. Problem: single hidden unit means limited expressivity.
    • Sylvester: f(z)=z+Uh(Wz+b) where W is d×m and m is the number of “hidden units”. When m is small, determinant computation is efficient. Problem: hard to invert in general.
  4. Radial: Modify distribution around a center point f(z)=z+βα+|zz0|(zz0). Problem: hard to invert.
  5. Coupling: f(z)=[h(zA,T(zB)),zB] where h is a bijection known as coupling function, T is a conditioner acting only on zB. Typically the coupling function is element-wise. T can be arbitrarily complex and typically it is a Neural Network.
  6. Autoregressive Flows: y=f(z) where each output dimension yt depends on the previous entries yt=h(zt,Tt(z1:t)) where h is a bijection parametrized by θ and Tt are arbitrary functions and T1 is constant. E.g.
    • Masked Autoregressive Flows (MAF): Computing the inverse is challenging since it requires computing it sequentially.
    • Inverse Autoregressive Flows (IAF): Here the forward computation is sequential, but the inverse is cheap. Notice that while most flows focus on having the normalizing direction being cheap (for cheap training), IAF instead models the generative direction so that training is expensive but generation is cheap. This is handy in Stochastic Variational Inference, wheras MAFs should be preferred in density estimation.
  7. Residual: f(z)=z+F(z) where F is a FFNN. Motivation: residual connection is the discretization of a first order ODE xt˙=F(xt,θt). Problem: no closed-form inverse, but can be found via fixed-point iterations.
  8. Continuous: Instead of discretizing the ODE like residual flows, let Φt(z) be the solution at time t and x(0)=z be the initial condition. The map Φt is a group of diffeomorphisms on Rd parametrized by t[0,1], also known as smooth flow.
    • Neural ODE (NODE): Model y=f(z) using f=Φ1: an infinitely deep NN with input z and continuous weights θ(t). Invertibility follows from uniqueness+existence of solutions of the ODE. Trained them using adjoint sensitivity method (i.e. backprop in continuous time).

References

  1. Improving Variational Auto-Encoders using convex combination linear Inverse Autoregressive Flow for linear flow with A=k=1KωkAk.
  2. Improving Variational Auto-Encoders using Householder Flow linear flow using orthogonal matrices that are constructed as products of K Householder transformations.
  3. Emerging Convolutions for Generative Normalizing Flows linear flows using QR decomposition and Q constructed using products of Householder transformations.
Previous
Next