52 Improved Variational Inference with Inverse Autoregressive Flow

Annotated Paper Link: Google Drive

Notes for Kingma et al. (2016).

The paper introduces Inverse Autoregressive Flow (IAF), a new type of normalizing flow that allows for efficient variational inference in high-dimensional latent spaces which uses an autoregressive model
They rely on gaussian autoregressive models. Calculating the mean and variance at a certain step is done by a neural network that takes the previous steps as input.
Sampling from this model:
- Sample a random noise vector \(\epsilon \sim \mathcal{N}(0, I)\). Apply the follwing to extract the corresponding vector \(\mathbf{y}\)
- \(y_0 = \mu_0 + \sigma_0 \odot \epsilon_0\) for the first step
- \(y_t = \mu_t(y_{1:t-1}) + \sigma_t(y_{1:t-1}) \odot \epsilon_t\) for the subsequent steps where \(\mu_t\) and \(\sigma_t\) are functions of the previous steps
Applicability in normalizing flow:
- The inverse is easy to compute parallely
  - \(\mathbf{\epsilon} = (\mathbf{y} - \mu(\mathbf{y})) / \sigma(\mathbf{y})\) where subtraction and division are element-wise operations
- The Jacobian determinant is easy to compute
  - \(\log \det \left| \frac{\partial \mathbf{\epsilon}}{\partial \mathbf{y}} \right| = \sum_{t=1}^D -\log \sigma_t(\mathbf{y}_{1:t-1})\) where \(D\) is the dimensionality of the latent space
- The Inverse Autoregressive Flow:
  - Pass \(\mathbf{x}\) to an Encoder Neural Network to generate \(\mu_0\) and \(\sigma_0\) for the first step and \(h\) which will be used by the later transfomration steps
  - Initialize \(\mathbf{z}\) with \(\mathbf{z} = \mu_0 + \sigma_0 \odot \epsilon\) where \(\epsilon\) is a random noise vector with a simple distribution (e.g. \(\mathcal{N}(0, I)\))
  - For \(t\) in \(1\) to \(T\) (number of flow steps):
    - Pass \(h\) and \(\mathbf{z}\) to a transformation neural network to generate \(\mu_t\) and \(\sigma_t\)
    - Update \(\mathbf{z}\) with \(\mathbf{z} = \mu_t + \sigma_t \odot \mathbf{z}\)
  - The density of the final output is:
    - \(\log q(\mathbf{z}_T | \mathbf{x}) = - \sum_{i=1}^D \left( \frac{1}{2} \epsilon_i^2 + + \frac{1}{2} \log 2\pi + \sum_{t=0}^T \log \sigma_{t, i} \right)\)
The IAF can be used as the variational posterior, \(q(\mathbf{z} | \mathbf{x})\), in a VAE.
In the inverse step of the IAF (which is used during training to compute the KL divergence), the autoregressive model is used in the reverse direction, which allows for efficient parallel computation. That is you are given the full \(y\) and you want to compute \(\epsilon\) from certain porions of \(y\) and not the other way around. This is what allows for efficient training. For inference, the flow model isn’t used at all. This is a VAE.