Autoregressive Image Generation without Vector Quantization

Annotated Paper Link: Google Drive Link

These are notes for Li et al. (2024).

This paper introduces the usage of ARMs for image generation using continous representations (no quantization as in VQ-VAE) by using a diffusion model to model \(p(x_i|z_i)\). The ARM is used to model \(p(z_i|z_{<i})\).

Method

First of all, generative models don’t require a categorical probability distribution, \(p(x|z)\). Instead the distribution must have two properties: (1) it need to have a loss function that measure the difference between the estimated distribution and the true distribution, and (2) a sampler that can sample from the distribution \(x \sim p(x|z)\).

For diffusion, the loss can be defined as

\[ L(z, x) = \mathbb{E}_{\epsilon, t} \left[ \| \epsilon - \epsilon_\theta(x_t | z_t, t) \|^2 \right] \]

Sampling is done via reverse diffusion process.

We first produce \(z_i\) by a network \(f(x_1, \ldots, x_{i-1})\) and then sample \(x_i\) from \(p(x_i|z_i)\). \(f\) is an ARM. The difussion loss is applied on \(p(x_i|z_i)\) and backpropagated to the ARM.