56 Autoregressive Image Generation without Vector Quantization

These are notes for Li et al. (2024).

This paper introduces the usage of ARMs for image generation using continous representations (no quantization as in VQ-VAE) by using a diffusion model to model \(p(x_i|z_i)\). The ARM is used to model \(p(z_i|z_{<i})\).

56.1 Method

First of all, generative models don’t require a categorical probability distribution, \(p(x|z)\). Instead the distribution must have two properties: (1) it need to have a loss function that measure the difference between the estimated distribution and the true distribution, and (2) a sampler that can sample from the distribution \(x \sim p(x|z)\).

For diffusion, the loss can be defined as

\[ L(z, x) = \mathbb{E}_{\epsilon, t} \left[ \| \epsilon - \epsilon_\theta(x_t | z_t, t) \|^2 \right] \]

Sampling is done via reverse diffusion process.

We first produce \(z_i\) by a network \(f(x_1, \ldots, x_{i-1})\) and then sample \(x_i\) from \(p(x_i|z_i)\). \(f\) is an ARM. The difussion loss is applied on \(p(x_i|z_i)\) and backpropagated to the ARM.