An Image is Worth 32 Tokens for Reconstruction and Generation

Annotated Paper Link: Google Drive

Notes for Yu et al. (2024)

This paper introduces TiTok: An efficient image tokenizer that tokenizes images into a 1D sequence instead of a 2D grid. It is much faster than other techniques and performs better on ImageNet.

Method

During tokenization, they patchify the image using a patch embedding layer \(\mathbf{P} \in \mathbb{R}^{\frac{H}{f} \times \frac{W}{f} \times D}\) and concatenate it with \(K\) latent tokens \(\mathbf{L} \in \mathbb{R}^{K \times D}\) . They are fed into a ViT encoder \(Enc\). In the encoder output, only the latent tokens are preserved. Note that this technique decouples the latent size from the image resolution. The image latent representation is

\[ \mathbf{Z}_{1D} = Enc(\mathbf{P} \oplus \mathbf{L}) \]

In the detokenization phase, they incorporate a sequence of mask tokens \(\mathbf{M} \in \mathbb{R}^{\frac{H}{f} \times \frac{W}{f} \times D}\). The image is then reconstructed using a ViT decoder \(Dec\) as

\[ \hat{\mathbf{I}} = Dec(Quant(\mathbf{Z}_{1D}) \oplus \mathbf{M}) \]

For image generation, they rely on the MaskGIT generation framework by simply replacing its VQGAN tokenizer with our TiTok.

They use a two-stage training technique.