Language Model Beats Diffusion – Tokenizer is Key to Visual Generation

Introduction

The contributions of this paper are:

New Video Tokenizer
Novel lookup-free quantization (LFQ) approach
First evidence that a language model can outperform diffusion models in visual generation tasks.
Video compressor better than HVEC and close to VVC.

This paper shows that a good visual representation of images is the secret to to making language models work for visual generation, and they shows this by creating a new video/image tokenizer that beats difussion models trained on the same data and simillar budgets on datasets like ImageNet.

This paper is very close to MAGVIT (Yu et al. (2023)) and VQ-VAE (Oord et al. (2018)).

Method

In normal circumstances for visual tokenization, the encoder \(E\) maps a video \(V \in \mathbb{R}^{T\times H\times W\times 3}\) to a latent space \(Z \in \mathbb{R}^{T'\times H'\times W'\times D}\), where \(D\) is the embedding dimension. The quantizer \(q\) then maps the latent representation to a discrete codebook \(C \in \mathbb{R}^{K\times D}\), where \(K\) is the codebook size. The decoder \(D\) then reconstructs the video from the quantized representation.

Lookup-Free Quantization (LFQ)

A common misconception is that improving reconstruction equates improving the generation quality of the language model. The paper shows evidence of the opposite. To avoid a drop in generation quality, a common approcah to accomodate a larger codebook size is decreasing the embedding dimension.

LFQ builds on this approach and squashes the output to a single integer. That is \(C\) is an integer set with \(|C| = K\).While in the VQ-VAE model the quantizer must lookup all \(K\) \(d\)-dimensional codebook entries to find the closest one, LFQ elimnates the need for such embedding lookup.

I’ve annotated the section of the paper that describes the details of the LFQ approach with the text “LFQ Explanation”.

The one described in the paper is the simplest form with independet binary dimensions.

Visual Tokenizer Model Improvement

The paper describes multiple updates to the architecture of the visual tokenizer that they have made.

Honestly, I don’t fully understand all the details, but have annotated some of the relevant sections. I will avoid writing details about them for the time being.

Experiments

They performed experiments on 3 different tasks:

Visual Generation
Video Compression
Video Understanding