Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Annotated Paper Link: Google Drive

Notes for Tian et al. (2024)

This paper introduces Visual AutoRegressive modleing (VAR). Basically, using ARM to predict the next resolution instead of predicting the next pixel calue. This ARM surpasses Difussion Transformer (DiT) in image quality, inference speed, data efficency, ans scalability. This model follows the scaling laws common for LLMs and zero-shot generalization (tested on image in painting and class conditional editing).