57 Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

These are notes for Zhou et al. (2024).

This paper intriduces a multi-modal (text and image) model that can uses a discrete distribution loss for text (next token prediction) and a continuous distribution loss for images (diffusion loss). For text, loss is calculated per token, while for images, loss is calculated per image. The loss function is a weighted sum of the two losses.