Background

Overview of Genarative Model

The main task in generative models is to build a model that approximates a probability \(P_{\theta}\) from a set of data points sampled from the true distribution \(P_{data}\) (\(P_{\theta}\) should be as close as possible to \(P_{data}\) according to some distance measure)

High-Level View of Generative Models Tasks

We want to learn a probability distribution \(p(x)\) over data \(x\) (e.g., Images, Text, Protein Sequence). These distributions could be used for

Generation: Sample \(x_{new} \sim p(x)\)
Density Estimation: \(p(x)\) should be high if \(x\) is part of the data (anomaly detection)
Unsupervised Representation Learning: Feature extraction

The Curse of Dimensionality

Representing \(p(x)\) is difficult for high-dimensional data (such as images and text). This is easier, however, for simpler data. For example, a Categorical Distribution (n-sided cube) can be parameterized using \(n-1\) parameters (one for the probability of each outcome and the last is calculated from the first \(n-1\)).

This isn’t the case for high-dimensional data. A single pixel in an image, for example, needs \(256*256*256 - 1\) parameters. For an image, the number of required parameters grows exponentially with the number of pixels.

Independence Assumption

One way to simplify the structure is to assume independence

\[ p(x_1, ..., x_n) = p(x_1)p(x_2)...p(x_n) \]

This would require only \(n\) parameters (because \(p(x_i)\) needs only one parameter) to model the \(2^n\) images.

The problem is that this assumption is way too strong. For example, if you are generating images, you are not allowed to see any other pixel value when extracting the value of a pixel.

Conditional Independence Assumptions

Using the chain rule, you can simplify a multivariate distribution to

\[ p(x_1, ...., x_n) = p(x_1)p(x_2|x_1)p(x_3|x_1,x_2)...p(x_n|x_1,...,x_{n-1}) \]¹

The number of parameters, however, is still exponential

\[ 1 + 2 + ... + 2^{n-1} = 2^n-1 \]²

This is calculated using the following logic

\(p(x_1)\) needs 1 parameter
\(p(x_2|x_1 = 0)\) needs 1 parameter and \(p(x_2|x_1 = 1)\) needs 1 parameter for a total of 2 parameters
\(...\)

Assuming conditional independence (\(X_{i+1} \perp \{X_1, ..., X_{i-1}\} | X_i\)) can simplify the math

\[ p(x_1, ...., x_n) = p(x_1)p(x_2|x_1)p(x_3|x_2)...p(x_n|x_{n-1}) \]

Making the number of required parameters linear ( \(2n-1\) ).

However, these models are still weak. We need more context to make better estimates. Using only a single word isn’t enough to predict the next word.

Bayesian Network

Bayesian network adds more flexibility relative to the last discussed model by making variables rely on a set of other variables (not strictly only one variable). That is, it specifies \(p(x_i)|x_{A_i})\) where \(x_{A_i}\) is a set of random variables. In this case, the joint parameterization is

\[ p(x_1, ..., x_n) = \prod_i p(x_i|x_{A_i}) \]

More formally, a Bayesian Network is a DAG where the nodes are random variables and the edges are the dependence relationships between these variables. Each node has one conditional probability distribution specifying the variable’s probability conditioned on its parents’ values.

Now the number of parameters is exponential in the number of parents, not the number of variables.

Discriminative vs Generative Models

To show the differences, we will use an example of using Naive Bayes for single-label prediction: classify emails as spam ( \(Y=1\) ) or not ( \(Y=0\)). In this task, we are asked to model \(p(y, x_1, ..., x_n)\) where \(y\) is the label, \(1 : n\) are the indices of the words in the vocabulary, \(X_i\) is a random value that is \(1\) if word \(i\) appears in the email and \(0\) otherwise.

What Naive Bayes does is assume that words are conditionally independent given \(Y\), making the joint distribution

\[ p(y, X_1, ..., x_n) = p(y) \prod_{i=1}^n p(x_i | y) \]

After estimating the parameters from the training data, prediction is done using the Bayes rule

\[ P(Y=1 | x_1, ..., x_n) = \frac{p(Y=1) \prod_{i=1}^np(x_i|Y=1)}{\sum_{y=\{0,1\}}p(Y=y)\prod_{i=1}^np(x_i|Y=y)} \]

The independence assumption made here is that one word’s appearance in an email is independent from another’s appearance.

This model is a generative one. It attempts to calculate \(p(Y, \mathbf{X})\) using \(p(Y, \mathbf{X}) = p(Y)p(\mathbf{X}|Y)\). A discriminative model, on the other hand, uses \(p(Y, \mathbf{X}) = p(\mathbf{X})p(Y|\mathbf{X})\).

In the generative case, we need to learn both \(P(Y)\) and \(p(\mathbf{X}|Y)\) and then compute \(p(Y|\mathbf{X})\) using Bayes rule. In the discriminative, you just need to estimate the conditional probability \(p(Y|\mathbf{X})\). In the discriminative, you don’t have to model \(p(\mathbf{X})\), while the generative model does learn the input features vector.

Discriminative models usually assume a functional form for the probability of the target condition on the features. That is,

\[ p(Y=1 | \mathbf{x}, \mathbf{\alpha}) = f(\mathbf{x}, \mathbf{\alpha}) \]

For example, logistic regression assumes a linear relationship. Neural networks make it more flexible by adding non-linearities.

To sum it all up, the generative model learns the full joint distribution, making it able to predict anything from anything. Discriminative models learn only the relationship between inputs and the target. This limits the uses of the discriminative model but makes the task much simpler.

This is the simplification used by autoregressive models, which are very popular with LLMs. That is predict a token from the previously seen set of tokens.↩︎
Assuming the total number of values \(x\) could take is 2↩︎