Maximum Likelihood Learning
We are given a dataset \(\mathcal{D}\) of \(m\) samples from \(P_{data}\). We assume the samples are IID. We are also given a family of models \(\mathcal{M}\). The task is to learn a model \(P_{\theta}\). That captures \(P_{data}\) from the sampled data.
This task is hard because \(m\) is way, way too small compared to the number of possible states of the variables.
To be able to extract the best model, we need to define the word best.
We need a measure of similarity between the two distributions \(P_{\theta}\) and \(P_{data}\).
Kullback-Leibler Divergence (KL-Divergence)
\[ D(p||q) = \sum_x p(x) log \frac{p(x)}{q(x)} \]
This quantity is non-negative, equal to zero iff \(p=q\), and asymmetric ( \(D(p||q) \neq D(q||p)\) ).
It measures the expected number of extra bits required to describe samples \(p(x)\) using a compression code based on \(q\) instead of \(p\).
The KL divergence can be used for our task because
\[ D(P_{data} || P_{\theta}) = \textbf{E}_{\mathbf{x} \sim P_{data}} [log(\frac{P_{data}(\mathbf{x})}{P_{\theta}})] \]
This can be simplified to
\[ D(P_{data} || P_{\theta}) = \textbf{E}_{\mathbf{x} \sim P_{data}} [\log {P_{data}(\mathbf{x})}] - \textbf{E}_{\mathbf{x} \sim P_{data}} [\log {P_{\theta}(\mathbf{x})}] \]
Since the first term is constant in \(\theta\), you can just minimize
\[ - \textbf{E}_{\mathbf{x} \sim P_{data}} [\log {P_{\theta}(\mathbf{x})}] \]
or equivalently maximize
\[ \textbf{E}_{\mathbf{x} \sim P_{data}} [\log {P_{\theta}(\mathbf{x})}] \]
In other words, minimizing the KL-divergence is equivalent to maximizing the log-likelihood.
We can’t directly calculate the expectation because we don’t have access to \(P_{data}\). What we can compute is the empirical log-likelihood. This is done by approximating the expectation in the following manner.
\[ \textbf{E}_{\mathcal{D}} [\log {P_{\theta}(\mathbf{x})}] = \frac{1}{|\mathcal{D}|} \sum_{\mathbf{x} \in \mathcal{D}} \log P_{\theta}(\mathbf{x}) \]