50 Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks

Annotated Paper Link: Google Drive

This is a 1999 paper by the Bengio brothers. Bengio & Bengio (1999).

This paper uses a neural network (with one hidden layer) to model conditional probabilities of the form \(P(x_i | x_{i-1}, ..., x_{i-k})\) where \(x_i\) is a discrete variable which is used to calculate the joint probability of a sequence of discrete variables \(P(x_1, x_2, ..., x_n)\).
This is basically a very early version of autoregressive models, one that is suitable for modeling all dependecies (\(O(n^2)\) dependencies).
The output of the \(i^th\) variable depends on all the previous \(i-1\) variables.
The paper uses a single hidden layer with a non-linearity (tanh) followed by another linear layer and softmax