Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks
Annotated Paper Link: Google Drive
This is a 1999 paper by the Bengio brothers. Bengio & Bengio (1999).
- This paper uses a neural network (with one hidden layer) to model conditional probabilities of the form \(P(x_i | x_{i-1}, ..., x_{i-k})\) where \(x_i\) is a discrete variable which is used to calculate the joint probability of a sequence of discrete variables \(P(x_1, x_2, ..., x_n)\).
- This is basically a very early version of autoregressive models, one that is suitable for modeling all dependecies (\(O(n^2)\) dependencies).
- The output of the \(i^th\) variable depends on all the previous \(i-1\) variables.
- The paper uses a single hidden layer with a non-linearity (tanh) followed by another linear layer and softmax