Deep Learning in NLP

Word Embedding

Imagine using one-hot encoding to encode all the words in a language. The result would be hundreds of thousands of very sparse vectors. Word embedding allows the respresentation of words in lower dimensional, dense vectors in a way that captures some of the input semantics. Each vector defines a point in space. In this image, we see high-dimensional vectors mapped to 3-dimensional space with semantic relationships translated to geometrical relationships.

There are several general purpose libraries that offer algorithms for generating word embeddings.

  • Word2Vec
    • A classic library, with algorithms for continuous bag of words (CBOW) and skip grams, for generating continuous word vectors. Simple to use but might not capture complex relationships between words.
    • GenSim is a library focused on topic modeling and provides an implementation with a tutorial here: Word2vec. You can also use GenSim with NLTK, the documentation gives an example: NLTK & Gensim
    • TensorFlow A notebook tutorial is available here: word2vec. To test on different datasets, find more data in TensorFlow datasets
    • PyTorch tutorial with code on word embeddings includes N-Gram and CBOW.
  • GloVe: Global Vectors for Word Representation
    • This algorithm uses co-occurrence statistics to generate word vectors. Captures semantic relationships better than Word2Vec but can be slower to train for large datasets.
    • Project site includes code, pre-trained vectors and the original publication: GloVe
    • Pretrained GloVe models are also available on GenSim
  • fastText
    • An algorithm that utilizes subword information (character n-grams) to handle rare words and out-of-vocabulary terms. Efficient and handles diverse languages.
    • Download pretrained models at fasttext.cc
    • A pretrained fastText model is also available on GenSim

A variety of NLP tasks can be performed by training deep neural networks. Examples include next word prediction for text generation and part of speech (POS) tagging (with labeled input).

Recurrent neural networks allow the sequential processing of text to capture meaning and context across long distances between words. We can train a language model using an RNN with word embeddings as input, and softmax probabilities of the next word in the sequence as output. Context is captured in the hidden layers.

RNNs can struggle with context over long sequences. For example, generating a pronoun after the subject has been forgotten. Long short-term memory (LSTM) is a specialized RNN that solves this type of problem by "remembering" some values along with their position in a sequence.

Recurrent Neural Network (RNN)
Long short-term memory (LSTM) Image source

Tutorials on text generation with RNNs

Deep learning applied to NLP has had much success in machine translation where we translate a sequence of words from a source language to a target language. Because natural languages can have different grammatical structures, a direct word-to-word vocabulary translation is insufficient. We need to model both languages for a successful translation. Sequence-to-sequence models have two neural networks, one for the source language, and one for the target language.

These types of models can also be used for NLP tasks like text summarization, image captioning, question-answering, and conversation.

Evolution of seq2seq models

Neural machine translation with Seq2Seq & Attention