NLP Text Representation Method: One-Hot Encoding

Text Representation Method: One-Hot Encoding

Let’s start with a basic idea of text representation: map each word in the vocabulary (V) of the text corpus to a unique ID (integer value), then represent each sentence or document in the corpus as a V-dimensional vector. To understand this better, let’s take a toy corpus with only four documents—D1, D2, D3, D4 as an example.
D1: dog bites man
D2: man bites dog
D3: dog eats meat
D4: man eats food
Note: All examples are in lowercased text and ignored punctuation.

Vocabulary of this corpus (V): [dog, bites, man, eats, meat, food]. We can organize the vocabulary in any order. In this example, we simply take the order in which the words appear in the corpus. Every document in this corpus can now be represented with a vector of size six.

In one-hot encoding, each word w in the corpus vocabulary is given a unique integer ID that is between 1 and |V|. Each word is then represented by a V-dimensional binary vector of 0s and 1s.

Lets me walk you with an example, how to perform one-hot encoding.

Sentence: 'man bites dog'
Map each of the six words to unique IDs: dog = 1, bites = 2, man = 3, meat = 4 , food = 5, eats = 6.

As per the scheme, each word is a six-dimensional vector.
Representation of word as per schema:
man    : [0 0 1 0 0 0] #1 is at 3rd position as "man" ID is 3, so 3rd index will be 1 and rest is 0.
bites   : [0 1 0 0 0 0] #1 is at 2nd position as "bites" ID is 2, so 2nd index will be 1 and rest is 0.
dog     : [1 0 0 0 0 0] #1 is at 1st position as "dog" ID is 1, so 1st index will be 1 and rest is 0.

Thus our sentence will represent as [[0 0 1 0 0 0],[0 1 0 0 0 0],[1 0 0 0 0 0]]. Now that matrix will be fed into ML model as a feature vector of that text.

Here is the code snippet:

Now that we understand the scheme, let’s discuss some of its pros and cons. On the positive side, one-hot encoding is intuitive to understand and straightforward to implement. However, it suffers from a few shortcomings:

1. The size of a one-hot vector is directly proportional to size of the vocabulary, and most real-world corpora have large vocabularies. This results in a sparse representation where most of the entries in the vectors are zeroes, making it computationally inefficient to store, compute with, and learn from (sparsity leads to overfitting).

2. This representation does not give a fixed-length representation for text, i.e., if a text has 10 words, you get a longer representation for it as compared to a text with 5 words. For most learning algorithms, we need the feature vectors to be of the same length.

3. It treats words as atomic units and has no notion of similarity between words. For example, consider three words: run, ran, and apple. Run and ran have similar meanings as opposed to run and apple. But if we take their respective vectors and compute Euclidean distance between them, they’re all equally apart (sqrt[2]). Thus, semantically, they’re very poor at capturing the meaning of the word in relation to other words.

4. Say we train a model using our above corpus. At runtime, we get a sentence: “man eats fruits.” The training data didn’t include “fruit” and there’s no way to represent it in our model. This is known as the out of vocabulary (OOV) problem. A one-hot encoding scheme cannot handle this. The only way is to retrain the model: start by expanding the vocabulary, give an ID to the new word, etc.

Reference: "Practical Natural Language Processing" by Sowmya Vajjala, Bodhisattwa Majumder, Anuj Gupta & Harshit Surana 

Comments

Popular posts from this blog

Why is NLP Challenging?