NLP Text Representation Method: One-Hot Encoding

Text Representation Method: One-Hot Encoding Let’s start with a basic idea of text representation: map each word in the vocabulary (V) of the text corpus to a unique ID (integer value), then represent each sentence or document in the corpus as a V-dimensional vector. To understand this better, let’s take a toy corpus with only four documents—D1, D2, D3, D4 as an example. D1: dog bites man D2: man bites dog D3: dog eats meat D4: man eats food Note: All examples are in lowercased text and ignored punctuation. Vocabulary of this corpus (V): [dog, bites, man, eats, meat, food]. We can organize the vocabulary in any order. In this example, we simply take the order in which the words appear in the corpus. Every document in this corpus can now be represented with a vector of size six. In one-hot encoding, each word w in the corpus vocabulary is given a unique integer ID that is between 1 and |V|. Each word is then represented by a V-dimensional binary vector of 0s and 1s. Lets me walk you...