Posts

Showing posts from April, 2023

NLP Text Representation Method: One-Hot Encoding

Image
Text Representation Method: One-Hot Encoding Let’s start with a basic idea of text representation: map each word in the vocabulary (V) of the text corpus to a unique ID (integer value), then represent each sentence or document in the corpus as a V-dimensional vector. To understand this better, let’s take a toy corpus with only four documents—D1, D2, D3, D4 as an example. D1: dog bites man D2: man bites dog D3: dog eats meat D4: man eats food Note: All examples are in lowercased text and ignored punctuation. Vocabulary of this corpus (V): [dog, bites, man, eats, meat, food]. We can organize the vocabulary in any order. In this example, we simply take the order in which the words appear in the corpus. Every document in this corpus can now be represented with a vector of size six. In one-hot encoding, each word w in the corpus vocabulary is given a unique integer ID that is between 1 and |V|. Each word is then represented by a V-dimensional binary vector of 0s and 1s. Lets me walk you...

Why is NLP Challenging?

Image
In this article, I'll talk about what makes NLP a challenging problem domain ? So let's explore: Ambiguity Ambiguity means uncertainty of meaning. Most human languages are inherently ambiguous. Consider the following sentence: “I made her duck.” This sentence has multiple meanings. The first one is: I cooked a duck for her. The second meaning is: I made her bend down to avoid an object. Here, the ambiguity comes from the use of the word “made.” Which of the two meanings applies depends on the context in which the sentence appears. If the sentence appears in a story about a mother and a child, then the first meaning will probably apply. But if the sentence appears in a book about sports, then the second meaning will likely apply. Below figure contains few examples of ambiguity in language from the Winograd schema challenge . Common Knowledge A key aspect of any human language is “common knowledge” i.e. all facts that most humans are aware of. In any conversation, it is assumed t...