Natural Language Processing Demystified Part 1 - Processing Raw Text

NLPCat

A simple Google search defines natural language processing as the following:

Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human-computer interaction.


Wat?



If you find that description intimidating I don't blame you. However, for most of the problems where natural language processing is used, all you need to know is how to count. Now most of you are thinking 'it can't be that easy', and that's true. Because before we can count we have to know what we are counting. And that is where text processing comes in. Take this example sentence:

The quick brown fox jumped over the lazy dog

One of you out there just counted the number of the words in this sentence and got nine. I appreciate your enthusiasm.

However it is slightly more complicated than that. Before we begin our counting experiment I will briefly explain a few NLP (natural language processing) terms.

Corpus:  The text you are working with. In this case it is our sentence 'The quick brown fox jumped over the lazy dog'. In most cases a corpus contains many sentences or even paragraphs of text.

Stop Words:  Words that occur very frequently in a language but usually do not contain much information. In English this includes words like 'the','and' and 'over' in most situations.

So the first thing we want to do is to take our corpus and remove all the stop words. What we are then left with is:

quick brown fox jumped lazy dog

Some of you just counted the number of words in this corpus and got six. You need to learn to be a little more patient. Before we can start our counting experiment we need to learn the following additional NLP terms:

Stemming:  Removing parts of a word to get it's base. Like lazy→laz (which then will match lazily, laziness etc)

Lemmatization:  Similar to stemming it can be used to find roots (like jumped→jump), but has language context, so can also do things like better→good
Why use stemming and lemmatization? When interpreting the meaning of a sentence, generally the tense (past, present future), pluralization etc. is not as important as the meaning. In other words, I care more that the dog has the property "laziness" and the fox has the properties "quick" and "brown" than I do about the fact that the corpus is in the past tense. After applying stemming and lemmatization our corpus now looks like this:


quick brown fox jump laz dog



Great! Now we can start counting!

But the answer is still not six.

Part 2 - Bag of Words