NLP - Preprocessing of Text
As a data scientist, we may use NLP for sentiment analysis (classifying words to have positive or negative connotation) or to make predictions in classification models, among other things. Typically, whether we’re given the data or have to scrape it, the text will be in its natural human format of sentences, paragraphs, tweets, etc. From there, before we can dig into analyzing, we will have to do some cleaning to break the text down into a format the computer can easily understand.
These steps are needed for transferring text from human language to machine-readable format for further processing. We will also discuss text preprocessing tools.After a text is obtained, we start with text normalization. Text normalization includes:
- Tokenization
- Remove stop words
- Remove sparse terms and particular words
- Stemming
Tokenization: It is the process of breaking up the original raw text into component pieces or otherwise known as tokens.
Remove stop words:
“Stop words” are the most common words in a language like “the”, “a”, “on”, “is”, “all”. These words do not carry important meaning and are usually removed from texts. It is possible to remove stop words using Natural Language Toolkit (NLTK), a suite of libraries and programs for symbolic and statistical natural language processing.
A scikit-learn tool also provides a stop words list:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
It’s also possible to use spaCy, a free open-source library:
from spacy.lang.en.stop_words import STOP_WORDS
Remove sparse terms and particular words:
In some cases, it’s necessary to remove sparse terms or particular words from texts. This task can be done using stop words removal techniques considering that any group of words can be chosen as the stop words.
Stemming:
One of the most common and effective stemming tools is Porter's algorithm developed by Martin Porter
in 1980. The algorithm employs five phases of word reduction, each with its own set of mapping rules.
- In the first phase simple suffixed mapping rules are defined as:
- More sophisticated phrases that consider the length and complexity of the word before applying a rule. For example:
Spacy didn't have stemming because it figures lemmatization is a much more effective
way of actually reducing these words.
Lemmatization:
In contrast to stemming, lemmatization looks beyond word reduction, and considers a language's full vocabulary to apply a morphological analysis to words. The lemma of 'was' is 'be' and the lemma of 'mice' is 'mouse'.
Lemmatization is typically seen as much more informative than simple stemming which is why the Spacy library has opted to only have lemmatization available instead of simple stemming.
Lemmatization looks at surrounding text to determine a given word's part of speech, it does not categorize phrases.
For more understanding, let try some code given on my github link:
Comments
Post a Comment