NLP - Preprocessing of Text

May 13, 2020

As a data scientist, we may use NLP for sentiment analysis (classifying words to have positive or negative connotation) or to make predictions in classification models, among other things. Typically, whether we’re given the data or have to scrape it, the text will be in its natural human format of sentences, paragraphs, tweets, etc. From there, before we can dig into analyzing, we will have to do some cleaning to break the text down into a format the computer can easily understand.

These steps are needed for transferring text from human language to machine-readable format for further processing. We will also discuss text preprocessing tools.After a text is obtained, we start with text normalization. Text normalization includes:

- Tokenization

- Remove stop words

- Remove sparse terms and particular words

- Stemming

Tokenization: It is the process of breaking up the original raw text into component pieces or otherwise known as tokens.

Remove stop words:

“Stop words” are the most common words in a language like “the”, “a”, “on”, “is”, “all”. These words do not carry important meaning and are usually removed from texts. It is possible to remove stop words using Natural Language Toolkit (NLTK), a suite of libraries and programs for symbolic and statistical natural language processing.

A scikit-learn tool also provides a stop words list:

from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

It’s also possible to use spaCy, a free open-source library:

from spacy.lang.en.stop_words import STOP_WORDS

Remove sparse terms and particular words:

In some cases, it’s necessary to remove sparse terms or particular words from texts. This task can be done using stop words removal techniques considering that any group of words can be chosen as the stop words.

Stemming:

One of the most common and effective stemming tools is Porter's algorithm developed by Martin Porter

in 1980. The algorithm employs five phases of word reduction, each with its own set of mapping rules.

In the first phase simple suffixed mapping rules are defined as:

From a given set of stemming rules only one rule is applied, based on the longest suffix S1. Thus, caresses reduces to caress but not to cares.

More sophisticated phrases that consider the length and complexity of the word before applying a rule. For example:

Spacy didn't have stemming because it figures lemmatization is a much more effective

way of actually reducing these words.

Lemmatization:

In contrast to stemming, lemmatization looks beyond word reduction, and considers a language's full vocabulary to apply a morphological analysis to words. The lemma of 'was' is 'be' and the lemma of 'mice' is 'mouse'.

Lemmatization is typically seen as much more informative than simple stemming which is why the Spacy library has opted to only have lemmatization available instead of simple stemming.

Lemmatization looks at surrounding text to determine a given word's part of speech, it does not categorize phrases.

For more understanding, let try some code given on my github link:

https://github.com/anjali0409/Natural-Language-Processing

Search This Blog

Deep Learning

NLP - Preprocessing of Text

Comments

Post a Comment

Popular posts from this blog

Importance of Activation Functions

Natural Language Processing- Basics

Introduction to Deep Learning