NLP - Preprocessing of Text
As a data scientist, we may use NLP for sentiment analysis (classifying words to have positive or negative connotation) or to make predictions in classification models, among other things. Typically, whether we’re given the data or have to scrape it, the text will be in its natural human format of sentences, paragraphs, tweets, etc. From there, before we can dig into analyzing, we will have to do some cleaning to break the text down into a format the computer can easily understand. These steps are needed for transferring text from human language to machine-readable format for further processing. We will also discuss text preprocessing tools. After a text is obtained, we start with text normalization. Text normalization includes: - Tokenization - Remove stop words - Remove sparse terms and particular words - Stemming Tokenization: It is the process of breaking up the original raw text into component pieces or otherwise known as tokens. Remove stop words: “Stop words” a...