Introduction to NLP

8 min readOct 15, 2020

In short NLP is an AI technique used to do text analysis. Whenever we have lots of text data to analyze we can use NLP.

Introduction

We generate tons of data every day. Our WhatsApp chats, phone calls, emails, SMS’s contains unstructured data which is easy for us to understand but not so easy for machines. In fact around 80% of available data is in unstructured format and considering the growth of faceless apps like Chatbot, this is going to increase. Majority of this unstructured data is in text format. It’s easy for humans to analyze and process the unstructured text/audio data but it takes lots of time and quality also varies. So there is need of an automated system which can do it, that’s where Natural Language Processing (NLP) technique of Artificial Intelligent(AI) comes for rescue.

So where does NLP stands in the realm of AI.

You can see there is overlap of ML and NLP, because once we convert unstructured data to structured format we can use Ml statistical tools and algorithms to solve the problems.

What is NLP?

In short NLP is an AI technique used to do text analysis. For nerds out there here is more formal definition of NLP.

“Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.”

So whenever we have lots of text data to analyze we can use NLP. Apart from text analysis, NLP also used for variety of other tasks. Few important use cases of NLP are,

Text Classification: Using NLP we can classify given corpus of text into different groups based on label or keywords.
Sentiment Analysis: To identify the sentiment(positive or negative) from given text. Very useful in case of movie/product/services reviews.
Relationship Extraction: Can be used to retrieve important relationship data from text. e.g. relationship between place and person.
Chatbots: NLP is one of the core building block of Chatbot platforms like Google Dialogflow, Amazon Lex.
Speech recognition: NLP is used to simplify speech recognition and make it less time-consuming.
Question and Answering: Using NLP we can analyze given textual data and build a model which can answer user questions.
Named Entity Recognition(NER): We can identify important information (entity) like date time, place, person etc from text using NLP.
Optical Character Recognition: Given an image representing printed text, determine the corresponding text.

Understanding the Text Data is Hard

Languages that we use do not follow any specific rule, consider below sentences for example.

“Let’s eat grandma.”, “kids are really sweet.”, “I’d kill for a bath.”

What do you think a computer program will interpret from above sentences? Parsing any natural language input using computers is very difficult problem. Like any complex problem, in order to solve it we are going to split it into small pieces and then chain them together for final analysis. This process is called as building pipeline in machine learning terminology. Same thing we are going to do to solve Natural Language processing problems.

Before we go in details about pipeline steps let’s try to understand how our text data is formatted. Our input text data can be unstructured but every sentence is collection of words and every document is collection of sentences. Every text corpus at its core is just a collection words.

We can have text corpus of any kind of data, with one or more than one document. In case of email’s we may have separate document for each email and in case of reviews we may have one single document with tab separated data for each review.

NLP Workflow

Irrespective of our text data format, steps that are used to solve NLP problems remains more or less same. Major steps that we follow while solving the NLP problems are as below.

In the text preprocessing step we remove all the clutter and noise from the text. Then we perform the exploratory data analysis to understand the data. Based on our understanding from data analysis, we create new features in feature engineering step. Now once we have well formatted data with features, to create a ML model as per our requirement. In the last step we test our model and deploy it in production.

Text Preprocessing

Text preprocessing is very important step in NLP workflow, without it, we can’t analyze the text data. Below are the three major steps in text preprocessing.

Noise Removal

Any text which is not relevant to the context of the data and the task that we want to perform is considered as noise. Most common noise in text data is HTML tags, stop words, punctuations, white spaces and URL’s. So in this step we remove all these noisy elements from text. Libraries such as spaCy and NLTK also has the standard dictionary of some of these noisy elements. If required we can also build our own list.

Text Normalization

On higher level normalization is used to reduce the dimensions of the features so that machine learning models can efficiently process the data. Text data contains multiple representation of the same word, For example — “play”, “player”, “played”, “plays” and “playing” are the different variations of the word — “play”. These variations are useful in case of speech but not much useful for text analysis. During text normalization we convert all the disparities of a word into their normalized form. In this step we perform tokenization, lemmatization, stemming, and sentence segmentation.

Tokenization: Tokenization is one of the first step in any NLP pipeline. Tokenization is nothing but splitting the raw text into small chunks of words or sentences, called tokens. For more details please refer Tokenization in NLP
Lemmatization: Lemmatization removes inflected ending from the word and return the base/root/dictionary form of the word. This base form of the word is knows as lemma.
Stemming: It is one of the way of doing the lemmatization. Stemming involves simply lopping off easily-identified prefixes and suffixes to get the base form of the word. For example ‘connect’ is the base form of ‘connection’, here ‘ion’ is just suffix.

Object Standardization

Text data often contains words or phrases which are not present in any standard dictionaries of spaCy or NLTK library. So we have to handle all such words with the help of custom code. In this step we fix the non-standard words with the help of regular expression and custom lookup table.

Exploratory Data Analysis

In case of unstructured text data exploratory data analysis plays an extremely important role. In this step we visualize and explore data to generate insights. Based on our understanding we try to summarize the main characteristics in data for feature generation.

Feature Engineering

In this step we convert the preprocessed data into features for machine learning models to work on. We can use below techniques to extract features from text data.

Syntactic Parsing: Once we have the tokens we can predict the part of speech(noun, verb, adjective etc) for it. Knowing the role of each word in the sentence will help to understand the meaning of it. We use dependency grammar and part of speech (POS) tags for syntactic analysis.
Entity Extraction: It is more advanced form of language processing, that is used to identify parameter values from input text. These parameter values can be places, people, organizations..etc. This is very useful to pickup the important topics or key section from a text input.
Statistical Features: Using technique like Term Frequency-Inverse Document Frequency(TF-IDF) we can convert text data into numerical format. We can also use Word Count, Sentence Count, Punctuation Counts etc to create count/density based features.
Word Embedding: Word embedding technique are used to represent the word as a vector. Popular model like Word2Vec can be used to perform such task. These word vectors can be used as features in machine learning models.

Model Building & Deployment

First step in model building is to have separate set of training and test data sets. This will make sure that our model will get tested on unknown data.
Choose an algorithm as per task in hand. For example if we are working on classification problem then we can choose from variety of classification algorithm like Logistic Regression, Support Vector Machine, Naïve Bayes etc.
Create pipeline which will feed the data to the model. Same pipeline then can be used to test real word data.
Once pipeline is built test it using training dataset and evaluate the model using test dataset.
We can use variety of metric to test the model score. Once we get satisfactory score then deploy the model in production.

Libraries for NLP

Scikit-learn: Scikit-learn is a free software machine learning library for the Python programming language.
Natural Language Toolkit (NLTK): The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language.
spaCy — Industrial-Strength Natural Language Processing.: spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython.
Gensim: Gensim is an open-source library for unsupervised topic modeling and natural language processing, using modern statistical machine learning. Gensim is implemented in Python and Cython.
Stanford CoreNLP — NLP services and packages by Stanford NLP Group: CoreNLP enables users to derive linguistic annotations for text, including token and sentence boundaries, parts of speech, named entities, numeric and time values, dependency and constituency parses, coreference, sentiment, quote attributions, and relations.
TextBlob: Simplified Text Processing: TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP)