What is Natural Language Processing? - Great Learning
We use cookies to give you the best online experience. By using our website, you agree to our use of cookies in accordance with our cookie policy. Learn More

What is Natural Language Processing?

Reading Time: 8 minutes

All of us have come across Google’s keyboard which suggests auto-corrects, word predicts (next word that would be used) and more. Grammarly is a great tool for content writers and professionals to make sure their articles look professional. It uses ML algorithms to suggest the right amounts of gigantic vocabulary, tonality, and much more, to make sure that the content written is professionally apt, and captures the total attention of the reader. Translation systems use language modelling to work efficiently with multiple languages. By the end of the article, we would have realized that we are surrounded by NLP applications everywhere.

How did the field of NLP start?

People involved with language characterization and understanding of patterns in languages are called linguists. Computational linguistics kicked off as the amount of textual data started to explode tremendously. Wikipedia is the greatest textual source there is. The field of computational linguistics began with an early interest in understanding the patterns in data, Parts-of Speech(POS) tagging, easier processing of data for various applications in the banking and finance industries, educational institutions, etc. 

What is NLP?

NLP is the application of computational linguistics to build real-world applications which work with languages comprising of varying structures. We are trying to teach the computer to learn languages, and then also expect it to understand it, with suitable efficient algorithms. 

Why is advancement in the field of NLP necessary?

NLP is the process of enhancing capabilities of computers to understand human language. Databases are highly structured forms of data. Internet, on the other hand, is completely unstructured with minimal components of structure in it. In such a case, understanding human language and modelling it is the ultimate goal under NLP. For example, Google Duplex and Alibaba’s voice assistant are on the journey to mastering non-linear conversations. Non-linear conversations are somewhat close to the human’s manner of communication. We talk about cats in the first sentence, suddenly jump to talking tom, and then refer back to the initial topic. The person listening to this understands the jump that takes place. Computers currently lack this capability. 

What is the scope of NLP?

Currently, NLP professionals are in a lot of demand, for the amount of unstructured data available is increasing at a very rapid pace. Underneath this unstructured data lies tons of information that can help companies grow and succeed. For example, monitoring tweet patterns can be used to understand the problems existing in the societies, and it can also be used in times of crisis. Thus, understanding and practicing NLP is surely a guaranteed path to get into the field of machine learning. For beginners, creating a NLP portfolio would highly increase the chances of getting into the field of NLP.

What are some of the applications of NLP?

  1. Grammarly, Microsoft Word, Google Docs
  2. Search engines like DuckDuckGo, Google
  3. Voice assistants – Alexa, Siri
  4. News feeds- Facebook,Google News
  5. Translation systems – Google translate

What are some of the open-source softwares available in the domain of NLP?

  1. NLTK
  2. Stanford toolkit
  3. Gensim
  4. Open NLP

We will understand traditional NLP, a field which was run by the intelligent algorithms that were created to solve various problems. With the advance of deep neural networks, NLP has also taken the same approach to tackle most of the problems today. In this article we will cover traditional algorithms to ensure the fundamentals are understood.

We look at the basic concepts such as regular expressions, text-preprocessing, POS-tagging and parsing.

What are regular expressions?

Regular expressions is the process of effective matching of patterns in strings. Patterns are used extensively to get meaningful information from large amounts of unstructured data. There are various regular expressions involved. Let us consider them one by one:

  • (a period): All characters except for \n are matched
  • \w: All [a-z A-Z 0-9] characters are matched with this expression
  • $: Every expression ends with $
  • \: used to nullify the speciality of the special character.
  • \W (upper case W) matches any non-word character.
  • \s: This expression (lowercase s) matches a single white space character – space, newline,

return, tab, form [\n\r\t\f].

  • \r: This expression is used for a return character.
  • \S: This expression matches any non-white space character.
  • \t: This expression performs a tab operation.
  • \n: Used to express a newline character.
  • \d: Decimal digit [0-9].
  • ^: Used at the start of the string.

What is text wrangling?

We will define it as the pre-processing done before obtaining a machine readable and formatted text from raw data. 

Some of the processes under text wrangling are: 

  1. text cleansing
  2. specific pre-processing
  3. tokenization
  4. stemming or lemmatization 
  5. stop word removal 

Text cleansing

Text collected from various sources has a lot of noise due to the unstructured nature of the text. Upon parsing of the text from the various data sources, we need to make sense of the unstructured property of the raw data. Therefore, text cleansing is used in the majority of the cleaning to be performed. 

What factors decide the quality and quantity of text cleansing?

  1. Parsing performance
  2. Source of data source
  3. External noise

Consider a second case, where we parse a PDF. There could be noisy characters, non ASCII characters, etc. Before proceeding onto the next set of actions, we should remove these to get a clean text to process further. If we are dealing with xml files, we are interested in specific elements of the tree. In the case of databases we manipulate splitters and are interested in specific columns. We will look at splitters in the coming section.

How do we define Text cleansing?

In conclusion, processes done with an aim to clean the text and to remove the noise surrounding the text can be termed as text cleansing. Data munging and data wrangling are also used to talk about the same. They are used interchangeably in a similar context.

Sentence splitter

NLP applications require splitting large files of raw text into sentences to get meaningful data. Intuitively, a sentence is the smallest unit of conversation. How do we define something like a sentence for a computer? Before that, why do we need to define this smallest unit? We need it because it simplifies the processing involved. For example, the period can be used as splitting tool, where each period signifies one sentence. To extract dialogues from a paragraph, we search for all the sentences between inverted commas and double-inverted commas. . A typical sentence splitter can be something as simple as splitting the string on (.), to something as complex as a predictive classifier to identify sentence boundaries:


Token is defined as the minimal unit that a machine understands and processes at a time. All the text strings are processed only after they have undergone tokenization, which is the process of splitting the raw strings into meaningful tokens. The task of tokenization is complex due to various factors such as

  1. need of the NLP application
  2. the complexity of the language itself

For example, in English it can be as simple as choosing only words and numbers through a regular expression. For dravidian languages on the other hand, it is very hard due to vagueness present in the morphological boundaries between words.

What is Stemming?

Stemming is the process of obtaining the root word from the word given. Using efficient and well-generalized rules, all tokens can be cut down to obtain the root word, also known as the stem. Stemming is a purely rule-based process through which we club together variations of the token. For example, the word sit will have variations like sitting and sat. It does not make sense to differentiate between sit and sat in many applications, thus we use stemming to club both grammatical variances to the root of the word. Stemming is in use for its simplicity. But in the case of dravidian languages with many more alphabets, and thus many more permutations of words possible, the possibility of the stemmer identifying all the rules is very low. In such cases we use the lemmatization instead. Lemmatization is a robust, efficient and methodical way of combining grammatical variations to the root of a word.

What are the various types of stemmers?

  1. Porter Stemmer
  2. Lovins Stemmer
  3. Dawson Stemmer
  4. Krovetz Stemmer
  5. Xerox Stemmer

Porter Stemmer: Porter stemmer makes use of larger number of rules and achieves state-of-the-art accuracies for languages with lesser morphological variations. For complex languages, custom stemmers need to be designed, if necessary. On the contrary, a basic rule-based stemmer, like removing –s/es or -ing or -ed can give you a precision of more than 70 percent .

There exists a family of stemmers known as Snowball stemmers that is used for multiple languages like Dutch, English, French, German, Italian, Portuguese, Romanian, Russian, and so on. 

In modern NLP applications usually stemming as a pre-processing step is excluded as it typically depends on the domain and application of interest. When NLP taggers, like Part of Speech tagger (POS), dependency parser, or NER are used, we should avoid stemming as it modifies the token and thus can result in an unexpected result.

What is Lemmatization?

Lemmatization is a methodical way of converting all the grammatical/inflected forms of the root of the word. Lemmatization makes use of the context and POS tag to determine the inflected form(shortened version) of the word and various normalization rules are applied for each POS tag to get the root word (lemma).

A few questions to ponder about would be.

  • What is the difference between Stemming and lemmatization?
  • What would the rules be for a rule-based stemmer for your native language?
  • Would it be simpler or difficult to do so?

What is stop word removal?

Stop words are the most commonly occurring words, that seldom add weightage and meaning to the sentences. They act as bridges and their job is to ensure that sentences are grammatically correct. It is one of the most commonly used pre-processing steps across various NLP applications. Thus, removing the words that occur commonly in the corpus is the definition of stop-word removal. Majority of the articles and pronouns are classified as stop words. Many tasks like information retrieval and classification are not affected by stop words. Therefore, stop-word removal is not required in such a case. On the contrary, in some NLP applications stop word removal has a major impact. The stop word list for a language is a hand-curated list of words that occur commonly. Stop word lists for most languages are available online. Many ways exist to automatically generate the stop word list. A simple way to obtain the stop word list is to make use of the word’s document frequency. Words presence across the corpus is used as an indicator for classification of stop-words. Research has ascertained that we obtain the optimum set of stop words for a given corpus. NLTK comes with a loaded list for 22 languages.

One should consider answering the following questions.

  • How does stop-word removal help?
  • What are some of the alternatives for stop-word removal?

What is rare word removal?

Some of the words that are very unique in nature like names, brands, product names, and some of the noise characters also need to be removed for different NLP tasks. Use of names in the case of text classification isn’t a feasible option to use. Even though we know Adolf Hitler is associated with bloodshed, his name is an exception. Usually, names, do not signify the emotion and thus nouns are treated as rare words and replaced by a single token. The rare words are application dependent, and must be chosen uniquely for different applications.

What is spell correction?

Finally, spellings should be checked for in the given corpus. The model should not be trained with wrong spellings, as the outputs generated will be wrong. Thus, spelling correction is not a necessity but can be skipped if the spellings don’t matter for the application.

In the next article, we will refer to POS tagging, various parsing techniques and applications of traditional NLP methods. We learned the various pre-processing steps involved and these steps may differ in terms of complexity with a change in the language under consideration. Therefore, understanding the basic structure of the language is the first step involved before starting any NLP project. We need to ensure, we understand the natural language, before we can teach the computer.

Leave a Reply

Notify of
Subscribe to Our Blog