Natural Language Processing | Text Preprocessing | Spacy vs NLTK

Published in

Nerd For Tech

5 min readAug 6, 2021

Natural Language Processing is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computer to process and analyze large amounts of natural language data. There are some basics preprocessing steps available for text data.

Tokenization
Stemming
Lemmatization
Stop Words

What is Spacy?

Open source Natural Language Processing library launched in 2015. Designed to effectively handle NLP tasks with the most efficient implementation of common algorithms. For many NLP tasks, Spacy only has one implemented method, choosing the most efficient algorithm currently available. This means you often don’t have the option to choose other algorithms.

What is NLTK?

NLTK — Natural Language Toolkit is a very popular open source library launched in 2001. It also provides many functionalities, but includes less efficient implementations.

Spacy VS NLTK

For very common NLP tasks, Spacy is much faster and more efficient than NLTK, at the cost of the user not being able to choose algorithmic implementations. However, Spacy does not include pre-created models for some applications, such as sentiment analysis, which is typically easier to perform with NLTK.

Spacy consumes lesser time when compared to NLTK for various operations like Tokenize, Tag, Parse.

Import spacy and create a object as nlp mentioned below.

When we run nlp, our text enters a processing pipeline that first breaks down the text and then performs a series of operations to tag, parse and describe the data.

Tokenization:

Tokenization is a common task in Natural Language Processing (NLP). Tokens are the building blocks of Natural Language. Tokenization is a way of separating a piece of text into smaller units called token.

Spacy is intelligent enough, when we pass the text inside the nlp object, the text is processed in the pipeline as mentioned above. First for loop indicates text tokenization and the list comprehension explains about the sentence tokenization.

NLTK has a various tokenization algorithm.

Sent tokenizer returns the sentence tokens. Look at the length of token, it’s one which shows the there is only one sentence tokens.

Word tokenizer returns the word tokens. Number of tokens are 27 which shows there is 27 word tokens in the sentence.

Stemming:

Stemming is a somewhat crude method for cataloging related words; it essentially chops off letters from the end until the stem is reached. This works fairly well in most cases, but unfortunately English has many exceptions where a more sophisticated process is required. In fact, spaCy doesn’t include a stemmer, opting instead to rely entirely on lemmatization. So stemming method available only in the NLTK library

Porter Stemmer:

One of the most common and effective stemming tools is Porter’s Algorithm developed by Martin Porter in 1980. The algorithm employs five phases of word reduction, each with its own set of mapping rules.

In the first phase, simple suffix matching rules are defined, such as:

From a given set of stemming rules only one rule is applied, based on the longest suffix S1. Thus, ‘caresses’ reduces to ‘caress’ but not ‘cares’.

More sophisticated phases consider the length/complexity of the word before applying a rule. For example:

Here m>0 describes the “measure” of the stem, such that the rule is applied to all but the most basic stems.

Snowball Stemmer:

This is somewhat of a misnomer, as Snowball is the name of a stemming language developed by Martin Porter. The algorithm used here is more accurately called the “English Stemmer” or “Porter2 Stemmer”. It offers a slight improvement over the original Porter stemmer, both in logic and speed. Since nltk uses the name SnowballStemmer, we’ll use it here.

Lancaster Stemmer:
The Lancaster stemming algorithm is another algorithm that you can use. This one is the most aggressive stemming algorithm of the bunch. However, if you use the stemmer in NLTK, you can add your own custom rules to this algorithm very easily

Lemmatization:

In contrast to stemming, lemmatization looks beyond word reduction, and considers a language’s full vocabulary to apply a morphological analysis to words. The lemma of ‘was’ is ‘be’ and the lemma of ‘mice’ is ‘mouse’. Further, the lemma of ‘meeting’ might be ‘meet’ or ‘meeting’ depending on its use in a sentence.

NLTK’s lemmatizer requires a positional argument, if don’t give the positional tag, it’ll take the word as a noun and not performs the lemmatization. You can see below in the snippet, after giving the proper positional tag the lemmatization performs better.

Stop words:

Words like “a” and “the” appear so frequently that they don’t require tagging as thoroughly as nouns, verbs and modifiers. We call these stop words, and they can be filtered from the text to be processed. spaCy holds a built-in list of some 326 English stop words.

To load and add a stop word in spaCY

These are the different ways of basic text processing done with the help of spaCy and NLTK library. Spacy performs in an efficient way for the large task. Hope you got the insight about basic text preprocessing steps followed for NLP task.

Click this github link to refer the notebook file.

Natural Language Processing | Text Preprocessing | Spacy vs NLTK

Stemming:

Lemmatization:

Written by Rishi Kumar