Natural Language Processing | Feature Extraction Techniques.

Rishi Kumar
Nerd For Tech
Published in
10 min readAug 13, 2021

--

Most classic machine learning and deep learning algorithms can’t take in raw text. Instead, we need to perform feature extraction from the raw text in order to pass numerical features to machine learning algorithms.

Bag of Words Model — TF

This is perhaps the most simple vector space representational model for unstructured text. A vector space model is simply a mathematical model to represent unstructured text (or any other data) as numeric vectors, such that each dimension of the vector is a specific feature attribute. The bag of words model represents each text document as a numeric vector where each dimension is a specific word from the corpus and the value could be its frequency in the document, occurrence (denoted by 1 or 0) or even weighted values. The model’s name is such because each document is represented literally as a ‘bag’ of its own words, disregarding word orders, sequences and grammar.

Count Vectorizer

This should make things clearer! You can clearly see that each column or dimension in the feature vectors represents a word from the corpus and each row represents one of our documents. The value in any cell, represents the number of times that word (represented by column) occurs in the specific document (represented by row). Hence, if a corpus of documents consists of N unique words across all the documents, we would have an N-dimensional vector for each of the documents.

Bag of N-Grams Model

A word is just a single token, often known as a unigram or 1-gram. We already know that the Bag of Words model doesn’t consider the order of words. But what if we also wanted to take into account phrases or collection of words which occur in a sequence? N-grams help us achieve that. A N-gram is basically a collection of word tokens from a text document such that these tokens are contiguous and occur in a sequence. Bi-grams indicate n-grams of order 2 (two words), Tri-grams indicate n-grams of order 3 (three words), and so on. The Bag of N-Grams model is hence just an extension of the Bag of Words model so we can also leverage N-gram based features. The following example depicts bi-gram based features in each document feature vector.

This gives us feature vectors for our documents, where each feature consists of a bi-gram representing a sequence of two words and values represent how many times the bi-gram was present in our documents.

Drawbacks of using a BOW models:

  • If the new sentences contain new words, then our vocabulary size would increase and thereby, the length of the vectors would increase too.
  • Additionally, the vectors would also contain many 0s, thereby resulting in a sparse matrix (which is what we would like to avoid).
  • We are retaining no information on the grammar of the sentences nor on the ordering of the words in the text.

TF-IDF

The TF-IDF model tries to combat this issue by using a scaling or normalizing factor in its computation. TF-IDF stands for Term Frequency-Inverse Document Frequency, which uses a combination of two metrics in its computation, namely: term frequency (tf) and inverse document frequency (idf).

Mathematically, we can define TF-IDF as tfidf = tf x idf .

Here, tfidf (w, D) is the TF-IDF score for word w in document D.
→ The term tf (w, D) represents the term frequency of the word w in document D, which can be obtained from the Bag of Words model.
→ The term idf (w, D) is the inverse document frequency for the term w, which can be computed as the log transform of the total number of documents in the corpus C divided by the document frequency of the word w, which is basically the frequency of documents in the corpus where the word w occurs.

The TF-IDF based feature vectors for each of our text documents show scaled and normalized values as compared to the raw Bag of Words model values.

→ Bag of Words just creates a set of vectors containing the count of word occurrences in the document (reviews), while the TF-IDF model contains information on the most important words and the less important ones as well.

→ Bag of Words vectors is easy to interpret. However, TF-IDF usually performs better in machine learning models.

→ While both Bag-of-Words and TF-IDF have been popular in their own regard, there still remained a void where understanding the context of words was concerned. Detecting the similarity between the words ‘spooky’ and ‘scary’, or translating our given documents into another language, requires a lot more information on the documents.

→This is where Word Embedding techniques such as Word2Vec, Continuous Bag of Words (CBOW), Skipgram, etc. come in.

Word2Vec Model

This model was created by Google in 2013 and is a predictive deep learning based model to compute and generate high quality, distributed and continuous dense vector representations of words, which capture contextual and semantic similarity. Essentially, these are unsupervised models which can take in massive textual corpora, create a vocabulary of possible words and generate dense word embedding for each word in the vector space representing that vocabulary.

Usually you can specify the size of the word embedding vectors and the total number of vectors are essentially the size of the vocabulary. This makes the dimensionality of this dense vector space much lower than the high-dimensional sparse vector space built using traditional Bag of Words models.

There are two different models architectures which can be leveraged by Word2Vec to create these word embedding representations. These include,

  • The Continuous Bag of Words (CBOW) Model
    - The Skip-gram Model

The Continuous Bag of Words (CBOW) Model:

The CBOW model architecture tries to predict the current target word (the center word) based on the source context words (surrounding words).

Considering a simple sentence, “the quick brown fox jumps over the lazy dog”, this can be pairs of (context_window, target_word) where if we consider a context window of size 2, we have examples like ([quick, fox], brown), ([the, brown], quick), ([the, dog], lazy) and so on.

Thus the model tries to predict the target_word` based on the `context_window` words.

The Skip-gram Model

The Skip-gram model architecture usually tries to achieve the reverse of what the CBOW model does. It tries to predict the source context words (surrounding words) given a target word (the center word).

Considering our simple sentence from earlier, “the quick brown fox jumps over the lazy dog”. If we used the CBOW model, we get pairs of (context_window, target_word) where if we consider a context window of size 2, we have examples like ([quick, fox], brown), ([the, brown], quick), ([the, dog], lazy) and so on.

Now considering that the skip-gram model’s aim is to predict the context from the target word, the model typically inverts the contexts and targets, and tries to predict each context word from its target word. Hence the task becomes to predict the context [quick, fox] given target word ‘brown’ or [the, brown] given target word ‘quick’ and so on.

Thus the model tries to predict the context window words based on the target word.

Robust Word2Vec Model with Gensim

The gensim framework, created by Radim Řehůřek consists of a robust, efficient and scalable implementation of the Word2Vec model. We will leverage the same on our sample toy corpus. In our workflow, we will tokenize our normalized corpus and then focus on the following four parameters in the Word2Vec model to build it.

→ size: The word embedding dimensionality
→ window: The context window size
→ min_count: The minimum word count
→ sample: The downsample setting for frequent words
→sg: Training model, 1 for skip-gram otherwise CBOW

We will build a simple Word2Vec model on the corpus and visualize the embeddings.

Cosine Similarity:

Cosine Similarity is used to measure how similar word vectors are each other. Cosine similarity is essentially checking the distance between the two vectors.

We can also perform vector arithmetic with the word vectors.

new_vector = king — man+woman

This creates a new vector that we can then attempt to find most similar vectors too.

→ new vector closest to vector for queen

Cosine similarity is the cos of the angle between the two vectors. Cosine distance can be found by 1- Cosine Similarity. Higher the angle between two vectors lower the cosine similarity which gives high cosine distance value, whereas lower the angle between two vectors higher the cosine similarity which gives low cosine distance value.

The above image gives the top 3 similar words for each word.

The GloVe Model:

The GloVe model stands for Global Vectors which is an unsupervised learning model which can be used to obtain dense word vectors similar to Word2Vec. However the technique is different and training is performed on an aggregated global word-word co-occurrence matrix, giving us a vector space with meaningful sub-structures. This method was invented in Stanford by Pennington et al. and I recommend you to read the original paper on GloVe, [‘GloVe: Global Vectors for Word Representation’ by Pennington et al.] paper which is an excellent read to get some perspective on how this model works.

The basic methodology of the GloVe model is to first create a huge word-context co-occurrence matrix consisting of (word, context) pairs such that each element in this matrix represents how often a word occurs in the context (which can be a sequence of words). The idea then is to apply matrix factorization.

Considering the Word-Context (WC) matrix, Word-Feature (WF) matrix and Feature-Context (FC) matrix, we try to factorise WC = WF x FC

Such that we we aim to reconstruct WC from WF and FC by multiplying them. For this, we typically initialize WF and FC with some random weights and attempt to multiply them to get WC (an approximation of WC) and measure how close it is to WC. We do this multiple times using Stochastic Gradient Descent (SGD) to minimize the error. Finally, the Word-Feature matrix (WF) gives us the word embedding for each word where F can be preset to a specific number of dimensions.

Implementation of glove model.

The FastText Model:

The FastText model was first introduced by Facebook in 2016 as an extension and supposedly improvement of the vanilla Word2Vec model. Based on the original paper titled [‘Enriching Word Vectors with Subword Information’] https://arxiv.org/pdf/1607.04606.pdf by Mikolov et al. which is an excellent read to gain an in-depth understanding of how this model works. Overall, FastText is a framework for learning word representations and also performing robust, fast and accurate text classification. The framework is open-sourced by Facebook on [GitHub] https://github.com/facebookresearch/fastText and claims to have the following.

→ Recent state-of-the-art English word vectors.
→ Word vectors for 157 languages trained on Wikipedia and Crawl.
→ Models for language identification and various supervised tasks.

The Word2Vec model typically ignores the morphological structure of each word and considers a word as a single entity. The FastText model considers each word as a Bag of Character n-grams. This is also called as a subword model in the paper.

We add special boundary symbols < and > at the beginning and end of words. This enables us to distinguish prefixes and suffixes from other character sequences. We also include the word w itself in the set of its n-grams, to learn a representation for each word (in addition to its character n-grams).

Taking the word where and n=3 (tri-grams) as an example, it will be represented by the character n-grams: <wh, whe, her, ere, re> and the special sequence < where > representing the whole word. Note that the sequence, corresponding to the word < her > is different from the tri-gram her from the word where.

In practice, the paper recommends in extracting all the n-grams for n ≥ 3 and n ≤ 6. This is a very simple approach, and different sets of n-grams could be considered, for example taking all prefixes and suffixes. We typically associate a vector representation (embedding) to each n-gram for a word.

Thus, we can represent a word by the sum of the vector representations of its n-grams or the average of the embedding of these n-grams. Thus, due to this effect of leveraging n-grams from individual words based on their characters, there is a higher chance for rare words to get a good representation since their character based n-grams should occur across other words of the corpus.

These are the embedding techniques used for feature extraction in NLP.

Thanks for reading up to the end. Refer this notebook for practical implementation. LINK

--

--

Rishi Kumar
Nerd For Tech

I'm a passionate and disciplined Data Science enthusiast working with Logitech as Data Scientist