Posted on: July 22, 2020 Posted by: admin Comments: 0
Bag Of Words & TF-IDF

Creating feature vector using Bag Of Words & TF-IDF in NLP

Source: Unsplash

Our machines are hungry for data to be fed, but they are not intelligent enough to understand whether the data being fed is useful or garbage, it is the responsibility of us human to make sure the data is clean, processed and useful so that our machines are making the sensible decisions as an outcome.

With this wisdom at our side, today we will uncover how the preprocessed data in NLP converted into a feature vector, which is the acceptable format of the food which our deep learning or traditional machine learning models and be trained on.

In our NLP part 3, we got performed hands-on python lab to understood the critical text preprocessing steps that one has to perform before that data can be prepared for model training.Hands-On Lab On Text Preprocessing in NLP Using Python
Tokenization, stemming , lemmatization and POS tagging explained using NLTK & Pythonmedium.com

Where we covered various text preprocessing techniques like

  • Tokenization
  • Stemming
  • lemmatization
  • Stop word removal
  • Part of speech tagging

So today we will cover how we convert those preprocessed set of words or sentences in the corpus/document, to a feature vector using

  • B.O.W: Bag Of Words
  • TF-IDF: Term Frequency — Inverse Document Frequency

What Is Bag of Words: Feature Vector?

The bag-of-words model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.

So basically it is a count of each word in your document. It comprises of two things:

  • A vocabulary of known words.
  • A measure of the presence of known words.

Example Of Bag Of Words:

Let’s understand the same with a simple example

Suppose we have these set of the document D1 & D2 in the given corpus with the following set of words

d1 : “ Pramod, has played districts and state level cricketmatches

d2 : “ Pranjal, has played only matches at the state level

Tokenizing the sentences in the given word

Let’s break the given set of words in the document d1 and d2 as tokens, I have created an excel of the same, which can be seen below

Once we are done creating a vocabulary out of the given documents in the corpus, the next step if to create a list or dictionary of each word based on their occurrence or frequency in the given list of vocabulary

Creating A Frequency vector: Dictionary of Words

The way we do it is by counting the occurrence of each word in the list. Please refer to the image below :

What About The Non-Informative & Large corpus?

As you can see in the above dictionary of words table, there are some stop words which can be of least importance such as uninformative words like stops words can be preprocessed even before it is converted to feature vector.

Also if the corpus is comprised of a huge set of words then the length of the feature vector will also grow in size which can be resource-intensive and expensive on the performance side, so as a best practice we always rely on NLP text preprocessing steps before we go on to create a feature vector using BOW model.

Creating a BOW Model Out Of Dictionary:

Suppose we filter out the following set of words based on the frequency

“Pramod”, ”played ”, “state”, “level”, “cricket”, “matches ”

So the BOW Model Can be represented as Matrix of rows and column as given

Row: document in the given corpus

Columns: Selected list of words from the dictionary vector based on the frequency.

Refer the BOW model Representation below:

Here “Pramod” is present only in d1 and not in d2 so it is represented as 1 and 0 respectively. Similarly played and other words are present in both the document that’s why they have been represented as 1. Hope this helps you all to understand the underlying mechanism of how BOW functions.

What Are The Limitation Of Bag Of Words Model For Feature representation?

Bag of words model is the simplest representation of the given set of sentences in the corpus into a feature vector of fixed length. But it comes with it’s own limitation

  • You can lose some information: based on the location of the words or the position of the words
  • BOW can ignore some important words: Due to less frequency of some really crucial word which may carry some relevant info can be neglected by the BOW model.
  • BOW is not good with semantics: BOW fails to understand the semantic relationship between words like soccer & football has the same context, bit our BOW model can miss this and represent them differently in the form the feature vector

TF-IDF :

As we discussed the BOW model though is simple, has many shortcomings due to which it may include some words of least importance just plainly based on the frequency count and may loose. It is not at all necessary that the word which has the highest no of counts would be the word with vital information always pertaining to the given document.

So what is the way out to understand which word in vocab is relevant and which one not so relevant to be picked for feature representation. Well TF-IDF has an answer to this big question.

Here TF-IDF has the mechanism of penalizing the words which are extremely frequent in each and every document.

How Does TF-IDF Work?

TF-IDF takes an approach to rescale the frequency of the common words based on, how often they appear in all present documents in a given corpus so that the scores for frequent words like “the”, “is”, that are also frequent across all the documents are penalized, for which it utilizes the concept of Term-frequency- Inverse document frequency, which we often call as

TF-IDF.

  • Term Frequency: is a scoring of the frequency of the word in the current document.

Formula :

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

  • Inverse Document Frequency: is a scoring of how rare the word is across all the documents in a corpus. It measures how important a term is. While computing TF, all terms are considered equally important. However, it is known that certain terms, such as “is”, “of”, and “that”, may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scaling up the rare ones, by computing the following:

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

Example TF-IDF:

Consider a document containing 500 words wherein the word Dog appears 4 times. The term frequency (i.e., tf) for Dog is:

TF(dog)= 4/500 = 0.008

Now, Suppose we have 10 million documents and the word Dog appears in five thousand of these. Then, the

Inverse document frequency,

IDF(Dog) =log(10,000,000 / 5,000) = 3.3.

So, TF-IDF(Dog)= 0.008*3.3= 0.02

The higher the score, the more relevant that word is in that particular document.

What’s Next In Part 5 Of This NLP Series?

So, now that we have covered both the BOW model & the TF-IDF model of representing documents into feature vector. We will move one step further to write our second hands-on python lab code to understand how these two core models of feature extraction work using python libraries.

We will make use of TfidfVectorizer() function from the Sk-learn library to easily implement the above BoW(Tf-IDF), model.

So stay tuned for :

BOW & TF-IDF Model Hans-On Pyhton Lab Using Sk-learn library

Time to sign-off with this food for thought :

“ The only way to make this human life worth living is by being humble enough to accept that you don’t know enough. Then only the zeal to learn more and share more will be alive and make you the life full of possibility”

So keep learning and keep supporting,

Thanks a lot…..

Leave a Comment