Decision Tree — a Revolution Algorithm

Rishi Kumar
Nerd For Tech
Published in
6 min readJul 16, 2021

--

First level of tree based approach. It’s aka Classification and Regression Tree (CART).

Let’s start off with a thought experiment to give a little bit of motivation behind why we would use Decision Tree method.

Imagine a scenario that I play tennis every Saturday and I always invite a friend to come with me. Sometimes, my friend shows up, sometimes not. For him it depends on a variety of factors, such as: weather, temperature, humidity, wind etc.. I start keeping track of new features and whether or not he showed up to play with me.

Here are the data each of the columns represents features and each of rows represents a certain day. Final column has a label whether or not my friend came to play with me. You can see on the first day it was temperature mild, sunny outlook, 80% humidity, No windy and my friend came out to play.

However in the day 2, it was hot temperature, sunny outlook, 75 % humidity, there is wind and my friend didn’t come to play. I can build Decision Tree with this process which helps predicting whether or not he will show up to play, on taking account of several factors. An intuitive way to do this is through a decision tree.

Generally, we have terminologies in the Decision Tree. In the above tree we have

Root node: The node that performs the first split.

Decision nodes: Intermediate nodes, which split for a value of certain attributes.

Leaves: Terminal nodes that predict the outcome.

You might be thinking, how do these splits occur? That’s where Information gain (Entropy) or Gini index comes to play an important role in splitting nodes. These are the mathematical methods of choosing the best split.

First, let me tell about the Entropy in order to understand the Information gain.

Entropy is the measure of uncertainty or impurity. It measures the purity of the split. In the below picture, we can understand Low entropy means low mixing of variables, whereas the high entropy means high mixing of variables.

There are two steps for calculating information gain for each attribute:

→ Calculate entropy of the target.

→ Calculate the entropy of every attribute.

To illustrate how does the information gain works, let me take the above dataset as an example. At first, we need to calculate the entropy of the target and it is calculated as follows.

Then, we need to calculate the entropy of attributes. To calculate the entropy of attributes, we need to create the frequency table for the attributes.

Frequency table

Next, calculate the entropy of each attribute.

Entropy calculation of outlook
Information Gain of Outlook.

Similarly, Information Gain is calculated for the each attribute.

From the above information gain calculated image, we can come to the conclusion, outlook feature has high I.G value which means it has high purity of labels or less mixture of target labels, so that we can create the root node of the tree with outlook.

Construction of tree

Let’s iterate the same process excluding outlook.

From the above table, we can say there is no mixture of labels, if the scenario is overcast, the person comes to play, so we can straight away construct the decision node with that.

Decision tree.

In the next step, we can calculate information gain for the feature for the outlook sunny.

Information Gain

From the above calculation, The feature windy has the highest I.G value, so we can construct the Decision tree with that variable.

Construction of Decision Tree.

Next, we need to find the decision node, if the outlook is rainy.

Information Gain calculation

Humidity has the highest value of Information Gain. Humidity feature is used as the decision node for next construction of the tree.

Final structure of decision tree.

On looking the final structure of decision tree, we can easily predict the final outcome of various scenarios.

Gini Index is used as the default method to construct a decision tree.

Gini Index formula

The steps to build the tree using Gini Index approach is same as the entropy. Since we are taking the sum of squares of probability for each class category rather than taking the log value, the computational speed is better than the entropy method.

For continuous values the best split is calculated by the Standard Deviation Reduction (SDR) method.

SDR formula

This is same as the entropy approach, whereas Standard Deviation of the values is calculated instead of entropy.

Advantages:

  • Decision tree is very easy to implement and easy to understand.
  • Trees can easily handle and initiate predictions.

Disadvantages:

  • Trees generally do not have the same level of predictive accuracy as any of the other regression and classification approach.
  • It often results in overfitting which means tree gives good result for training data and bad results for test data.

Conclusion:

  • Decision Tree is the very old concept and base for upgraded models like Random Forest, Boosting methods. In order to understand all those algorithms, we need to understand the decision tree working method.
  • Hope you got the basic understanding of how the decision tree works. Thanks for reading the article.
  • Refer this link for Decision Tree implementation in Python.

--

--

Rishi Kumar
Nerd For Tech

I'm a passionate and disciplined Data Science enthusiast working with Logitech as Data Scientist