Random Forest — a Sturdy algorithm.

Rishi Kumar
Nerd For Tech
Published in
3 min readJul 23, 2021

--

Decision Tree has many problems like greedy algorithm, overfitting, low predictions accuracy and calculations can become complex, when there are many class labels. To overcome these problems we use Random Forest.

Random Forest Classifier

Ensemble is a machine learning paradigm to make slow learners to be the fast learners. There are several types of ensemble method like bagging, boosting, stacking. Random Forest comes under the bagging method.

Bootstrap Aggregating or Bagging is a machine learning meta algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting.

Bootstrapped dataset with replacement to its original size is called Bootstrapping. It creates a decision tree using the bootstrapped dataset, but it uses only a random subset of variables at each step. Bagging reduces the variance and reduce the chance of overfitting. The data which is not included in the bootstrap dataset is called out of bag dataset. The proposition out of bag sample that are incorrectly classified is called “Out of bag error”.

Random Forest gives the improvement over the bagged trees by way of a small tweak that decorrelate the trees. Random Forest selects only the subset of variables and it forms root node based on that variable. So it avoids the variance in predictor and less chance of overfitting. Random Forest is useful for feature selection which randomly select different features for each decision tree and score those features.

It aggregates the result predicted from the all the estimators. In case of classification problem, it gives the final output based on the voting classifier and in the regression task, it calculates the mean of the aggregated values and gives the final predicted value.

Hyperparameter:

The main role of Data scientist is to properly tune the hyperparameters available in the model. There are several hyperparameters available in the Random Forest.They are

→ n_estimators: The number of decision trees in the forest.

→ max_feature: The number of features to consider when looking for the best split.

→ max_depth: The maximum depth of the tree

→ min_samples_split: The minimum number of samples required to split an internal node

→ min_sample_leaf: The minimum number of samples required to be at a leaf node.

And there are many more hyperparameters available, refer documentary to know more about hyperparameters available in the Random Forest.

For classification problems, no. of features to be considered in the each tree is m=sqrt(p).

For regression problems, no. of features to be considered in the each tree is m=p/3.

The performance of each model is tested by out of bag samples. When each tree in the forest is tested by out of bag samples. It can provide an estimated accuracy of the bagged models when averaged. This estimated performance is often called as the OOB estimate of performance. These performance measures correlate well with cross validation estimates.

Advantages:

  • It doesn’t overfit
  • It’s one of the most accurate learning algorithms available.
  • No feature scaling required.
  • It runs efficiently on the large database.

Disadvantages:

  • Biased with the features having many categories.
  • It consumes more time to run.

Conclusion:

  • Random Forest are an effective tool in prediction.
  • Forest give results competitive with boosting and adaptive bagging, yet do not progressively change the training set.
  • Random inputs and random features produce good results in prediction.
  • It requires very less feature engineering.

Refer this github link to know about practical implementation.

--

--

Rishi Kumar
Nerd For Tech

I'm a passionate and disciplined Data Science enthusiast working with Logitech as Data Scientist