Evaluating Performance -Classification

Rishi Kumar

Published in

Nerd For Tech

5 min readJul 10, 2021

Error metrics to evaluate classification task.

The key classification metrics we need to know are:

Accuracy
Recall
Precision
F1-Score

Typically in any classification task model can achieve only two results:

→ Either your model was correct in its prediction.

→ or your model was incorrect in its prediction.

For the purpose of explaining metrics, let’s take a scenario of Binary Classification, where we have only two classes. The same process can be expanded to multiple classes.

I’m gonna explain the scenario of predicting, if an image is Dog / Cat.

We feed the test image to the trained model, compares the predicted output with test image’s label to evaluate either it’s correct or wrong prediction. We repeat this process for all the images in our X_test data. At the end, we will have the count of correct matches and the incorrect matches. The key realization we need to make, is that in the real world not all incorrect and correct matches hold equal value. Also in the real world, a single metric won’t tell the complete story, that’s why previously mentioned four metrics are used to evaluate the model. We could organize our predicted values compared to the real values in a confusion matrix.

Accuracy:

Accuracy in classification problems is the number of correct predictions made by a model divided by the total number of predictions. For example, if the X_test set has 100 images and our model correctly predicted 80 images, then we have 80/100, 0.8 or 80 % accuracy.

or simply simplified as

Accuracy is useful when target classes are well balanced. In our example, we would have roughly the same amount of cat images as we have dog images.

Accuracy is not a good choice with unbalanced classes. Imagine we had 99 dog images and 1 image of a cat. If our model was simply a line that always predicted a dog we would get 99 % accuracy. In this situation we’ll want to understand recall and precision.

Recall (aka Sensitivity):

Ability of a model to find all the relevant cases within a dataset. In a simple words, when it’s actually ‘yes’, how often does it predict ‘yes’.

The precise definition of recall is the number of true positives divided by the number of true positive plus the number of false negatives.

or simply,

Precision:

Ability of a classification model to identify only the relevant data points. In a simple words, when it predicts ‘yes’, how often it’s correct.

Precision is defined as the number of true positives divided by the number of false positives.

or simply,

Often when we have trade-off Recall and Precision. While recall expresses the ability to find relevant instances in a dataset, precision expresses the proportion of the data points our model says was relevant actually were relevant.

F1-Score:

In cases where we want to find an optimal blend of precision and recall, we can combine the two metrics using what is called the F1 Score.

The F1 Score is the harmonic mean of precision and recall taking the both metrics into the account in the following equation.

The reason we use harmonic mean instead of a simple average because it punishes extreme values. A classifier with the precision of 1.0 and a recall 0.0 has a simple average of 0.5 but an F1 score of 0.

We can also view all correctly classified versus incorrectly classified images in the form of a confusion matrix.

Confusion matrix might look confusing, but not after you fully read this article. To explain confusion matrix, a true positive would be someone actually having that disease and the model correctly predicting that they have it, a true negative would be someone not having the disease and the model correctly predicting that they do not have the disease and then we have essentially two types of incorrect predictions, a false positive and false negative. A false positive would be if the person doesn’t have the disease and you predict that they do have it, that’s a false positive because you are falsely saying that they are positive for this disease. A false negative is essentially opposite of that, where this person does have this disease present and the model report back they don’t actually have the disease. These are also known as Type I error and Type II error in the statistics.

Conclusion:

The main point to remember with confusion matrix and the various calculated metrics is that they are all fundamentally ways of comparing the predicted values versus the true values. What constitutes ‘good’ metrics, will really depend on the specific situation.