Accuracy, Precision, Recall, and F Score

Accuracy, Precision, Recall, and F-Score

How do you measure how well your machine learning model is doing? There are four main metrics for measuring the accuracy of a machine learning model. These metrics are accuracy, precision, recall, and F-Score (or F Score). In this post, we’ll be covering how to calculate each of these metrics and what they’re used for.

Before we get into learning what all of these metrics are, we first have to cover the confusion matrix. When your machine learning model makes a prediction it can either be right or wrong about its prediction. A confusion matrix keeps track of whether the model predicted the given class correctly or not. 

Table of Contents:

  • Confusion Matrices for Accuracy, Precision, Recall, and F Score
    • Confusion Matrix Example
  • Accuracy as a Machine Learning Model Metric
    • How to Measure Model Accuracy
    • When to Use Accuracy as a Metric
  • Precision as a Machine Learning Model Metric
    • How to Measure Model Precision
    • When to Use Precision as a Metric
  • Recall as a Machine Learning Model Metric
    • How to Measure Model Recall
    • When to Use Recall as a Metric
  • F-Score as a Machine Learning Model Metric
    • How to Measure Model F Score 
    • When to Use F Score as a Metric
  • Summary of Accuracy, Precision, Recall, and F-Score

Confusion Matrices for Accuracy, Precision, Recall, and F Score

Before we dive into confusion matrices, I just want to say that I think these are poorly named. To create a confusion matrix, you create a two by two grid comparing the predicted and actual outcomes. An illustrated image is below. The image also shows the term used for each of the boxes.

confusion matrix
Confusion Matrix + Terms

A common question that arises here is – what happens if my model predicts multiple classes? If your model predicts more than one class, the “negative” or “false” prediction just becomes “not the class we’re predicting for”.

You may also see the term False Positive referred to as a Type 1 Error and the term False Negative referred to as a Type 2 Error. These are the same things with different names.

Confusion Matrix Example

Let’s take a look at a contrived confusion matrix example so we can get a working knowledge of how they work. Let’s assume we have a model that predicts two classes and are making 20 predictions total. 

What would a confusion matrix look like if the truth was that there were 10 positives and 10 negatives and our model predicted 2 positives as negatives and 1 negative as a positive? To fill in the confusion matrix, we’d start with the errors we were given. We have two 2 false positives and 1 false negative. Then we fill in the other number in the row based on our errors.

Example Confusion Matrix
Example Confusion Matrix

Accuracy as a Machine Learning Model Metric

Accuracy measures how often your model “gets it right” compared to the total number of predictions. In terms of metrics, it’s the total fraction of predictions that match the actual values.

How to Measure Model Accuracy

Using a confusion matrix, accuracy can be calculated as the number of true positives + the number of true negatives divided by the total number of predictions made.

Based on the example confusion matrix above, the accuracy of our example model would be (8+9)/20 or 85%. Not a great model, but we’re just using it as an example to illustrate how to measure accuracy, precision, recall, and F score.

When to Use Accuracy as a Metric

Knowing how to measure accuracy is just as important as knowing when to use accuracy as a metric. In many cases, it’s not inappropriate, but what if I were predicting whether or not there would be a tsunami in Seattle on any given day? I would just predict “no” every day and be right over 99% of the time. However, that would be a completely useless model.

Accuracy is a good metric to use if you are looking at a sample set that has closely balanced classes. If you have highly unbalanced classes, using accuracy as a metric would encourage the model to learn the most available class. A concrete example would be a neural network that predicts the MNIST digits dataset.

Precision as a Machine Learning Model Metric

Precision and accuracy are two terms that are often mixed up in everyday speech. In regular speech, accuracy refers to how close your prediction is to the actual truth and precision refers to how close your predictions are to each other. In machine learning metrics, we measure precision based on the total number of positive predictions.

How to Measure Model Precision

Unlike measuring your dart throwing precision, we’re not going to ask the machine to make the same prediction multiple times. We’re going to measure using the metrics we have in our confusion matrix. A machine learning model’s prediction is the number of true positives divided by the total number of positive predictions.

Based on the confusion matrix we have above, there are 8 true positives. We can find that we have 9 positive predictions by summing the number of true positives and false positives. This gives us a precision of 8/9 or about 88.89%.

When to Use Precision as a Metric

The main strength of precision as a metric is to measure how confident we are in the correctness of our prediction. If we need to be highly confident in our truth value, then we should prioritize precision as a metric. A real life example of this would be a model that predicts how safe a submarine would be in deep sea conditions. We want to be as sure as possible that our sailors will return alive.

Recall as a Machine Learning Model Metric

While accuracy and precision are two commonly used words that have slightly altered meanings when applied to machine learning model metrics, recall isn’t as commonly used as a measuring term. Usually we hear recall in terms of cars or other products in the market being “recalled”.

How to Measure Model Recall

The name “recall” suggests that we’re measuring how well our model “remembers” something. In this case, that something is the actual number of positive values. As a machine learning metric, recall is measured as the total number of true positives divided by the total number of actual positive values

In our example above, there are 8 true positives and 10 actually positive values. This gives us a recall of 8/10 or 80%. This is not a great recall score, but once again, we are just using the numbers in the given confusion matrix for example purposes.

When to Use Recall as a Metric

If we were using recall as a metric with the confusion matrix we have, we would have to continue tweaking our model. 80% is a pretty abysmal recall score.

Recall is a good metric to use when we are most concerned with getting as many true positive values as possible. A real life example would be a machine learning model to capture early stage cancer from medical images.

F-Score as a Machine Learning Model Metrics

Unlike accuracy, precision, or recall, F-Score (also called F1-Score) doesn’t really lend itself to any hints as to how to calculate it or what it may represent.

The F-Score is the harmonic mean of precision and recall. The harmonic mean of two numbers strikes a balance between them. We calculate the harmonic mean of a and b as 2*a*b/(a+b).

How to Measure Model F Score 

Plugging precision and recall into the formula above results in 2 * precision * recall / (precision + recall). When we turn this into the stats we have in the confusion matrix, we need to use all of the numbers except for true negatives. This harmonic mean simplifies to 2 * the number of true positives/(2 * the number of true positives + the number of false negatives + the number of false positives).

This simplified formula using only the stats we have in the confusion matrix shows us that F Score is a comparison of true positives to the total number of positives, both predicted and actual. Our example F Score would be 2*8/(2*8+2+1)=16/19 or 84%. Better than recall, but worse than precision. As expected.

When to Use F Score as a Metric

From the formula, we know that F-Score is a balance of precision and recall. We use F Score in cases where we need to have both good precision and good recall. This points to using F Score when we are concerned with getting the highest number of true positives and we want to be as sure as possible that our prediction is correct. 

A real life example would be if you’re investing. You want to invest in the most number of winners and invest in them as often as possible.

Summary of Accuracy, Precision, Recall, and F Score

Accuracy, Precision, Recall, and F Score are the four big machine learning model metrics. It’s impossible to say which of these metrics is “best” because each of them has their own usage. 

Accuracy is best used when we want the most number of predictions that match the actual values across balanced classes. Precision is best used when we want to be as sure as possible that our predictions are correct. Recall is best used when we want to maximize how often we correctly predict positives. Finally, F-Score is a combination of precision and recall. It is used when we want to be as sure as possible about our predictions while maximizing the number of correctly predicted positives.

Further Reading

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Yujian Tang

3 thoughts on “Accuracy, Precision, Recall, and F Score

Leave a Reply

%d bloggers like this: