Built with Mathigon

Glossary

Select one of the keywords on the left…

Precision and RecallThe F1-Score

Reading time: ~10 min

By now, you've seen the constant tug-of-war between precision and recall. Wouldn't it be incredibly convenient if there was a single score that balanced both of them?

As you probably guessed, such a metric exists! It is called the F1-score. Instead of having to juggle two different numbers, the F1-score gives you a single percentage that tells you how well your AI is performing overall, making sure it isn't secretly failing at either precision or recall.

Because of this balance, the F1-score is considered the gold standard for ranking different AI models to see which one genuinely performs the best.

So, how exactly does this magical balancing act work?

The Intuition

Think of the F1-score as a strict teacher grading a group project. The final grade isn't just an average; you only get an A if both partners did a great job.

If precision is perfect (100%) but recall is terrible (0%), a normal average would give a passing score of 50%. The F1-score, however, heavily penalizes this imbalance and drags the final score much closer to 0%. The only way to achieve a F1-score is to have strong performance in both precision and recall!

The Mathematics

For those curious about the underlying mechanics, the F1-score is calculated using what is called the harmonic mean. Because we are dealing with ratios, the harmonic mean ensures that lower values heavily pull down the final average.

F1 = \frac{2 \cdot \text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}

By substituting our formulas for precision and recall, we get an alternative, equivalent formula based purely on the original confusion matrix outcomes:

F1 = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}

Let's look at the F1-score in action. You can try adjusting the values yourself to see how a terrible score in just one metric drastically drags the entire F1-score down.

Does it have any flaws?

Beware the True Negatives!

The F1-score is a fantastic tool, but it has a blind spot: it completely ignores True Negatives.

Because the formula only relies on True Positives (TP), False Positives (FP), and False Negatives (FN), it never asks how many times the AI correctly ignored something negative. If knowing the number of correctly identified healthy patients is highly important to your case, you might need to look beyond the F1-score.

Additionally, the F1-score treats precision and recall as important. In real-world scenarios—like our AI doctor trying to catch rare diseases—a False Negative might be far more dangerous than a False Positive. In those specific cases where one error is worse than the other, relying entirely on the F1-score can be misleading.

Sina