🎓 Learning corner

Classification metrics

Use these metrics when a system decides whether something belongs to a class: cat or not cat, name or not name, sensitive or not sensitive.

Background

Imagine a system that looks at 100 photos and predicts either Cat or Not Cat. The four boxes below are the starting point for precision, recall, and F1.

Cat detector confusion matrix showing 20 detected cats, 10 missed cats, 5 false alarms, and 65 correct non-cats.

Correct decisions

20 real cats were detected, and 65 real non-cats were correctly rejected.

Mistakes

5 non-cats were false alarms, and 10 real cats were missed.

🎯 Precision

Precision looks only at the items the system selected. In the cat example, the system predicted 25 photos as cats. 20 really were cats and 5 were false alarms.

Precision example showing 20 correct cat predictions and 5 false alarms among 25 predicted cats.
Precision = TP / (TP + FP)
Precision = 20 / (20 + 5) = 80%

High precision means: when the system makes a claim, users can usually trust it. This matters when false alarms are costly.

🔍 Recall

Recall looks at all the truly relevant items. In the cat example, there were 30 real cats. The system found 20 and missed 10.

Recall example showing 20 detected cats and 10 missed cats among 30 real cat photos.
Recall = TP / (TP + FN)
Recall = 20 / (20 + 10) = 66.7%

High recall means: the system does not miss many of the things we care about. This matters when missing an item is more harmful than reviewing an extra false alarm.

⚖️ F1 Score

F1 combines precision and recall into one stricter balance score. It is high only when the system is both careful and complete.

F1 score illustration balancing 80 percent precision and 66.7 percent recall to produce F1 of 72.7 percent.
F1 = 2 × (Precision × Recall) / (Precision + Recall)
F1 = 2 × (0.80 × 0.667) / (0.80 + 0.667) = 0.727 = 72.7%

F1 is useful when both kinds of mistakes matter: false alarms waste time, but missed items are also risky.