Why Do We Need Precision-Recall Curve?

What are the precision & recall and why are they used?

For binary classification problems, it's usually a matter of how to differentiate from a negative class to a positive class. When we have a binary classifier to make predictions, there will always be true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). The classifier that can produce more true positives and negatives (less false positives and negatives in other words) indicates better predictive power and less business cost.

As a data scientist, I usually first define the cost of a false negative and the cost of a false positive from a business standpoint, so that I can optimize the threshold accordingly and balance the trade-off.

In a nutshell, both precision and recall are the metrics to evaluate the predictive performance. Below are the detailed definitions about precision and recall.

precision = TP/(TP + FP), it tells how sure you are of your true positives. For example, if you have a precision score of 0.8, it means among all the data points that you predicted as positive, 80% of them are true positive.

You care more about precision if the occurrence of false positive is intolerable. For example, emails larger than thresholds are treated as spam, however, mistakenly labeling a non-spam message as spam (false positive) is harmful to users because they may miss important messages. (If we want to increase precision score and decrease false positive, we can increase the threshold.)

recall = TP/(TP + FN), it tells what proportion of actual positives are identified. For example, if you have a recall score of 0.8, it means among all the positive data points, you correctly labeled 80% of them as positive.

You care more about recall if the occurrence of false negative is intolerable. For example, patients larger than thresholds are treated as having a cancer, however, mistakenly labeling a patient with cancer as healthy person (false negative) is harmful because they can't receive immediate treatment and they may die of cancer. (If we want to increase recall and decrease false negative, we can lower the threshold.)

How’s the precision-recall curve different from the ROC curve, and when would you use one over the other?

A precision-recall curve shows the relationship between precision and recall for every possible threshold, while ROC curve show the relationship between recall and FPR (false positive rate) for every possible threshold.

Graphically, the closer the precison-recall curve is to the upper right corner, the better the prediction model. On the other hand, the closer the ROC curve is to the upper left corner, the better the prediction model.

If you have a severely imbalanced dataset with only a few positive classes, ROC curve can be misleading, it tends to overestimate the model's predictive power.

For example, I built a logistic regression model (I will proudly call this model OnePunch) to make binary classification about fraud transaction, and the dataset was severely skewed with only 1% of the classes as "fraud class" (positive class). When I used both ROC curve and precision-recall curve to evaluate my model-OnePunch, they gave me totally different model performance. I've displayed these two curves below, as well as the area under each curve.

ROC curve & PR curve for the same model

It's very clear that ROC curve gives a very high rating to my model-OnePunch, because the curve is very close to the upper left corner and the AUC is 0.879, which is a pretty high score. However, the PR curve looks not good, and the AUC is only 0.609, so basically the PR curve is saying my model-OnePunch is just slightly better than random guessing.

Why is that?

It is because precision and recall both focus on the positive/minority class, so precision-recall curves are not affected by the imbalanced dataset, it can more truly reflect the predictive ability of the model.

In a word, use precision-recall curve rather than ROC curve to better evaluate the predictive model when you have a severely imbalanced dataset. When the dataset is roughly balanced, then the ROC curve is usually enough.