Skip to main content

Why Do We Need Precision-Recall Curve?

 


What are the precision & recall and why are they used? 

For binary classification problems, it's usually a matter of how to differentiate from a negative class to a positive class. When we have a binary classifier to make predictions, there will always be true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). The classifier that can produce more true positives and negatives (less false positives and negatives in other words) indicates better predictive power and less business cost.

As a data scientist, I usually first define the cost of a false negative and the cost of a false positive from a business standpoint, so that I can optimize the threshold accordingly and balance the trade-off.

In a nutshell, both precision and recall are the metrics to evaluate the predictive performance. Below are the detailed definitions about precision and recall.

  • precision = TP/(TP + FP), it tells how sure you are of your true positives. For example, if you have a precision score of 0.8, it means among all the data points that you predicted as positive, 80% of them are true positive.
    • You care more about precision if the occurrence of false positive is intolerable. For example, emails larger than thresholds are treated as spam, however, mistakenly labeling a non-spam message as spam (false positive) is harmful to users because they may miss important messages. (If we want to increase precision score and decrease false positive, we can increase the threshold.)
  • recall = TP/(TP + FN), it tells what proportion of actual positives are identified. For example, if you have a recall score of 0.8, it means among all the positive data points, you correctly labeled 80% of them as positive.
    • You care more about recall if the occurrence of false negative is intolerable. For example, patients larger than thresholds are treated as having a cancer, however, mistakenly labeling a patient with cancer as healthy person (false negative) is harmful because they can't receive immediate treatment and they may die of cancer. (If we want to increase recall and decrease false negative, we can lower the threshold.)

How’s the precision-recall curve different from the ROC curve, and when would you use one over the other?

A precision-recall curve shows the relationship between precision and recall for every possible threshold, while ROC curve show the relationship between recall and FPR (false positive rate) for every possible threshold.

Graphically, the closer the precison-recall curve is to the upper right corner, the better the prediction model. On the other hand, the closer the ROC curve is to the upper left corner, the better the prediction model.

If you have a severely imbalanced dataset with only a few positive classes, ROC curve can be misleading, it tends to overestimate the model's predictive power.

For example, I built a logistic regression model (I will proudly call this model OnePunch) to make binary classification about fraud transaction, and the dataset was severely skewed with only 1% of the classes as "fraud class" (positive class). When I used both ROC curve and precision-recall curve to evaluate my model-OnePunch, they gave me totally different model performance.  I've displayed these two curves below, as well as the area under each curve. 



ROC curve & PR curve for the same model

It's very clear that ROC curve gives a very high rating to my model-OnePunch, because the curve is very close to the upper left corner and the AUC is 0.879, which is a pretty high score. However, the PR curve looks not good, and the AUC is only 0.609, so basically the PR curve is saying my model-OnePunch is just slightly better than random guessing.

Why is that?

It is because precision and recall both focus on the positive/minority class, so precision-recall curves are not affected by the imbalanced dataset,  it can more truly reflect the predictive ability of the model.

In a word, use precision-recall curve rather than ROC curve to better evaluate the predictive model when you have a severely imbalanced dataset. When the dataset is roughly balanced, then the ROC curve is usually enough.

Comments

Post a Comment

Popular posts from this blog

Use Monte Carlo Simulation + Hypothesis Test to Evaluate Missing Value Imputation

It is quite normal that there’s a lot of missing values here and there when you get a dataset in the real world. How to deal with missing values is a big topic. As far as I’m concerned, there’s no one optimal method that works for all situations, usually, it is case-by-case.  For me, most of the time I don’t want to throw away observations that contains missing values, unless the data is really large while the missing value percentage under certain features are quite high (e.g. >98%) and those features do not play an important role in the analysis. Otherwise I’ll try my best to impute all the missing values to retain important information in the data as much as possible. There’s a lot of missing value imputation methods out there, including statistical methods and machine learning methods. No matter what method we use, we want a data distribution (after imputation) that is not twisted, or in other words, is not statistically significantly different from the original distribution...

Use Flowers to Analyze PCA and NMF

PCA and NMF are both dimension reduction methods that can be used for BIG DATA  problems. They use matrix algebra to transform the original feature space in the dataset to a much lower dimensional space while keeping the majority of the information in the dataset. To make this article more vivid, I would like to use image data to show the functionality and difference about these two methods. In fact, PCA and NMF are very natural transformation methods for image data, because image data has large feature space (width x height x channel) and pixels in the image are also highly correlated. Let’s take a look at this flower dataset first. The flower data set I used in this article was from Kaggle ( https://www.kaggle.com/alxmamaev/flowers-recognition ) We have roses and sunflowers in this image dataset. Each image is represented as a 128x128x3 numpy array. So if you flatten each image, you will get a 1D array with 49152 features, which is a REALLY LARGE number in terms of features....