Skip to main content

Why Do We Need Precision-Recall Curve?

 


What are the precision & recall and why are they used? 

For binary classification problems, it's usually a matter of how to differentiate from a negative class to a positive class. When we have a binary classifier to make predictions, there will always be true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). The classifier that can produce more true positives and negatives (less false positives and negatives in other words) indicates better predictive power and less business cost.

As a data scientist, I usually first define the cost of a false negative and the cost of a false positive from a business standpoint, so that I can optimize the threshold accordingly and balance the trade-off.

In a nutshell, both precision and recall are the metrics to evaluate the predictive performance. Below are the detailed definitions about precision and recall.

  • precision = TP/(TP + FP), it tells how sure you are of your true positives. For example, if you have a precision score of 0.8, it means among all the data points that you predicted as positive, 80% of them are true positive.
    • You care more about precision if the occurrence of false positive is intolerable. For example, emails larger than thresholds are treated as spam, however, mistakenly labeling a non-spam message as spam (false positive) is harmful to users because they may miss important messages. (If we want to increase precision score and decrease false positive, we can increase the threshold.)
  • recall = TP/(TP + FN), it tells what proportion of actual positives are identified. For example, if you have a recall score of 0.8, it means among all the positive data points, you correctly labeled 80% of them as positive.
    • You care more about recall if the occurrence of false negative is intolerable. For example, patients larger than thresholds are treated as having a cancer, however, mistakenly labeling a patient with cancer as healthy person (false negative) is harmful because they can't receive immediate treatment and they may die of cancer. (If we want to increase recall and decrease false negative, we can lower the threshold.)

How’s the precision-recall curve different from the ROC curve, and when would you use one over the other?

A precision-recall curve shows the relationship between precision and recall for every possible threshold, while ROC curve show the relationship between recall and FPR (false positive rate) for every possible threshold.

Graphically, the closer the precison-recall curve is to the upper right corner, the better the prediction model. On the other hand, the closer the ROC curve is to the upper left corner, the better the prediction model.

If you have a severely imbalanced dataset with only a few positive classes, ROC curve can be misleading, it tends to overestimate the model's predictive power.

For example, I built a logistic regression model (I will proudly call this model OnePunch) to make binary classification about fraud transaction, and the dataset was severely skewed with only 1% of the classes as "fraud class" (positive class). When I used both ROC curve and precision-recall curve to evaluate my model-OnePunch, they gave me totally different model performance.  I've displayed these two curves below, as well as the area under each curve. 



ROC curve & PR curve for the same model

It's very clear that ROC curve gives a very high rating to my model-OnePunch, because the curve is very close to the upper left corner and the AUC is 0.879, which is a pretty high score. However, the PR curve looks not good, and the AUC is only 0.609, so basically the PR curve is saying my model-OnePunch is just slightly better than random guessing.

Why is that?

It is because precision and recall both focus on the positive/minority class, so precision-recall curves are not affected by the imbalanced dataset,  it can more truly reflect the predictive ability of the model.

In a word, use precision-recall curve rather than ROC curve to better evaluate the predictive model when you have a severely imbalanced dataset. When the dataset is roughly balanced, then the ROC curve is usually enough.

Comments

Post a Comment

Popular posts from this blog

Comparison between Logistic Regression and Support Vector Machine

Logistic regression and support vector machine (SVM) are both popular models that can be applied to classification tasks. This article gives the introduction of these two methods and summarizes the differences between them. What is Logistic Regression? Logistic regression is a generalized linear model for binary classification. In logistic regression, we take the output of the linear function and then pass the value to the sigmoid function. The sigmoid function is S-shaped,  it is a bounded and differentiable activation function.  We use sigmoid function in logistic regression because it can take any real-valued number and map it into a value between the range of 0 and 1, as is known to all, the probability of any event is between 0 and 1, so sigmoid function is an intuitive and right choice for logistic regression. After we get the probabilities, we then set a threshold to make decisions, if the probability is greater than the threshold, we assign it a label 1, else we a...

Use Monte Carlo Simulation + Hypothesis Test to Evaluate Missing Value Imputation

It is quite normal that there’s a lot of missing values here and there when you get a dataset in the real world. How to deal with missing values is a big topic. As far as I’m concerned, there’s no one optimal method that works for all situations, usually, it is case-by-case.  For me, most of the time I don’t want to throw away observations that contains missing values, unless the data is really large while the missing value percentage under certain features are quite high (e.g. >98%) and those features do not play an important role in the analysis. Otherwise I’ll try my best to impute all the missing values to retain important information in the data as much as possible. There’s a lot of missing value imputation methods out there, including statistical methods and machine learning methods. No matter what method we use, we want a data distribution (after imputation) that is not twisted, or in other words, is not statistically significantly different from the original distribution...