A Closer Look at AUROC and AUPRC under Class Imbalance
Matthew B. A. McDermott1Harvard Medical School, Department of Biomedical Informatics, Lasse Hyldig Hansen2Cognitive Science, Aarhus University, Denmark, Haoran Zhang3Massachusetts Institute of Technology, Giovanni Angelotti4IRCCS Humanitas Research Hospital, Artificial Intelligence Center, Milan, Italy, Jack Gallifant3Massachusetts Institute of Technology
This paper critically examines the widely held belief in machine learning (ML) that the area under the precision-recall curve (AUPRC) is superior to the area under the receiver operating characteristic (AUROC) for binary classification tasks in class-imbalanced scenarios. Through novel mathematical analysis, it demonstrates that AUPRC is not inherently superior and may even be detrimental due to its tendency to overemphasize improvements in subpopulations with more frequent positive labels, potentially exacerbating algorithmic biases.
Using Atomic Mistakes
Atomic mistakes occur when neighboring samples, when ordered by model score, are out-of-order with respect to the classi- fication label. AUROC improves by a constant amount no matter which atomic mistake is corrected; AUPRC improves in descend- ing order with model score due to the dependence on model firing rate (Theorem 1).
Different types of mistakes a model can learn to fix. y= 0 is the negative class and y= 1 is the positive class. a= 0 is subgroup 1 and a= 1 is subgroup 2.
Which mistake you should prioritize fixing first depends on usage; in a classification setting, where you do not know whether the sample of interest is from a high-scoring or low-scoring region, you want to use a metric that optimizes scores in an unbiased manner, like AUROC. In a single-stream retrieval setting, where you choose the top-k samples, regardless of group membership and evaluate with those, a metric that favors mistakes in high-scoring regions like AUPRC will be most impactful. But, if you care about retrieving the top-k metrics from multiple distinct subpopulations within your dataset, AUPRC will be dangerous as it will favor the high-prevalence sub-population
Optimizing AUPRC Introduces Disparities
Optimizing overall AUROC.
Optimizing overall AUPRC.
Comparison of the impact of optimizing for overall AUROC and overall AUPRC on the per-group AUROC and AUPRCs of two groups in a synthetic setting, using both the sequentially fixing atomic mistakes optimization procedure. Left: Fixing atomic mistakes to optimize overall AUROC, Right: Fixing atomic mistakes to optimize overall AUPRC.
These figures demonstrate the impact of the optimization metric on subpopulation disparity. In particular, on the right we observe a notable disparity introduced when optimizing under the AUPRC metric. This is evident in the performance metrics across the high and low preva- lence subpopulations, which exhibit significant divergence as the optimization process favors the group with higher prevalence. In comparison, when optimizing for overall AUROC (Left), the AUROC and AUPRC of both groups increase together.
Our work builds upon insights in other work that has examined robustness of models and metrics among subpopulations:
Notes: This work is a fine-grained analysis of the variation in mechanisms that cause subpopulation shifts, and how algorithms generalize across such diverse shifts at scale.
This work is not yet peer-reviewed. The preprint can be cited as follows.
Matthew B. A. McDermott, Lasse Hyldig Hansen, Haoran Zhang, Giovanni Angelotti, and Jack Gallifant. "A Closer Look at AUROC and AUPRC under Class Imbalance" arXiv preprint arXiv:2401.06091 (2024).
@misc{mcdermott2024closer, title={A Closer Look at AUROC and AUPRC under Class Imbalance}, author={Matthew B. A. McDermott and Lasse Hyldig Hansen and Haoran Zhang and Giovanni Angelotti and Jack Gallifant}, year={2024}, eprint={2401.06091}, archivePrefix={arXiv}, primaryClass={cs.LG} }