AUROC and AUPRC under Class Imbalance

TRIPOD-LLM is now published in Nature Medicine

A Closer Look at AUROC and AUPRC under Class Imbalance

Matthew B. A. McDermott¹Harvard Medical School, Department of Biomedical Informatics, Lasse Hyldig Hansen²Cognitive Science, Aarhus University, Denmark, Haoran Zhang³Massachusetts Institute of Technology, Giovanni Angelotti⁴IRCCS Humanitas Research Hospital, Artificial Intelligence Center, Milan, Italy, Jack Gallifant³Massachusetts Institute of Technology

ArXiv Preprint

Github Repo

How Should We Prioritize Fixing Model Mistakes?

This paper critically examines the widely held belief in machine learning (ML) that the area under the precision-recall curve (AUPRC) is superior to the area under the receiver operating characteristic (AUROC) for binary classification tasks in class-imbalanced scenarios. Through novel mathematical analysis, it demonstrates that AUPRC is not inherently superior and may even be detrimental due to its tendency to overemphasize improvements in subpopulations with more frequent positive labels, potentially exacerbating algorithmic biases.

Using Atomic Mistakes

Atomic mistakes occur when neighboring samples, when ordered by model score, are out-of-order with respect to the classi- fication label. AUROC improves by a constant amount no matter which atomic mistake is corrected; AUPRC improves in descend- ing order with model score due to the dependence on model firing rate (Theorem 1).

Different types of mistakes a model can learn to fix. y= 0 is the negative class and y= 1 is the positive class. a= 0 is subgroup 1 and a= 1 is subgroup 2.

Which mistake you should prioritize fixing first depends on usage; in a classification setting, where you do not know whether the sample of interest is from a high-scoring or low-scoring region, you want to use a metric that optimizes scores in an unbiased manner, like AUROC. In a single-stream retrieval setting, where you choose the top-k samples, regardless of group membership and evaluate with those, a metric that favors mistakes in high-scoring regions like AUPRC will be most impactful. But, if you care about retrieving the top-k metrics from multiple distinct subpopulations within your dataset, AUPRC will be dangerous as it will favor the high-prevalence sub-population

Optimizing AUPRC Introduces Disparities

Optimizing overall AUROC.

Optimizing overall AUPRC.

Comparison of the impact of optimizing for overall AUROC and overall AUPRC on the per-group AUROC and AUPRCs of two groups in a synthetic setting, using both the sequentially fixing atomic mistakes optimization procedure. Left: Fixing atomic mistakes to optimize overall AUROC, Right: Fixing atomic mistakes to optimize overall AUPRC.

These figures demonstrate the impact of the optimization metric on subpopulation disparity. In particular, on the right we observe a notable disparity introduced when optimizing under the AUPRC metric. This is evident in the performance metrics across the high and low preva- lence subpopulations, which exhibit significant divergence as the optimization process favors the group with higher prevalence. In comparison, when optimizing for overall AUROC (Left), the AUROC and AUPRC of both groups increase together.

Related Work

Our work builds upon insights in other work that has examined robustness of models and metrics among subpopulations:

Yang, Zhang*, Katabi, and Ghassemi. Change is Hard: A Closer Look at Subpopulation Shift. 2023.

Notes: This work is a fine-grained analysis of the variation in mechanisms that cause subpopulation shifts, and how algorithms generalize across such diverse shifts at scale.

How To Cite

This work is not yet peer-reviewed. The preprint can be cited as follows.

Bibliography

Matthew B. A. McDermott, Lasse Hyldig Hansen, Haoran Zhang, Giovanni Angelotti, and Jack Gallifant. "A Closer Look at AUROC and AUPRC under Class Imbalance" arXiv preprint arXiv:2401.06091 (2024).

BibTeX

@misc{mcdermott2024closer,
            title={A Closer Look at AUROC and AUPRC under Class Imbalance}, 
            author={Matthew B. A. McDermott and Lasse Hyldig Hansen and Haoran Zhang and Giovanni Angelotti and Jack Gallifant},
            year={2024},
            eprint={2401.06091},
            archivePrefix={arXiv},
            primaryClass={cs.LG}
        }