Adult: Mechanical Circulatory Support
Limitations of receiver operating characteristic curve on imbalanced data: Assist device mortality risk scores

https://doi.org/10.1016/j.jtcvs.2021.07.041Get rights and content

Abstract

Objective

In the left ventricular assist device domain, the receiver operating characteristic is a commonly applied metric of performance of classifiers. However, the receiver operating characteristic can provide a distorted view of classifiers’ ability to predict short-term mortality due to the overwhelmingly greater proportion of patients who survive, that is, imbalanced data. This study illustrates the ambiguity of the receiver operating characteristic in evaluating 2 classifiers of 90-day left ventricular assist device mortality and introduces the precision recall curve as a supplemental metric that is more representative of left ventricular assist device classifiers in predicting the minority class.

Methods

This study compared the receiver operating characteristic and precision recall curve for 2 classifiers for 90-day left ventricular assist device mortality, HeartMate Risk Score and Random Forest for 800 patients (test group) recorded in the Interagency Registry for Mechanically Assisted Circulatory Support who received a continuous-flow left ventricular assist device between 2006 and 2016 (mean age, 59 years; 146 female vs 654 male patients), in whom 90-day mortality rate is only 8%.

Results

The receiver operating characteristic indicates similar performance of Random Forest and HeartMate Risk Score classifiers with respect to area under the curve of 0.77 and Random Forest 0.63, respectively. This is in contrast to their precision recall curve with area under the curve of 0.43 versus 0.16 for Random Forest and HeartMate Risk Score, respectively. The precision recall curve for HeartMate Risk Score showed the precision rapidly decreased to only 10% with slightly increasing sensitivity.

Conclusions

The receiver operating characteristic can portray an overly optimistic performance of a classifier or risk score when applied to imbalanced data. The precision recall curve provides better insight about the performance of a classifier by focusing on the minority class.

Section snippets

Comparison of Two Classifiers for 90-Day Mortality

This study compares the performance of 2 classifiers for predicting 90-day mortality after LVAD implantation: the well-known HeartMate Risk Score (HMRS) and a Random Forest (RF) that was derived de novo from a large multicenter registry data. The HMRS, a logistic regression-based score, was derived from and validated within 1122 patients with 13% 90-day mortality who received a HeartMate II as a bridge to transplant or destination therapy and computes the 90-day risk scores for mortality based

Limitations of Receiver Operating Curve Due to Imbalanced Left Ventricular Assist Device Mortality Rate

Figure 5, A, shows the ROC curves for the 2 classifiers, HMRS and RF, for prediction of 90-day mortality after LVAD implantation. The color of the curves corresponds to the values of cutoff thresholds for each classifier, shown in their corresponding legends (from 0.01 to 0.52 for RF vs −1.8 to 6.50 for HMRS). The dominant color in the ROC curve for HMRS is green corresponding to the compact (tall and narrow) distribution of scores around the mean of 1.71 as shown in Figure 2, B. Therefore, a

Discussion

The clinical utility of a risk score or classifier for mortality after LVAD implantation depends greatly on the degree of separability between predicted probabilities of the 2 classes: DEAD versus SURV (Figure 2, A). Overlap between the distributions of the 2 classes creates an intermediate range of probabilities that is associated with both classes. This results in 2 types of errors: False DEAD (alive patients who are incorrectly labeled as DEAD) and False SURV (dead patients who are

Conclusions

ROC has become an entrenched evaluation tool for assessing the performance of classifiers and risk scores in the medical arena. However, when the data are highly imbalanced, ROC can provide a misleading optimistic view of the performance of the classifiers. In such circumstances, it is imperative to use PRC to precisely evaluate the prediction of the minority class. Figure 7 depicts a summary of the study showing the effect of imbalanced data on the outcome of RF classifier.

References (24)

  • D. Berrar et al.

    Caveats and pitfalls of roc analysis in clinical microarray research (and how to avoid them)

    Brief Bioinform

    (2012)
  • T. Saito et al.

    The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets

    PLoS One

    (2015)
  • Cited by (13)

    • Evaluating prediction model performance

      2023, Surgery (United States)
    • Commentary: Machine learning and the brave new world of risk model assessment

      2023, Journal of Thoracic and Cardiovascular Surgery
    • Commentary: To classify means to choose a threshold

      2023, Journal of Thoracic and Cardiovascular Surgery
    View all citing articles on Scopus

    Data for this study were provided by the International Registry for Mechanical Circulatory Support, funded by the National Heart, Lung, and Blood Institute, National Institutes of Health, and The Society of Thoracic Surgeons. This work was supported by the National Institutes of Health under Grant R01HL122639.

    This study was waived from Informed Written Consent.

    View full text