Elsevier

The Lancet

Volume 401, Issue 10372, 21–27 January 2023, Pages 215-225
The Lancet

Articles
Machine learning-based marker for coronary artery disease: derivation and validation in two longitudinal cohorts

https://doi.org/10.1016/S0140-6736(22)02079-7Get rights and content

Summary

Background

Binary diagnosis of coronary artery disease does not preserve the complexity of disease or quantify its severity or its associated risk with death; hence, a quantitative marker of coronary artery disease is warranted. We evaluated a quantitative marker of coronary artery disease derived from probabilities of a machine learning model.

Methods

In this cohort study, we developed and validated a coronary artery disease-predictive machine learning model using 95 935 electronic health records and assessed its probabilities as in-silico scores for coronary artery disease (ISCAD; range 0 [lowest probability] to 1 [highest probability]) in participants in two longitudinal biobank cohorts. We measured the association of ISCAD with clinical outcomes—namely, coronary artery stenosis, obstructive coronary artery disease, multivessel coronary artery disease, all-cause death, and coronary artery disease sequelae.

Findings

Among 95 935 participants, 35 749 were from the BioMe Biobank (median age 61 years [IQR 18]; 14 599 [41%] were male and 21 150 [59%] were female; 5130 [14%] were with diagnosed coronary artery disease) and 60 186 were from the UK Biobank (median age 62 [15] years; 25 031 [42%] male and 35 155 [58%] female; 8128 [14%] with diagnosed coronary artery disease). The model predicted coronary artery disease with an area under the receiver operating characteristic curve of 0·95 (95% CI 0·94–0·95; sensitivity of 0·94 [0·94–0·95] and specificity of 0·82 [0·81–0·83]) and 0·93 (0·92–0·93; sensitivity of 0·90 [0·89–0·90] and specificity of 0·88 [0·87–0·88]) in the BioMe validation and holdout sets, respectively, and 0·91 (0·91–0·91; sensitivity of 0·84 [0·83–0·84] and specificity of 0·83 [0·82–0·83]) in the UK Biobank external test set. ISCAD captured coronary artery disease risk from known risk factors, pooled cohort equations, and polygenic risk scores. Coronary artery stenosis increased quantitatively with ascending ISCAD quartiles (increase per quartile of 12 percentage points), including risk of obstructive coronary artery disease, multivessel coronary artery disease, and stenosis of major coronary arteries. Hazard ratios (HRs) and prevalence of all-cause death increased stepwise over ISCAD deciles (decile 1: HR 1·0 [95% CI 1·0–1·0], 0·2% prevalence; decile 6: 11 [3·9–31], 3·1% prevalence; and decile 10: 56 [20–158], 11% prevalence). A similar trend was observed for recurrent myocardial infarction. 12 (46%) undiagnosed individuals with high ISCAD (≥0·9) had clinical evidence of coronary artery disease according to the 2014 American College of Cardiology/American Heart Association Task Force guidelines.

Interpretation

Electronic health record-based machine learning was used to generate an in-silico marker for coronary artery disease that can non-invasively quantify atherosclerosis and risk of death on a continuous spectrum, and identify underdiagnosed individuals.

Funding

National Institutes of Health.

Introduction

Detection of coronary artery disease enables initiation of preventive measures, including lifestyle modifications and lipid-lowering therapies, to prevent cardiovascular disease.1, 2, 3 However, coronary artery disease is a complex disease with many contributing factors and varied clinical manifestations.4, 5 Quantitative differences in the amount of coronary stenosis and plaque composition result in gradations of risk for myocardial infarction and death.6, 7 This phenotypic spectrum of coronary artery disease is missed with the binary classification of coronary artery disease as case versus control. Misclassification of coronary artery disease is also possible, whereby individuals without a diagnosis of coronary artery disease have evidence of disease.8, 9, 10 Missed diagnosis of coronary artery disease might lead to myocardial infarction, stroke, and death.11, 12, 13, 14, 15

Risk factors can inform the screening and diagnosis of coronary artery disease, including the presence of hypertension, diabetes, smoking, and dyslipidaemia.3, 16 These variables are included in tools to assess risks that predict coronary artery disease events, such as the Framingham Risk Score,17 SCORE2,18 and pooled cohort equations (PCEs).19 However, these tools use a small number of predictors and discard large amounts of data contained in electronic health records (EHRs); for example, most vital signs, laboratory tests, medications, symptoms, and other clinical features are not used. Millions of these heterogeneous clinical data points are accrued by patients longitudinally through EHR-based health systems but are difficult to analyse or interpret without the use of machine learning.20, 21, 22, 23, 24

Research in context

Evidence before this study

On July 2, 2022, we searched PubMed without language or date restrictions for studies reporting on the development and validation of machine learning-based models for coronary artery disease, including atherosclerosis, death, and myocardial infarction. The following terms and related terms were used when searching: (“machine learning”, “artificial intelligence”, or “random forest”) and (“coronary artery disease”, “atherosclerosis”, “plaque”, or “myocardial infarction”). We identified several machine learning models in the past decade that predict coronary artery disease. However, these studies used machine learning models as a classification tool to simply predict the case-control status of coronary artery disease (binary framework of disease) and none used models to capture coronary artery disease on a spectrum of disease probabilities (quantitative framework of disease). Many of the studies were based on a limited set of features or predetermined risk factors. Hence, assessments of the clinical utility of coronary artery disease-predictive machine learning models are scarce. Therefore, we investigated probabilities generated by a machine learning model as an in-silico marker for coronary artery disease. Its clinical utility to quantify atherosclerotic plaque burden, survival, and risk of myocardial infarction on a continuum was assessed in a longitudinal multi-ethnic cohort, and underdiagnosed individuals with coronary artery disease were identified as an example of its intervenability. Our multimodal model analyses millions of diverse clinical datapoints of diagnoses, laboratory test results, medications, and vitals contained in the electronic health records (EHRs) of participants.

Added value of this study

To our knowledge, this study is the first that constructs a quantitative marker for coronary artery disease risk, severity, and prognosis from a machine learning model trained on clinical data from EHRs. Individuals with common diseases occupy a spectrum of disease that represents an individual's combination of risk factors and pathogenic processes; quantitative differences in coronary stenosis, for example, result in gradations of risk of death. Quantification of where an individual falls on the disease spectrum is needed for clinical screening and management. We developed and externally tested a coronary artery disease-predictive machine learning model using 95 935 EHRs in the multi-ethnic BioMe Biobank and UK Biobank, and from it generated an in-silico score for coronary artery disease (ISCAD). We found that coronary stenosis from angiography data increased quantitatively with ascending ISCAD, including risk of obstructive coronary artery disease, multivessel coronary artery disease, and stenosis of each major coronary artery, such as the left main and proximal left anterior descending arteries. All-cause death increased stepwise over ascending ISCAD and sequelae, such as recurrent myocardial infarction, rose in gradations with ISCAD. ISCAD showed greater associations with these coronary artery disease outcomes than did conventional risk scores of pooled cohort equations and polygenic risk scores. We identified participants with high ISCAD who had no coronary artery disease diagnosis and found that almost 50% of them had clinical evidence of underdiagnosed coronary artery disease on manual chart review.

Implications of all the available evidence

Our study shows a reconceptualisation of coronary artery disease—including atherosclerosis, death, and sequelae—as a spectrum of disease that is quantifiable with artificial intelligence trained on clinical data. This in-silico marker derived from machine learning captured coronary artery disease pathophysiology and clinical outcomes on a continuum. The model is holistic in drawing on a wide array of clinical information from population-based biobanks, inclusive in representing diverse populations, and faithful in preserving the complexity of disease. The implementation of machine learning-based quantitative markers for coronary artery disease might help to define the disease state and clinical outcomes in patients, while optimising the detection of disease and reducing underdiagnosis.

Machine learning models have been developed to accurately predict 5-year or 10-year risk of coronary artery disease on the basis of EHR data.25, 26 We recently developed an EHR-based model that outperforms PCEs and conventional risk factors in predicting 1-year coronary artery disease status.27 However, these models are primarily tested as a classification tool to predict case-control status of disease (binary framework) and do not attempt to measure disease on a continuous scale (quantitative framework). Individuals occupy a spectrum of coronary artery disease, rather than rigid categories of case versus control, and evaluation of coronary artery disease in a quantitative manner might better represent this spectrum and improve personalised care.28, 29, 30 In this cohort study, we examined whether a quantitative in-silico score for coronary artery disease (ISCAD) derived from a machine learning model has clinical utility as a marker in the detection, risk stratification, and prognosis of coronary artery disease. Conventionally, markers are molecules or anthropometrics measured in the body as an in-vivo indicator of disease.31 We sought to examine ISCAD, an amalgam of clinical data points in EHRs, as an in-silico marker for coronary artery disease. We evaluated the association of ISCAD with clinical outcomes of coronary artery disease—namely, atherosclerotic plaque burden, all-cause death, and coronary artery disease sequelae (including recurrent myocardial infarction)—and identified underdiagnosed individuals who had high ISCAD and EHR evidence of disease but did not have a corresponding diagnosis.

Section snippets

Study design

In this cohort study, we trained, validated, and externally tested a coronary artery disease-predictive machine learning model using clinical features extracted from EHRs in two large biobanks. This model was adapted from a previous model27 for the short-term risk prediction of coronary artery disease in a binary framework based on EHR data. In the present study, probability scores from the model were instead evaluated as a quantitative coronary artery disease marker.

We trained and validated

Results

The study population included 95 935 participants from the two biobanks with EHR data to train, validate, and externally test the machine learning model (figure 1). EHRs used for training and validation and holdout evaluation were from 35 749 participants in the BioMe Biobank (median age 61 years [IQR 18]; 13 290 [37%] were male and 22 459 [63%] were female; 5130 [14%] were with diagnosed coronary artery disease). The model was trained and validated on EHR data for 20 497 participants from the

Discussion

In this cohort study, we sought to evaluate the performance of a novel in-silico quantitative marker for coronary artery disease, generated from a machine learning model trained on EHR data in two large biobanks, to capture coronary artery disease risk, atherosclerosis, and all-cause death in a diverse population. The primary finding was that an artificial intelligence-derived marker could capture the clinical risk of PCEs and genetic risk of PRSs for coronary artery disease, and non-invasively

Data sharing

The dataset from UK Biobank analysed in the study is available via application to the Access Management System at https://bbams.ndph.ox.ac.uk/ams/. Further information regarding the BioMe Biobank and its dataset is available at https://icahn.mssm.edu/research/ipm/programs/biome-biobank. Code used for the analyses is available at https://data.mendeley.com/datasets/ncmpmvv5tm/draft?a=a7dbf30b-17c9-401b-9cbd-4be2dc067dca.

Declaration of interests

RD reported receiving grants from AstraZeneca; grants and non-financial support from Goldfinch Bio; being a scientific co-founder, consultant, and equity holder for Pensieve Health; and being a consultant for Variant Bio, outside of the submitted work. GNN reported being a scientific co-founder, consultant, advisory board member, and equity owner of Renalytix AI; a scientific co-founder and equity holder for Pensieve Health; a consultant for Variant Bio; and received grants from Goldfinch Bio

References (51)

  • GD Kitsios et al.

    Heterogeneity of the phenotypic definition of coronary artery disease and its impact on genetic association studies

    Circ Cardiovasc Genet

    (2011)
  • KAA Fox et al.

    The myth of ‘stable’ coronary artery disease

    Nat Rev Cardiol

    (2020)
  • TM Maddox et al.

    Nonobstructive coronary artery disease and risk of myocardial infarction

    JAMA

    (2014)
  • D-W Park et al.

    Extent, location, and clinical significance of non-infarct-related coronary artery disease among patients with ST-elevation myocardial infarction

    JAMA

    (2014)
  • TD Sequist et al.

    Missed opportunities in the primary care management of early acute ischemic heart disease

    Arch Intern Med

    (2006)
  • M Turkay et al.

    Missed opportunities for coronary heart disease diagnoses: primary care experience

    Croat Med J

    (2007)
  • C Araújo et al.

    Missed opportunities in symptomatic patients before a first acute coronary syndrome: the EPIHeart cohort study

    Cardiology

    (2018)
  • F Sanchis-Gomar et al.

    Epidemiology of coronary heart disease and acute coronary syndrome

    Ann Transl Med

    (2016)
  • C Özcan et al.

    Coronary artery disease severity and long-term cardiovascular risk in patients with myocardial infarction: a Danish nationwide register-based cohort study

    Eur Heart J Cardiovasc Pharmacother

    (2018)
  • T Jernberg et al.

    Cardiovascular risk in post-myocardial infarction patients: nationwide real world data demonstrate the importance of a long-term perspective

    Eur Heart J

    (2015)
  • M Zeitouni et al.

    Risk factor burden and long-term prognosis of patients with premature coronary artery disease

    J Am Heart Assoc

    (2020)
  • RJ Myerburg et al.

    Sudden cardiac death caused by coronary heart disease

    Circulation

    (2012)
  • PWF Wilson et al.

    Prediction of coronary heart disease using risk factor categories

    Circulation

    (1998)
  • S Hageman et al.

    SCORE2 risk prediction algorithms: new models to estimate 10-year risk of cardiovascular disease in Europe

    Eur Heart J

    (2021)
  • A Rajkomar et al.

    Machine learning in medicine

    N Engl J Med

    (2019)
  • Cited by (39)

    View all citing articles on Scopus
    View full text