Estimation of low-density lipoprotein cholesterol levels using machine learning

https://doi.org/10.1016/j.ijcard.2022.01.029Get rights and content

Highlights

  • ML (machine learning) algorithms derived LDL-C values with higher correlation and better accuracy compared with conventional methods.

  • ML estimations were especially accurate even with high TG levels, an area where performance of conventional methods was inadequate.

  • We believe that ML algorithms could be incorporated into electronic health records, and substitute the Friedewald or Martin equations.

Abstract

Background

Low-density lipoprotein-cholesterol (LDL-C) is used as a threshold and target for treating dyslipidemia. Although the Friedewald equation is widely used to estimate LDL-C, it has been known to be inaccurate in the case of high triglycerides (TG) or non-fasting states. We aimed to propose a novel method to estimate LDL-C using machine learning.

Methods

Using a large, single-center electronic health record database, we derived a ML algorithm to estimate LDL-C from standard lipid profiles. From 1,029,572 cases with both standard lipid profiles (total cholesterol, high-density lipoprotein-cholesterol, and TG) and direct LDL-C measurements, 823,657 tests were used to derive LDL-C estimation models. Patient characteristics such as sex, age, height, weight, and other laboratory values were additionally used to create separate data sets and algorithms.

Results

Machine learning with gradient boosting (LDL-CX) and neural network (LDL-CN) showed better correlation with directly measured LDL-C, compared with conventional methods (r = 0.9662, 0.9668, 0.9563, 0.9585; for LDL-CX, LDL-CN, Friedewald [LDL-CF], and Martin [LDL-CM] equations, respectively). The overall bias of LDL-CX (−0.27 mg/dL, 95% CI −0.30 to −0.23) and LDL-CN (−0.01 mg/dL, 95% CI -0.04–0.03) were significantly smaller compared with both LDL-CF (−3.80 mg/dL, 95% CI −3.80 to −3.60) or LDL-CM (−2.00 mg/dL, 95% CI −2.00 to −1.94), especially at high TG levels.

Conclusions

Machine learning algorithms were superior in estimating LDL-C compared with the conventional Friedewald or the more contemporary Martin equations. Through external validation and modification, machine learning could be incorporated into electronic health records to substitute LDL-C estimation.

Introduction

Dyslipidemia is defined as an abnormally high or low amount of one or more kinds of lipid in the blood. It is a major cause of atherosclerosis, and along with hypertension, diabetes, and smoking, is a strong risk factor for development of cardiovascular disease [1]. However, the prevalence of dyslipidemia has increased over the years due to a rising obesity population. Among many forms of lipids in the human body, low-density lipoprotein-cholesterol (LDL-C) is most notable due to its wide use as a marker of diagnosis and treatment [2]. The benefits of lowering LDL-C for primary and secondary prevention of coronary heart disease have been supported by numerous major clinical trials [3]. Therefore, current guidelines are focused on reducing LDL-C as the primary goal of treatment [4,5], and specify values according to each individual's atherosclerotic cardiovascular disease (ASCVD) risk, which are also derived from lipid values [4].

Historically, plasma LDL-C could only be measured after ultracentrifugation. In 1972, based on an analysis of 448 patients, Friedewald proposed the following formula in a method to estimate plasma LDL-C concentration from other routine cholesterol measurements;

LDL-C = Total cholesterol – High-density lipoprotein-cholesterol (HDL-C) - (triglycerides/5) [6].

Although almost 50 years have passed since its first use, the original Friedewald equation is still widely used in everyday practice. However, there are several limitations that need to be acknowledged when using the Friedewald equation. For instance, the estimation of LDL-C becomes inaccurate when triglyceride (TG) levels are over 400 mg/dL [[7], [8], [9]]. Also, it has been reported that results of the calculation vary according to whether the patients has diabetes, liver disease, or is in non-fasting state [[10], [11], [12]]. Due to these limitations, there have constant attempts to make adjustments to the Friedewald equation or develop a more sophisticated equation to estimate LDL-C [13,14]. Recently, Martin et al. have shown a concordance rate of 94.1%, using adjustable TG:VLDL ratios derived from 900,605 samples [15].

In an era when LDL-C goals are being set to a limit that has not been seen before, accurate identification of LDL-C is important for treating dyslipidemia and preventing cardiovascular disease. Thus, we aimed to establish novel machine learning (ML) models to precisely estimate LDL-C using a large electronic health record (EHR) database.

Section snippets

Data and model preparation

Data of lipid profile tests were acquired from a large single-center EHR database. Those with results of both standard lipid profiles (total cholesterol, HDL-C, TG) and directly measured LDL-C were included in the analysis. Tests which were not performed on a single blood sample were excluded. To increase the generalizability of data, all results regardless of age and sex were included in the analysis. Three different models were constructed, according to the number of features used in the

Baseline characteristics

Retrospective analysis of the EHR database from October 1999 to February 2019 revealed 1,029,572 test results from around 398,960 patients. Baseline characteristics for patients in the derivation and validation set for the primary model are summarized in Table 1. From total lipid samples, 823,657 results were randomly selected and used as a derivation set, while 205,915 results were used to test the acquired ML algorithm. The mean age of patients was 63 years, and 60% were males. The mean TC,

Discussion

Using an EHR database, we were able to apply ML algorithms to propose a novel method for estimating LDL-C. Unlike previous studies which mainly focused on fine-tuning the Friedewald equation by adjusting coefficients or introducing other lipid profiles [[13], [14], [15],24], we aimed to derive LDL-C using standard lipid values and easily acquired patient characteristics. As a result, estimation of LDL-C using XGBoost and NN algorithms resulted in better correlation with directly measured LDL-C,

Study limitations

The limitations of this study are as follows. First, our estimation was derived and validated using patient records from a single center with a high proportion of patients who were referred from primary or secondary clinics. Generalization of our current algorithm may not be appropriate unless external validation is performed using an independent cohort. In the future, we plan to apply our algorithm to other data using a common data model (CDM) to improve accuracy and generalizability. Second,

Conclusion

Through ML, we were able to accurately estimate LDL-C regardless of individual patient conditions. The novel method personalizes the estimation of LDL-C, giving each patient a precise, tailored diagnosis without the need for increased expenses.

Funding sources

None.

Declaration of Competing Interest

The authors report no relationships that could be construed as a conflict of interest.

References (34)

  • D.M. DeLong et al.

    A comparison of methods for the estimation of plasma low- and very low-density lipoprotein cholesterol. The Lipid Research Clinics Prevalence Study

    JAMA

    (1986)
  • Z. Reiner

    Triglyceride-rich lipoproteins and novel targets for anti-atherosclerotic therapy

    Korean Circ. J.

    (2018)
  • P.S. Bachorik et al.

    National Cholesterol Education Program recommendations for measurement of low-density lipoprotein cholesterol: executive summary. The National Cholesterol Education Program Working Group on Lipoprotein Measurement

    Clin. Chem.

    (1995)
  • F. Razi et al.

    LDL-cholesterol measurement in diabetic type 2 patients: a comparison between direct assay and popular equations

    J. Diabet. Metabol. Disord.

    (2017)
  • C. Matas et al.

    Limitations of the Friedewald formula for estimating low-density lipoprotein cholesterol in alcoholics with liver disease

    Clin. Chem.

    (1994)
  • Y. Chen et al.

    A modified formula for calculating low-density lipoprotein cholesterol values

    Lipids Health Dis.

    (2010)
  • S.S. Martin et al.

    Comparison of a novel method vs the Friedewald equation for estimating low-density lipoprotein cholesterol levels from the standard lipid profile

    JAMA

    (2013)
  • Cited by (0)

    1

    These authors contributed equally to the work.

    2

    These authors take responsibility for all aspects of the reliability and freedom from bias of the data presented and their discussed interpretation.

    3

    These authors take responsibility for all aspects of the reliability.

    View full text