Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Triage-driven diagnosis of Barrett’s esophagus for early detection of esophageal adenocarcinoma using deep learning

Abstract

Deep learning methods have been shown to achieve excellent performance on diagnostic tasks, but how to optimally combine them with expert knowledge and existing clinical decision pathways is still an open challenge. This question is particularly important for the early detection of cancer, where high-volume workflows may benefit from (semi-)automated analysis. Here we present a deep learning framework to analyze samples of the Cytosponge-TFF3 test, a minimally invasive alternative to endoscopy, for detecting Barrett’s esophagus, which is the main precursor of esophageal adenocarcinoma. We trained and independently validated the framework on data from two clinical trials, analyzing a combined total of 4,662 pathology slides from 2,331 patients. Our approach exploits decision patterns of gastrointestinal pathologists to define eight triage classes of varying priority for manual expert review. By substituting manual review with automated review in low-priority classes, we can reduce pathologist workload by 57% while matching the diagnostic performance of experienced pathologists.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Cytosponge procedure, triage scheme and data summary.
Fig. 2: Tile- and patient-level classification of Cytosponge-TFF3 samples.
Fig. 3: Application of quality control and diagnostic confidence class scheme to the internal validation cohort.
Fig. 4: Triage-driven approach with incremental triage class substitution scheme on internal validation set.
Fig. 5: Triage model applied to the external validation cohort and simulation of cohort variation.

Similar content being viewed by others

Data availability

The dataset is governed by data usage policies specified by the data controller (University of Cambridge, Cancer Research UK). We are committed to complying with Cancer Research UK’s Data Sharing and Preservation Policy. Whole-slide images used in this study will be available for non-commercial research purposes upon approval by a Data Access Committee according to institutional requirements. Applications for data access should be directed to rcf29@cam.ac.uk. Data derived from the raw images are freely available at a public repository: https://github.com/markowetzlab/cytosponge-triage. The code and included data enable replication of the results and figures in this manuscript.

Code availability

The source code of this work is freely available at a public repository: https://github.com/markowetzlab/cytosponge-triage.

References

  1. Hawkes, N. Cancer survival data emphasise importance of early diagnosis. Br. Med. J. 364, l408 (2019).

    Article  Google Scholar 

  2. Schiffman, J. D., Fisher, P. G. & Gibbs, P. Early detection of cancer: past, present, and future. Am. Soc. Clin. Oncol. Educ. Book 35, 57–65 (2015).

    Article  Google Scholar 

  3. Nanda, K. et al. Accuracy of the Papanicolaou test in screening for and follow-up of cervical cytologic abnormalities: a systematic review. Ann. Intern. Med. 132, 810–819 (2000).

    Article  CAS  Google Scholar 

  4. Cyr, P. R. Atypical moles. Am. Fam. Physician 78, 735–740 (2008).

    Google Scholar 

  5. Talbot, I., Price, A. & Salto-Tellez, M. Biopsy Pathology in Colorectal Disease (CRC Press, 2006).

  6. Maung, R. Pathologists’ workload and patient safety. Diagn. Histopathol. 22, 283–287 (2016).

    Article  Google Scholar 

  7. Esteva, A. et al. A guide to deep learning in healthcare. Nat. Med. 25, 24 (2019).

    Article  CAS  Google Scholar 

  8. Miotto, R., Wang, F., Wang, S., Jiang, X. & Dudley, J. T. Deep learning for healthcare: review, opportunities and challenges. Brief. Bioinform. 19, 1236–1246 (2018).

    Article  Google Scholar 

  9. Yu, K.-H., Beam, A. L. & Kohane, I. S. Artificial intelligence in healthcare. Nat. Biomed. Eng. 2, 719–731 (2018).

    Article  Google Scholar 

  10. Bray, F. et al. Global cancer statistics 2018: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 68, 394–424 (2018).

    Article  Google Scholar 

  11. Pohl, H., Sirovich, B. & Welch, H. G. Esophageal adenocarcinoma incidence: are we reaching the peak? Cancer Epidemiol. Prev. Biomark. 19, 1468–1470 (2010).

    Article  Google Scholar 

  12. Smyth, E. C. et al. Oesophageal cancer. Nat. Rev. Dis. Primers 3, 17048 (2017).

    Article  Google Scholar 

  13. Peters, Y. et al. Barrett oesophagus. Nat. Rev. Dis. Primers 5, 35 (2019).

  14. El-Serag, H. B., Sweet, S., Winchester, C. C. & Dent, J. Update on the epidemiology of gastro-oesophageal reflux disease: a systematic review. Gut 63, 871–880 (2014).

    Article  Google Scholar 

  15. Spechler, S. J. & Souza, R. F. Barrett’s esophagus. N. Engl. J. Med. 371, 836–845 (2014).

    Article  CAS  Google Scholar 

  16. Odze, R. Histology of Barrett’s metaplasia: do goblet cells matter? Dig. Dis. Sci. 63, 2042–2051 (2018).

    Article  Google Scholar 

  17. Kadri, S. R. et al. Acceptability and accuracy of a non-endoscopic screening test for Barrett’s oesophagus in primary care: cohort study. Br. Med. J. 341, c4372 (2010).

    Article  Google Scholar 

  18. Ross-Innes, C. S. et al. Evaluation of a minimally invasive cell sampling device coupled with assessment of trefoil factor 3 expression for diagnosing Barrett’s esophagus: a multi-center case–control study. PLoS Med. 12, e1001780 (2015).

    Article  Google Scholar 

  19. Freeman, M., Offman, J., Walter, F. M., Sasieni, P. & Smith, S. G. Acceptability of the cytosponge procedure for detecting Barrett’s oesophagus: a qualitative study. BMJ Open 7, e013901 (2017).

    Article  Google Scholar 

  20. Paterson, A. L., Gehrung, M., Fitzgerald, R. C. & O’Donovan, M. Role of TFF3 as an adjunct in the diagnosis of Barrett’s esophagus using a minimally invasive esophageal sampling device—The CytospongeTM. Diagn. Cytopathol. 48, 253–264 (2019).

  21. Lao-Sirieix, P. et al. Non-endoscopic screening biomarkers for Barrett’s oesophagus: from microarray analysis to the clinic. Gut 58, 1451–1459 (2009).

    Article  CAS  Google Scholar 

  22. Fitzgerald, R. et al. Cytosponge-trefoil factor 3 versus usual care to identify Barrett’s oesophagus in a primary care setting: a prospective, multicentre, pragmatic, randomised controlled trial. Lancet 396, 333–344 (2020).

  23. Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Commun. ACM https://doi.org/10.1145/3065386 (2017).

  24. Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 4700–4708 (IEEE, 2017).

  25. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2818–2826 (IEEE, 2016).

  26. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).

  27. Iandola, F. N. et al. Squeezenet: Alexnet-level accuracy with 50× fewer parameters and <0.5MB model size. Preprint at https://arxiv.org/abs/1602.07360 (2016).

  28. Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. Preprint at https://arxiv.org/abs/1409.1556 (2014).

  29. Selvaraju, R. R. et al. Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision 618–626 (IEEE, 2017).

  30. Fitzgerald, R. C. et al. British Society of Gastroenterology guidelines on the diagnosis and management of Barrett’s oesophagus. Gut 63, 7–42 (2014).

    Article  Google Scholar 

  31. Fan, X. & Snyder, N. Prevalence of Barrett’s esophagus in patients with or without GERD symptoms: role of race, age, and gender. Dig. Dis. Sci. 54, 572–577 (2009).

    Article  Google Scholar 

  32. Rex, D. K. et al. Screening for Barrett’s esophagus in colonoscopy patients with and without heartburn. Gastroenterology 125, 1670–1677 (2003).

    Article  Google Scholar 

  33. Elizondo, J. H. et al. Prevalence of Barrett’s esophagus: an observational study from a gastroenterology clinic. Rev. Gastroenterol. Mex. 82, 296–300 (2017).

    Google Scholar 

  34. Campanella, G. et al. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat. Med. 25, 1301–1309 (2019).

    Article  CAS  Google Scholar 

  35. Iizuka, O. et al. Deep learning models for histopathological classification of gastric and colonic epithelial tumours. Sci. Rep. 10, 1504 (2020).

    Article  CAS  Google Scholar 

  36. Kather, J. N. et al. Predicting survival from colorectal cancer histology slides using deep learning: a retrospective multicenter study. PLoS Med. 16, e1002730 (2019).

    Article  Google Scholar 

  37. Mobadersany, P. et al. Predicting cancer outcomes from histology and genomics using convolutional networks. Proc. Natl Acad. Sci. USA 115, E2970–E2979 (2018).

    Article  CAS  Google Scholar 

  38. Saillard, C. et al. Predicting survival after hepatocellular carcinoma resection using deep learning on histological slides. Hepatology 72, 2000–2013 (2020).

  39. Echle, A. et al. Clinical-grade detection of microsatellite instability in colorectal tumors by deep learning. Gastroenterology 159, 1406–1416 (2020).

  40. Coudray, N. et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat. Med. 24, 1559 (2018).

    Article  CAS  Google Scholar 

  41. Kather, J. N. et al. Pan-cancer image-based detection of clinically actionable genetic alterations. Nat. Cancer 1, 789–799 (2020).

  42. Fu, Y. et al. Pan-cancer computational histopathology reveals mutations, tumor composition and prognosis. Nat. Cancer 1, 800–810 (2020).

  43. Bien, N. et al. Deep-learning-assisted diagnosis for knee magnetic resonance imaging: development and retrospective validation of MRNet. PLoS Med. 15, e1002699 (2018).

    Article  Google Scholar 

  44. Steiner, D. F. et al. Impact of deep learning assistance on the histopathologic review of lymph nodes for metastatic breast cancer. Am. J. Surg. Pathol. 42, 1636–1646 (2018).

    Article  Google Scholar 

  45. Hekler, A. et al. Superior skin cancer classification by the combination of human and artificial intelligence. Eur. J. Cancer 120, 114 (2019).

    Article  Google Scholar 

  46. Kyono, T., Gilbert, F. J. & van der Schaar, M. Improving workflow efficiency for mammography using machine learning. J. Am. Coll. Radiol. 17, 56–63 (2020).

    Article  Google Scholar 

  47. Tschandl, P. et al. Human–computer collaboration for skin cancer recognition. Nat. Med. 26, 1229–1234 (2020).

  48. Bejnordi, B. E., Timofeeva, N., Otte-Höller, I., Karssemeijer, N. & van der Laak, J. A. Quantitative analysis of stain variability in histology slides and an algorithm for standardization. In Medical Imaging 2014: Digital Pathology (eds Gurcan, M. N. & Madabhushi, A.) https://doi.org/10.1117/12.2043683 (SPIE, 2014).

  49. Imperiale, T. F. et al. Multitarget stool DNA testing for colorectal-cancer screening. N. Engl. J. Med. 370, 1287–1297 (2014).

    Article  CAS  Google Scholar 

  50. Liu, M. C. et al. Sensitive and specific multi-cancer detection and localization using methylation signatures in cell-free DNA. Ann. Oncol. 31, 745–759 (2020).

  51. Kieffer, B., Babaie, M., Kalra, S. & Tizhoosh, H. R. Convolutional neural networks for histopathology image classification: training vs. using pre-trained networks. In 2017 Seventh International Conference on Image Processing Theory, Tools and Applications (IPTA) 1–6 (IEEE, 2017).

  52. Courtiol, P. et al. Deep learning-based classification of mesothelioma improves prediction of patient outcome. Nat. Med. 25, 1519–1525 (2019).

    Article  CAS  Google Scholar 

  53. Sharma, P. et al. The development and validation of an endoscopic grading system for Barrett’s esophagus: the Prague C & M criteria. Gastroenterology 131, 1392–1399 (2006).

    Article  Google Scholar 

  54. Levine, D. S. et al. An endoscopic biopsy protocol can differentiate high-grade dysplasia from early adenocarcinoma in Barrett’s esophagus. Gastroenterology 105, 40–50 (1993).

    Article  CAS  Google Scholar 

  55. Litjens, G. ASAP https://github.com/computationalpathologygroup/ASAP (2015).

  56. Paszke, A. et al. Pytorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (eds. Wallach, H. et al.) 4399 (Curran Associates, 2019).

Download references

Acknowledgements

This research was supported by Cancer Research UK (FM: C14303/A17197), the Medical Research Council (RCF: RG84369) and Cambridge University Hospitals NHS Foundation Trust. BEST2 was funded by Cancer Research UK (12088 and 16893). M.G. acknowledges support from an Enrichment Fellowship from the Alan Turing Institute. M.C.O. acknowledges support from a Borysiewicz Fellowship from the University of Cambridge and a Junior Research Fellowship from Trinity College, Cambridge. F.M. is a Royal Society Wolfson Research Merit Award holder. We thank M. Schneider, R. Drews, P. Martinez-Gonzalez and T. Whitmarsh for valuable input on this work. The authors thank the NIHR Cambridge Biomedical Research Centre (BRC-1215-20014) and the Experimental Cancer Medicine Centre for their support and for providing the infrastructure for the research procedures in Cambridge. The views expressed are those of the authors and not necessarily those of the NIHR or the Department of Health and Social Care. In addition, we thank the Human Research Tissue Bank, which is supported by the UK National Institute for Health Research Cambridge Biomedical Research Centre, from Addenbrookes Hospital. Finally, we thank the BEST2 trial team, the Histopathology core facility at the Cancer Research UK Cambridge Institute and Pathognomics Ltd. for their support.

Author information

Authors and Affiliations

Authors

Contributions

M.G. conceived and led the analysis. M.C.O. and A.B. contributed to the analysis. M.G. and A.B. wrote the code for analysis. M.O. and R.C.F. were involved in the collection and labeling of the data. R.C.F. conceived the study. R.C.F. and F.M. directed the project. M.G. and F.M. wrote the manuscript with the assistance and feedback of all other co-authors.

Corresponding authors

Correspondence to Rebecca C. Fitzgerald or Florian Markowetz.

Ethics declarations

Competing interests

The Cytosponge device technology and the associated TFF3 biomarker are licensed to Covidien GI solutions (now owned by Medtronic) by the Medical Research Council. M.G., M.C.O. and F.M. are named inventors on a patent pertaining to technology applied in this work. R.C.F. and M.O. are named inventors on patents pertaining to the Cytosponge and associated technology. M.G., M.O. and R.C.F. are shareholders of Cyted Ltd., a company working on early detection technology.

Additional information

Peer review information Nature Medicine thanks Marnix Jansen, Nasir Rajpoot and Pratik Shah for their contribution to the peer review of this work. Javier Carmona was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Differential increase of training partition size for ResNet-18.

Training subset refers to the relative proportion of the training partition used in the model training phase. Development subset refers to the relative proportion of the training partition used in the model development phase. The peak development weighted recall (a) and precision (b) correspond to the best performing cohort for each training run. The size of the development set was fixed at 15 patients. For each patient, an average of 3,500 tiles was used. For both H&E and TFF3 no substantial increase in performance metrics could be observed after a training subset size of 50 patients. Individual Cytosponge H&E sections are already highly heterogeneous, which means that the value gained by increasing the size of the training dataset is limited. We opted for retaining all the annotated data in the training set, to maximize the chances of capturing the whole spectrum of data variability and therefore the robustness of the model. H&E benefited more from an increased number of patients than the TFF3 model. This difference is associated with the increased complexity of detecting different tissue morphologies on H&E vs. brown goblet cells on TFF3. In TFF3 slides regions were extensively annotated by pathologists and this ground truth served as a comparator for the recall provided in both figures.

Extended Data Fig. 2 Comparison of pathologist landmarks with saliency maps extracted from VGG-16 architectures.

Additional examples of saliency maps for Hematoxylin & Eosin stain (squamous cells and columnar epithelium) and Trefoil factor 3 (positive goblet cells). Landmarks selected by an experienced pathologist are shown as overlays with red borders on pathology tile images. For all classes, there was visual agreement between highlighted areas by the pathologist and saliency map activations.

Extended Data Fig. 3 Determination of probability thresholds in order to obtain number of tiles.

Both plots show the AUC-ROC for individual probability thresholds (after softmax) which are used to decide whether a tile falls into the relevant class. a, AUC-ROC for quality control (QC) ground truth determined by the pathologist compared with number of tiles containing columnar epithelium at individual probability thresholds. b, AUC-ROC for diagnosis ground truth determined by the endoscopy (with confirmed IM on pathology) compared with number of tiles containing positive goblet cells at individual probability thresholds.

Extended Data Fig. 4 Performance of all deep learning architectures on the calibration cohort.

(a) ROC analysis of number of tiles containing columnnar epithelium on H\&E compared with pathologist ground truth from Cytosponge (b) ROC analysis of number of tiles containing positive goblet cells on TFF3 compared with pathologist ground truth from Cytosponge (c) ROC analysis of number of tiles containing positive goblet cells on TFF3 compared with endoscopy (with confirmed IM) ground truth. A weak AUC dependency on architecture complexity can be observed.

Extended Data Fig. 5 Performance of all deep learning architectures on the internal validation cohort.

a, ROC analysis of number of tiles containing columnnar epithelium on H&E compared with pathologist ground truth from Cytosponge (b) ROC analysis of number of tiles containing positive goblet cells on TFF3 compared with pathologist ground truth from Cytosponge (c) ROC analysis of number of tiles containing positive goblet cells on TFF3 compared with endoscopy (with confirmed IM) ground truth. As in the calibration cohort, a weak AUC dependency on architecture complexity can be observed.

Extended Data Fig. 6 Application of quality control and diagnostic confidence class scheme to calibration cohort.

The lines indicate operating points chosen by three different expert observers. a, Quality ground truth by pathologist from Cytosponge (top) compared with number of detected columnar epithelium (CE) tiles on H\&E detected by VGG-16 (bottom). For the first operating point, E#2 and E#3 agreed whereas E#1 selected a higher cut-off. Majority voting resulted in the lower cut-off being chosen. For the second operating point, all thee observers (E#1, E#2, and E#3) agreed on the same threshold. The line drawn by E#1 for the second operating point effectively resulted in the same operating point as E#2 and E#3. b, Diagnosis ground truth by pathologist from Cytosponge (top), Endoscopy (with confirmed IM on biopsy) ground truth (middle) compared with number of detected TFF3-positive tiles on TFF3 detected by ResNet-18 (bottom). For both the first and second operating points E#1, E#2, and E#3 agreed. The line drawn by E#3 for the second operating point effectively resulted in the same operating point as E#1 and E#2.

Extended Data Fig. 7 Performance of semi-automated, triage-driven model on external validation cohort.

a, Cumulative substitution scheme starting with fully manual review, followed by substitution with automated review of class no. 1, then 1 and 2, etc. b, Cumulative substitution scheme starting with fully manual review, followed by substitution with automated review of class no. 8, then 8 and 7, etc.

Supplementary information

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gehrung, M., Crispin-Ortuzar, M., Berman, A.G. et al. Triage-driven diagnosis of Barrett’s esophagus for early detection of esophageal adenocarcinoma using deep learning. Nat Med 27, 833–841 (2021). https://doi.org/10.1038/s41591-021-01287-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41591-021-01287-9

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing