Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Letter
  • Published:

Cancer prognosis with shallow tumor RNA sequencing

Abstract

Disrupted molecular pathways are often robustly associated with disease outcome in cancer1,2,3. Although biologically informative transcriptional pathways can be revealed by RNA sequencing (RNA-seq) at up to hundreds of folds reduction in conventionally used coverage4,5,6, it remains unknown how low-depth sequencing datasets perform in the challenging context of developing transcriptional signatures to predict clinical outcomes. Here we assessed the possibility of cancer prognosis with shallow tumor RNA-seq, which would potentially enable cost-effective assessment of much larger numbers of samples for deeper biological and predictive insights. By statistically modeling the relative risk of an adverse outcome for thousands of subjects in The Cancer Genome Atlas7,8,9,10,11,12,13, we present evidence that subsampled tumor RNA-seq data with a few hundred thousand reads per sample provide sufficient information for outcome prediction in several types of cancer. Analysis of predictive models revealed robust contributions from pathways known to be associated with outcomes. Our findings indicate that predictive models of outcomes in cancer may be developed with dramatically increases in sample numbers at low cost, thus potentially enabling the development of more realistic predictive models that incorporate diverse variables and their interactions. This strategy could also be used, for example, in longitudinal analysis of multiple regions of a tumor alongside treatment for quantitative modeling and prediction of outcome in personalized oncology.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Statistically modeling disease outcome of TCGA subjects using tumor RNA-seq data.
Fig. 2: Linear models of tumor RNA-seq data have significant power to complement classical clinical parameters in predicting an outcome for several types of cancer.
Fig. 3: A substantial reduction in sequencing depth does not affect predictive model building and performance.

Similar content being viewed by others

Data availability

No custom software was developed for this report and all data are publicly available. A detailed description of the methodological procedures is available in the Methods, including packages and functions used, which are also publicly available. The code used to reproduce the core analysis is described at https://github.com/PedroMilanezAlmeida/TumorShallowSeq.

References

  1. Hahn, W. C. & Weinberg, R. A. Modelling the molecular circuitry of cancer. Nat. Rev. Cancer 2, 331–341 (2002).

    Article  CAS  PubMed  Google Scholar 

  2. Vogelstein, B. et al. Cancer genome landscapes. Science 339, 1546–1558 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  3. Vogelstein, B. & Kinzler, K. W. Cancer genes and the pathways they control. Nat. Med. 10b, 789–799 (2004).

    Article  Google Scholar 

  4. Heimberg, G., Bhatnagar, R., El-Samad, H. & Thomson, M. Low dimensionality in gene expression data enables the accurate extraction of transcriptional programs from shallow sequencing. Cell Syst. 2, 239–250 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Kliebenstein, D. J. Exploring the shallow end; estimating information content in transcriptomics studies. Front. Plant Sci. 3, 213 (2012).

    PubMed  PubMed Central  Google Scholar 

  6. Pollen, A. A. et al. Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat. Biotechnol. 32, 1053–1058 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Liu, J. et al. An integrated TCGA pan-cancer clinical data resource to drive high-quality survival outcome analytics. Cell 173, 400–416.e11 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Hutter, C. & Zenklusen, J. C. The cancer genome atlas: creating lasting value beyond its data. Cell 173, 283–285 (2018).

    Article  CAS  PubMed  Google Scholar 

  9. Dupuy, A. & Simon, R. M. Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. J. Natl Cancer Inst. 99, 147–157 (2007).

    Article  PubMed  Google Scholar 

  10. Simon, N., Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for Cox’s proportional hazards model via coordinate descent. J. Stat. Softw. 39, 1–13 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  11. Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. Royal Stat. Soc. B 67, 301–320 (2005).

    Article  Google Scholar 

  12. Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  13. Therneau, T. M. & Grambsch, P. M. Modeling Survival Data: Extending the Cox Model (Springer, 2000).

  14. Ding, L. et al. Perspective on oncogenic processes at the end of the beginning of cancer genomics. Cell 173, 305–320.e10 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Tothill, R. W. et al. Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome. Clin. Cancer Res. 14, 5198–5208 (2008).

    Article  CAS  PubMed  Google Scholar 

  16. Dai, H. et al. A cell proliferation signature is a marker of extremely poor outcome in a subpopulation of breast cancer patients. Cancer Res. 65, 4059–4066 (2005).

    Article  CAS  PubMed  Google Scholar 

  17. Budhu, A. et al. Prediction of venous metastases, recurrence, and prognosis in hepatocellular carcinoma based on a unique immune response signature of the liver microenvironment. Cancer Cell 10, 99–111 (2006).

    Article  CAS  PubMed  Google Scholar 

  18. Merlos-Suárez, A. et al. The intestinal stem cell signature identifies colorectal cancer stem cells and predicts disease relapse. Cell Stem Cell 8, 511–524 (2011).

    Article  PubMed  Google Scholar 

  19. Cuzick, J. et al. Prognostic value of an RNA expression signature derived from cell cycle proliferation genes in patients with prostate cancer: a retrospective study. Lancet Oncol. 12, 245–255 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Rosenwald, A. et al. The proliferation gene expression signature is a quantitative integrator of oncogenic events that predicts survival in mantle cell lymphoma. Cancer Cell 3, 185–197 (2003).

    Article  CAS  PubMed  Google Scholar 

  21. Thorsson, V. et al. The immune landscape of cancer. Immunity 48, 812–830.e14 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Gentles, A. J. et al. The prognostic landscape of genes and infiltrating immune cells across human cancers. Nat. Med. 21, 938–945 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Sanchez-Vega, F. et al. Oncogenic signaling pathways in the cancer genome atlas. Cell 173, 321–337.e10 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Ein-Dor, L., Kela, I., Getz, G., Givol, D. & Domany, E. Outcome signature genes in breast cancer: is there a unique set? Bioinformatics 21, 171–178 (2005).

    Article  CAS  PubMed  Google Scholar 

  25. Venet, D., Dumont, J. E. & Detours, V. Most random gene expression signatures are significantly associated with breast cancer outcome. PLoS Comput. Biol. 7, e1002240 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Statnikov, A., McVoy, L., Lytkin, N. & Aliferis, C. F. Improving development of the molecular signature for diagnosis of acute respiratory viral infections. Cell Host Microbe 7, 100–1 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Zaas, A. K. et al. Response: improving development of the molecular signature for diagnosis of acute respiratory viral infections. Cell Host Microbe 7, 102 (2010).

    Article  CAS  Google Scholar 

  28. Senft, D., Leiserson, M. D. M., Ruppin, E. & Ronai, Z. A. Precision oncology: the road ahead. Trends Mol. Med. 23, 874–898 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  29. Kumar-Sinha, C. & Chinnaiyan, A. M. Precision oncology in the age of integrative genomics. Nat. Biotechnol. 36, 46–60 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Cieślik, M. & Chinnaiyan, A. M. Cancer transcriptome profiling at the juncture of clinical translation. Nat. Rev. Genet. 19, 93–109 (2018).

    Article  PubMed  Google Scholar 

  31. McGranahan, N. & Swanton, C. Clonal heterogeneity and tumor evolution: past, present, and the future. Cell 168, 613–628 (2017).

    Article  CAS  PubMed  Google Scholar 

  32. Gerlinger, M. et al. Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N. Engl. J. Med. 366, 883–892 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Robinson, D. G. & Storey, J. D. subSeq: determining appropriate sequencing depth through efficient read subsampling. Bioinformatics 30, 3424–3426 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Dang, C. V. Links between metabolism and cancer. Genes Dev. 26, 877–890 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Kurtz, D. M. et al. Dynamic risk profiling using serial tumor biomarkers for personalized outcome prediction. Cell 178, 699–713.e19 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2019).

  37. RStudio, Inc., RStudio Team. RStudio: Integrated Development Environment for R (RStudio, 2016).

  38. Huber, W. et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 12, 115–121 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Colaprico, A. et al. TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res. 44, e71 (2016).

    Article  PubMed  Google Scholar 

  40. Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  41. Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).

    PubMed  PubMed Central  Google Scholar 

  42. Microsoft & Weston, S. foreach: Provides Foreach Looping Construct for R (CRAN, 2017).

  43. Wickham, H., François, R., Henry, L. & Müller, K. dplyr: A Grammar of Data Manipulation (CRAN, 2019).

  44. Tang, Y., Horikoshi, M. & Li, W. X. ggfortify: unified interface to visualize statistical results of popular R packages. R Journal 8, 474–485 (2016).

    Article  Google Scholar 

  45. Wickham, H. Ggplot2: Elegant Graphics for Data Analysis (Springer, 2009).

  46. Auguie, B. gridExtra: Miscellaneous Functions for “Grid” Graphics (CRAN, 2017).

  47. Cancer Genome Atlas Research Network. et al. The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).

    Article  PubMed Central  Google Scholar 

  48. Kuhn, M. Building predictive models in R using the caret package. J. Stat. Softw. 28 https://doi.org/10.18637/jss.v028.i05 (2008).

  49. Jr, F. E. H. rms: Regression Modeling Strategies (CRAN, 2019).

  50. Liu, Y. et al. Comparative molecular analysis of gastrointestinal adenocarcinomas. Cancer Cell 33, 721–735.e8 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Liberzon, A. et al. The molecular signatures database (MSigDB) hallmark gene set collection. Cell Syst. 1, 417–425 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Weiner, J. tmod: Feature Set Enrichment Analysis for Metabolomics and Transcriptomics (CRAN, 2018).

Download references

Acknowledgements

We would like to thank S. J. Chanock for early critical comments and for reviewing the manuscript, as well as E. Condiff, A. Gola, H. Wong and S. Uderhardt for fruitful suggestions and discussions. This work was supported by the Intramural Research Program of the National Institute of Allergy and Infectious Disease (NIAID), National Institutes of Health (NIH) and the Institutes supporting the Trans-NIH Center for Human Immunology. This study used the Office of Cyber Infrastructure and Computational Biology (OCICB) High Performance Computing (HPC) cluster at NIAID.

Author information

Authors and Affiliations

Authors

Contributions

P.M.-A.: study conception and analysis design, acquisition, analysis and interpretation of the data, draft and revision of the manuscript; A.J.M.: analysis design, result interpretation and manuscript revision; R.N.G.: discussion of study design and results, draft and revision of the manuscript; J.S.T.: study and analysis design, result interpretation, draft and revision of the manuscript.

Corresponding author

Correspondence to Pedro Milanez-Almeida.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Javier Carmona was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Linear models of tumor RNA-seq data have significant power to predict RRE in several types of cancer: p-values.

Cox regression models, penalized to select any linear combination of gene expression levels across the whole genome, were trained and validated on RNA-seq samples from TCGA (always for each cancer type separately). For comparison, FOXM1/KLRB1 expression-based predictor as well as molecular subtyping by the top molecular feature associated with outcome in TCGA were also analyzed (see list of molecular subtypes below). Density distributions of p-values derived from testing the statistical significance (likelihood ratio test) of predicting outcome using the predicted RRE derived with penalized Cox (red) or FOXM1/KLRB1 (orange) or using molecular subtyping (blue) before adjusting for false discovery rate (FDR). Two-sided statistics were used. Numbers are the upper limit of 95% confidence intervals of the median of p-values after adjusting for FDR as by Benjamini-Hochberg (BH). The threshold for statistical significance was set at the upper limit ≤0.05. Error bars are 95% confidence intervals of the median after adjustment for FDR. Only the p-values of models that passed a test for proportional hazards assumption at alpha=0.05 are shown. Numbers on the far right are the total number of samples available for each cancer type. Cancer type abbreviation: cancer type (molecular subtype classification) – LGG: brain lower grade glioma (supervised DNA methylation cluster); KIRC: kidney renal clear cell carcinoma (mRNA cluster); KIRP: kidney renal papillary cell carcinoma (methylation cluster [Laird group]); MESO: mesothelioma (ncRNA cluster k4); PRAD: prostate adenocarcinoma (BRCA2 CNA); LUAD: lung adenocarcinoma (fusions); ACC: adrenocorticalcarcinoma (mRNA K4); LIHC: liver hepatocellular carcinoma (TP53 mutation signature [sum of expression of targets]); BLCA: bladder urothelial carcinoma (total number of single nucleotide variants); HNSC: head-neck squamous cell carcinoma (copy number); UVM: uveal melanoma (miRNA cluster); PAAD: pancreatic adenocarcinoma (RPPA clusters); LAML: acute myeloid leukemia (not available); UCEC: uterine corpus endometrial carcinoma (CNA cluster k4); CESC: cervical squamous cell carcinoma and endocervical adenocarcinoma (GEXP APOBEC3F); BRCA: breast invasive carcinoma (pan gyn clusters); THCA: thyroid carcinoma (TAtoGC); COAD: colon adenocarcinoma (MS frac stage1 AAtoAC); UCS: uterine carcinosarcoma (RB1 mutation); ESCA: esophageal carcinoma (MS frac stage1 AAtoAC); GBM: glioblastoma multiforme (IDH codel subtype); THYM: thymoma (GTF2I mutation); LUSC: lung squamous cell carcinoma (non-silent mutations); OV: ovarian cancer (not available); READ: rectum adenocarcinoma (arm CN 1q); STAD: stomach carcinoma (MLH1 methylation); TGCT: testicular germ cell tumors (T:A>C:G%).

Extended Data Fig. 2 Linear models of tumor RNA-seq data have significant power to predict RRE in several types of cancer: c-indices.

Same as in Extended Data Fig. 1, but showing density distributions of c-index found by univariate Cox models for each variable, after correction with the Dxy value. Numbers inside the boxes are medians. Error bars are 95% confidence intervals of the median. Where curves are missing, the corresponding parameter was not available. Numbers on the far right are the total number of samples available for each cancer type. Only the c-indices of models that passed a test for proportional hazards assumption at alpha = 0.05 are shown. Two-sided statistics were used.

Extended Data Fig. 3 Assessment of penalized Cox models using independent data.

Genes with mean coefficients not extremely close to zero (genes within the two bins closest to zero, out of 100 bins, were excluded; see Supplementary Fig. 1) were separated into two classes according to the sign of their mean coefficient (positive or negative). The distribution of their meta z score in the recent PRECOG meta-analysis of tumor gene expression was then retrieved and visualized. Numbers inside the boxes are BH-adjusted p-values from a one-sided Wilcoxon rank sum test of whether the mean ranks of meta z scores were equal when comparing genes with positive vs negative mean coefficients in penalized Cox. Numbers at the bottom of the plots show the number of genes that had either positive or negative mean coefficients in penalized Cox for each cancer type after exclusion of those extremely close to zero (n).

Extended Data Fig. 4 Cell division/proliferation, immune response and epithelial mesenchymal transition are among the pathways enriched in the penalized Cox regression-derived models.

The absolute mean of the coefficients from all one hundred rounds of training was used as ranking of genes in pathway enrichment analysis among the 50 Hallmark MsigDB gene sets (H-MsigDB). The CERNO test was used to derive and BH-correct p-values (FDR). The union of the top 5 most significantly enriched pathways by p-value for each cancer type are shown. Please see the Methods for details on additional pathway enrichment analysis procedures that we performed.

Extended Data Fig. 5 Reduction in sample sizes to save cost could lead to models of considerably poorer performance.

a, Schematic representation of data split to simulate reduced sample sizes available for training. 20% of the data were kept hidden to serve as test set (“Validation”, blue), while the remaining data were used for training (80% of all data), or half of the remaining data were used for training (40% of all) or a quarter of the remaining data were used in training (20% of all). The split into training and test data was repeated a hundred times, as was the whole training and testing procedure (i.e., training with penalized Cox followed by prediction of RRE and non-penalized, univariate Cox regression with test samples). b, Median p-values, c-indices and AICs from validation models (two-sided statistics) for LGG, LUAD and HNSC – representative cancer types across different levels of performance and relative large total number of samples (n = 506, 500 and 499, respectively). Please note how relative performance becomes worse much faster when progressing from left to right (i.e., reducing sample size by 2-fold steps) within each heatmap than when progressing from top to bottom (i.e., reducing sequencing depth by 10-fold steps).

Extended Data Fig. 6 Assessment of penalized Cox models using independent data after training on subsampled data.

As in Extended Data Fig. 3, genes with positive and negative mean coefficients not very close to zero (Supplementary Fig. 13) were analyzed with regard to their meta z score previously derived in a meta-analysis of tumor gene expression22. Numbers inside the boxes are BH-adjusted p-values from a one-sided Wilcoxon rank sum test of whether the mean ranks of meta z scores were equal when comparing genes with positive vs negative mean coefficients in penalized Cox. One representative example from the 10 subsampling replicates of 100-fold reduction is shown. Numbers at the bottom of the plots show the number of genes that had either positive or negative mean coefficients in penalized Cox for each cancer type after exclusion of those extremely close to zero (n).

Extended Data Fig. 7 Cell division/proliferation, immune response, epithelial mesenchymal transition, and metabolic pathways are among the most enriched in models derived from training on subsampled datasets.

As in Extended Data Fig. 4, the absolute mean of the coefficients from all one hundred rounds of training on one representative example from the 10 subsampling replications of 100-fold reduction was used as ranking of genes in pathway enrichment analysis among the 50 Hallmark MsigDB gene sets. The CERNO test was used to derive and BH-adjust p-values (FDR). The union of the top 5 most significantly enriched pathways by p-value for each cancer type are shown. Clustering of pathways was based on negative log10 adjusted p-values. Please see the Methods for details on additional pathway enrichment analysis procedures that we performed.

Supplementary information

Supplementary Information

Supplementary Figs. 1–16.

Reporting Summary

Supplementary Table

Supplementary Table 1

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Milanez-Almeida, P., Martins, A.J., Germain, R.N. et al. Cancer prognosis with shallow tumor RNA sequencing. Nat Med 26, 188–192 (2020). https://doi.org/10.1038/s41591-019-0729-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41591-019-0729-3

This article is cited by

Search

Quick links

Nature Briefing: Cancer

Sign up for the Nature Briefing: Cancer newsletter — what matters in cancer research, free to your inbox weekly.

Get what matters in cancer research, free to your inbox weekly. Sign up for Nature Briefing: Cancer