Predictive performance of machine and statistical learning methods: Impact of data-generating processes on external validity in the “large N, small p” setting

Peter C. Austin*, Frank E. Harrell, Ewout W. Steyerberg

*Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

29 Citations (Scopus)


Machine learning approaches are increasingly suggested as tools to improve prediction of clinical outcomes. We aimed to identify when machine learning methods perform better than a classical learning method. We hereto examined the impact of the data-generating process on the relative predictive accuracy of six machine and statistical learning methods: bagged classification trees, stochastic gradient boosting machines using trees as the base learners, random forests, the lasso, ridge regression, and unpenalized logistic regression. We performed simulations in two large cardiovascular datasets which each comprised an independent derivation and validation sample collected from temporally distinct periods: patients hospitalized with acute myocardial infarction (AMI, n = 9484 vs. n = 7000) and patients hospitalized with congestive heart failure (CHF, n = 8240 vs. n = 7608). We used six data-generating processes based on each of the six learning methods to simulate outcomes in the derivation and validation samples based on 33 and 28 predictors in the AMI and CHF data sets, respectively. We applied six prediction methods in each of the simulated derivation samples and evaluated performance in the simulated validation samples according to c-statistic, generalized R2, Brier score, and calibration. While no method had uniformly superior performance across all six data-generating process and eight performance metrics, (un)penalized logistic regression and boosted trees tended to have superior performance to the other methods across a range of data-generating processes and performance metrics. This study confirms that classical statistical learning methods perform well in low-dimensional settings with large data sets.

Original languageEnglish
Pages (from-to)1465-1483
Number of pages19
JournalStatistical Methods in Medical Research
Issue number6
Publication statusPublished - 13 Apr 2021

Bibliographical note

Funding Information:
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by ICES, which is funded by an annual grant from the Ontario Ministry of Health and Long-Term Care (MOHLTC). This research was supported by an operating grant from the Canadian Institutes of Health Research (CIHR) (PJT – 166161). PCA is supported in part by Mid-Career Investigator awards from the Heart and Stroke Foundation. FEH’s work on this paper was supported by CTSA award No. UL1 TR002243 from the National Center for Advancing Translational Sciences. Its contents are solely the responsibility of the authors and do not necessarily represent official views of the National Center for Advancing Translational Sciences or the National Institutes of Health.

Publisher Copyright:
© The Author(s) 2021.


Dive into the research topics of 'Predictive performance of machine and statistical learning methods: Impact of data-generating processes on external validity in the “large N, small p” setting'. Together they form a unique fingerprint.

Cite this