Investigating the impact of development and internal validation design when training prognostic models using a retrospective cohort in big US observational healthcare data

Jenna M. Reps*, Patrick Ryan, P. R. Rijnbeek

*Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

2 Citations (Scopus)
2 Downloads (Pure)

Abstract

Objective The internal validation of prediction models aims to quantify the generalisability of a model. We aim to determine the impact, if any, that the choice of development and internal validation design has on the internal performance bias and model generalisability in big data (n∼500 000). Design Retrospective cohort. Setting Primary and secondary care; three US claims databases. Participants 1 200 769 patients pharmaceutically treated for their first occurrence of depression. Methods We investigated the impact of the development/validation design across 21 real-world prediction questions. Model discrimination and calibration were assessed. We trained LASSO logistic regression models using US claims data and internally validated the models using eight different designs: no test/validation set', test/validation set' and cross validation with 3-fold, 5-fold or 10-fold with and without a test set. We then externally validated each model in two new US claims databases. We estimated the internal validation bias per design by empirically comparing the differences between the estimated internal performance and external performance. Results The differences between the models' internal estimated performances and external performances were largest for the no test/validation set' design. This indicates even with large data the no test/validation set' design causes models to overfit. The seven alternative designs included some validation process to select the hyperparameters and a fair testing process to estimate internal performance. These designs had similar internal performance estimates and performed similarly when externally validated in the two external databases. Conclusions Even with big data, it is important to use some validation process to select the optimal hyperparameters and fairly assess internal validation using a test set or cross-validation.

Original languageEnglish
Article numbere050146
JournalBMJ Open
Volume11
Issue number12
DOIs
Publication statusPublished - 24 Dec 2021

Bibliographical note

Funding Information:
Funding This work was supported by the Innovative Medicines Initiative 2 Joint Undertaking (JU) under grant agreement No 806968. The JU receives support from the European Union’s Horizon 2020 research and innovation programme and EFPIA.

Funding Information:
This work was supported by the Innovative Medicines Initiative 2 Joint Undertaking (JU) under grant agreement No 806968. The JU receives support from the European Union's Horizon 2020 research and innovation programme and EFPIA.

Funding Information:
Competing interests JMR and PR report and are employees of Janssen Research and Development and are shareholders of Johnson & Johnson. PRR reports grants from Innovative Medicines Initiative, grants from Janssen Research and development, during the conduct of the study.

Publisher Copyright:
© 2021 BMJ Publishing Group. All rights reserved.

Fingerprint

Dive into the research topics of 'Investigating the impact of development and internal validation design when training prognostic models using a retrospective cohort in big US observational healthcare data'. Together they form a unique fingerprint.

Cite this