Development and validation of colorectal cancer risk prediction tools: A comparison of models

Duco T. Mülder*, Rosita van den Puttelaar, Reinier G.S. Meester, James F. O'Mahony, Iris Lansdorp-Vogelaar

*Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

3 Citations (Scopus)
33 Downloads (Pure)


Background: Identification of individuals at elevated risk can improve cancer screening programmes by permitting risk-adjusted screening intensities. Previous work introduced a prognostic model using sex, age and two preceding faecal haemoglobin concentrations to predict the risk of colorectal cancer (CRC) in the next screening round. Using data of 3 screening rounds, this model attained an area under the receiver-operating-characteristic curve (AUC) of 0.78 for predicting advanced neoplasia (AN). We validated this existing logistic regression (LR) model and attempted to improve it by applying a more flexible machine-learning approach. Methods: We trained an existing LR and a newly developed random forest (RF) model using updated data from 219,257 third-round participants of the Dutch CRC screening programme until 2018. For both models, we performed two separate out-of-sample validations using 1,137,599 third-round participants after 2018 and 192,793 fourth-round participants from 2020 onwards. We evaluated the AUC and relative risks of the predicted high-risk groups for the outcomes AN and CRC. Results: For third-round participants after 2018, the AUC for predicting AN was 0.77 (95% CI: 0.76–0.77) using LR and 0.77 (95% CI: 0.77–0.77) using RF. For fourth-round participants, the AUCs were 0.73 (95% CI: 0.72–0.74) and 0.73 (95% CI: 0.72–0.74) for the LR and RF models, respectively. For both models, the 5% with the highest predicted risk had a 7-fold risk of AN compared to average, whereas the lowest 80% had a risk below the population average for third-round participants. Conclusion: The LR is a valid risk prediction method in stool-based screening programmes. Although predictive performance declined marginally, the LR model still effectively predicted risk in subsequent screening rounds. An RF did not improve CRC risk prediction compared to an LR, probably due to the limited number of available explanatory variables. The LR remains the preferred prediction tool because of its interpretability.

Original languageEnglish
Article number105194
JournalInternational Journal of Medical Informatics
Publication statusPublished - Oct 2023

Bibliographical note

Publisher Copyright: © 2023 The Authors


Dive into the research topics of 'Development and validation of colorectal cancer risk prediction tools: A comparison of models'. Together they form a unique fingerprint.

Cite this