Abstract
Clinical prediction models are essential tools in modern healthcare, supporting diagnosis, prognosis, and clinical decision-making. Numerous methods, including regression and machine learning approaches, are employed to develop these models, yet selecting the best method remains challenging. Limited guidance exists regarding the comparative performance of these methods, especially under varying data conditions encountered in practice. This thesis aims to systematically evaluate and compare regression and machine learning methods for clinical prediction modelling, focusing on model performance across different validation frameworks and data scenarios.
The work comprised empirical and simulation-based studies. Initially, risk prediction models for thyroid cancer recurrence were developed using real clinical data. Multiple modelling approaches, including logistic regression (full model and backward-selected model), shrinkage methods (LASSO, ridge, and elastic net), and machine learning approaches (classification and regression tree, random forest, gradient boosting machine, extreme gradient boosting, neural network, and support vector machine) were implemented. Model performance was assessed using repeated cross-validation, optimism-corrected bootstrapping, and temporal validation, with performance metrics including discrimination, calibration, and overall prediction error. Two extensive simulation studies were conducted. The first used multiple data-generating mechanisms derived from a real dataset, while the second applied an independent simulation framework that systematically varied key factors such as sample size, event rate, number of candidate predictors, correlation among variables, and proportion of noise predictors. These studies enabled a comprehensive evaluation of model performance under diverse and clinically relevant scenarios.
Overall, shrinkage regression methods achieved higher discrimination than the other approaches across most scenarios and demonstrated more stable performance, although calibration remained less reliable in settings with small samples and low event rates. Machine learning approaches, particularly random forests and support vector machines, showed greater sensitivity to data characteristics and often required adequate sample sizes and event rates to perform reliably. Machine learning models demonstrated greater sensitivity to hyperparameter tuning and validation strategies and were more prone to overfitting, particularly random forests. Internal validation methods provided more stable performance estimates. In some cases, temporal validation produced unstable and misleading performance estimates due to temporal shifts in patient characteristics, reduced sample size from data-splitting, and the exclusion of recent data from model development. Across all methods, relying solely on discrimination risked misleading conclusions, emphasising the need for careful assessment of calibration and multiple clinically relevant performance metrics.
Simulation findings highlight the importance of aligning modelling strategies with data characteristics and selecting evaluation methods that comprehensively assess model performance. The results provide practical guidance for selecting appropriate modelling approaches in clinical prediction, particularly when data are limited or event rates are low, to improve the utility and generalisability of clinical prediction models.