Logo image
Applying machine learning and deep learning to analyse clinical cancer epigenetic data
Doctoral Thesis

Applying machine learning and deep learning to analyse clinical cancer epigenetic data

Doctor of Philosophy - PhD, University of Otago
University of Otago
06/05/2026
DOI:
https://doi.org/10.82348/our-archive.00134
Handle:
https://hdl.handle.net/10523/50713

Abstract

Cancer epigenetics DNA methylation, cell-free DNA (cfDNA) non-invasive cancer diagnosis deep learning Transformer models DNAmBERT k-mer tokenization methylation haplotypes multi-cancer detection transfer learning biomarker discovery computational oncology high-throughput sequencing.

Epigenetic alterations, particularly DNA methylation changes, play a critical role in cancer development and progression, offering valuable biomarkers for diagnosis, prognosis, and therapeutic strategies. With advances in high-throughput sequencing technologies, the analysis of DNA methylation data has become increasingly crucial. However, the complexity, heterogeneity, and scale of clinical epigenetic datasets present significant analytical challenges.

This PhD thesis makes three primary contributions to the field of computational cancer epigenetics, focusing on the analysis of clinical cancer epigenetic data—specifically DNA methylation patterns derived from circulating cell-free DNA (cfDNA)—for non-invasive cancer diagnostics.

First, we develop and evaluate DNAmBERT, a novel Transformer-based deep learning model designed for non-invasive cancer diagnosis using cfDNA methylation data. DNAmBERT integrates k-mer tokenized DNA sequences with read-level methylation haplotype information through a structured concatenated input representation, enabling joint modelling of sequence and epigenetic context, enabling robust binary classification of tumour-derived versus normal cfDNA across diverse sequencing platforms and cancer types. The model consistently achieves high performance in terms of accuracy, sensitivity, specificity, and AUC.

Second, we extend DNAmBERT for multi-cancer detection. By training the model across several cancer types—including colorectal, lung, and liver—we demonstrate its ability to accurately distinguish among multiple malignancies and detect early-stage cancers, underscoring its utility in pan-cancer screening applications.

Third, we apply a transfer learning framework that enables DNAmBERT to generalize across cancer types. The model is initially pre-trained on specific cancer types and subsequently fine-tuned on further cancer types for both binary and multi-cancer classification. This approach leverages shared epigenetic signatures and adapts the pre-trained representations to new contexts, achieving strong performance under data-scarce conditions and demonstrating its versatility for diverse clinical applications.

To support these contributions, we conducted a comprehensive systematic literature review of deep learning (DL) applications in cancer epigenetics. This review highlights DL applications for classification, CpG methylation state prediction, biomarker discovery, and survival prediction using methylation and multi-omics data, and it informs the design and scope of DNAmBERT.

Overall, this thesis demonstrates that integrating advanced Transformer-based architectures with rich cfDNA-derived epigenetic features can significantly enhance the accuracy, interpretability, and scalability of non-invasive cancer diagnostic tools.

pdf
Thesis8.44 MB
Embargoed Access, Embargo ends: 30/05/2027 2: Abstract Only

Metrics

11 Record Views

Details

Logo image