Machine learning models for complex trait prediction

Complex traits are heritable traits that do not obey monogenic Mendelian

inheritance laws. These traits, including for example height, cannot be

explained by a single genetic variation, but instead are influenced by many

variants across the genome. While understanding the relationship between

genetics and complex traits is essential to understanding many heritable

diseases, such traits often remain poorly understood. In many cases it is

not even known which single nucleotide variants (SNVs) are responsible for

a complex trait. While simple additive models can explain some of the

statistical variance of these traits, they often fall short of a full explanation.

This has lead to theories that either rare variants with large effects,

or interactions between variants are responsible for many complex traits.

The gap between the known heritability of a trait and the heritability explained

by current models is known as the ‘missing heritability’. Assuming interactions,

rather than rare variants, are responsible for missing heritability,

it is possible in theory to identify them with appropriate models of the

relationship between genetics and traits.

Most current models of this relationship for disease-related traits do not

account for any form of interaction. To overcome this limitation, we develop

and evaluate several machine learning models for the prediction of

complex traits. Among these are several classic general-purpose machine

learning models, and three new approaches we developed with genomics

in mind. These include regression with pairwise interaction terms,

transformer encoders designed to be trained on SNV data, and a method using

differentiable fuzzy logic based on models of gene regulation.

Using hyperuricaemia, gout, and antibiotic resistance as complex traits, we test these

methods in both simulated and real datasets. As well as evaluating the

accuracy of each method, we develop a simulation of several datasets with

varying complexity to evaluate each method’s scalability. In particular, we

focus on each method’s performance with respect to the amount of training

data available, the underlying function’s complexity, and the level of noise.

Our regression model is able to correctly identify genes involved in

simulated epistasis. Throughout both simulations and an analysis of gout in the

UK Biobank, we determine that linear models have excellent performance

when very few training samples are available. Transformer-based neural

networks perform equally well in most cases, and both outperform other

classic methods.

Abstract