Abstract
Complex traits are heritable traits that do not obey monogenic Mendelian
inheritance laws. These traits, including for example height, cannot be
explained by a single genetic variation, but instead are influenced by many
variants across the genome. While understanding the relationship between
genetics and complex traits is essential to understanding many heritable
diseases, such traits often remain poorly understood. In many cases it is
not even known which single nucleotide variants (SNVs) are responsible for
a complex trait. While simple additive models can explain some of the
statistical variance of these traits, they often fall short of a full explanation.
This has lead to theories that either rare variants with large effects,
or interactions between variants are responsible for many complex traits.
The gap between the known heritability of a trait and the heritability explained
by current models is known as the ‘missing heritability’. Assuming interactions,
rather than rare variants, are responsible for missing heritability,
it is possible in theory to identify them with appropriate models of the
relationship between genetics and traits.
Most current models of this relationship for disease-related traits do not
account for any form of interaction. To overcome this limitation, we develop
and evaluate several machine learning models for the prediction of
complex traits. Among these are several classic general-purpose machine
learning models, and three new approaches we developed with genomics
in mind. These include regression with pairwise interaction terms,
transformer encoders designed to be trained on SNV data, and a method using
differentiable fuzzy logic based on models of gene regulation.
Using hyperuricaemia, gout, and antibiotic resistance as complex traits, we test these
methods in both simulated and real datasets. As well as evaluating the
accuracy of each method, we develop a simulation of several datasets with
varying complexity to evaluate each method’s scalability. In particular, we
focus on each method’s performance with respect to the amount of training
data available, the underlying function’s complexity, and the level of noise.
Our regression model is able to correctly identify genes involved in
simulated epistasis. Throughout both simulations and an analysis of gout in the
UK Biobank, we determine that linear models have excellent performance
when very few training samples are available. Transformer-based neural
networks perform equally well in most cases, and both outperform other
classic methods.