Abstract
The aim of this thesis is to look into data selection strategies for selecting data to be used for Bayesian analysis of genotyping by sequencing (GBS) data. Each selection of data leads to a different distribution on the model parameters. Methods for analysing the different resulting posterior distributions will be discussed and compared. The most applicable method will be applied to a set of simulated genetic markers.
Traditionally, GBS data sets are constructed so that each marker is a polymorphic (non-constant) site, for example a single nucleotide polymorphism (SNP). However, there is evidence to show that this might not be the optimal method. The best method may in fact be to include a certain proportion of sites which are not filtered on being polymorphic sites and are allowed to be constant sites.
To understand whether there is truth in this, we begin by analysing simplified problems with simpler distributions. These simpler problems will be studied analytically and using Monte Carlo samples. This decision making process is to decide the optimal proportion of which class of data points to include in the marker data set. The chosen method will then be first applied to a simulated marker data set and then the results analysed in order to show there appears to be an optimal mixture of data which should be used in any future phylogenetic analysis.