Abstract
The study of microorganisms is a challenging area of research which has greatly benefited from the development of a wide range of sequencing tech- nologies. Use of sequence data has allowed a number of significant advance- ments in research into microorganisms; however, it has also led to number of new challenges. While there has been a rapid increase in the total number of sequences in public databases, these sequences are not equally distributed across organisms. The bias towards specific organisms has led to challenges in using sequence databases to identify and classify organisms and genetic features that were not the primary focus of the data gathering.
This thesis addresses three sequence classification challenges and uses a combination of existing tools for sequence analysis and novel methods. Us- ing these methods a software tool (PredVirusHost) was created that can identify archaeal viruses in viral metagenomes, outperforming existing viral classifiers with an accuracy greater than 88%. The underlying mechanisms of the first phase of CRISPR-Cas systems were investigated, and primed adaptation was shown to occur throughout bacteria and archaea. Bacte- rial small noncoding RNAs were identified from RNA-Seq datasets, and through the use of a number of metrics, it was possible to show that more than 90% of RNA expression in noncoding regions is likely the result of spurious promoters.