Logo image
The Twilight Zone of Nucleotide Homology
Graduate Thesis/Dissertation   Open access

The Twilight Zone of Nucleotide Homology

Stephanie McGimpsey
Master of Science - MSc, University of Otago
University of Otago
2019
Handle:
https://hdl.handle.net/10523/9302

Abstract

Twilight zone Sequence alignment Bioinformatics Genomics Nucleotide homology Homology Blastn nhmmer ggsearch ssearch Computational biology Stephanie McGimpsey Paul Gardner Twilight zone of nucleotide homology
Homology search tools are important for inferring homology in the abundance of genomes currently sequenced. These tools utilise sequence similarity in order to assign a score between two sequences from which homology is inferred. The relationship between sequence similarity and homology can break down for certain levels of similarity. The zone of pairwise identity where a known pair of homologs has a 50% chance or less of being inferred as homologous based on the alignment score is called the twilight zone. The twilight zone for nucleotide homology has previously been calculated using databases that were small or contained bias. Therefore, the aim of this research was to calculate the twilight zone of nucleotide homology using a carefully designed database of homologous sequences. A database of core ncRNA and mRNA genes from a large range of genus representative bacteria was generated, from which sequence pairs were chosen. The database was used to calculate where the twilight zone of nucleotide homology was for four different types of alignment algorithms; BLASTn, ggsearch, nhmmer and ssearch. The effect of G+C content and sequence length on the location of the twilight zone was also examined. The twilight zone was shown to be between 38-50% pairwise identity for all alignment algorithms tested. Both sequence length and G+C content shift the twilight zone for all four alignment algorithms. This research has shown that between 38-50% pairwise identity homology should not be inferred based only on the alignment score, as there is a greater chance of incorrectly inferring homology than correctly inferring homology. Furthermore, the analyses have shown that a parametric approach to database design is required to further balance the database used for the twilight zone calculation.
pdf
McGimpseyStephanie2019MSc.pdfDownloadView

Metrics

737 File views/ downloads
1017 Record Views

Details

Logo image