Show simple item record

dc.contributor.advisorBrown, Christopher Michael
dc.contributor.authorPollock, Robert
dc.identifier.citationPollock, R. (2011). Messenger RNA Data Mining (Thesis, Doctor of Philosophy). University of Otago. Retrieved from
dc.description.abstractMessenger RNA (mRNA) is the molecular intermediary carrying genetic information from the nucleus to the cytoplasm. All the information about cellular localisation, efficiency of translation and message stability must be found within the sequence of the mRNA molecule. Very little is known about the sequence elements that must direct many of these processes. De novo computational methods have been extremely useful in characterising the molecular determinants of function at the protein level. Similarly much has been learned about the functional elements at or near transcription start sites (TSSs). Computational methods have not yet been as successful in finding elements within the messenger RNA sequence. There is much evidence to suggest such elements exist. Conservation in genomic sequence corresponding to the mRNA 30 untranslated region (3'UTR) is greater than it is in the intergenic region. In addition there are a large number of proteins in the eukaryotic cell that contain RNA binding domains. In this study, predicted 3'UTR regions for all genes from the budding yeast Saccharomyces cerevisiae were extracted from the downstream region of each gene. Yeast makes a powerful platform for this purpose because of the vast repository of functional data available for cross checking results, and thereby validating methodology. In this work, the 3'UTR sequences of yeast were interrogated using computational methodology to find statistically significant patterns within the sequences. Such patterns (or sequence motifs) can be over-represented words—either contiguous words or regular expressions—or collections of locally aligned sequences (typically expressed as matrices of positional letter counts). Two methods were chosen here: the first, MEME, searches for statistically significant local alignments. The second, TEIRESIAS, enumerates words (allowing for wildcard letters) present in the sequence database, subject to certain constraints. The goal of this project was to find significant words using as naïve and unsupervised an approach as is possible, utilising little or no functional data upstream of the analysis. Through doing this, it was possible to illustrate the power of computational methodology to decipher the molecular details of cellular control, with little or no additional information. This approach represents a significant challenge for computational biology tools. The sequence dataset was searched for elements under two scenarios: the first, completely unsupervised (using all of the sequence data), and the second partly supervised. The partly supervised approach utilised mRNA halflife values were obtained from a genome-wide study of mRNA stability to bin the sequences before conducting the search. This study examines the performance of both MEME and TEIRESIAS when applied to these difficult problems, and considers ways of improving the sensitivity of the analysis (to find the maximum number of motifs). The effect of repetitive and duplicated sequences, and appropriate parameters for searching, on the output of TEIRESIAS is examined. Secondly a methodology for converting text-pattern motifs (from TEIRESIAS) into alignments that can be compared with MEME results, is developed. The partially supervised approach was conducted using MEME. MEME was not able to find many motifs when executed with default parameters, typically finding only low complexity elements, duplicated sequences, and alignments of only 2-3 sequences. After a concerted effort was made to optimize MEME, which is documented here, some 18,000 statistically significant alignments were found. After filtering and testing of the motifs, 3311 motifs comprising 1324 motif clusters were found to be associated with stability. Using the fully unsupervised approach, MEME was also able to find a large number of motifs, though (notably) slightly fewer than were found using the partially supervised approach. TEIRESIAS was also able to find many statistically significant patterns using the unsupervised method. The large number of motifs found during this analysis were tested for likely biological validity in several ways: by testing whether or not motif-containing sequences were biased towards stable or unstable sequences; by testing whether or not motif-containing sequences were enriched in functional and localisation classes of genes (Gene Ontology) and by testing for motif enrichment in the target mRNAs associated with known RNA binding proteins. These approaches confirmed that the motifs found were likely to have biological significance and leads to their consideration as candidates for likely components of functional elements.
dc.publisherUniversity of Otago
dc.rightsAll items in OUR Archive are provided for private study and research purposes and are protected by copyright with all rights reserved unless otherwise indicated.
dc.subjectData Mining
dc.subjectpost-transcriptional control
dc.titleMessenger RNA Data Mining
dc.typeThesis of Philosophy of Otago Theses
otago.openaccessAbstract Only
 Find in your library

Files in this item


There are no files associated with this item.

This item is not available in full-text via OUR Archive.

If you would like to read this item, please apply for an inter-library loan from the University of Otago via your local library.

If you are the author of this item, please contact us if you wish to discuss making the full text publicly available.

This item appears in the following Collection(s)

Show simple item record