Bioinformatic analysis of RNA from diverse species

Ambarish Biswas

Although most living cells contain a full complement of genes, gene expression differs between cells. In addition to being translated, RNA molecules play key roles in other important cellular functions. Gene expression in eukaryotic cells is controlled post-transcriptionally by combinations of regulatory elements in mRNAs. Such elements include structural and functional sequence motifs, and binding sites for miRNAs and proteins. These are usually located in the untranslated regions (UTRs) of mRNA sequences, particularly the 3’UTRs. Characterisation of these elements can provide detailed insight into the complex mechanism of regulation of gene expression. However, identification and visualisation of these cis-regulatory elements in a true overlapping manner is a challenging task. The Scan for Motifs (SFM) web-application simplified the process of identifying a wide range of regulatory elements on alignments of vertebrate 3’UTRs as well as in any inputted sequence. SFM includes identification of both RNA Binding Protein (RBP) sites and targets of miRNAs. The regulatory elements can be filtered by False Discovery Rate (FDR) estimations. The output provides an interactive graphical representation highlighting potential regulatory elements and overlaps between them complemented with simple statistics and cross-reference to their sources. In eukaryotes, the RNAi (RNA interference) is now a well-studied and established method for inhibiting gene expression. A similar system, known as CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) or Cas (CRISPR associated) system is present in bacteria and archaea that provides adaptive immunity against phage infection. CRISPR RNAs (crRNAs) are a type of small noncoding RNA that plays an important role in a noncoding RNA guided defence system in prokaryotes. Specific prediction methods found crRNA-encoding loci in nearly half of sequenced bacterial, and ~90% of archaeal species. The bacterial and archaeal CRISPR/Cas adaptive immune system targets specific protospacer nucleotide sequences in invading organisms. This requires base pairing between processed CRISPR RNA and the target protospacer. For type I and II CRISPR/Cas systems, protospacer adjacent motifs (PAM) are essential for target recognition, and for type III, mismatches in the flanking sequences are important in the antiviral response. In this study, the properties of each class of CRISPR were examined and these information is used for building a tool (CRISPRTarget) that predicts the most likely targets of CRISPR RNAs. This can be used to discover targets in newly sequenced genomic or metagenomic data. To test its utility, the features and targets of well-characterized Streptococcus thermophilus and Sulfolobus solfataricus type II and III CRISPR/Cas systems were discovered. Finally, in Pectobacterium species, new CRISPR targets were identified, establishing a model of temperate phage exposure and subsequent inhibition by the type I CRISPR/Cas systems. CRISPR arrays consist of repeat sequences separated by specific spacer sequences. Generally one strand is transcribed, producing long pre-crRNAs, which are processed to short crRNAs that base pair with invading nucleic acids to facilitate their destruction. No current software for the discovery of CRISPR loci predicts the direction of crRNA transcription. A new algorithm (CRISPRDirection) was developed that accurately predicts the strand of the resulting crRNAs. The method supports FASTA/multi-FASTA sequence or repeats as well as a complete annotation file as input. CRISPRDirection uses parameters that are calculated from the CRISPR repeat predictions and flanking sequences, which are combined by weighted voting. The prediction may utilise optional prior coding sequence annotation. CRISPRDirection correctly predicted the orientation of 94% of a reference set of arrays. Existing CRISPR detection algorithms do not utilise recently identified features of CRISPR structure, expression, or direction of RNA transcription. A series of routines were developed and implemented as CRISPRDetect that detect and refine CRISPR arrays. This algorithm is optimised but parameters are user tuneable. CRISPRDetect discovers putative arrays, extends the array by detecting additional repeats, and refines the internal structure in a true array specific manner. It also includes the direction of transcription by calculating and using parameters relating to the structure and evolution of the arrays. CRISPRDetect enables more accurate detection of arrays and is suitable for inclusion in genome annotation pipelines. It comes with an interactive web-server as well as a command-line executable. Additionally, an interactive database named CRISPRBank was developed, which contains CRISPR specific information from all published bacterial and archaeal genomes. All these tools and associated files can be accessed at the bioanalysis server, which as yet has the most comprehensive and diverse set of tools to aid CRISPR analysis.

Bioinformatic analysis of RNA from diverse species

Abstract

Files and links (1)

Metrics

Details