RSmatch: aligning RNA secondary structures and finding RNA motifs


Introduction

Many ribonucleic acids (RNAs) play important roles in gene regulation, including non-coding RNAs and cis elements in mRNAs. Some of their functions are attributable to the structure they adopt, which are also called RNA motifs. Like sequence elements, RNA structure elements can be identified by comparing RNAs containing similar structures. The RSmatch package is designed to provide a light-weight approach to compare RNA structures, thereby uncovering functional structure elements. Compared with other tools for RNA structure comparison, RSmatch is fast, requiring quadratic time determined by the sizes of two given structures. 

RSmatch uses two scoring schemes, i.e. position independent and position dependent schemes. The position independent scheme entails two scoring matrices, one for single-stranded regions and the other for double-stranded regions. This scoring scheme is used in pair-wise comparisons and database searches.  The position dependent scheme, also known as profile, scores individual structure positions and is used by the multiple structure alignment and iterative database search functions. RSmatch provides both global and local alignment options even though the latter is more useful in most cases. In addition, RSmatch can take pattern-based structures as input. Please check the following publication for details:

Liu., J., Wang, J.T., Hu, J., and Tian, B. A method for aligning RNA secondary structures and its application to RNA motif detection. BMC Bioinformatics 2005, 6:89

In the current version (1.2), RSmatch provides the following functions: (1) regular database search, (2) multiple structure alignment, (3) iterative database search, (4) pair-wise structure alignment, (5) slide folding.

  1. For a regular database search, the package finds RNA structures in a database that locally or globally match a given query structure.  This function can also be used to detect motif occurrences in an RNA structure database when the query structure is a known motif with a defined pattern.
  2. For multiple structure alignment, RSmatch constructs a multiple alignment for a given set of RNA structures by progressively expanding the alignment one at a time. This function is a useful  when a small set of RNAs are functionally related by a shared motif. 
  3. For iterative database search, RSmatch is able to continuously conduct database searches using a position-specific scoring matrix and update the matrix using the latest result. This function could be much more sensitive than the regular database search, but at the cost of computing time.
  4. For pair-wise structure alignment, RSmatch can take sequences, which are subsequently  folded by the Vienna RNA package and compared by RSmatch functions. RSmatch can also use a sliding window method to fold different regions of the input RNA to enhance sensitivity. In addition, RNA structures at both minimum free energy (MFE) and sub-optimal energies can be used for alignment. 
  5. For slide folding RSmatch takes RNA sequences as input and performs slide fold based upon the user's requirement and gives the folded output.

As we are continuing to polish this software, your feedback will be highly appreciated. Please contact Mugdha Khaladkar or  Dr. Bin Tian or Dr. Jason T. L. Wang  for comments/suggestions/queries. 


Download & Installation

The version 1.2 of RSmatch can be downloaded from here. The RSmatch package is implemented using Java and Perl and run under a UNIX/Linux operating system. It needs a Java environment  to run smoothly. Please make sure your Java version is no older than 1.4. Otherwise, please download a newer version of JAVA from java.sun.com. If the input data are RNA sequences (which must be in the FASTA format), you also need to download and install the Vienna RNA package

If the input data are RNA structures, follow these instructions to install and run RSmatch1.2. 

[A]   Install RSmatch

  1. Download RSmatch1.2 (right click the mouse and choose "save target as" to download).
  2. Extract the tar file to your installation directory, e.g. /home/RSmatch, by typing tar xvzf RSmatch1.2.tar
  3. A directory named "release" under /home/RNA will appear. Switch to it by typing cd release1.2
  4. Type RSmatch1.2 to run the program.

If the input data are RNA sequences in the FASTA format, follow these instructions to install RSmatch1.2 and Vienna RNA package v1.4.

[B]   Install Vienna RNA v1.4 & RSmatch1.2

  1. Download Vienna RNA package v1.4 and put it under the /home/RNA directory.
  2. Unpack the Vienna RNA package by typing gunzip < ViennaRNA-1.4.tar.gz | tar xvf -
  3. A directory named "ViennaRNA-1.4" under /home/RNA will appear. Switch to it by typing cd ViennaRNA-1.4
  4. Install the Vienna software by typing make all ; make install
  5. Set up the environment variable "VIENNA_HOME".  If your command shell is bash, add export VIENNA_HOME = /home/RNA/ViennaRNA-1.4 to your .bashrc file. If you use csh, add setenv VIENNA_HOME = /home/RNA/ViennaRNA-1.4 to your .cshrc file. You need to log out and log in again to make it effective.
  6. Install and run RSmatch1.2 by following the instructions in [A] above. RSmatch1.2 will automatically invoke Vienna RNA v1.4 to fold the input sequences into structures and then align the structures.

Usage instructions

[A]   Input:

There are two types of input data. The first type is the nested parenthesized notation representing an RNA secondary structure. For each structure, it has three lines: header line, primary sequence line and structure notation line. A sample structure is like this:
>NM_003234:3394-3493    Homo sapiens transferrin receptor (p90, CD71) (TFRC), mRNA
GCTTTCTGTCCTTTTGGCACTGAGATATTTATTGTTTATTTATCAGTGACAGAGTTCACTATAAATGGTGTTTTTTTAATAGAATATAATTATCGGAAGC
((((((.((((....)).))...((((.........(((((((.(((((......))))))))))))(((((((......)))))))...))))))))))

The second type is the FASTA format for RNA sequences. For the sequence data, RSmatch1.2 will automatically invoke Vienna RNA v1.4 to fold the sequences into structures and then align the structures. A sample sequence in the FASTA format is like this:

>NM_003234:3394-3493    Homo sapiens transferrin receptor (p90, CD71) (TFRC), mRNA
GCTTTCTGTCCTTTTGGCACTGAGATATTTATTGTTTATTTATCAGTGACAGAGTTCACTATAAATGGTG
TTTTTTTAATAGAATATAATTATCGGAAGC

[B]   Output:

The output of RSmatch1.2 gives detailed alignment information. The Stockholm format is adopted to display the output of multiple structure alignment.

[C]   Options:

You can find the general syntax of the command by typing RSmatch1.2.

The general syntax is as follows:

RSmatch1.2 [options]
General options:
  -p [dsearch | isearch | mrsa | prsa]
     choose a program:
       dsearch        simple database search;
       isearch        iterative database search;
       mrsa           multiple RNA structure alignment;
       prsa           pair-wise RNA structure alignment;
       slide          slide folding RNA sequences;
  -D <database>       FASTA-formatted sequence database.
  -d <database>       secondary structure database. 
  -g <penalty>        gap penalty.
  -o <output>         output file; default is 'result.out'.
  -r <range>          range of folding free energy (kcal/mol) used to select alternative RNA structures;
                      default is 0.
  -S <ratio>          sliding step length, expressed as a ratio of <W_length>; default is 0.5.
  -W <W_length>       sliding window size; default is 100 nt.
  -z F                      turn off slide folding. 
Options for 'dsearch' and 'isearch':
  -n <topN>           output top 'topN' hits.
  -Q <query>          query sequence in FASTA format.
  -q <query>          query structure.
Options for 'dsearch' and 'prsa':
  -s <score_matrix>   file containing position independent score matrices; default is 'scoreMat.structure'.
Options for 'dsearch':
  -G <global alignment>
     T:     global alignment
     F:     local alignment
     default: F
  -m <query type>     query type:
                        0: real structure without IUB code;
                        1: pattern structure containing IUB code.
                        default: 0
Options for 'isearch':
  -R <repeat>         number of iterations.
Options for 'prsa':
  -F <factor>         the window-size decreasing rate. A series of window sizes are generated for folding sequences. 
                      The <factor> is the ratio of two contiguous window sizes.

[D]   Examples: