RNASeq Analysis: mRNA and the Spliceosome

Novoalign has some specialised functions to help with mRNA projects and the identification of splice forms.

mRNA exists in alternative splice forms and the user is faced with the problem of how to represent these in the reference sequences that will be used for alignment. We need reads that cross splice junctions to align to the appropriate junction.

The model adopted in Novoalign is to build a set of reference sequences comprised of the following

  1. A sequence per gene comprised of all exons
  2. A set of alternative splice junction sequences. These are comprised of x bases each side of a junction where x is 1 or 2bp shorter than the read length. Every possible splice junction is constructed. Reads that cross splice junctions will align to these junction sequences. The splice junction sequences together with the parent gene sequence form a sequence group withing Novoalign, we’ll see how these groups come into play later.
  3. The genome with genes masked as N’s. The genome is included so that we can align reads from any undocumented coding regions.

Standard Paired End Processing

One feature of Novoalign paired end processing is that we have a structural variation penalty which is used to decide whether we should report alignment of a proper fragment for a pair or should we report the best alignment to each read even if they don’t form a proper fragment. It’s quite possible that a fragment crosses a structural variation of some sort, possibly a large deletion, an inversion, or transposition. Paired end alignment follows the process:

  1. For Read1 we use a seeded alignment process to find alignment locations each with a Read1 alignment score.
  2. Then for each location found we do a Needleman-Wunsch alignment of the second read against a region starting from the Read1 alignment and extending 6 standard deviations beyond mean fragment length.
  3. The best alignment for Read2 will define the pair score for Read1/Read2. All the alignments are added to a collection for Read1.
  4. This process is repeated using Read2 seeded alignment and then Needleman-Wunsch for Read1, creating a collection of Read2/Read1 pairs.

We then decide whether we have a “proper pair” or not. To do this we use a structural variation penalty as follows.

We have a proper pair if the score of the best pair (Read1/Read2 or Read2/Read1 combined score including fragment length penalty) is less than the structural variation penalty (default 70) plus best single-end Read1 score plus best single-end Read2 score.

If we have a proper pair we combine the Read1/Read2 & Read2/Read1 lists, removing duplicates and sorting by alignment score. At this point we have a list of one or more proper pair alignments. This list is passed to reporting which can report one or more alignments depending on the options.

If we didn’t have a ‘proper pair’ then we report individual alignments to each read in the pair.

Note. The default structural variation penalty of 70 was chosen based on the frequency of structural variations and large insert deletes seen in the genomes of Craig Venter and James Watson. If Psv is the probability that a fragment crosses a structural variation then the SV penalty should be set to -log 10(Psv)

Paired End Processing with Sequence Groups

As mentioned before, when doing mRNA alignments we include a full gene sequence and a set of splice junction sequences that cover all the possible splice junctions for the gene.

We then slightly modify the paired end alignment process by simply reducing the structural variation penalty for the case when the Read1 and Read2 align to different sequences but where the sequences are in the same group.

In fact we now have three structural variation penalties:

  1. A penalty that applies where the reads aligned to different sequences that are in the same group (gene)
  2. A penalty that applies if the two reads align to the same sequence but are further apart than expected for a proper fragment.
  3. A penalty that applies to other cases.

This allows a low penalty to be set for read pairs that align to different sequences in the same gene group.

Novoalign options for structural variations

-v 99 Sets the structural variation penalty for chimeric fragments. This form uses a single penalty for all pairs that do not fit the fragment length distribution. Default penalty is 70. If Psv is the probability of a structural variation (that might result in a chimeric fragment) in the genome being sequenced vs the reference genome then the SV penalty is -10log10(Psv). Individual alignments will be reported if their combined score less the structural variation penalty is better than the best pair.
-v 99 99 Sets the structural variation penalties for chimeric fragments. In this form the first penalty applies to chimera where the alignments for the two reads of a pair lie in the same reference sequence. The second penalty applies to chimera that cross reference sequences.
-v 99 99 99 regex Sets the structural variation penalties for chimeric fragments. The three penalties are for:
1. Penalty for SVs within a group of sequences as defined by the regular expression.
2. Penalty for SVs within a single sequence
3..Penalty for SVs across different sequences and groups.
“regex” defines a regular expression applied to headers of indexed sequences. The regular expression defines one field that identifies a group name within the sequence headers.

Example of Sequence Headers and Regular Expression


For these headers we want to extract the Gene code and chromosome as the group name. The regular expression used in this case is “[>]([^_]*)” – this will extract the field starting after the ‘>’ and stopping before the first underscore.

Creating the Reference Sequence Set for RNASeq

We recommend the Useq(external link) package (minimum version 8.2.X ) installed on your system before commencing the tutorial below. We also demonstrate this worked example with a human dataset.

The basic process is as follows

  1. Make Transcripts and splice junction sequences from your reference genome. The Useq MakeTranscriptome(external link) program is used to do this.
  2. Align your RNASeq reads to the whole genome + splice junctions using novoalign (or novoalignCS for SOLiD). Select SAM output format using “-o SAM”
  3. Calculate expression values per transcript using alignments and an R package such as DeSeq
  4. Determine which splice junctions are represented in the dataset by summarizing alignments to these junctions

Click here(external link) to start downloading the UCSC refFlat annotations for Human hg19. Click “get output” to start downloading the refFlat.txt.gz file which we use below. Decompress this file once it is downloaded.

Start off by creating a set of exome splice junctions with MakeTranscriptome. Assuming we have 50bp reads we would do the following:

# -n 60000 specifies the maximum number of splices per gene
java -jar pathTo/USeq/USeq_8.X.X/Apps/MakeTranscriptome   -f ./seqs  -u refFlat.txt -r 45 -n 60000 -s

Where the seqs directory contains the following files for the human genome. In general you would set the radius as N-6 where N is your read length.

chr10.fasta  chr12.fasta  chr14.fasta  chr16.fasta  chr18.fasta  chr1.fasta   chr21.fasta  chr2.fasta  chr4.fasta  chr6.fasta  chr8.fasta  chrX.fasta
chr11.fasta  chr13.fasta  chr15.fasta  chr17.fasta  chr19.fasta  chr20.fasta  chr22.fasta  chr3.fasta  chr5.fasta  chr7.fasta  chr9.fasta  chrY.fasta

And the refFlat.txt file has the following format

#geneName	name	chrom	strand	txStart	txEnd	cdsStart	cdsEnd	exonCount	exonStarts	exonEnds
WASH7P	NR_024540	chr1	-	14361	29370	29370	29370	11	14361,14969,15795,16606,16857,17232,17605,17914,18267,24737,29320,	14829,15038,15947,16765,17055,17368,17742,18061,18366,24891,29370,
FAM138A	NR_026818	chr1	-	34610	36081	36081	36081	3	34610,35276,35720,	35174,35481,36081,
FAM138F	NR_026820	chr1	-	34610	36081	36081	36081	3	34610,35276,35720,	35174,35481,36081,

Useq MakeTranscriptome should produce the following two FASTA files

refFlatRad45Num60kMin10Splices.fasta  refFlatRad45Num60kMin10Transcripts.fasta

With refFlatRad45Num60kMin10Splices.fasta having the corresponding headers:


These files can then be used to build a database for searching with novoalign/novoalignCS using the full genome, transcripts and junction records.

Note. USEQ may generate duplicate splice junction records which is an error in novoindex after release V3.07.00. You should remove any duplicate splice junctions before building the index. You can use the Perl script below.

novoindex Transcriptome.nix refFlatRad45Num60kMin10Splices.fasta  refFlatRad45Num60kMin10Transcripts.fasta

#For colorspace

novoindex -i -c Transcriptome.ncx refFlatRad45Num60kMin10Splices.fasta refFlatRad45Num60kMin10Transcripts.fasta
use strict;
use Bio::SeqIO;
my %unique;
my $file = "mm10genome-single-line.fa";
my $seqio = Bio::SeqIO->new(-file => $file, -format => "fasta");
my $outseq = Bio::SeqIO->new(-file => ">$file.uniq", -format => "fasta");

while(my $seqs = $seqio->next_seq) {
    my $id = $seqs->display_id;
    my $seq = $seqs->seq;
    unless(exists($unique{$seq})) {
        $unique{$seq} +=1;

It is also advisable to add the full human genome with genes masked with “N” characters to your database. You will need to generate this file using the USeq tool MaskExonsInFastaFiles(external link) You need to have the UCSC Refflat gene table that you used to build the transcriptome database and your directory of genome fasta files.

Include the gene masked genome in the Novoindex build with:

novoindex Transcriptome.nix geneMaskedGenome.fasta refFlatRad45Num60kMin10Splices.fasta  refFlatRad45Num60kMin10Transcripts.fasta 

#For colorspace

novoindex Transcriptome.nix geneMaskedGenome.fasta refFlatRad45Num60kMin10Splices.fasta  refFlatRad45Num60kMin10Transcripts.fasta

The reference transcriptome with splice junctions are now ready for searching with novoalign.

RNASeq Alignment

Now it is time to do the alignment to our genome databases and junction sequences

#Write output to this directory
mkdir -p novo
novoalign -d Transcriptome.nix   \
-a \
-f read1.fastq read2.fastq \
-o SAM   \
-k   \
-r Random  \
-v 0 70 70 "[>]([^:]*)"    > novo/alignments.sam 2> novo/alignment_stats.txt

We now need to convert the splice junction coordinates back to genome coordinates using USeq SamTranscriptomeParser(external link)

java -jar pathTo/USeq/USeq_8.X.X/Apps/SamTranscriptomeParser -f ./novo/alignments.sam  -a 50000 -n 100 -u -s  novo/alignments_Converted.sam

#output is novo/alignments_Converted.sam.gz
#Unmapped reads are saved with -u option in  the same file in SAM format

There have been some reports that this stage does not work well if the “-s file.sam” is omitted and this relates to conversion to Picard by USeq.

At this point you have alignments where read pairs are mapped to genes, splice junctions and any other part of the reference genome. These alignments are now in genome coordinates with tags for what gene each read maps to . See the “GN” SAM tags.

This is quite comprehensize in terms of search space for RNASeq data.
To continue with expression analysis convert the resultant file to BAM format:

#We convert and sort with novosort
samtools view -uS  novo/alignments_Converted.sam 2> convert.err | novosort   -  > novo/final.bam

We now have file named “novo/final.bam” containing our sorted alignments against the transcriptome, genome and splice junctions. This file can be used to calculate gene expression (with Cufflinks, DeSeq or USeq) and splicing information.



Spliced alignment with NovoAlign.


Loading posts...