Novoalign has some specialised functions to help with mRNA projects and the identification of splice forms.
mRNA exists in alternative splice forms and the user is faced with the problem of how to represent these in the reference sequences that will be used for alignment. We need reads that cross splice junctions to align to the appropriate junction.
The model adopted in Novoalign is to build a set of reference sequences comprised of the following
- A sequence per gene comprised of all exons
- A set of alternative splice junction sequences. These are comprised of x bases each side of a junction where x is 1 or 2bp shorter than the read length. Every possible splice junction is constructed. Reads that cross splice junctions will align to these junction sequences. The splice junction sequences together with the parent gene sequence form a sequence group withing Novoalign, we’ll see how these groups come into play later.
- The genome with genes masked as N’s. The genome is included so that we can align reads from any undocumented coding regions.
Standard Paired End Processing
One feature of Novoalign paired end processing is that we have a structural variation penalty which is used to decide whether we should report alignment of a proper fragment for a pair or should we report the best alignment to each read even if they don’t form a proper fragment. It’s quite possible that a fragment crosses a structural variation of some sort, possibly a large deletion, an inversion, or transposition. Paired end alignment follows the process:
- For Read1 we use a seeded alignment process to find alignment locations each with a Read1 alignment score.
- Then for each location found we do a Needleman-Wunsch alignment of the second read against a region starting from the Read1 alignment and extending 6 standard deviations beyond mean fragment length.
- The best alignment for Read2 will define the pair score for Read1/Read2. All the alignments are added to a collection for Read1.
- This process is repeated using Read2 seeded alignment and then Needleman-Wunsch for Read1, creating a collection of Read2/Read1 pairs.
We then decide whether we have a “proper pair” or not. To do this we use a structural variation penalty as follows.
We have a proper pair if the score of the best pair (Read1/Read2 or Read2/Read1 combined score including fragment length penalty) is less than the structural variation penalty (default 70) plus best single-end Read1 score plus best single-end Read2 score.
If we have a proper pair we combine the Read1/Read2 & Read2/Read1 lists, removing duplicates and sorting by alignment score. At this point we have a list of one or more proper pair alignments. This list is passed to reporting which can report one or more alignments depending on the options.
If we didn’t have a ‘proper pair’ then we report individual alignments to each read in the pair.
Paired End Processing with Sequence Groups
As mentioned before, when doing mRNA alignments we include a full gene sequence and a set of splice junction sequences that cover all the possible splice junctions for the gene.
We then slightly modify the paired end alignment process by simply reducing the structural variation penalty for the case when the Read1 and Read2 align to different sequences but where the sequences are in the same group.
In fact we now have three structural variation penalties:
- A penalty that applies where the reads aligned to different sequences that are in the same group (gene)
- A penalty that applies if the two reads align to the same sequence but are further apart than expected for a proper fragment.
- A penalty that applies to other cases.
This allows a low penalty to be set for read pairs that align to different sequences in the same gene group.
Novoalign options for structural variations
-v 99 | Sets the structural variation penalty for chimeric fragments. This form uses a single penalty for all pairs that do not fit the fragment length distribution. Default penalty is 70. If Psv is the probability of a structural variation (that might result in a chimeric fragment) in the genome being sequenced vs the reference genome then the SV penalty is -10log10(Psv). Individual alignments will be reported if their combined score less the structural variation penalty is better than the best pair. |
-v 99 99 | Sets the structural variation penalties for chimeric fragments. In this form the first penalty applies to chimera where the alignments for the two reads of a pair lie in the same reference sequence. The second penalty applies to chimera that cross reference sequences. |
-v 99 99 99 regex | Sets the structural variation penalties for chimeric fragments. The three penalties are for: 1. Penalty for SVs within a group of sequences as defined by the regular expression. 2. Penalty for SVs within a single sequence 3..Penalty for SVs across different sequences and groups. “regex” defines a regular expression applied to headers of indexed sequences. The regular expression defines one field that identifies a group name within the sequence headers. |
Example of Sequence Headers and Regular Expression
>SH2D6-chr2_I_19_2_85516418_85517139 >SH2D6-chr2_I_20_2_85517188_85517484 >SH2D6-chr2_I_21_2_85517621_85517700 >ENSG00000207207-chr2_I_0_2_85515101_85515180 >ENSG00000207207-chr2_I_1_2_85515213_85515292 >ENSG00000222310-chr2_I_0_2_85523089_85523168 >ENSG00000222310-chr2_I_1_2_85523383_85523462 >AC016753.9-chr2_I_0_2_85523098_85523177 >AC016753.9-chr2_I_1_2_85523385_85523464 >ENSG00000199936-chr2_I_0_2_85540846_85540925 >ENSG00000199936-chr2_I_1_2_85540738_85540817 >MAT2A-chr2_I_0_2_85619759_85619838
For these headers we want to extract the Gene code and chromosome as the group name. The regular expression used in this case is “[>]([^_]*)” – this will extract the field starting after the ‘>’ and stopping before the first underscore.
Creating the Reference Sequence Set for RNASeq
We recommend the Useq package (minimum version 8.2.X ) installed on your system before commencing the tutorial below. We also demonstrate this worked example with a human dataset.
The basic process is as follows
- Make Transcripts and splice junction sequences from your reference genome. The Useq MakeTranscriptome program is used to do this.
- Align your RNASeq reads to the whole genome + splice junctions using novoalign (or novoalignCS for SOLiD). Select SAM output format using “-o SAM”
- Calculate expression values per transcript using alignments and an R package such as DeSeq
- Determine which splice junctions are represented in the dataset by summarizing alignments to these junctions
Click here to start downloading the UCSC refFlat annotations for Human hg19. Click “get output” to start downloading the refFlat.txt.gz file which we use below. Decompress this file once it is downloaded.
Start off by creating a set of exome splice junctions with MakeTranscriptome. Assuming we have 50bp reads we would do the following:
# -n 60000 specifies the maximum number of splices per gene java -jar pathTo/USeq/USeq_8.X.X/Apps/MakeTranscriptome -f ./seqs -u refFlat.txt -r 45 -n 60000 -s
Where the seqs directory contains the following files for the human genome. In general you would set the radius as N-6 where N is your read length.
chr10.fasta chr12.fasta chr14.fasta chr16.fasta chr18.fasta chr1.fasta chr21.fasta chr2.fasta chr4.fasta chr6.fasta chr8.fasta chrX.fasta chr11.fasta chr13.fasta chr15.fasta chr17.fasta chr19.fasta chr20.fasta chr22.fasta chr3.fasta chr5.fasta chr7.fasta chr9.fasta chrY.fasta
And the refFlat.txt file has the following format
#geneName name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds WASH7P NR_024540 chr1 - 14361 29370 29370 29370 11 14361,14969,15795,16606,16857,17232,17605,17914,18267,24737,29320, 14829,15038,15947,16765,17055,17368,17742,18061,18366,24891,29370, FAM138A NR_026818 chr1 - 34610 36081 36081 36081 3 34610,35276,35720, 35174,35481,36081, FAM138F NR_026820 chr1 - 34610 36081 36081 36081 3 34610,35276,35720, 35174,35481,36081,
Useq MakeTranscriptome should produce the following two FASTA files
refFlatRad45Num60kMin10Splices.fasta refFlatRad45Num60kMin10Transcripts.fasta
With refFlatRad45Num60kMin10Splices.fasta having the corresponding headers:
>LINC00261:chr20:22545709-22545754_22548431-22548476 >LINC00261:chr20:22547398-22547443_22548431-22548476 >LINC00261:chr20:22545709-22545754_22547320-22547365 >LINC00261:chr20:22548478-22548523_22559147-22559192 >LINC00261:chr20:22545709-22545754_22559147-22559192 >SLC9A8:chr20:48429440-48429485_48471974-48472019 >SLC9A8:chr20:48429440-48429485_48467346-48467381_48500382-48500392 >SLC9A8:chr20:48466172-48466217_48503288-48503333 >SLC9A8:chr20:48466207-48466217_48467346-48467381_48504365-48504410
These files can then be used to build a database for searching with novoalign/novoalignCS using the full genome, transcripts and junction records.
Note. USEQ may generate duplicate splice junction records which is an error in novoindex after release V3.07.00. You should remove any duplicate splice junctions before building the index. You can use the Perl script below.
novoindex Transcriptome.nix refFlatRad45Num60kMin10Splices.fasta refFlatRad45Num60kMin10Transcripts.fasta #For colorspace novoindex -i -c Transcriptome.ncx refFlatRad45Num60kMin10Splices.fasta refFlatRad45Num60kMin10Transcripts.fasta
use strict;
use Bio::SeqIO;
my %unique;
my $file = "mm10genome-single-line.fa";
my $seqio = Bio::SeqIO->new(-file => $file, -format => "fasta");
my $outseq = Bio::SeqIO->new(-file => ">$file.uniq", -format => "fasta");
while(my $seqs = $seqio->next_seq) {
my $id = $seqs->display_id;
my $seq = $seqs->seq;
unless(exists($unique{$seq})) {
$outseq->write_seq($seqs);
$unique{$seq} +=1;
}
}
It is also advisable to add the full human genome with genes masked with “N” characters to your database. You will need to generate this file using the USeq tool MaskExonsInFastaFiles You need to have the UCSC Refflat gene table that you used to build the transcriptome database and your directory of genome fasta files.
Include the gene masked genome in the Novoindex build with:
novoindex Transcriptome.nix geneMaskedGenome.fasta refFlatRad45Num60kMin10Splices.fasta refFlatRad45Num60kMin10Transcripts.fasta #For colorspace novoindex Transcriptome.nix geneMaskedGenome.fasta refFlatRad45Num60kMin10Splices.fasta refFlatRad45Num60kMin10Transcripts.fasta
The reference transcriptome with splice junctions are now ready for searching with novoalign.
RNASeq Alignment
Now it is time to do the alignment to our genome databases and junction sequences
#Write output to this directory mkdir -p novo novoalign -d Transcriptome.nix \ -a \ -f read1.fastq read2.fastq \ -o SAM \ -k \ -r Random \ -v 0 70 70 "[>]([^:]*)" > novo/alignments.sam 2> novo/alignment_stats.txt
We now need to convert the splice junction coordinates back to genome coordinates using USeq SamTranscriptomeParser
java -jar pathTo/USeq/USeq_8.X.X/Apps/SamTranscriptomeParser -f ./novo/alignments.sam -a 50000 -n 100 -u -s novo/alignments_Converted.sam #output is novo/alignments_Converted.sam.gz #Unmapped reads are saved with -u option in the same file in SAM format
There have been some reports that this stage does not work well if the “-s file.sam” is omitted and this relates to conversion to Picard by USeq.
At this point you have alignments where read pairs are mapped to genes, splice junctions and any other part of the reference genome. These alignments are now in genome coordinates with tags for what gene each read maps to . See the “GN” SAM tags.
This is quite comprehensize in terms of search space for RNASeq data.
To continue with expression analysis convert the resultant file to BAM format:
#We convert and sort with novosort samtools view -uS novo/alignments_Converted.sam 2> convert.err | novosort - > novo/final.bam
We now have file named “novo/final.bam” containing our sorted alignments against the transcriptome, genome and splice junctions. This file can be used to calculate gene expression (with Cufflinks, DeSeq or USeq) and splicing information.