miRNA Alignment
Novoalign is well suited to aligning microRNA against whole genomes and has two features especially designed for the purpose:
- Novoalign can strip 3′ adapters from reads before aligning them to the reference genome. This is required because miRNA are typically shorter than the read length and hence some of the 3′ adapter will have been included in the read. A default adapter sequence based on Solexa adapters is provided or you can specify an adapter sequence.
Aligning of adapter sequence to the read uses base qualities, essential as we typically have lower quality bases in 3′ end of the read where the adapter is, and also allows for a degree of mismatches. Approximate 80% identity against the adapter is required. - Precursor miRNA is usually found in a hair pin structure which means that there is likely to both a forward and reverse complement alignment of the miRNA in close proximity. Novoalign can look for this secondary structure and add an additional score to each alignment for the best nearby alignment on the opposite strand. This may help resolve the most likely source of the precursor miRNA.
Command Line Options
Several novoalign command line options come into play when aligning miRNA:
-a | Enables adapter stripping. The default adapter sequence is ‘TCGTATGCCGTCTTCTGCTTG’ or you can use -a sequence to specify the adapter sequence. | |
-m
|
Turns on detection of hair-pin structure and adds a score for the complementary strand alignment to the report | |
-l
|
Specifies a minimum length of read to align. For miRNA use -l17 or -l18. Any read shorter than this length (after adapter stripping) will not be aligned and will be reported as a low quality read. Note, that the default minimum length is calculated based on genome size and is typically around 22 for Human size genomes. This is too long for miRNA so you need to set the -l option. |
|
-t
|
Sets the score threshold for aligning reads and controls the number of mismatches allowed. Typically for miRNA one would be looking for exact matches so you can set the threshold to a low value such as -t30. Having a threshold of 30 allows one mismatch. | |
-h
|
Sets homopolymer filter score. This can be used to filter out Poly-A sequences. Any read that matches a homopolymer with score less than or equal to this threshold will skipped. Default 20, For miRNA and RNA suggest setting this to 60 or 90. | |
-r
|
Sets reporting option to allow reporting more than one alignment location per read. You could use -rA or -rE, -rA will print all alignments with score within 5 points of the best alignment; and -rE will print all alignments with a score less than or equal to the threshold specified by the -t option. |
A typical command to align miRNA reads would be:
- novoalign -d hg36.ndx -f s_1_0000_sequence.txt -a -m -l 18 -h90 -r A
TruSeq Adaptors
The Illumina Truseq small RNA kits have a different adaptor to earlier kits
RNA 3. Adapter (RA3)
- CTGGAATTCTCGGGTGCCAAGG
You will need to specify this in the -a option:
- novoalign -d hg36.ndx -f s_1_0000_sequence.txt -a CTGGAATTCTCGGGTGCCAAGG -m -l 18 -h90 -r A
Note. | Having the wrong adapter sequence is often the cause of a low alignment yield and should be the first thing you check. |
Other Adapter Sequences
This adapter sequence has been seen on some small RNA reads:
- CTGTAGGCACC
we are not sure what kit generates this.
SAM Report Format
When the -m option is used two tags are added to the alignment record.
Tag | Type | Description |
ZH | i | Hairpin score for miRNA alignment (-m option) |
ZL | i | In miRNA mode (-m option) this is the alignment location for adjacent opposite strand alignment. |
Native Report Format
# novoalign (2.03.01MT) - short read aligner with qualities. # (C) 2008 NovoCraft # Licensed to Novocraft Technologies Sdn Bhd # novoalign -d /wd5/db/fly -f SRR001342.fastq -l 13 -a CTGTAGGCACCATCAATTCGTATGCCGTCTTCTGCT -rA -h 90 -m # Interpreting input files as Sanger FASTQ. # Index Build Version: 2.1 # Hash length: 14 # Step size: 1 @SRR001342.1 S TGATGATGATGCTGATGATGATGATGGG IIIIIIIIIIIIIIIIIIIIIIIIIII+ U 30 7 378 >gi|116010291 10390561 R . . . 17T>G @SRR001342.17 S GAAGATTGCAGGTTCGAGTCCTGTCACGG IIIIIIIIIII%IIIIIIIIIIII9;III R 2 1 374 >gi|116010291 13908888 F . . . @SRR001342.17 S GAAGATTGCAGGTTCGAGTCCTGTCACGG IIIIIIIIIII%IIIIIIIIIIII9;III R 2 1 370 >gi|116010291 13910277 F . . . @SRR001342.17 S GAAGATTGCAGGTTCGAGTCCTGTCACGG IIIIIIIIIII%IIIIIIIIIIII9;III R 2 1 362 >gi|116010291 13910882 F . . . @SRR001342.17 S GAAGATTGCAGGTTCGAGTCCTGTCACGG IIIIIIIIIII%IIIIIIIIIIII9;III R 2 1 358 >gi|116010291 13911084 F . . . @SRR001342.17 S GAAGATTGCAGGTTCGAGTCCTGTCACGG IIIIIIIIIII%IIIIIIIIIIII9;III R 2 1 358 >gi|116010291 13911286 F . . . @SRR001342.17 S GAAGATTGCAGGTTCGAGTCCTGTCACGG IIIIIIIIIII%IIIIIIIIIIII9;III R 2 1 374 >gi|116010291 21103159 F . . . @SRR001342.17 S GAAGATTGCAGGTTCGAGTCCTGTCACGG IIIIIIIIIII%IIIIIIIIIIII9;III R 2 1 354 >gi|116010444 19347948 R . . . @SRR001342.22 S TCCCATATTGTCTAGTGGTTAGGGTCACCGGCCG IIIIIIIIIIIIIIIIIIIIIII&I%8(4=5/%& U 67 6 437 >gi|116010443 22840863 R . . . 1G>C 2A>G 8G>T 9T>G 11T>C @SRR001342.26 S TAGGAACTTCATACCGTGCTCT IIIIIIIIIIIIIIIIIIIIII U 0 22 270 >gi|116010443 10358381 F . . . @SRR001342.63 S AAAATAAACAAATAGGG IIIIIIIIIIIIIIIII U 0 21 154 >gi|116010291 18213481 F . . . @SRR001342.101 S TGGAATGTAAAGCAGTATGGAG IIIIIIIIIIIIIIII3IIIII U 30 43 296 >gi|116010444 20487496 F . . . 13A>C @SRR001342.105 S TCTGGGTGTTGCGTTGTGTGTA IIIIIIIIIIIIIIIIIIIIII U 0 73 254 >gi|116010290 646768 F . . . @SRR001342.187 S GGATTTGAAAAGGT IIIIIIIIII<III R 0 2 172 >gi|116010443 766011 R . . . @SRR001342.187 S GGATTTGAAAAGGT IIIIIIIIII<III R 0 2 156 >gi|116010443 17092433 F . . . @SRR001342.488 S AAGTGGGAGATATT IIIIIIIIIII+II U 0 2 120 >gi|116010443 239195 F . . . @SRR001342.489 S QC @SRR001342.490 S QC ... # Read Sequences: 298515 # Aligned: 3547 # Unique Alignment: 1739 # Gapped Alignment: 76 # Quality Filter: 185451 # Homopolymer Filter: 80434 # Elapsed Time: 26,594s # Done.
Looking at one line in detail
@SRR001342.1 S TGATGATGATGCTGATGATGATGATGGG IIIIIIIIIIIIIIIIIIIIIIIIIII+ U 30 7 378 >gi|116010291 10390561 R . . . 17T>G
@SRR001342.1 | The read header | |
S | Indicates Single end read. Other values ‘L’ & ‘R’ for paired end reads. | |
TGATGATGATGCTGATGATGATGATGGG | The read sequence. | |
IIIIIIIIIIIIIIIIIIIIIIIIIII+ | Base qualities using Phred scale and Sanger Fastq coding | |
U | Indicates a unique alignment location. Other values ‘R’ for multiple alignment locations, ‘NM’ for no alignment found, and ‘QC’ for low quality read. | |
30 | The alignment score. | |
7 | The alignment quality | |
378 | The alignment score for nearby reverse complement of the read. i.e. Possible hairpin structure. | |
>gi|116010291 | The sequence header of reference sequence that we aligned to. | |
10390561 | One based offset of alignment in the reference sequence. | |
R | Indicates alignment used reverse complement of the read. i.e. Complementary strand. | |
. . . | Three fields used in paired end reads. (as of V2.07.03 these fields report the location of the nearby reverse complement of the read) | |
17T>G | List of base mismatches, inserts and deletions. |
Performance
The above example processed reads at rate better than 10,000 reads/sec using a Dual core AMD Athlon work station. Performance may be slower on larger genomes.