MicroRNA

miRNA Alignment

Novoalign is well suited to aligning microRNA against whole genomes and has two features especially designed for the purpose:

Novoalign can strip 3′ adapters from reads before aligning them to the reference genome. This is required because miRNA are typically shorter than the read length and hence some of the 3′ adapter will have been included in the read. A default adapter sequence based on Solexa adapters is provided or you can specify an adapter sequence.
Aligning of adapter sequence to the read uses base qualities, essential as we typically have lower quality bases in 3′ end of the read where the adapter is, and also allows for a degree of mismatches. Approximate 80% identity against the adapter is required.
Precursor miRNA is usually found in a hair pin structure which means that there is likely to both a forward and reverse complement alignment of the miRNA in close proximity. Novoalign can look for this secondary structure and add an additional score to each alignment for the best nearby alignment on the opposite strand. This may help resolve the most likely source of the precursor miRNA.

Command Line Options

Several novoalign command line options come into play when aligning miRNA:

-a	Enables adapter stripping. The default adapter sequence is ‘TCGTATGCCGTCTTCTGCTTG’ or you can use -a sequence to specify the adapter sequence.
-m	Turns on detection of hair-pin structure and adds a score for the complementary strand alignment to the report
-l	Specifies a minimum length of read to align. For miRNA use -l17 or -l18. Any read shorter than this length (after adapter stripping) will not be aligned and will be reported as a low quality read. Note, that the default minimum length is calculated based on genome size and is typically around 22 for Human size genomes. This is too long for miRNA so you need to set the -l option.
-t	Sets the score threshold for aligning reads and controls the number of mismatches allowed. Typically for miRNA one would be looking for exact matches so you can set the threshold to a low value such as -t30. Having a threshold of 30 allows one mismatch.
-h	Sets homopolymer filter score. This can be used to filter out Poly-A sequences. Any read that matches a homopolymer with score less than or equal to this threshold will skipped. Default 20, For miRNA and RNA suggest setting this to 60 or 90.
-r	Sets reporting option to allow reporting more than one alignment location per read. You could use -rA or -rE, -rA will print all alignments with score within 5 points of the best alignment; and -rE will print all alignments with a score less than or equal to the threshold specified by the -t option.

A typical command to align miRNA reads would be:

: novoalign -d hg36.ndx -f s_1_0000_sequence.txt -a -m -l 18 -h90 -r A

TruSeq Adaptors

The Illumina Truseq small RNA kits have a different adaptor to earlier kits

RNA 3. Adapter (RA3)

: CTGGAATTCTCGGGTGCCAAGG

You will need to specify this in the -a option:

: novoalign -d hg36.ndx -f s_1_0000_sequence.txt -a CTGGAATTCTCGGGTGCCAAGG -m -l 18 -h90 -r A

Note.	Having the wrong adapter sequence is often the cause of a low alignment yield and should be the first thing you check.

Other Adapter Sequences

This adapter sequence has been seen on some small RNA reads:

: CTGTAGGCACC

we are not sure what kit generates this.

SAM Report Format

When the -m option is used two tags are added to the alignment record.

Tag	Type	Description
ZH	i	Hairpin score for miRNA alignment (-m option)
ZL	i	In miRNA mode (-m option) this is the alignment location for adjacent opposite strand alignment.

Native Report Format

# novoalign (2.03.01MT) - short read aligner with qualities.
# (C) 2008 NovoCraft
# Licensed to Novocraft Technologies Sdn Bhd
#  novoalign -d /wd5/db/fly -f SRR001342.fastq -l 13 -a CTGTAGGCACCATCAATTCGTATGCCGTCTTCTGCT -rA -h 90 -m 
# Interpreting input files as Sanger FASTQ.
# Index Build Version: 2.1
# Hash length: 14
# Step size: 1
@SRR001342.1	S	TGATGATGATGCTGATGATGATGATGGG	IIIIIIIIIIIIIIIIIIIIIIIIIII+	U	30	7	378	>gi|116010291	10390561	R	.	.	.	17T>G
@SRR001342.17	S	GAAGATTGCAGGTTCGAGTCCTGTCACGG	IIIIIIIIIII%IIIIIIIIIIII9;III	R	2	1	374	>gi|116010291	13908888	F	.	.	.
@SRR001342.17	S	GAAGATTGCAGGTTCGAGTCCTGTCACGG	IIIIIIIIIII%IIIIIIIIIIII9;III	R	2	1	370	>gi|116010291	13910277	F	.	.	.
@SRR001342.17	S	GAAGATTGCAGGTTCGAGTCCTGTCACGG	IIIIIIIIIII%IIIIIIIIIIII9;III	R	2	1	362	>gi|116010291	13910882	F	.	.	.
@SRR001342.17	S	GAAGATTGCAGGTTCGAGTCCTGTCACGG	IIIIIIIIIII%IIIIIIIIIIII9;III	R	2	1	358	>gi|116010291	13911084	F	.	.	.
@SRR001342.17	S	GAAGATTGCAGGTTCGAGTCCTGTCACGG	IIIIIIIIIII%IIIIIIIIIIII9;III	R	2	1	358	>gi|116010291	13911286	F	.	.	.
@SRR001342.17	S	GAAGATTGCAGGTTCGAGTCCTGTCACGG	IIIIIIIIIII%IIIIIIIIIIII9;III	R	2	1	374	>gi|116010291	21103159	F	.	.	.
@SRR001342.17	S	GAAGATTGCAGGTTCGAGTCCTGTCACGG	IIIIIIIIIII%IIIIIIIIIIII9;III	R	2	1	354	>gi|116010444	19347948	R	.	.	.
@SRR001342.22	S	TCCCATATTGTCTAGTGGTTAGGGTCACCGGCCG	IIIIIIIIIIIIIIIIIIIIIII&I%8(4=5/%&	U	67	6	437	>gi|116010443	22840863	R	.	.	.	1G>C 2A>G 8G>T 9T>G 11T>C
@SRR001342.26	S	TAGGAACTTCATACCGTGCTCT	IIIIIIIIIIIIIIIIIIIIII	U	0	22	270	>gi|116010443	10358381	F	.	.	.
@SRR001342.63	S	AAAATAAACAAATAGGG	IIIIIIIIIIIIIIIII	U	0	21	154	>gi|116010291	18213481	F	.	.	.
@SRR001342.101	S	TGGAATGTAAAGCAGTATGGAG	IIIIIIIIIIIIIIII3IIIII	U	30	43	296	>gi|116010444		20487496	F	.	.	.	13A>C
@SRR001342.105	S	TCTGGGTGTTGCGTTGTGTGTA	IIIIIIIIIIIIIIIIIIIIII	U	0	73	254	>gi|116010290		646768	F	.	.	.
@SRR001342.187	S	GGATTTGAAAAGGT	IIIIIIIIII<III	R	0	2	172	>gi|116010443	766011	R	.	.	.
@SRR001342.187	S	GGATTTGAAAAGGT	IIIIIIIIII<III	R	0	2	156	>gi|116010443	17092433	F	.	.	.
@SRR001342.488	S	AAGTGGGAGATATT	IIIIIIIIIII+II	U	0	2	120	>gi|116010443	239195	F	.	.	.
@SRR001342.489	S			QC
@SRR001342.490	S			QC
...
#     Read Sequences:   298515
#            Aligned:     3547
#   Unique Alignment:     1739
#   Gapped Alignment:       76
#     Quality Filter:   185451
# Homopolymer Filter:    80434
#       Elapsed Time: 26,594s
# Done.

Looking at one line in detail

@SRR001342.1	S	TGATGATGATGCTGATGATGATGATGGG	IIIIIIIIIIIIIIIIIIIIIIIIIII+	U	30	7	378	>gi|116010291	10390561	R	.	.	.	17T>G

@SRR001342.1	The read header
S	Indicates Single end read. Other values ‘L’ & ‘R’ for paired end reads.
TGATGATGATGCTGATGATGATGATGGG	The read sequence.
IIIIIIIIIIIIIIIIIIIIIIIIIII+	Base qualities using Phred scale and Sanger Fastq coding
U	Indicates a unique alignment location. Other values ‘R’ for multiple alignment locations, ‘NM’ for no alignment found, and ‘QC’ for low quality read.
30	The alignment score.
7	The alignment quality
378	The alignment score for nearby reverse complement of the read. i.e. Possible hairpin structure.
>gi\|116010291	The sequence header of reference sequence that we aligned to.
10390561	One based offset of alignment in the reference sequence.
R	Indicates alignment used reverse complement of the read. i.e. Complementary strand.
. . .	Three fields used in paired end reads. (as of V2.07.03 these fields report the location of the nearby reverse complement of the read)
17T>G	List of base mismatches, inserts and deletions.

Performance

The above example processed reads at rate better than 10,000 reads/sec using a Dual core AMD Athlon work station. Performance may be slower on larger genomes.

Documentation

MicroRNA

miRNA Alignment

Command Line Options

TruSeq Adaptors

Other Adapter Sequences

SAM Report Format

Native Report Format

Performance

LATEST NEWS

Novocraft and Basepair Inc. Announce Strategic Partnership to Deliver Advanced Genomic Pipelines in the Cloud

Novoalign V4.03.01

Novoalign V4.03.00 and Novosort V3.00.00

Contact Us