We’ve tested Novoalign on several sets of reads from ION Torrent including reads from 314 & 316 Chips.
The test data is available on the ION Torrent Community web site in the developer section.
Using the 71.fastq reads from the O104H4 dataset, 90% of reads aligned and about 87% of aligned reads had inserts or deletions.
The run log was
novoalign -d ../../ref/ecoliO104H4TY-2482 -f ../71.fastq -o SAM -K qcal.csv -g 20 -x5 --hpstats indels.tsv <br /> ...<br /> # Read Sequences: 124035 <br /> # Aligned: 112053 <br /> # Unique Alignment: 111365 <br /> # Gapped Alignment: 96212 <br /> # Quality Filter: 140 <br />
In V2.07.12 we add a new option to collect insert/deletion counts as a function of homopolymer run length. Just add option –hpstats to the Novoalign command.
There is an R script in the release that can then chart the Phred scale probablity of Indels.
IONTorrent.R -f -r .pdf
Here’s the first two charts from the R script using all reads from the O104H4 data set …
The charts give the phred scale (-10log10(P)) probability of an insert or delete in a homopolymer run as function of position in the read and the run length. The lower indel probabilities (hiigher phred score) near the ends of the reads are an artefact of the alignment dynamic programming and soft clipping. The x-axis is the start location of the homolymer in the read. For deletions the chart is by length of run in the reference sequence and the Lt=1 line is a single base deletion, Lt=2 is a run of 2 in the reference which has been aligned with a deletion. The Insertions chart is keyed on length of the run in the read and Lq=1 is a single base insertion that doesn’t match adjacent bases. Others are insertions that match adjacent bases.
Alignment clipping, lower base qualities (for 3′ end) and the nature of the Needleman-Wunsch alignment algorithm will result in less insertions and deletions being reported near the ends of the reads resulting in the higher phred scores near the 3′ & 5′ ends of the reads.
The other 4 charts produced by IONTorrent.R look at the performance based on the x&y coordinates of the microwells on the chip.
|Note. The coordinates are parsed from the read headers by Novoalign and it’s possible that changes in header format may break this. If this happens could you please send us a sample of the read headers with vendor documentation.|
The charts starting from top left in clockwise direction are,
- The number of mapped reads.
- The phred scaled likelihood of a single base insertion that does not match either of the adjacent bases. A missing entry indicates that there were insufficient mapped reads to calculate the phred quality
- The phred scaled likelihood of an insertion that matches one of it’s adjacent bases, i.e. a run length error that increases the length.
- The phred scaled likelihood of a deletion. i.e. a run length error that decreases the length.
We also looked at indel rates for individual bases. Raw indel rates differed between AT & CG but when normalised to a probability the difference was not statistially significant.
In another test we looked at the length of the run length errors, errors of length >1 were less likely than errors of length 1.
|Insertion Length||Deletion Length|
In our tests we used a gap open penalty of 20 and a gap extend penalty of 5, this will give a penalty of 25 for a 1bp indel and 30 for a 2bp indel. We tried different gap penalties and lower penalties can map more reads however this can have a negative affect on variant calling.
The results reflect the indels in alignments found by Novoalign and obviously don’t include indel errors in reads that failed to align and so are very likely to be an underestimate of the true indel error rate.. Also, as other aligners such as TMAP, BWA and Bowtie vary in their ability to align reads with indels and they way the clip alignments they will produce different results.