Quality calibration is the process of re-evaluating base qualities using the actual counts of mismatches from alignments.
Using Quality Calibration in Novoalign offers several benefits:
- More accurate alignments with improved quality scores
- Faster Alignment
- Option to do away with the phiX Control lane
- Eliminates the need to run calibration as an extra step in your pipeline
- Provides base specific calibration unavailable in most external calibration systems.
- Known SNPs are not counted as mismatches if encoded in the reference as IUPAC ambiguous codes.
Quality Calibration is available in Novoalign V2.04 and later.
The calibration is base specific which means two things:
- We keep mismatch counts based on the actual base called so we can detect situations where, say, T is over called and likely to be wrong but calls of A, C &G are likely to be correct.
- Rather than count “mismatches” we maintain counts for each of the bases aligned. This allows us to detect situation where a wrong call of , say, a T is more likely to be an A than a C. We can then calculate mismatch penalties specific to each base at each position in a read.
These counts are used to calculate an actual mismatch probability or penalty as a function of: the position in the read; the “as called” base quality; the base called; and the base aligned. The mismatch probability is then used in Novoalign alignment process in place of the “as called” base quality to set penalties for the alignment dynamic programming.
Categories used for counting mismatches are:
- The read within the pair (0 for first read, 1 for second read)
- The base position in the read, zero based.
- The “as called” quality
- The base called
For each combination, Novoalign maintains the count of the number of alignments to each of the four bases, MA, MC, MG & MT. Only ungapped alignments with a quality > 60 or 120 for paired end are used to count mismatches.
The first step in the process of calculating calibrated qualities for each category involves binning counts across read length and quality values. Binning helps to increase the counts and to smooth fluctuations. Bins are 5 bases long and have variable number of quality values. At low qualities bins take a single quality value, in mid range bins are 3 quality values wide and above a quality of 30 they are 5 wide. There is a bin for each base position and quality values so mismatch counts get added to multiple overlapping bins, this design eliminates edge effect between bins.
The second step involves adding priors to the count of calls and mismatches. Use of a prior helps stabilise calibrated quality values when counts are low. The prior is a minimum value for mismatch count and if the actual mismatch count is below the prior then we add extra mismatches to bring the count up to the prior and then a corresponding number of extra matches based on the “as called” quality.
Novoalign then calculates 4 base penalties is Pi = max(30, -10log10(Mi/N)) for i in ACGT where Mi is the number of times an alignment matched base i and N is the total calls for this bin. Base penalties are used in the alignment scoring process.
A Phred scaled quality value is also calculated as P = max(93, -10log10(M/N)) where M is the total mismatches and N the total calls for the bin. This calibrated quality value is used in the report for the base qualities.
Using Quality Calibration
Quality calibration works for read files in the following formats:
|Solexa & Illumina FASTQ|
|FASTA||Every base is assumed to have a starting quality of 30.|
|FASTA with separate quality file|
It does not work with prb files.
The simplest way to use quality calibration is just to add the option -k to the Novoalign command line. This turns on calibration with calibration based on actual alignments. The calibration will start off neutral as a result of the priors and gradually as more alignments are added the calibration will shift to reflect the actual mismatch counts.
Novoalign also has the ability to save the mismatch count data and then use this as input to the calibration of a following run of Novoalign. Scenarios where this might be used include:
- Using mismatch counts from phiX lane to calibrate another lane
- Running an initial Novoalign at a low threshold to get mismatch statistics for use in a following run, possibly at a higher threshold. This would remove some start up effects from a single pass run.
Operation is controlled by two command line option:
|-k [infile]||Enables quality calibration. The quality calibration data (mismatch counts) are either read from the named file or accumulated from actual alignments. Default is no calibration.|
Note. Quality calibration does not work with reads in prb format.
|-K [file]||Accumulates mismatch counts for quality calibration by position in the read and called base quality. Mismatch counts are written to the named file after all reads are processed. When used with -k option the mismatch counts include any counts read from the input quality calibration file.|
These two options can be used in several combination's :
|-k||Turns on calibration with mismatch counting. Effects of calibration can be seen after a few thousand reads have been aligned. Calibration data is recalculated periodically as more reads are aligned.|
|-k infile||Turns on calibration with mismatch counts read from infile. Mismatch counts from alignments are not used.|
|-K outfile||Turns on mismatch counting without calibration. At the end of the run the mismatch counts are written to the outfile ready for use as input in another run.|
|-k -K outfile||Turns on calibration with mismatch counting. At the end of the run the mismatch counts are written to the outfile ready for use as input in another run.|
|-k infile -K outfile||Turns on calibration and mismatch counting. Initial mismatch counts are loaded from infile, new alignments are added to the counts, and then at the end of the run the mismatch counts are written to the outfile ready for use as input in another run. Calibration data is recalculated periodically as more reads are aligned.|
Quality Calibration and SNPs
It's possible that a mismatch is due to a SNP rather than a sequencing error and it's preferable that these are not counted as mismatches for quality calibration as they will artificially reduce the quality of the bases and may then lower the quality of consensus calls. Generally the effect is small but it's worth avoiding at least for documented SNPs.
To avoid counting known SNPs as mismatches you should use a reference genome where SNPs have been encoded as IUPAC ambiguous codes. For Human you can download such a genome at http://hgdownload.cse.ucsc.edu/goldenPath/hg19/snp132Mask/ or you can build your own using novoutil IUPAC .
If the base in a read matches the IUPAC ambiguous code then it will not be counted as a mismatch and it will not result in reduced base quality.
Encoding known SNPs as IUPAC codes also improves alignment accuracy at these locations and eliminates allelic bias.
Quality Calibration and Novoalign Report
There is no change to the report format, the quality string displayed is now the calibrated qualities.