Novo2maq – Converting a Novoalign report into MAQ map format.
MAQ includes a useful set of utilities that includes assembly, SNP calling, indel calling and a viewer. Novo2maq allows you to take advantage of these tools by converting a novoalign report into a the MAQ map format.
The process is fairly straight forward with a couple of options and one “feature” that catches a few users.
- novo2maq [-s on] [-r] out.map in.list in.novo
|-s on||Enables the Smith-Waterman scoring check of indels so only indels with a few good bases on both sides are recorded as maq indel alignments.|
|-r||Produces short report to stdout listing the number of alignments for each reference sequence.|
|out.map||is file name for the output MAQ map file.|
|in.list||is a list of reference sequence headers to be selected. This file servers to specify reference sequences to be selected for conversion to the MAQ map format and also allows translation of the header sequence. Each line in the file has format:<refheader> <tab> <replacement header> Any reference sequences not listed in this file will not be converted to MAQ map file. Use ‘-‘ rather than a filename to specify that all sequences are to be selected and converted. Note that the order of the headers in in.list should match the order that will be used with maq pileup.|
|in.novo||is the Novoalign report file to be converted. Use ‘-‘ to specify that the report is to be read from stdin|
Novoalign can handle alignments with inserts and deletes on both single end and paired end reads whilst MAQ only does indels on paired end alignments. This isn’t a problem for Novo2maq, indels called on single end reads can be converted to MAQ map format and MAQ indelpe can be used to call the indels.
Smith-Waterman Test on Indels
Novoalign uses global Needleman-Wunsch alignment which, unlike Smith-Waterman, will include all bases of the read in the alignment and in calculation of score and alignment quality. The Smith-Waterman test in Novo2maq computes the a local alignment for the read to check whether any indels in the read would be in a local alignment. If not then the read isn’t flagged as containing an indel. The effect is that the only indels flagged in the MAQ map file are indels with at least a few aligned bases on each side. This may reduce false positive indel calls.
The <in.list> file
This file is used for two purposes:
- To select alignments for specific reference sequences
- To convert reference sequence names
Each line is in the format:
<reference sequence header> <tab> <replacement header>
Only alignments to sequences listed in this file will be converted!
>gi|42406306|ref|NC_000019.8|NC_000019 >Chr19 >gi|51511721|ref|NC_000005.8|NC_000005 >Chr5 >gi|51511724|ref|NC_000008.9|NC_000008 >Chr8 >gi|51511727|ref|NC_000011.8|NC_000011 >Chr11 >gi|51511729|ref|NC_000013.9|NC_000013 >Chr13 >gi|51511730|ref|NC_000014.7|NC_000014 >Chr14 >gi|51511731|ref|NC_000015.8|NC_000015 >Chr15 >gi|51511732|ref|NC_000016.8|NC_000016 >Chr16 >gi|51511734|ref|NC_000017.9|NC_000017 >Chr17 >gi|51511735|ref|NC_000018.8|NC_000018 >Chr18 >gi|51511747|ref|NC_000020.9|NC_000020 >Chr20 >gi|51511750|ref|NC_000021.7|NC_000021 >Chr21 >gi|89161185|ref|NC_000001.9|NC_000001 >Chr1 >gi|89161187|ref|NC_000010.9|NC_000010 >Chr10 >gi|89161190|ref|NC_000012.10|NC_000012 >Chr12 >gi|89161199|ref|NC_000002.10|NC_000002 >Chr2 >gi|89161203|ref|NC_000022.9|NC_000022 >Chr22 >gi|89161205|ref|NC_000003.10|NC_000003 >Chr3 >gi|89161207|ref|NC_000004.10|NC_000004 >Chr4 >gi|89161210|ref|NC_000006.10|NC_000006 >Chr6 >gi|89161213|ref|NC_000007.12|NC_000007 >Chr7 >gi|89161216|ref|NC_000009.10|NC_000009 >Chr9 >gi|89161218|ref|NC_000023.9|NC_000023 >ChrX >gi|89161220|ref|NC_000024.8|NC_000024 >ChrY
If you get this file wrong you’ll end up with no records converted or maybe just missing some reference sequences, so take care.
Starting with V2.03.12 this file is optional. If you enter a ‘-‘ in place of the file name then all alignments will be converted.
There is also an option in novoutil to print out all the reference sequence headers in a format usable as the in.list file.
- novoutil n2mhdrs novoindex >in.list
It’s possible to improve parallelism in your processes if you use Novo2maq to split your alignments into files per chromosome and then run the maq tools separately against each chromosome. This would be most useful on very large projects where you have multiple lanes of data. The basic process would be:
- Create an in.list file for each chromosome
- Run novo2maq for each chromosome/lane
- Use MAQ mapmerge to merge lanes for each chromosome producing one map file per chromosome
- then run MAQ tools against each chromosome.