Input data
Novoalign accepts many input read formats. The two most notably used for NGS sequencing platforms are
- FASTA
- FASTQ
Novoalign accepts both formats and in licensed versions is able to read these files even if they are gzip-compressed.
Database Index formatting
Format a human reference database for Illumina GAII/HiSeq read mapping:
novoindex hg18.nix hg18.fasta
References can be in multiple files and may be specified with wildcard expressions
novoindex hg18_version2.nix chr*.fasta
Using IUPAC codes for known SNPs
novoutil iupac dbsnp.vcf chr*.fasta > hg18.dbsnp.fa novoindex hg18.nix hg18.dbsnp.fa
Format a genome database for SOLiD mapping. Unlike Bioscope or other mapping tools the entire human reference genome may be indexed in 1 step and this takes a few minutes.
novoindex -c hg18.ncx hg18.fasta
Format a genome database for bisulfite sequence mapping
novoindex -b hg18.nbx hg18.fasta
Alignment
Single-end or fragment Illumina read mapping
# A database index and read file are always required novoalign -d hg18.nix -f SRR040810_1.fastq.gz -o SAM > alignments.sam 2> log.txt
Report the first 100,000 alignments and quit
novoalign -d hg18.nix -f SRR040810_1.fastq.gz -# 100K -o SAM > alignments.sam 2> log.txt
Aligning Illumina reads
Align Illumina paired-end reads to a reference genome. The expected size distribution for these sequencing runs were mean=300 and standard deviation = 50
#Note that novoalign accepts gzip compressed input read files novoalign -d hg18.nix -f SRR040810_1.fastq.gz SRR040810_2.fastq.gz -i 200,50 -o SAM > alignments.sam > log.txt
Align Illumina mate-pair/jumping library reads to a reference genome. The expected size distribution for these sequencing were mean=4000 and standard deviation = 500
novoalign -d hg18.nix -f Mate1.fastq Mate2.fastq -i MP 4000,500 -o SAM >alignments.sam 2> log.txt
Align Illumina mate-pair/jumping library reads to a reference genome. Two fragment length distributions can be given to the aligner, one where mean,sd = 4000,500 is for first fragmentation step and the second with mean,sd=250,80 is for secondary fragmentation. This mode allows for mixed mate pairs/ paired end reads that happens as a result of incomplete biotin enrichment.
novoalign -d hg18.nix -f Mate1.fastq Mate2.fastq -i MP 4000,500 250,80 -o SAM >alignments.sam 2> log.txt
Align ABI SOLiD mate pair reads in CSFASTA format to a reference genome
# Assumes Rosalind_20080729_2_Chris5_F3_QV.qual and Rosalind_20080729_2_Chris5_R3_QV.qual exist in the same folder as the csfasta files! novalignCS -d ecoli.ncx -f Rosalind_20080729_2_Chris5_F3.csfasta Rosalind_20080729_2_Chris5_R3.csfasta -i MP 4000,600 -o SAM >alignments.sam 2>log.txt
Align ABI SOLiD reads in CSFASTQ format
# writes output to STDOUT by default novalignCS -d hg18.ncx -f SRR040808.fastq.gz -o SAM >SRR040808.sam
Align 454 paired-end reads
# set the mate orientation using "-i ++" as in the case of 454 mate pairs novoalign -d hg18.nix -f 454_reads1.fastq 454_reads2.fastq -i ++ 5000 800 -o SAM >alignments.sam 2>log.txt
In general the “-i” option may be used to specify the orientation model of read pairs for novoalign to process.
Align PacBio (Pacific Biosciences) reads
Novoalign can be used with PacBio reads however do ensure to use novoalign version 3.0 or higher to achieve good performance. Due to the higher error rates in PacBio SMRT reads , the gap open “-g” and extend “-x” penalties need to be adjusted to suit. From our testing we’ve found -g 20 -x 0 will give good results.
# adjust gap penalties novoalign -d hg18.nix -f PacBio_reads.fastq -g 20 -x 0 -o SAM >alignments.sam 2>log.txt
You can use similar gap penalties to align Illumina reads to Pac Bio reads as first step in correcting the PacBio reads.