Basic Short Read Mapping

Input data

Novoalign accepts many input read formats. The two most notably used for NGS sequencing platforms are

FASTA
FASTQ

Novoalign accepts both formats and in licensed versions is able to read these files even if they are gzip-compressed.

Database Index formatting

Format a human reference database for Illumina GAII/HiSeq read mapping:

novoindex  hg18.nix  hg18.fasta

References can be in multiple files and may be specified with wildcard expressions

novoindex hg18_version2.nix   chr*.fasta

Using IUPAC codes for known SNPs

novoutil iupac dbsnp.vcf  chr*.fasta > hg18.dbsnp.fa
novoindex hg18.nix hg18.dbsnp.fa

Format a genome database for SOLiD mapping. Unlike Bioscope or other mapping tools the entire human reference genome may be indexed in 1 step and this takes a few minutes.

novoindex -c hg18.ncx  hg18.fasta

Format a genome database for bisulfite sequence mapping

novoindex -b hg18.nbx  hg18.fasta

Alignment

Single-end or fragment Illumina read mapping

# A database index and read file are always required
novoalign -d hg18.nix -f SRR040810_1.fastq.gz -o SAM > alignments.sam 2> log.txt

Report the first 100,000 alignments and quit

novoalign -d hg18.nix -f SRR040810_1.fastq.gz -# 100K -o SAM > alignments.sam 2> log.txt

Aligning Illumina reads

Align Illumina paired-end reads to a reference genome. The expected size distribution for these sequencing runs were mean=300 and standard deviation = 50

#Note that novoalign accepts gzip compressed input read files
novoalign -d hg18.nix -f SRR040810_1.fastq.gz SRR040810_2.fastq.gz -i 200,50 -o SAM > alignments.sam > log.txt

Align Illumina mate-pair/jumping library reads to a reference genome. The expected size distribution for these sequencing were mean=4000 and standard deviation = 500

novoalign -d hg18.nix -f Mate1.fastq  Mate2.fastq  -i MP 4000,500 -o SAM >alignments.sam 2> log.txt

Align Illumina mate-pair/jumping library reads to a reference genome. Two fragment length distributions can be given to the aligner, one where mean,sd = 4000,500 is for first fragmentation step and the second with mean,sd=250,80 is for secondary fragmentation. This mode allows for mixed mate pairs/ paired end reads that happens as a result of incomplete biotin enrichment.

novoalign -d hg18.nix -f Mate1.fastq  Mate2.fastq  -i MP 4000,500  250,80 -o SAM >alignments.sam 2> log.txt

Align ABI SOLiD mate pair reads in CSFASTA format to a reference genome

# Assumes Rosalind_20080729_2_Chris5_F3_QV.qual and Rosalind_20080729_2_Chris5_R3_QV.qual exist in the same folder as the csfasta files!
novalignCS -d ecoli.ncx -f Rosalind_20080729_2_Chris5_F3.csfasta Rosalind_20080729_2_Chris5_R3.csfasta -i MP 4000,600 -o SAM >alignments.sam 2>log.txt

Align ABI SOLiD reads in CSFASTQ format

# writes output to STDOUT by default
novalignCS -d hg18.ncx  -f SRR040808.fastq.gz -o SAM  >SRR040808.sam

Align 454 paired-end reads

# set the mate orientation using "-i ++" as in the case of 454 mate pairs
novoalign -d hg18.nix  -f 454_reads1.fastq  454_reads2.fastq -i ++ 5000 800 -o SAM >alignments.sam 2>log.txt

In general the “-i” option may be used to specify the orientation model of read pairs for novoalign to process.

Align PacBio (Pacific Biosciences) reads

Novoalign can be used with PacBio reads however do ensure to use novoalign version 3.0 or higher to achieve good performance. Due to the higher error rates in PacBio SMRT reads , the gap open “-g” and extend “-x” penalties need to be adjusted to suit. From our testing we’ve found -g 20 -x 0 will give good results.

# adjust gap penalties
novoalign -d hg18.nix  -f PacBio_reads.fastq -g 20 -x 0 -o SAM >alignments.sam 2>log.txt

You can use similar gap penalties to align Illumina reads to Pac Bio reads as first step in correcting the PacBio reads.

Documentation

Basic Short Read Mapping

Input data

Database Index formatting

Alignment

Aligning Illumina reads

Align ABI SOLiD mate pair reads in CSFASTA format to a reference genome

Align ABI SOLiD reads in CSFASTQ format

Align 454 paired-end reads

Align PacBio (Pacific Biosciences) reads

LATEST NEWS

Novocraft and Basepair Inc. Announce Strategic Partnership to Deliver Advanced Genomic Pipelines in the Cloud

Novoalign V4.03.01

Novoalign V4.03.00 and Novosort V3.00.00

Contact Us