Novoalign detects mismatches and short indels in short reads from next-generation sequencing platforms. The recipes below contain command line usage to generate list of single nucleotide polymorphisms SNPs and Indels for further analysis.
Generate sorted alignments in BAM format
Initially we need to align our file(s) to the reference genome:
novoalign -d hg18.nix -f reads1.fastq -o SAM 2> stats.txt > novo.sam samtools view -bS novo.sam | samtools sort - reads1
novoalignCS -d hg18.ncx -f reads1.csfastq -o SAM 2> stats.txt > novocs.sam samtools view -bS novocs.sam | samtools sort - reads1For paired-end or mate-pair reads adjust the parameters accordingly.
If we have multiple alignment files these all need to be sorted individually and merged with either samtools or Picard.
Remove PCR duplicates
It is highly recommended to remove PCR duplicates before proceeding with the SNP and Indel calls. We use samtools as an example to remove PCR duplicates.
#remove PCR duplicates from single-end reads samtools rmdup -s reads1.bam reads1.sorted.bam #For Paired reads samtools rmdup reads1.bam reads1.pe.sorted.bam
SNP and Microindel calling
In this example we use the samtools pileup SNP caller and accept reads with a minimum mapping quality of 20:
#The hg18.fasta reference genome sequence is required. samtools.pl is also required samtools view -u -q 20 reads1.bam | samtools pileup -vc -f hg18.fasta - | samtools.pl varFilter -D 1000 -G 50 > variations.list
Filtering and final list generation
A simple awk script can be used to filter our list of variations where the indel SNP quality is greater than 50 or 60 respectively:
awk '($3=="*"&&$6>=50)||($3!="*"&&$6>30)' variations.list > variations.filtered.list
Exporting to other formats
The samtools distribution contains a helper script called sam2vcf.pl that may be used to convert to the Variant Call Format (VCF):
sam2vcf.pl variations.filtered.list -r hg18.fasta > variations.vcf