NovoAlign SAM Format Validation

A key feature of Novoalign is to report alignments in the sequence alignment map (SAM) format specification.  Numerous SAM/BAM  program readers exist to process this alignment output format.

The Picard(external link) library may be used to validate and run various processing jobs on Novoalign output format.

Generate the  Alignments

Run Novoalign/NovoalignCS as per usual and select SAM output format:

#Use the  $'@RG\tID:readgroup\tPU:platform­ unit\tLB:library'  to specify the Read Group (RG) tag required by Picard
novoalign -d hg18.nix -f  SRR040810_1.fastq.gz  -o SAM  $'@RG\tID:readgroup\tPU:platform­ unit\tLB:library' 2>novoalign.stats > alignments.sam

Convert the SAM file to BAM format using samtools(external link) :

#creates SRR040810_1.bam
samtools view -bS alignments.sam | samtools sort - SRR040810_1

Alternatively convert to BAM format using Picard(external link) :

java  -jar /path_to_picard/SamFormatConverter.jar I=alignments.sam O=SRR040810_1.bam

Run Picard Validation

Now that we have our BAM file we can run the Picard validation as below:

java -jar /path_to_picard/ValidateSamFile.jar I=alignments.sam O=validate_report.txt

Other Picard utility examples

It is possible to run other Picard utilities once the SAM and BAM files are present. Note that many of the Picard operations require a reference genome file in FASTA format to be supplied on the command line:

BAM alignment file indexing:

#Builds a BAM index
java -jar /path_to_picard/BuildBamIndex I=SRR040810_1.bam O=SRR040810_1.bam.bai

PCR Duplicate removal from a BAM file:

#Marks PCR sequence duplicates in a BAM file and writes this information to the .rmdup.bam file
java -jar /path_to_picard/MarkDuplicates I=SRR040810_1.bam O=SRR040810_1.rmdup.bam M=duplicate_report.txt

Generate alignment statistics report from a sorted BAM file:

# The reference genome, hg18.fasta is required here
java -jar /path_to_picard/CollectAlignmentSummaryMetrics.jar I=SRR040810_1.sorted.bam R=hg18.fasta O=alignment_metrics.txt

Note that in order to successfully run Picard(external link) tools the reference genome sequences in FASTA format must match the reference sequences used to build the database index for novoalign. Picard sequence dictionaries are also required for commands where the “R=hg18.fasta” is used. See the Picard CreateSequenceDictionary(external link) utility on how to do this.

A full collection of Picard utilities(external link) may be used on Novoalign SAM/BAM format.

