NovoAlign SAM Format Validation

A key feature of Novoalign is to report alignments in the sequence alignment map (SAM) format specification. Numerous SAM/BAM program readers exist to process this alignment output format.

The Picard library may be used to validate and run various processing jobs on Novoalign output format.

Generate the Alignments

Run Novoalign/NovoalignCS as per usual and select SAM output format:

#Use the  $'@RG\tID:readgroup\tPU:platform unit\tLB:library'  to specify the Read Group (RG) tag required by Picard
novoalign -d hg18.nix -f  SRR040810_1.fastq.gz  -o SAM  $'@RG\tID:readgroup\tPU:platform unit\tLB:library' 2>novoalign.stats > alignments.sam

Convert the SAM file to BAM format using samtools :

#creates SRR040810_1.bam
samtools view -bS alignments.sam | samtools sort - SRR040810_1

Alternatively convert to BAM format using Picard :

java  -jar /path_to_picard/SamFormatConverter.jar I=alignments.sam O=SRR040810_1.bam

Run Picard Validation

Now that we have our BAM file we can run the Picard validation as below:

java -jar /path_to_picard/ValidateSamFile.jar I=alignments.sam O=validate_report.txt

Other Picard utility examples

It is possible to run other Picard utilities once the SAM and BAM files are present. Note that many of the Picard operations require a reference genome file in FASTA format to be supplied on the command line:

BAM alignment file indexing:

#Builds a BAM index
java -jar /path_to_picard/BuildBamIndex I=SRR040810_1.bam O=SRR040810_1.bam.bai

PCR Duplicate removal from a BAM file:

#Marks PCR sequence duplicates in a BAM file and writes this information to the .rmdup.bam file
java -jar /path_to_picard/MarkDuplicates I=SRR040810_1.bam O=SRR040810_1.rmdup.bam M=duplicate_report.txt

Generate alignment statistics report from a sorted BAM file:

# The reference genome, hg18.fasta is required here
java -jar /path_to_picard/CollectAlignmentSummaryMetrics.jar I=SRR040810_1.sorted.bam R=hg18.fasta O=alignment_metrics.txt

Note that in order to successfully run Picard tools the reference genome sequences in FASTA format must match the reference sequences used to build the database index for novoalign. Picard sequence dictionaries are also required for commands where the “R=hg18.fasta” is used. See the Picard CreateSequenceDictionary utility on how to do this.

A full collection of Picard utilities may be used on Novoalign SAM/BAM format.

Documentation

NovoAlign SAM Format Validation

Generate the Alignments

Run Picard Validation

Other Picard utility examples

LATEST NEWS

Novocraft and Basepair Inc. Announce Strategic Partnership to Deliver Advanced Genomic Pipelines in the Cloud

Novoalign V4.03.01

Novoalign V4.03.00 and Novosort V3.00.00

Contact Us