A key feature of Novoalign is to report alignments in the sequence alignment map (SAM) format specification. Numerous SAM/BAM program readers exist to process this alignment output format.
The Picard library may be used to validate and run various processing jobs on Novoalign output format.
Generate the Alignments
Run Novoalign/NovoalignCS as per usual and select SAM output format:
#Use the $'@RG\tID:readgroup\tPU:platform unit\tLB:library' to specify the Read Group (RG) tag required by Picard novoalign -d hg18.nix -f SRR040810_1.fastq.gz -o SAM $'@RG\tID:readgroup\tPU:platform unit\tLB:library' 2>novoalign.stats > alignments.sam
Convert the SAM file to BAM format using samtools :
#creates SRR040810_1.bam samtools view -bS alignments.sam | samtools sort - SRR040810_1
Alternatively convert to BAM format using Picard :
java -jar /path_to_picard/SamFormatConverter.jar I=alignments.sam O=SRR040810_1.bam
Run Picard Validation
Now that we have our BAM file we can run the Picard validation as below:
java -jar /path_to_picard/ValidateSamFile.jar I=alignments.sam O=validate_report.txt
Other Picard utility examples
It is possible to run other Picard utilities once the SAM and BAM files are present. Note that many of the Picard operations require a reference genome file in FASTA format to be supplied on the command line:
BAM alignment file indexing:
#Builds a BAM index java -jar /path_to_picard/BuildBamIndex I=SRR040810_1.bam O=SRR040810_1.bam.bai
PCR Duplicate removal from a BAM file:
#Marks PCR sequence duplicates in a BAM file and writes this information to the .rmdup.bam file java -jar /path_to_picard/MarkDuplicates I=SRR040810_1.bam O=SRR040810_1.rmdup.bam M=duplicate_report.txt
Generate alignment statistics report from a sorted BAM file:
# The reference genome, hg18.fasta is required here java -jar /path_to_picard/CollectAlignmentSummaryMetrics.jar I=SRR040810_1.sorted.bam R=hg18.fasta O=alignment_metrics.txt
Note that in order to successfully run Picard tools the reference genome sequences in FASTA format must match the reference sequences used to build the database index for novoalign. Picard sequence dictionaries are also required for commands where the “R=hg18.fasta” is used. See the Picard CreateSequenceDictionary utility on how to do this.
A full collection of Picard utilities may be used on Novoalign SAM/BAM format.