Required database indexes
Databases for Exome mapping can be either of the following
- Whole genome sequences
- Transcript/gene sequences with introns
- Make fasta files of these sequences and index them with novoindex
#Whole Genome database for Illumina/454 mapping novoindex hg19.nix hg19.fasta
#Whole Genome database for Illumina/454 mapping novoindex -c hg19.ncx hg19.fasta
For SOLiD mapping create a colorspace index
#Exon database for Colorspace mapping novoindex -c hg19.exons.ncx exons.fasta
In the examples below we use the whole genome sequence database.
Exome command line examples
Some whole-exome targeting strategies may leave a 5′ or 3′ terminal adaptor on the end of the read. These can be removed by Novoalign. It is highly recommended that you install the samtools package. We use novoalign’s read quality calibration to improve the quality of alignments.
Novoalign can be used with all major NGS platforms i.e. Illumina (paired,single and mate-pair) and Ion Torrent, as well as PacBio reads. PacBio reads may need to be split into 1Kb chunks to align more efficiently
A 3′ adaptor is present on all reads. Strip these off and align the read. Generate SAM output format:
#All alignment statistics are written to STDERR when the SAM output format is specified novoalign -k -d hg18.nix -f Reads1.2.fastq Reads1.2.fastq -a -o SAM "@RG\tID:1111\tSM:mySample\tPL:illumina" 2> stats.txt | samtools -bS - > Reads1.bam
The code above aligns paired-end reads with the adaptor specified with “-a”. For Default Illumina adaptors use “-a” without any args.
Novoalign will auto-detect the FASTQ encoding in version 3 or higher. Otherwise use “-F STDFQ” for standard Sanger FASTQ format.
For single-end read applications e.g. Ion Torrent data, we can use Novoalign to align the reads. In the case of a 5′ adaptor present on the short read sequences:
#Create BAM output for all read alignments. Reads with the 5' adaptor string are preprocessed to remove the adaptor sequence novoalign -k -d hg19.nix -f IonReads.fastq -a -o SAM 2> stats.txt | samtools -bS - > Reads134.bam
FASTA format is also accepted by Novoalign.
Report all alignments for every read in a given sample
#Use the "-r" option to control the reporting strategy to display all alignments. Speed performance is affected when using this option. novoalign -d hg19_exons.nix -f Reads134.fastq -a -r All -o SAM 2> stats.txt | samtools -bS - > Reads134.bam
Sorting the BAM file with multiple cores
We use Novosort to perform an efficient sort of the BAM file and optionally remove or mark PCR duplicates at the same time. This feature replaces one or more Picard/Samtools calls into 1 command that reduces the number of steps needed for your pipeline. Be sure to use the latest version of Novosort for doing the following.
To sort without removing duplicates, using 24 cores and 14Gb of memory:
novosort -m 14g -t . -c 24 Reads134.bam -i -o Sorted.bam
To sort and remove duplicates, using 24 cores and 14Gb of memory:
novosort -m 14g -t . -c 24 --removeduplicates --keeptags Reads134.bam -i -o Sorted.bam
To sort-merge previously sorted files and remove duplicates…
novosort -m 14g -t . -c 24 --removeduplicates --keeptags lane1.bam lane2.bam lane3.bam -i -o Sorted.bam
Novosort also handles binary BAM-formatted input from STDIN.
Ion Torrent (LifeTech) Alignment
Novoalign may be used to align Ion Torrent reads to a reference genome. These are typically single-end reads
novoalign -d hg19.nix -f Ion_reads.fastq.gz -o SAM -g 20 -x 6 | samtools view -bS – > Ion_reads.bam
Align Ion Torrent reads and write output in BAM format lowering gap open (“-g”) and extend(“-x”) penalties.
Colorspace (SOLiD) Alignment
Align ABI SOLiD colorspace reads and write output in SAM format
novoalignCS -d hg18.ncx -f SRR040807.fastq.gz -o SAM > SRR040807.sam 2> stats.txt
Align ABI SOLiD colorspace reads and write output in SAM format
novoalignCS -d hg18.ncx -f SRR040807.fastq.gz -o SAM 2> stats.txt | samtools view -bS - >SRR040807.bam
Remove PCR duplicates
It is recommended to remove or mark PCR duplicates before calculating gene/exon coverage. We use novosort to remove PCR and optical duplicates or reads with identical mapping coordinates.
novosort -m 8g -c 9 --removeduplicates --keeptags SRR040807.bam -o SRR040807.deduped.bam
Finding alignments on target
If we have mapped the sequencing reads to a standard reference genome e.g. Human, Mouse, Arabidopsis then it’s quite likely that gene annotations are available for this genome. Depending on the exome capture method there should exist list of all targets using the same reference genome. In our example below we use the Nimblegen seqcap annotations for the human genome UCSC hg18 version.
Convert the seqcap annotations to BED format for use with the BedTools package. We use the sorted and deduplicated BAM file generated above to calculate the coverage of exome capture targets.
# Note the BED file must be sorted. intersectBed is part of the BedTools package intersectBed -wa -abam SRR040807.deduped.bam -b Exome_annotations.bed > SRR040807.ontarget.bam
Note that the BED file used can be any annotation so long as it matches up to the reference sequence database that the reads were mapped to with novoalign.
Calculate gene/exon coverage
In cases where a reference genome is well annotated the annotations may be used to calculate gene coverage. The annotations below are examples of human Refseq gene annotations extracted from the UCSC genome database:
#gene_annotations.bed #BED-formatted gene annotations chr1 1298972 1300443 NM_001127229 0 - 1299043 1299999 0 4 173,446,86,204, 0,270,975,1267, chr1 1298973 1300425 NM_001127230 0 - 1299043 1299999 0 4 172,446,86,108, 0,269,974,1344, chr1 1310954 1324581 NM_030937 0 - 1312473 1324549 0 11 1871,93,112,142,105,100,65,121,110,75,320, 0,2065,2242,4518,4747,5054,7684,9682,12521,12885,13307,
We can calculate coverage of our short reads using the BEDTools coverageBed program
#calculate gene coverage coverageBed -abam SRR040807.ontarget.bam -b gene_annotations.bed > SRR040807.coverage.txt
Other programs such as Picard or GATK may be used to accomplish the same task.