Exome Sequencing Analysis

Required database indexes

Databases for Exome mapping can be either of the following

Whole genome sequences
Transcript/gene sequences with introns
Make fasta files of these sequences and index them with novoindex

#Whole Genome database for Illumina/454 mapping
novoindex  hg19.nix hg19.fasta

#Whole Genome database for Illumina/454 mapping
novoindex -c  hg19.ncx hg19.fasta

For SOLiD mapping create a colorspace index

#Exon database for Colorspace mapping
novoindex -c hg19.exons.ncx exons.fasta

In the examples below we use the whole genome sequence database.

Exome command line examples

Some whole-exome targeting strategies may leave a 5′ or 3′ terminal adaptor on the end of the read. These can be removed by Novoalign. It is highly recommended that you install the samtools package. We use novoalign’s read quality calibration to improve the quality of alignments.
Novoalign can be used with all major NGS platforms i.e. Illumina (paired,single and mate-pair) and Ion Torrent, as well as PacBio reads. PacBio reads may need to be split into 1Kb chunks to align more efficiently

A 3′ adaptor is present on all reads. Strip these off and align the read. Generate SAM output format:

#All alignment statistics are written to STDERR when the SAM output format is  specified
novoalign -k -d hg18.nix  -f Reads1.2.fastq Reads1.2.fastq  -a  -o SAM  "@RG\tID:1111\tSM:mySample\tPL:illumina" 2> stats.txt | samtools -bS -  > Reads1.bam

The code above aligns paired-end reads with the adaptor specified with “-a”. For Default Illumina adaptors use “-a” without any args.

Novoalign will auto-detect the FASTQ encoding in version 3 or higher. Otherwise use “-F STDFQ” for standard Sanger FASTQ format.

For single-end read applications e.g. Ion Torrent data, we can use Novoalign to align the reads. In the case of a 5′ adaptor present on the short read sequences:

#Create BAM output for all read alignments. Reads with the 5' adaptor string are preprocessed to remove the adaptor sequence
novoalign -k -d hg19.nix  -f IonReads.fastq -a -o SAM 2> stats.txt | samtools -bS -  > Reads134.bam

FASTA format is also accepted by Novoalign.

Report all alignments for every read in a given sample

#Use the "-r" option to control the reporting strategy to display all alignments. Speed performance is affected when using this option.
novoalign -d hg19_exons.nix  -f Reads134.fastq -a -r All -o SAM 2> stats.txt | samtools -bS -  > Reads134.bam

Sorting the BAM file with multiple cores

We use Novosort to perform an efficient sort of the BAM file and optionally remove or mark PCR duplicates at the same time. This feature replaces one or more Picard/Samtools calls into 1 command that reduces the number of steps needed for your pipeline. Be sure to use the latest version of Novosort for doing the following.

To sort without removing duplicates, using 24 cores and 14Gb of memory:

novosort  -m 14g -t . -c 24  Reads134.bam -i -o Sorted.bam

To sort and remove duplicates, using 24 cores and 14Gb of memory:

novosort  -m 14g -t . -c 24 --removeduplicates --keeptags Reads134.bam -i -o Sorted.bam

To sort-merge previously sorted files and remove duplicates…

novosort  -m 14g -t . -c 24 --removeduplicates --keeptags lane1.bam lane2.bam lane3.bam  -i -o Sorted.bam

Novosort also handles binary BAM-formatted input from STDIN.

Ion Torrent (LifeTech) Alignment

Novoalign may be used to align Ion Torrent reads to a reference genome. These are typically single-end reads

novoalign -d hg19.nix -f Ion_reads.fastq.gz -o SAM -g 20 -x 6 | samtools view -bS – > Ion_reads.bam

Align Ion Torrent reads and write output in BAM format lowering gap open (“-g”) and extend(“-x”) penalties.

Colorspace (SOLiD) Alignment

Align ABI SOLiD colorspace reads and write output in SAM format

novoalignCS -d hg18.ncx  -f SRR040807.fastq.gz -o SAM  > SRR040807.sam 2> stats.txt

Align ABI SOLiD colorspace reads and write output in SAM format

novoalignCS -d hg18.ncx  -f SRR040807.fastq.gz -o SAM  2> stats.txt | samtools view -bS - >SRR040807.bam

Remove PCR duplicates

It is recommended to remove or mark PCR duplicates before calculating gene/exon coverage. We use novosort to remove PCR and optical duplicates or reads with identical mapping coordinates.

novosort -m 8g -c 9 --removeduplicates --keeptags  SRR040807.bam -o SRR040807.deduped.bam

Finding alignments on target

If we have mapped the sequencing reads to a standard reference genome e.g. Human, Mouse, Arabidopsis then it’s quite likely that gene annotations are available for this genome. Depending on the exome capture method there should exist list of all targets using the same reference genome. In our example below we use the Nimblegen seqcap annotations for the human genome UCSC hg18 version.

Convert the seqcap annotations to BED format for use with the BedTools package. We use the sorted and deduplicated BAM file generated above to calculate the coverage of exome capture targets.

# Note the BED file must be sorted. intersectBed  is part of the BedTools package
 intersectBed -wa -abam SRR040807.deduped.bam  -b Exome_annotations.bed > SRR040807.ontarget.bam

Note that the BED file used can be any annotation so long as it matches up to the reference sequence database that the reads were mapped to with novoalign.

Calculate gene/exon coverage

In cases where a reference genome is well annotated the annotations may be used to calculate gene coverage. The annotations below are examples of human Refseq gene annotations extracted from the UCSC genome database:

#gene_annotations.bed
#BED-formatted gene annotations
chr1	1298972	1300443	NM_001127229	0	-	1299043	1299999	0	4	173,446,86,204,	0,270,975,1267,
chr1	1298973	1300425	NM_001127230	0	-	1299043	1299999	0	4	172,446,86,108,	0,269,974,1344,
chr1	1310954	1324581	NM_030937	0	-	1312473	1324549	0	11	1871,93,112,142,105,100,65,121,110,75,320,	0,2065,2242,4518,4747,5054,7684,9682,12521,12885,13307,

We can calculate coverage of our short reads using the BEDTools coverageBed program

#calculate gene coverage
coverageBed -abam SRR040807.ontarget.bam -b gene_annotations.bed > SRR040807.coverage.txt

Other programs such as Picard or GATK may be used to accomplish the same task.

Documentation

Exome Sequencing Analysis

Required database indexes

Exome command line examples

Sorting the BAM file with multiple cores

Ion Torrent (LifeTech) Alignment

Colorspace (SOLiD) Alignment

Remove PCR duplicates

Finding alignments on target

Calculate gene/exon coverage

LATEST NEWS

Novocraft and Basepair Inc. Announce Strategic Partnership to Deliver Advanced Genomic Pipelines in the Cloud

Novoalign V4.03.01

Novoalign V4.03.00 and Novosort V3.00.00

Contact Us