Bisulphite Treated Reads

Alignment of Bisulphite treated sequences

From V2.1, Novoalign can align bisulphite treated sequences against a reference genome in a single pass that aligns against both C-T and G-A in-silco indexed reference sequences.

The process involves:

Novoindex builds a double index structure using both C-T and G-A converted k-mers
Novoalign will then align using both indexes and using both forward and complementary directions.

There is also an option for penalising unconverted cytosine in CHG or CHH positions.

Note.

With most Illumina Whole Genome Bisulphite reads we only see two of the four possible strands. This happens because different adapter/primer sequences are used for 5′ & 3′ ends of the fragments and, as 5′ adapter is used to attach fragments to the slides, only two of the possible 4 strands are present in the reads. If your sample prep has preserved strand direction then reads will be G rich C poor as in the following example. In this case, you should use the option -b2 and there is no need to use the unconverted cytosine penalty (-u option).

Note. Read 2 of paired reads will be reverse complement and should be C rich G poor if -b2 is applicable.

Example of Illumina Bi-seq reads (Read 1 of pairs) where only two of four possible strands are present. The option -b2 should be used with these reads.

@unknown_0006:2:26:1334:1994#0/1
CTTGTGGTAAAGATGGTTTGTTTTTAGATGGGTAGGATATATATTATTAGGAAATTTTATTTTTAGTAGGTTGGGAAGTATTATGTTTTATGATAAAAAAA
@unknown_0006:2:26:1334:532#0/1
TTTAAAGTAGTTAAAAGTAATGTGTTATTATTTTTATTTAGTTAATATTTATAAGTTTGTAAAGTTAGAGTGTATTTAAAAGTTAGGTGATATGTTATTTG
@unknown_0006:2:26:1334:1549#0/1
AATATATTTGGTTTTTGTTGGATTTTAGTAGATTTTGTTGGATGGTAAATTGTATTATTTTGTAATTTTTATTTTGTAAATGTATATTTTTGTTATTGTTT
@unknown_0006:2:26:1334:2019#0/1
AATATAGAAATAGTAGTTTTATAGTGTGATAATAAGATTTTAGTTATGTTATTTTATAGTTAGAATGGTTAGTTTATATTAATAGTGTGTGTATATGTATT
@unknown_0006:2:26:1334:1107#0/1
AAAATGTTAAAAGAAGTTATTTAGAGGAGAGGGAAGAAATTTGGATTTATATTTTAAAAAAAGGAAGATTATGATAAGGAATAATTTTTTTTAATATTTTA
@unknown_0006:2:26:1334:152#0/1
GATGGTTTTAATTTTTTGATTTCGTGATTTATTTATTTCGGTTTTTTAAAGTGTTGAGATTATAGGTGGGTGGGTTATGAGGGTAGGAGATTGAGATTATC

Targeted bisulphite sequencing projects that use amplicon PCR before addition of Illumina adaptors typically produce alignments in teh 4 possible orientations and would be run with -b4 -u8 options set.

Novoindex

Novoindex is used to build the bisulphite index structure. This involves building two hash table, one using C-T converted k-mers and the other with G-A conversion.

The bisulphite k-mer hash table is more memory efficient than a normal index as after conversion we have only 3 possible nucleotides in the index k-mers. This allows index k-mer length to be increased compared to a normal index. However as we build two indexes memory utilisation will likely increase compared to standard index structure.

A bisulphite mode index comprises five tables, the first two being doubled up for the CT and GA indexes.:

Two k-mer hash tables of size 4*3^k bytes
Two sequence offset tables of size 4N/s bytes where N is the length of the sequences being indexed and s is the step size.
A compressed sequence file of size N/2 bytes.

Novoindex Command Line Options for Bisulphite Mode

-b	Enables bisulphite indexing mode.
-k	Sets the k-mer size. It’s best to let this default unless you need to reduce memory utilisation. Typical values would be in range 17-19, with 19 as the maximum allowed.
-s	Sets the step size for the index. With s=1 every k-mer is indexed, at s=2 only every second k-mer is indexed. Using s>1 reduces memory and increases processing time but it does not reduce sensitivity or specificity. It’s probably best to let s default. Set it to 2 or higher if you need to reduce memory utilisation.

Example:

: novoindex -b a.thaliana.nbx refseq/a.thaliana/*.fas

Novoalign

Novoalign will automatically switch into bisulphite alignment mode if the reference sequence index was created with the -b option. In this mode, alignment for each read will involve four searches:

Forward against CT index
Reverse complement against CT index
Forward against GA index
Reverse complement against GA index

The search process is done iteratively starting with a search for an exact match and gradually increasing the mismatch threshold until an alignment is found.

Scoring of alignments is adjusted so that there is no penalty for the alignment of thymines (which are possibly converted cytosines) to cytosines in the reference sequence.

Novoalign Command Line Options for Bisulphite Mode

-t 99,99	Sets the alignment threshold. We suggest setting -t 20,3 for Bi-Seq reads.
–hlimit 99	Limits the alignment score threshold for low complexity reads. We suggest setting –hlimit 6 for Bi-Seq
-b 2	Sets mode where alignment process is forward against CT index and complementary against the GA index. Use this option if your sample preparation has preserved the strand of the reads (Examination of read sequences will show they are typically G rich, C poor).
-u 99	Defines a penalty for unconverted cytosines at CHG or CHH. Default 0. For plants try 6, and for vertebrates 12. Please refer to reference manual for more details. Using an unconverted cytosine penalty has the secondary effect of improving performance and should be used if you haven’t used -b 2.
-H 99	Trims low quality 3′ bases. We suggest using -H 20

Note. We’ve assumed that your sample prep process can produce the four fragments shown in the earlier illustration. If your process has selected for only two of these and excludes the complementary PCR products, then use the ‘-b 2’ option.

Novoalign Report (Native Mode)

# novoalign (2.03.02MT) - short read aligner with qualities.
# (C) 2008 NovoCraft
# Licensed to Novocraft Technologies Sdn Bhd
#  novoalign -d /wd5/db/ath.nbx -f SRR01330.100k.fastq -b 2 -t120 
# Interpreting input files as Sanger FASTQ.
# Index Build Version: 2.1
# Methylation Mode ON - For bisulphite treated sequences.
# Hash length: 17
# Step size: 1
...
@SRR013309.99976 :1:4:932:709 length=50 S       TTTGGGCTTTCCGAAAAGGTATCACATGCTAAGTTTGGCTTCACGGTTTT      IIIIIIIIIIIIIIIIIIIIIIIIIIIIII07IIIII&?II*%/-%3I@$       R       90
@SRR013309.99977 :1:4:839:938 length=50 S       TGATTTTAATATGATTTTATTTAAATGGAGTATTTTTATTATTTTTATTT      IIIIIIIIIIIIIIIIIIIIIIIIIII907I@IIIII6II9+II0I,%(%      U       65      119     >Chr2   17017872     R       .       .       .       GA 1T>A 2C>A 3T>A 5G#A 6C>A 8G#A 9C>A 11G#A 12G#A 14G#A 17G#A 20G#A 30G#A 33G#A 34G#A 35G#A 41G#A 45G#A 46G#A 47G#A
@SRR013309.99978 :1:4:542:491 length=50 S       TTTTGTTTTGATGAGTTGATGGTTGTTTATTTTAATTTATAAGTTTTTTT      IIIIIIIIIIII>IIIII4II*IIIIII5IIII)1I<'?,$%$II(&&!.      NM
@SRR013309.99979 :1:4:63:623 length=50  S       TTGAGGGATGGTATTGGATATTGGTATGAGGAATGTTTGTTTTTTTTTTT      IIIIIIIIIIIIIIIIIIIIII2-6%I3*(?)4IE(II&&IG.4AF*-#/      R       2
@SRR013309.99980 :1:4:457:700 length=50 S       TACGATAAGTTTTAGGATGAATTTTCGATTGTTCGAGACAGGTTTTCTTT      IIIIIIIIIIIIIIIIIIIFIIIIIIIIIIIIII=/50E,I8.ICI(I'*      U       26      150     >Chr3   14048749        R       .       .       .       GA 1T>A 2C>A 6G#A 8G#A 19G#A 22G#A 27G#A 29G#A 39G#A 41G#A 45G#A
@SRR013309.99981 :1:4:254:595 length=50 S       TTGGTGGGGTCGAGTAAATTTGAATTTAAATTCGAATTTTTATTGTTTTT      IIIIIIIIIIIIIIIIIIIII3I?III2IIII@0&CIII&$"II%D&I:+      U       53      25      >Chr3  15315657        F       .       .       .       CT 10C#T 27C#T 40G>T 47G>T 50G>T
@SRR013309.99982 :1:4:610:41 length=50  S       TGTGTATGATTGAGTATAAGAATTTAAATCGCACCTGGATTTTAATGGCT      IIIIIIIIIIIIIIIIIIII@8III1B<I=(I0$+F8)$AIII$&"'%&%      R       47
@SRR013309.99983 :1:4:954:102 length=50 S       TAATGTTGGTTTTAATTTAATATTTAGAATTAGAAGGTTTGTTTTTGTAT      IIIIIIIIIIIIII9IIIICI=III71:-II14,2/5III+II&II&'&%      R       2
@SRR013309.99984 :1:4:281:708 length=50 S       TTTTAGTAATTTATTATTGTATAATTGTATATTAAATTATGTGTATTTTT      IIIIIIIIIIIIIIIIII=IIIIIII&IIIDI-I>B&B:*0"%,$I%IGI      U       79      90      >Chr2   839733  R       .       .       .       GA 4C>A 7T>A 11C>A 14C>A 18C>A 19G#A 29G#A 31G#A 34G#A 36G#A 39G#A 40G#A 41G#A 44G#A
@SRR013309.99985 :1:4:124:349 length=50 S       TGATTTTTGGGAAGTTTTTGAGGAGTTTAAGTTAGATTTTTTTTTGTAGT      IIIIIIIIIIIII73IIIII'>IHIBII92&@IE*-E&+$+%I'%&##&*      NM
@SRR013309.99986 :1:4:765:355 length=50 S       TGTAAGATGTAATTTTTTGAGATTATAATAAAGTTTGTATATAATGTTTT      IIIIIIIIIIIIIIIIIII5I1IIDI02II8I/III'I4I.I+4I&%I&D      U       22      150     >Chr1   22982615        F       .       .       .       CT 13C#T 14C#T 17C#T 29C#T 34C#T 47A>T 49G>T 50C#T
@SRR013309.99990 :1:4:510:813 length=50 S       TGTTGGATAGTAATGATATAAGTTTAATGTTTTAATTATTTATGTAAATT      IIIIIIIIIIIIIIIIIIIIIIIIII;IIIIII**IIAIII5"$(*.:II      U       24      150     >Chr5   15311929        R       .       .       .       GA 2G#A 6C>A 10G#A 11G#A 14G#A 15G#A 18G#A 20G#A 26G#A 34G#A 37G#A 40G#A
@SRR013309.99991 :1:4:739:197 length=50 S       TTTGTATTATTATTATGTGGTGTTGATGTTTTTTATTTTTTTTGTTTTGT      IIIIIIIIIIIIIIIIIII)IIIII%IIIIIIIIB1IIII3II%-(%F%I      U       89      90      >Chr4   13917606        F       .       .       .       CT 3C#T 8C#T 11C#T 14C#T 21C#T 23C#T 24C#T 33C#T 34C#T 36A>T 38C#T 39C#T 40C#T 41G>T 43C#T 45A>T 46A>T 47G>T 48C#T
...
#     Read Sequences:   100000
#            Aligned:    38388
#   Unique Alignment:    28110
#   Gapped Alignment:      273
#     Quality Filter:    21280
# Homopolymer Filter:        5
#       Elapsed Time: 4467,250s
# Done.

Looking at one line in detail…

@SRR013309.99990 :1:4:510:813 length=50 S       TGTTGGATAGTAATGATATAAGTTTAATGTTTTAATTATTTATGTAAATT      IIIIIIIIIIIIIIIIIIIIIIIIII;IIIIII**IIAIII5"$(*.:II      U       24      150     >Chr5   15311929        R      .       .       .       GA 2G#A 6C>A 10G#A 11G#A 14G#A 15G#A 18G#A 20G#A 26G#A 34G#A 37G#A 40G#A

@SRR013309.99990 :1:4:510:813 length=50	The read header or identification
S	Indicates a single end read
TGTTGGATAGTAATGATATAAGTTTAATGTTTTAATTATTTATGTAAATT	The read sequence
IIIIIIIIIIIIIIIIIIIIIIIIII;IIIIII*IIAIII5″$(.:II	Sanger fastq format quality string for the read
U	Indicates a unique alignment location was found
24	Alignment score
150	Quality score for the alignment
>Chr5	Chromosome that read is mapped to
15311929	1 based offset of the alignment within the chromosome
R	Strand of the alignment
. . .	Three fields for location of alignment of the pair to this read
GA	Indicates GA index and rules were use din the alignment
2G#A 6C>A 10G#A 11G#A 14G#A 15G#A 18G#A 20G#A 26G#A 34G#A 37G#A 40G#A	Mismatches, Inserts, Deletes and unconverted cytosines.

Performance

Alignment of bisulphite treated fragments is slower than for normal DNA due to the reduction in complexity from conversion of cytosines to uracil. The results in hash indexing process seeding to more alignment locations.

In above example using 50bp reads on a 600Mbp genome and allowing 4 mismatches (not counting converted cytosines) processing rate was around 25 reads/second on a Dual core AMD work station.

Documentation

Bisulphite Treated Reads

See also..

Alignment of Bisulphite treated sequences

Novoindex

Novoindex Command Line Options for Bisulphite Mode

Novoalign

Novoalign Command Line Options for Bisulphite Mode

Novoalign Report (Native Mode)

Performance

NovoMethyl – Analyzing Methylation Status

Regard the Bisulphite Treatment

LATEST NEWS

Novocraft and Basepair Inc. Announce Strategic Partnership to Deliver Advanced Genomic Pipelines in the Cloud

Novoalign V4.03.01

Novoalign V4.03.00 and Novosort V3.00.00

Contact Us

Documentation

Bisulphite Treated Reads

See also..

Alignment of Bisulphite treated sequences

Novoindex

Novoindex Command Line Options for Bisulphite Mode

Novoalign

Novoalign Command Line Options for Bisulphite Mode

Novoalign Report (Native Mode)

Performance

NovoMethyl – Analyzing Methylation Status Regard the Bisulphite Treatment

LATEST NEWS

Novocraft and Basepair Inc. Announce Strategic Partnership to Deliver Advanced Genomic Pipelines in the Cloud

Novoalign V4.03.01

Novoalign V4.03.00 and Novosort V3.00.00

Contact Us

NovoMethyl – Analyzing Methylation Status

Regard the Bisulphite Treatment