Alignment of Bisulphite treated sequences
From V2.1, Novoalign can align bisulphite treated sequences against a reference genome in a single pass that aligns against both C-T and G-A in-silco indexed reference sequences.
The process involves:
- Novoindex builds a double index structure using both C-T and G-A converted k-mers
- Novoalign will then align using both indexes and using both forward and complementary directions.
There is also an option for penalising unconverted cytosine in CHG or CHH positions.
With most Illumina Whole Genome Bisulphite reads we only see two of the four possible strands. This happens because different adapter/primer sequences are used for 5′ & 3′ ends of the fragments and, as 5′ adapter is used to attach fragments to the slides, only two of the possible 4 strands are present in the reads. If your sample prep has preserved strand direction then reads will be G rich C poor as in the following example. In this case, you should use the option -b2 and there is no need to use the unconverted cytosine penalty (-u option).
Note. Read 2 of paired reads will be reverse complement and should be C rich G poor if -b2 is applicable.
Example of Illumina Bi-seq reads (Read 1 of pairs) where only two of four possible strands are present. The option -b2 should be used with these reads.
@unknown_0006:2:26:1334:1994#0/1 CTTGTGGTAAAGATGGTTTGTTTTTAGATGGGTAGGATATATATTATTAGGAAATTTTATTTTTAGTAGGTTGGGAAGTATTATGTTTTATGATAAAAAAA @unknown_0006:2:26:1334:532#0/1 TTTAAAGTAGTTAAAAGTAATGTGTTATTATTTTTATTTAGTTAATATTTATAAGTTTGTAAAGTTAGAGTGTATTTAAAAGTTAGGTGATATGTTATTTG @unknown_0006:2:26:1334:1549#0/1 AATATATTTGGTTTTTGTTGGATTTTAGTAGATTTTGTTGGATGGTAAATTGTATTATTTTGTAATTTTTATTTTGTAAATGTATATTTTTGTTATTGTTT @unknown_0006:2:26:1334:2019#0/1 AATATAGAAATAGTAGTTTTATAGTGTGATAATAAGATTTTAGTTATGTTATTTTATAGTTAGAATGGTTAGTTTATATTAATAGTGTGTGTATATGTATT @unknown_0006:2:26:1334:1107#0/1 AAAATGTTAAAAGAAGTTATTTAGAGGAGAGGGAAGAAATTTGGATTTATATTTTAAAAAAAGGAAGATTATGATAAGGAATAATTTTTTTTAATATTTTA @unknown_0006:2:26:1334:152#0/1 GATGGTTTTAATTTTTTGATTTCGTGATTTATTTATTTCGGTTTTTTAAAGTGTTGAGATTATAGGTGGGTGGGTTATGAGGGTAGGAGATTGAGATTATC
Targeted bisulphite sequencing projects that use amplicon PCR before addition of Illumina adaptors typically produce alignments in teh 4 possible orientations and would be run with -b4 -u8 options set.
Novoindex is used to build the bisulphite index structure. This involves building two hash table, one using C-T converted k-mers and the other with G-A conversion.
The bisulphite k-mer hash table is more memory efficient than a normal index as after conversion we have only 3 possible nucleotides in the index k-mers. This allows index k-mer length to be increased compared to a normal index. However as we build two indexes memory utilisation will likely increase compared to standard index structure.
A bisulphite mode index comprises five tables, the first two being doubled up for the CT and GA indexes.:
- Two k-mer hash tables of size 4*3k bytes
- Two sequence offset tables of size 4N/s bytes where N is the length of the sequences being indexed and s is the step size.
- A compressed sequence file of size N/2 bytes.
Novoindex Command Line Options for Bisulphite Mode
|-b||Enables bisulphite indexing mode.|
|Sets the k-mer size. It’s best to let this default unless you need to reduce memory utilisation. Typical values would be in range 17-19, with 19 as the maximum allowed.|
|Sets the step size for the index. With s=1 every k-mer is indexed, at s=2 only every second k-mer is indexed. Using s>1 reduces memory and increases processing time but it does not reduce sensitivity or specificity. It’s probably best to let s default. Set it to 2 or higher if you need to reduce memory utilisation.|
- novoindex -b a.thaliana.nbx refseq/a.thaliana/*.fas
Novoalign will automatically switch into bisulphite alignment mode if the reference sequence index was created with the -b option. In this mode, alignment for each read will involve four searches:
- Forward against CT index
- Reverse complement against CT index
- Forward against GA index
- Reverse complement against GA index
The search process is done iteratively starting with a search for an exact match and gradually increasing the mismatch threshold until an alignment is found.
Scoring of alignments is adjusted so that there is no penalty for the alignment of thymines (which are possibly converted cytosines) to cytosines in the reference sequence.
Novoalign Command Line Options for Bisulphite Mode
|-t 99,99||Sets the alignment threshold. We suggest setting -t 20,3 for Bi-Seq reads.|
|Limits the alignment score threshold for low complexity reads. We suggest setting –hlimit 6 for Bi-Seq|
|Sets mode where alignment process is forward against CT index and complementary against the GA index. Use this option if your sample preparation has preserved the strand of the reads (Examination of read sequences will show they are typically G rich, C poor).|
|Defines a penalty for unconverted cytosines at CHG or CHH. Default 0. For plants try 6, and for vertebrates 12. Please refer to reference manual for more details. Using an unconverted cytosine penalty has the secondary effect of improving performance and should be used if you haven’t used -b 2.|
|Trims low quality 3′ bases. We suggest using -H 20|
Novoalign Report (Native Mode)
# novoalign (2.03.02MT) - short read aligner with qualities. # (C) 2008 NovoCraft # Licensed to Novocraft Technologies Sdn Bhd # novoalign -d /wd5/db/ath.nbx -f SRR01330.100k.fastq -b 2 -t120 # Interpreting input files as Sanger FASTQ. # Index Build Version: 2.1 # Methylation Mode ON - For bisulphite treated sequences. # Hash length: 17 # Step size: 1 ... @SRR013309.99976 :1:4:932:709 length=50 S TTTGGGCTTTCCGAAAAGGTATCACATGCTAAGTTTGGCTTCACGGTTTT IIIIIIIIIIIIIIIIIIIIIIIIIIIIII07IIIII&?II*%/-%[email protected]$ R 90 @SRR013309.99977 :1:4:839:938 length=50 S TGATTTTAATATGATTTTATTTAAATGGAGTATTTTTATTATTTTTATTT [email protected]+II0I,%(% U 65 119 >Chr2 17017872 R . . . GA 1T>A 2C>A 3T>A 5G#A 6C>A 8G#A 9C>A 11G#A 12G#A 14G#A 17G#A 20G#A 30G#A 33G#A 34G#A 35G#A 41G#A 45G#A 46G#A 47G#A @SRR013309.99978 :1:4:542:491 length=50 S TTTTGTTTTGATGAGTTGATGGTTGTTTATTTTAATTTATAAGTTTTTTT IIIIIIIIIIII>IIIII4II*IIIIII5IIII)1I<'?,$%$II(&&!. NM @SRR013309.99979 :1:4:63:623 length=50 S TTGAGGGATGGTATTGGATATTGGTATGAGGAATGTTTGTTTTTTTTTTT IIIIIIIIIIIIIIIIIIIIII2-6%I3*(?)4IE(II&&IG.4AF*-#/ R 2 @SRR013309.99980 :1:4:457:700 length=50 S TACGATAAGTTTTAGGATGAATTTTCGATTGTTCGAGACAGGTTTTCTTT IIIIIIIIIIIIIIIIIIIFIIIIIIIIIIIIII=/50E,I8.ICI(I'* U 26 150 >Chr3 14048749 R . . . GA 1T>A 2C>A 6G#A 8G#A 19G#A 22G#A 27G#A 29G#A 39G#A 41G#A 45G#A @SRR013309.99981 :1:4:254:595 length=50 S TTGGTGGGGTCGAGTAAATTTGAATTTAAATTCGAATTTTTATTGTTTTT [email protected]&CIII&$"II%D&I:+ U 53 25 >Chr3 15315657 F . . . CT 10C#T 27C#T 40G>T 47G>T 50G>T @SRR013309.99982 :1:4:610:41 length=50 S TGTGTATGATTGAGTATAAGAATTTAAATCGCACCTGGATTTTAATGGCT [email protected]<I=(I0$+F8)$AIII$&"'%&% R 47 @SRR013309.99983 :1:4:954:102 length=50 S TAATGTTGGTTTTAATTTAATATTTAGAATTAGAAGGTTTGTTTTTGTAT IIIIIIIIIIIIII9IIIICI=III71:-II14,2/5III+II&II&'&% R 2 @SRR013309.99984 :1:4:281:708 length=50 S TTTTAGTAATTTATTATTGTATAATTGTATATTAAATTATGTGTATTTTT IIIIIIIIIIIIIIIIII=IIIIIII&IIIDI-I>B&B:*0"%,$I%IGI U 79 90 >Chr2 839733 R . . . GA 4C>A 7T>A 11C>A 14C>A 18C>A 19G#A 29G#A 31G#A 34G#A 36G#A 39G#A 40G#A 41G#A 44G#A @SRR013309.99985 :1:4:124:349 length=50 S TGATTTTTGGGAAGTTTTTGAGGAGTTTAAGTTAGATTTTTTTTTGTAGT IIIIIIIIIIIII73IIIII'>IHIBII92&@IE*-E&+$+%I'%&##&* NM @SRR013309.99986 :1:4:765:355 length=50 S TGTAAGATGTAATTTTTTGAGATTATAATAAAGTTTGTATATAATGTTTT IIIIIIIIIIIIIIIIIII5I1IIDI02II8I/III'I4I.I+4I&%I&D U 22 150 >Chr1 22982615 F . . . CT 13C#T 14C#T 17C#T 29C#T 34C#T 47A>T 49G>T 50C#T @SRR013309.99990 :1:4:510:813 length=50 S TGTTGGATAGTAATGATATAAGTTTAATGTTTTAATTATTTATGTAAATT IIIIIIIIIIIIIIIIIIIIIIIIII;IIIIII**IIAIII5"$(*.:II U 24 150 >Chr5 15311929 R . . . GA 2G#A 6C>A 10G#A 11G#A 14G#A 15G#A 18G#A 20G#A 26G#A 34G#A 37G#A 40G#A @SRR013309.99991 :1:4:739:197 length=50 S TTTGTATTATTATTATGTGGTGTTGATGTTTTTTATTTTTTTTGTTTTGT IIIIIIIIIIIIIIIIIII)IIIII%IIIIIIIIB1IIII3II%-(%F%I U 89 90 >Chr4 13917606 F . . . CT 3C#T 8C#T 11C#T 14C#T 21C#T 23C#T 24C#T 33C#T 34C#T 36A>T 38C#T 39C#T 40C#T 41G>T 43C#T 45A>T 46A>T 47G>T 48C#T ... # Read Sequences: 100000 # Aligned: 38388 # Unique Alignment: 28110 # Gapped Alignment: 273 # Quality Filter: 21280 # Homopolymer Filter: 5 # Elapsed Time: 4467,250s # Done.
Looking at one line in detail…
@SRR013309.99990 :1:4:510:813 length=50 S TGTTGGATAGTAATGATATAAGTTTAATGTTTTAATTATTTATGTAAATT IIIIIIIIIIIIIIIIIIIIIIIIII;IIIIII**IIAIII5"$(*.:II U 24 150 >Chr5 15311929 R . . . GA 2G#A 6C>A 10G#A 11G#A 14G#A 15G#A 18G#A 20G#A 26G#A 34G#A 37G#A 40G#A
|@SRR013309.99990 :1:4:510:813 length=50||The read header or identification|
|S||Indicates a single end read|
|TGTTGGATAGTAATGATATAAGTTTAATGTTTTAATTATTTATGTAAATT||The read sequence|
|IIIIIIIIIIIIIIIIIIIIIIIIII;IIIIII**IIAIII5″$(*.:II||Sanger fastq format quality string for the read|
|U||Indicates a unique alignment location was found|
|150||Quality score for the alignment|
|>Chr5||Chromosome that read is mapped to|
|15311929||1 based offset of the alignment within the chromosome|
|R||Strand of the alignment|
|. . .||Three fields for location of alignment of the pair to this read|
|GA||Indicates GA index and rules were use din the alignment|
|2G#A 6C>A 10G#A 11G#A 14G#A 15G#A 18G#A 20G#A 26G#A 34G#A 37G#A 40G#A||Mismatches, Inserts, Deletes and unconverted cytosines.|
Alignment of bisulphite treated fragments is slower than for normal DNA due to the reduction in complexity from conversion of cytosines to uracil. The results in hash indexing process seeding to more alignment locations.
In above example using 50bp reads on a 600Mbp genome and allowing 4 mismatches (not counting converted cytosines) processing rate was around 25 reads/second on a Dual core AMD work station.