Based on Novomethyl (Beta.9.0)
Introduction
Novomethyl can analyse a set of alignments to identify methylated cytosine’s.
Novomethyl analyses Cytosines using two models
Methylation State | In this model the sample is assumed to be homogeneous and we call cytosines as either methylated or unmethylated. Each base in the genome being sequenced can be one of A, Cm, Cu, Gm, Gu or T and calling methylation status is very similar to calling a consensus sequence or SNPs with samtools pileup except that we have 6 rather than 4 states for each position. Novomethyl adopts the base calling models from progams such as samtools pileup to call methylation status.In this case we will use the basic statistical models from SOAP ♦ “SNP detection for massively parallel whole-genome resequencing” (2009) Genome Res. published online May 6, 2009 , doi:10.1101/gr.088013.108 |
Methylation Percentage | This model is designed for heterogeneous samples and uses a traditional 4 base model to call the consensus sequence and then for each cytosine it reports the % of bases that were methylated (i.e. Not converted to thyamine). This model should be used for heterogenous samples such as tumor cells. |
In both modes the reference sequence is used to set prior probabilities for the base calls and then the actual base calls are used to calculate posterior probabilities. The models can handle SNPs and heterozygous locii. It’s fully possible that cytosines in the reference are not in present in the sample or vice versa.
CT & GA alignments
DNA fragments can come off the Watson or Crick strand of the sample however our reference sequence has only one strand. Alignments of fragments from the Watson strand align in CT mode (i.e. T’s in the read can align to either a C or T in the reference sequence) and fragments from the Crick strand align in GA mode because they need to be reverse complemented to align to the reference and now Crick strand cytosine’s are the G’s in the reference.
On the Watson strand we represent methylated cytosine as Cm and unmethylated as Cu and Cytosines from the Crick strand are represented as Gm and Gu depending on methylation status.
Report Formats
There are two basic report formats, BED format and a native report format that borrows from samtools pileup format with addition of extra fields for methylation status.
The BED format report comes in two flavours:
- Track name is Cytosine Context & Methylation State and score is Phred scaled quality of the state
- Track name is Cytosine Context and the the score is the Methylation Percentage
Method
Four basic steps
- Align using Novoalign V2.07.06 or later
- Split alignments into separate CT & GA BAM files
- Use samtools mpileup to produce a raw pileup from the CT & GA files
- Use novomethyl to call methylation status
Aligning
Nothing special here, just make sure you use Novoalign V2.07.06 or later as it tags alignments based on the Watson/Crick strand. For best performance use -b2 option. we need to use a recent copy of Novoalign as it adds a tag to the sam records indication if the alignment is in CT or GA mode.
Splitting the BAM files
This is necessary as we need to separate alignments into Watson & Crick strands. This is easily done using grep on the sam report from Novoalign, we just grep for lines with tag ZB:Z:CT or ZB:Z:GA.
Example code that processes a set of single end read files:
for f in {list of read files} do novoalign -d referencegenome.nbx -t 20,3 -f ${f} -F STDFQ -b2 -a -o SAM '@RG\tID:Pooh' > ${f}.sam 2>${f}.log grep -e "^@\|ZB:Z:CT" ${f}.sam | samtools view -uS - | samtools sort - ${f}.CT& grep -e "^@\|ZB:Z:GA" ${f}.sam | samtools view -uS - | samtools sort - ${f}.GA& done wait
The option -t 20,3 sets the alignment threshold. For Bi-Seq projects we usually set this lower than the default.
The option -a turns on adapter trimming for short fragments. This is especially useful if you’ve gone for a short insert length to try and avoid fragments breaking during bisulphite treatment. For paired end data sets we’ve tested the default adapters were correct. However we suggest you check what adapters were actually used.
After samtools sort you may optionally remove duplicate fragments with samtools rmdup.
If you had multiple read files the next step is to merge the BAM files from above (but keeping CT & GA files separate)
Skip this step if you only had one file of reads.
samtools merge ct.bam *.CT.bam& samtools merge ga.bam *.GA.bam& wait
Samtools mpileup
We now have two BAM files and we can use these with samtools mpileup to produce a raw mpileup file. Note that we are not using mpileup to call consensus or bases, just to pileup the bases.
samtools mpileup -BC 0 -q 30 -f referencegenome.fa ct.bam ga.bam >biseq.mpileup.txt
The mpileup options -BC 0 are required to turn off base quality calibration. If not used, then mpileup will lower the quality of converted cytosine calls.
We’ve also shown the use of a quality limit of 30 for alignments, this is more an intuition than a recommendation.
A couple of important points:
- Don’t use an SM tag on the @RG record
- The ct.bam file should be first
The mpileup file will have separate columns for Watson & Crick strands (CT & GA mode alignments) and should look something like:
gnl|Amel_4.0|Group10.1 90 A 10 ........,, <>AB@B=BBC 11 ,,,.,.,.,C^~, CBBCCBBCA-B gnl|Amel_4.0|Group10.1 91 T 10 ........,, AABBBCACBB 11 ,,,.,.,.,., B@BBAAAB>@= gnl|Amel_4.0|Group10.1 92 C 10 TTTTTTTTtt A@BABC?BBB 11 ,,,.,.,.,., BBBACA=CB?B gnl|Amel_4.0|Group10.1 93 G 11 ........,,^~. A7BBBABB@B0 11 aaaAaAaAaAa CBCCCAACAC@ gnl|Amel_4.0|Group10.1 94 A 11 ........,,. >;5 gnl|Amel_4.0|Group10.1 96 C 11 TTTTTTTTttT BAA?BC?CBC; 11 ,,,.,.,.,., BBCAABA=BBA gnl|Amel_4.0|Group10.1 97 A 11 ........,,. >=?=@B=BCC: 11 ,,,.,.,.,., BBBBABB@ABB gnl|Amel_4.0|Group10.1 98 G 11 ........,,. B=BBBB@BBB< 11 aaaAaAaAaAa ABBCBABB
Methylation Status report (BED format) for Homogeneous Samples
If the tissue sample being sequenced is expected to be homogenous then you may want to call Methylation state rather than percentage methylated. This mode will report a Cytosine as either Methylated or Unmethylated using the “6 base” model of A,Cm,Cu,Gm,Gu & T. Cytosines are classified as either Methylated or Unmethyated with no shades of grey other than in Diploid mode where it’s possible that a Cytosine that is close to 50 methylated may be called as heterozygous for Cm & Cu.
Here we use novomethyl to produce a bed file showing methylation status of all cytosine’s that had sufficient read coverage to call the status.
novomethyl -h <biseq.mpileup.txt >Cmethyl.bed
The bed file has separate tracks for each context and methylation status. The score column indicates the quality of the call and strand indicates Watson/Crick location of the Cytosine.
gnl|Amel_4.0|Group10.1 91 92 CpGu 54 + gnl|Amel_4.0|Group10.1 92 93 CpGu 54 - gnl|Amel_4.0|Group10.1 94 95 CHHu 54 + gnl|Amel_4.0|Group10.1 95 96 CHGu 54 + gnl|Amel_4.0|Group10.1 97 98 CHGu 54 - gnl|Amel_4.0|Group10.1 98 99 CpGu 54 + gnl|Amel_4.0|Group10.1 99 100 CpGu(Het) 54 - gnl|Amel_4.0|Group10.1 101 102 CpGu 54 + gnl|Amel_4.0|Group10.1 102 103 CpGu 54 - gnl|Amel_4.0|Group10.1 104 105 CHHu 57 - gnl|Amel_4.0|Group10.1 106 107 CHHu 57 - gnl|Amel_4.0|Group10.1 108 109 CHHu 57 -
A log file is written to stderr with some statistics on Cytosine classifications. The counts of methylated/unmethylated cytosines only includes those where methylation status quality is greater than 5
# Novomethyl (Beta.4.0) Run Report # (C) 2011 NovoCraft Technologies Sdn Bhd. # Licensed to Novocraft Technologies Internal Use Only # novomethyl -h mpileup.10.3.1M.txt # Starting at Fri Mar 18 10:13:39 2011 # # Cytosine Classifications # # CT strand # 660 SNP (C in reference but not in reads.) # 0 Low Read Cover # 14405 Low Quality # 0 No Read Cover # # Homozygous Cytosines # Context Methylated Unmethylated % Methylated # CpG 49 65067 0.1% # CHG 1 24732 0.0% # CHH 12 102977 0.0% # # Heterozygous Cytosines # Context Methylated Unmethylated % Methylated # CpG 33 389 7.8% # CHG 1 170 0.6% # CHH 5 646 0.8% # # GA strand # 536 SNP (C in reference but not in reads.) # 0 Low Read Cover # 13995 Low Quality # 0 No Read Cover # # Homozygous Cytosines # Context Methylated Unmethylated % Methylated # CpG 57 65276 0.1% # CHG 2 24750 0.0% # CHH 5 101362 0.0% # # Heterozygous Cytosines # Context Methylated Unmethylated % Methylated # CpG 21 359 5.5% # CHG 0 162 0.0% # CHH 1 629 0.2% # # 99.9% Cu Conversion Efficiency (CT strand) # 99.9% Cu Conversion Efficiency (GA strand) # Note. Conversion efficiency may be inaccurate if the sample is heterogeneous. # Finished at Fri Mar 18 10:13:46 2011
Percent Methylation Report (BED format) for Heterogeneous Samples
Novomethyl is used to produce a bed file showing methylation percentage for each cytosine that had sufficient read coverage to call the status.
novomethyl -h -% <biseq.mpileup.txt >Cmethyl.bed
The bed file has separate tracks for each context and methylation status. The score column indicates the % of methylated cytosines and the strand indicates Watson/Crick location of the Cytosine. !
gnl|Amel_4.0|Group10.3 350 350 CHH 0 + gnl|Amel_4.0|Group10.3 362 362 CHG 0 + gnl|Amel_4.0|Group10.3 398 398 CHG 0 + gnl|Amel_4.0|Group10.3 410 410 CHG 40 + gnl|Amel_4.0|Group10.3 422 422 CHH 0 + gnl|Amel_4.0|Group10.3 446 446 CHH 0 + gnl|Amel_4.0|Group10.3 448 448 CHG 0 +
A log file is written to stderr. This includes a histogram of methylation rates for cytosines called with a quality >greater than -q setting and methylation status quality greater then 5.
# Novomethyl (Beta.4.0) Run Report # (C) 2011 NovoCraft Technologies Sdn Bhd. # Licensed to Novocraft Technologies Internal Use Only # novomethyl -% mpileup.10.3.1M.txt # Starting at Fri Mar 18 10:14:02 2011 # # Cytosine Classifications # # CT strand # 210 SNP (C in reference but not in reads.) # 14799 Low Read Cover # 264 Low Quality # 0 No Read Cover # % Meth CpG CHG CHH # 0 65038 24759 103078 # 1 0 0 0 # 2 0 0 0 # 3 0 0 0 # 4 0 0 1 # 5 4 2 0 # 6 18 8 11 . . . # 96 0 0 0 # 97 0 0 0 # 98 0 0 0 # 99 0 0 0 # 100 39 1 10 # # GA strand # 208 SNP (C in reference but not in reads.) # 14216 Low Read Cover # 270 Low Quality # 0 No Read Cover # % Meth CpG CHG CHH # 0 65284 24771 101513 # 1 1 0 0 # 2 0 0 0 # 3 0 0 0 . . . # 97 0 0 0 # 98 0 0 0 # 99 0 0 0 # 100 50 2 5 # # 99.9% Cu Conversion Efficiency (CT strand) # 99.9% Cu Conversion Efficiency (GA strand) # Note. Conversion efficiency may be inaccurate if the sample is heterogeneous. # Finished at Fri Mar 18 10:14:07 2011
The Cu Conversion efficiencies are tallied from Cytosines that are called as unmethylated.
Consensus & Methylation Report
In consensus report format the novomethyl report is similar to a samtools pileup report with added fields for cytosine context, status, quality and percent methylated.
novomethyl -h -o Consensus< <biseq.mpileup.txt >Cmethyl.txt
An example report is shown below followed by a description of the columns.
gnl|Amel_4.0|Group10.3 1401 T T 65 . . . . . . 7 T$T$TTTTT =??ABBA 2 TT CA gnl|Amel_4.0|Group10.3 1402 T T 58 . . . . . . 5 TTTTT B@BCB 2 TT CB gnl|Amel_4.0|Group10.3 1403 A A 64 . . . . . . 5 AAAAA @?BBA 2 AA @A gnl|Amel_4.0|Group10.3 1404 C C 56 CHGu + 66 0 0 5 5 tTttt BBCAB 2 CC B@ gnl|Amel_4.0|Group10.3 1405 C C 56 CpGu + 66 0 0 5 5 tTttt A@CCB 2 CC B@ gnl|Amel_4.0|Group10.3 1406 G G 63 CpGu - 55 0 0 2 5 GGGGG BABCB 2 AA A@ gnl|Amel_4.0|Group10.3 1407 T T 58 . . . . . . 5 TTTTT B:ACC 2 TT BA gnl|Amel_4.0|Group10.3 1408 A A 64 . . . . . . 5 AAAAA @AB@@ 2 AA A@ gnl|Amel_4.0|Group10.3 1409 T T 58 . . . . . . 5 TTTTT @ABAB 2 TT CB ... gnl|Amel_4.0|Group10.3 1553 T C/T 66 CHGu + 27 0 0 4 9 TTTTTTTTT BBBCCC?@B 10 ccTcCcTTCT CB?=BA8@A? ... gnl|Amel_4.0|Group10.3 88622 A A 82 . . . . . . 8 AAAAAAA^~A 68?@AABB 2 AA A? gnl|Amel_4.0|Group10.3 88623 A A 82 . . . . . . 8 AAAAAAAA 8;BABBCC 2 AA C7 gnl|Amel_4.0|Group10.3 88624 C C 56 CHGu + 84 0 0 8 8 tTttTTTt 7>1@B@C= 2 Cn B! gnl|Amel_4.0|Group10.3 88625 C C 15 CpGm + 10 75 6 8 8 CCCCTCTC B2>CB@CA 2 CC ;> gnl|Amel_4.0|Group10.3 88626 G G 84 CpGm - 32 100 2 2 8 GGGGGGGG A@9BBBC; 2 GG AB gnl|Amel_4.0|Group10.3 88627 A A 88 . . . . . . 9 AAAAAAAA^~A @B<B@BB@B 2 AA B= gnl|Amel_4.0|Group10.3 88628 C C 9 CpGu + 4 67 6 9 9 CCCCTCTtC B7CBAABCC 2 CC B; gnl|Amel_4.0|Group10.3 88629 G G 90 CpGm - 32 100 2 2 9 GGGGGGGGG BAA=ABCBB 2 GG AA gnl|Amel_4.0|Group10.3 88630 A A 88 . . . . . . 9 AAAAAAAAA A@BAA@BAB 2 AA @? gnl|Amel_4.0|Group10.3 88631 A A 88 . . . . . . 9 A$AAAAAAAA A)BB<BBCC 2 AA BA
A log file is also written to stderr. The log file is a combination of the status and methylation percentage reports. For sake of brevity an example is not given.
Consensus Report Columns
1. | RNAME | Reference sequence name from pileup |
2. | POS | 1-based Coordinate |
3. | REF | Reference Sequence Base. May include IUB ambiguous codes if present in the mpileup. |
4. | CONSENSUS | Consensus base, may show heterozygous base such as A/T |
5. | QUAL | Quality of the consensus call |
6. | CONTEXT | If it is a Cytosine we show the context as CpG, CHG or CHH together with methylation status indicator ‘u’ or ‘m’ if methylation quality is gretaer than 3. |
7. | STRAND | Watson-Crick location of the cytosine is shown as +/- |
8. | MQUAL | The quality of the methylation status given that we have a cytosine at this locus |
9. | MPCNT | The % of cytosines that were not converted to T’s or a period if the number of samples is too low to accurately call a percentage. |
10. | NU | The number of unconverted cytosines. |
11. | NC | The total number of cytosines. If a site is called as heterozygous C/T then the lesser of half the total C&T bases or the total T bases are assumed to result from the heterozygous T and are not included in the NC count. |
12. | NCT | The number of reads aligning on the Watson strand. The following 6 fields are directly from samtools mpileup file and we refer you to samtools documentation for further information. |
13. | CTBASES | The bases and read mapping qualities from Watson strand. |
14. | CTQUALITIES | The base qualities for Watson strand. |
15. | NGA | Next 3 fields are as above for Crick strand. |
16. | GABASES | |
17. | GAQUALITIES |
Novomethyl Command Line Option
Usage:
novomethyl options < mpileup.txt Options (with default value shown):
-1 | Sets Haploid mode. The sample is expected to be from a Haploid genome and and heterozygosity is not considered as a possible base state. |
-2 | Sets Diploid mode (default) and only reports Homozygous Cytosines. The sample is expected to be from a Diploid genome and and heterozygosity is considered as a possible base state. In BED format, only Homozygous cytosines have their methylation status or percentage reported. |
-h | Turns on Diploid mode with reporting of both homozygous and heterozygous Cytosines. The sample is expected to be from a Diploid genome and and heterozygosity is considered as a possible base state. Homozygous & Heterozygous cytosines will have their methylation status or percentage reported. |
-% | In BED format report % methylated rather than methylation status and quality. In this mode we are calling % methylation for the cytosine. This can be combined with the above Haploid & Diploid modes. If in Diploid mode (-h) and a locii is identified as Heterozygous C/[AGT] then % methylation is calculated on the C and can still show 100% methylated. |
-o Consensus | Uses consensus sequence report format. This format contains much more information however you may have to filter and process it to make a track suitable for your genome browser. |
-s 0.001 | Sets prior probability of a SNP. Used to set prior probabilities that a base is different to the reference genome. The reference base gets a prior of 1.0 all other bases get prior of Psnp. Priors also take into account ambiguous nucleotides in the reference genome. |
-e 0.98 | Sets bisulphite conversion efficiency. Used in calculation of P(dk|H) for unconverted Cytosines. Refer to SOAP paper for more details. We suggest running a test and then adjusting this to match the reported rates especially if your sample is heterogeneous. |
-m 0.5 | Sets prior probability of Cytosine being methylated. Splits prior probability of Cytosine between methylated and unmethylated states so that prior of a methylated Cytosine is P(C).Pm |
-q 10 | Sets quality limit for reporting Cytosine methylation status.(Note. A quality of 3 is probability of 0.5 that Cytosine is methylated so it’s really an unknown status.) In both state and % mode a cytosine locii will not be reported if it’s quality is less than this threshold. In state mode this is quality of cm or Cu state, in percent mode it is the quality of the C call. Quality is calculated using a 6 base model (A,Cu,Cm,Gu,Gm,T) with the basic statistical models from SOAP ♦ “SNP detection for massively parallel whole-genome resequencing” (2009) Genome Res. published online May 6, 2009 , doi:10.1101/gr.088013.108 |
-Q 10 | Lower limit on base quality for counting bases for % methylation and conversion rates. Base calls from mpileup file with quality less than this threshold will not be counted for determining % methylation or the conversion efficiency. |
-c 1 | Sets minimum number of reads covering a Cytosine (strand specific). In percent mode a Cytosine will not be called unless it has at least this minimum number of reads on CT or GA strand. |
-d dbname | Indexed reference genome(optional). This is the novoindex of the reference genome used to produce the alignments. If provided it is used for determing the cytosine context (CpG, CHG, CHH).It helps resolve context near the edges of read cover where the mpileup file does not include all the references base need to establish the context. |
-D | Debug mode (prints pileup details) The Consensus format report is possibly more useful. |
Heterozygous C/G Locii
If a locus is identified as heterozygous for C&G Novomethyl will report the site only once. The consensus is identified as C/G with appropriate quality and then methylation state and percentage is reported for either the C or the G but not both. The strand column will indicate if it is the Watson or Crick cytosine that is reported. The strand that has the higher quality for the cytosine (C or G) call is reported.