NovoMethyl – Analyzing Methylation Status

Based on Novomethyl (Beta.9.0)

Introduction

Novomethyl can analyse a set of alignments to identify methylated cytosine’s.

Novomethyl analyses Cytosines using two models

Methylation State

In this model the sample is assumed to be homogeneous and we call cytosines as either methylated or unmethylated. Each base in the genome being sequenced can be one of A, Cm, Cu, Gm, Gu or T and calling methylation status is very similar to calling a consensus sequence or SNPs with samtools pileup except that we have 6 rather than 4 states for each position. Novomethyl adopts the base calling models from progams such as samtools pileup to call methylation status.In this case we will use the basic statistical models from SOAP ♦ “SNP detection for massively parallel whole-genome resequencing” (2009) Genome Res. published online May 6, 2009 , doi:10.1101/gr.088013.108

Methylation Percentage

This model is designed for heterogeneous samples and uses a traditional 4 base model to call the consensus sequence and then for each cytosine it reports the % of bases that were methylated (i.e. Not converted to thyamine). This model should be used for heterogenous samples such as tumor cells.

In both modes the reference sequence is used to set prior probabilities for the base calls and then the actual base calls are used to calculate posterior probabilities. The models can handle SNPs and heterozygous locii. It’s fully possible that cytosines in the reference are not in present in the sample or vice versa.

CT & GA alignments

DNA fragments can come off the Watson or Crick strand of the sample however our reference sequence has only one strand. Alignments of fragments from the Watson strand align in CT mode (i.e. T’s in the read can align to either a C or T in the reference sequence) and fragments from the Crick strand align in GA mode because they need to be reverse complemented to align to the reference and now Crick strand cytosine’s are the G’s in the reference.

On the Watson strand we represent methylated cytosine as Cm and unmethylated as Cu and Cytosines from the Crick strand are represented as Gm and Gu depending on methylation status.

Report Formats

There are two basic report formats, BED format and a native report format that borrows from samtools pileup format with addition of extra fields for methylation status.

The BED format report comes in two flavours:

Track name is Cytosine Context & Methylation State and score is Phred scaled quality of the state
Track name is Cytosine Context and the the score is the Methylation Percentage

Method

Four basic steps

Align using Novoalign V2.07.06 or later
Split alignments into separate CT & GA BAM files
Use samtools mpileup to produce a raw pileup from the CT & GA files
Use novomethyl to call methylation status

Aligning

Nothing special here, just make sure you use Novoalign V2.07.06 or later as it tags alignments based on the Watson/Crick strand. For best performance use -b2 option. we need to use a recent copy of Novoalign as it adds a tag to the sam records indication if the alignment is in CT or GA mode.

Splitting the BAM files

This is necessary as we need to separate alignments into Watson & Crick strands. This is easily done using grep on the sam report from Novoalign, we just grep for lines with tag ZB:Z:CT or ZB:Z:GA.

Example code that processes a set of single end read files:

for f in {list of read files} 
do 
    novoalign -d referencegenome.nbx -t 20,3 -f ${f} -F STDFQ -b2 -a -o SAM '@RG\tID:Pooh' > ${f}.sam 2>${f}.log 
    grep -e "^@\|ZB:Z:CT" ${f}.sam | samtools view -uS - | samtools sort - ${f}.CT& 
    grep -e "^@\|ZB:Z:GA" ${f}.sam | samtools view -uS - | samtools sort - ${f}.GA& 
done 
wait

The option -t 20,3 sets the alignment threshold. For Bi-Seq projects we usually set this lower than the default.

The option -a turns on adapter trimming for short fragments. This is especially useful if you’ve gone for a short insert length to try and avoid fragments breaking during bisulphite treatment. For paired end data sets we’ve tested the default adapters were correct. However we suggest you check what adapters were actually used.

After samtools sort you may optionally remove duplicate fragments with samtools rmdup.

If you had multiple read files the next step is to merge the BAM files from above (but keeping CT & GA files separate)

Skip this step if you only had one file of reads.

samtools merge ct.bam *.CT.bam&
samtools merge ga.bam *.GA.bam&
wait

Samtools mpileup

We now have two BAM files and we can use these with samtools mpileup to produce a raw mpileup file. Note that we are not using mpileup to call consensus or bases, just to pileup the bases.

samtools mpileup -BC 0 -q 30 -f referencegenome.fa ct.bam ga.bam >biseq.mpileup.txt

The mpileup options -BC 0 are required to turn off base quality calibration. If not used, then mpileup will lower the quality of converted cytosine calls.

We’ve also shown the use of a quality limit of 30 for alignments, this is more an intuition than a recommendation.

A couple of important points:

Don’t use an SM tag on the @RG record
The ct.bam file should be first

The mpileup file will have separate columns for Watson & Crick strands (CT & GA mode alignments) and should look something like:

gnl|Amel_4.0|Group10.1  90      A       10      ........,,      <>AB@B=BBC      11      ,,,.,.,.,C^~,   CBBCCBBCA-B
gnl|Amel_4.0|Group10.1  91      T       10      ........,,      AABBBCACBB      11      ,,,.,.,.,.,     B@BBAAAB>@=
gnl|Amel_4.0|Group10.1  92      C       10      TTTTTTTTtt      A@BABC?BBB      11      ,,,.,.,.,.,     BBBACA=CB?B
gnl|Amel_4.0|Group10.1  93      G       11      ........,,^~.   A7BBBABB@B0     11      aaaAaAaAaAa     CBCCCAACAC@
gnl|Amel_4.0|Group10.1  94      A       11      ........,,.     >;5
gnl|Amel_4.0|Group10.1  96      C       11      TTTTTTTTttT     BAA?BC?CBC;     11      ,,,.,.,.,.,     BBCAABA=BBA
gnl|Amel_4.0|Group10.1  97      A       11      ........,,.     >=?=@B=BCC:     11      ,,,.,.,.,.,     BBBBABB@ABB
gnl|Amel_4.0|Group10.1  98      G       11      ........,,.     B=BBBB@BBB<     11      aaaAaAaAaAa     ABBCBABB

Methylation Status report (BED format) for Homogeneous Samples

If the tissue sample being sequenced is expected to be homogenous then you may want to call Methylation state rather than percentage methylated. This mode will report a Cytosine as either Methylated or Unmethylated using the “6 base” model of A,Cm,Cu,Gm,Gu & T. Cytosines are classified as either Methylated or Unmethyated with no shades of grey other than in Diploid mode where it’s possible that a Cytosine that is close to 50 methylated may be called as heterozygous for Cm & Cu.

Here we use novomethyl to produce a bed file showing methylation status of all cytosine’s that had sufficient read coverage to call the status.

novomethyl -h <biseq.mpileup.txt >Cmethyl.bed

The bed file has separate tracks for each context and methylation status. The score column indicates the quality of the call and strand indicates Watson/Crick location of the Cytosine.

gnl|Amel_4.0|Group10.1  91      92      CpGu    54      + 
gnl|Amel_4.0|Group10.1  92      93      CpGu    54      - 
gnl|Amel_4.0|Group10.1  94      95      CHHu    54      + 
gnl|Amel_4.0|Group10.1  95      96      CHGu    54      + 
gnl|Amel_4.0|Group10.1  97      98      CHGu    54      - 
gnl|Amel_4.0|Group10.1  98      99      CpGu    54      + 
gnl|Amel_4.0|Group10.1  99      100     CpGu(Het)    54      - 
gnl|Amel_4.0|Group10.1  101     102     CpGu    54      + 
gnl|Amel_4.0|Group10.1  102     103     CpGu    54      - 
gnl|Amel_4.0|Group10.1  104     105     CHHu    57      - 
gnl|Amel_4.0|Group10.1  106     107     CHHu    57      - 
gnl|Amel_4.0|Group10.1  108     109     CHHu    57      -

A log file is written to stderr with some statistics on Cytosine classifications. The counts of methylated/unmethylated cytosines only includes those where methylation status quality is greater than 5

#       Novomethyl (Beta.4.0) Run Report 
# (C) 2011 NovoCraft Technologies Sdn Bhd. 
# Licensed to Novocraft Technologies Internal Use Only 
# novomethyl -h mpileup.10.3.1M.txt  # Starting at Fri Mar 18 10:13:39 2011 
# 
#       Cytosine Classifications
# 
#       CT strand 
#              660      SNP (C in reference but not in reads.) 
#                0      Low Read Cover
#            14405      Low Quality 
#                0      No Read Cover 
# 
#       Homozygous Cytosines 
#       Context Methylated      Unmethylated    % Methylated 
#       CpG             49           65067             0.1% 
#       CHG              1           24732             0.0% 
#       CHH             12          102977             0.0% 
# 
#       Heterozygous Cytosines #       Context Methylated      Unmethylated    % Methylated 
#       CpG             33             389             7.8% 
#       CHG              1             170             0.6% 
#       CHH              5             646             0.8% 
# 
#       GA strand 
#              536      SNP (C in reference but not in reads.)
#                0      Low Read Cover 
#            13995      Low Quality 
#                0      No Read Cover 
# 
#       Homozygous Cytosines 
#       Context Methylated      Unmethylated    % Methylated 
#       CpG             57           65276             0.1% 
#       CHG              2           24750             0.0% 
#       CHH              5          101362             0.0% 
# 
#       Heterozygous Cytosines 
#       Context Methylated      Unmethylated    % Methylated 
#       CpG             21             359             5.5% 
#       CHG              0             162             0.0% 
#       CHH              1             629             0.2% 
# 
#       99.9%   Cu Conversion Efficiency (CT strand) 
#       99.9%   Cu Conversion Efficiency (GA strand) 
#        Note. Conversion efficiency may be inaccurate if the sample is heterogeneous. 
# Finished at Fri Mar 18 10:13:46 2011

Percent Methylation Report (BED format) for Heterogeneous Samples

Novomethyl is used to produce a bed file showing methylation percentage for each cytosine that had sufficient read coverage to call the status.

novomethyl -h -% <biseq.mpileup.txt >Cmethyl.bed

The bed file has separate tracks for each context and methylation status. The score column indicates the % of methylated cytosines and the strand indicates Watson/Crick location of the Cytosine. !

gnl|Amel_4.0|Group10.3  350     350     CHH     0       +
gnl|Amel_4.0|Group10.3  362     362     CHG     0       +
gnl|Amel_4.0|Group10.3  398     398     CHG     0       +
gnl|Amel_4.0|Group10.3  410     410     CHG     40      +
gnl|Amel_4.0|Group10.3  422     422     CHH     0       +
gnl|Amel_4.0|Group10.3  446     446     CHH     0       +
gnl|Amel_4.0|Group10.3  448     448     CHG     0       +

A log file is written to stderr. This includes a histogram of methylation rates for cytosines called with a quality >greater than -q setting and methylation status quality greater then 5.

#       Novomethyl (Beta.4.0) Run Report
# (C) 2011 NovoCraft Technologies Sdn Bhd.
# Licensed to Novocraft Technologies Internal Use Only
# novomethyl -% mpileup.10.3.1M.txt 
# Starting at Fri Mar 18 10:14:02 2011
#
#       Cytosine Classifications
#
#       CT strand
#              210      SNP (C in reference but not in reads.)
#            14799      Low Read Cover
#              264      Low Quality
#                0      No Read Cover
#       % Meth  CpG     CHG     CHH
#       0       65038   24759   103078
#       1       0       0       0
#       2       0       0       0
#       3       0       0       0
#       4       0       0       1
#       5       4       2       0
#       6       18      8       11
.
.
.
#       96      0       0       0
#       97      0       0       0
#       98      0       0       0
#       99      0       0       0
#       100     39      1       10
#
#       GA strand
#              208      SNP (C in reference but not in reads.)
#            14216      Low Read Cover
#              270      Low Quality
#                0      No Read Cover
#       % Meth  CpG     CHG     CHH
#       0       65284   24771   101513
#       1       1       0       0
#       2       0       0       0
#       3       0       0       0
.
.
.
#       97      0       0       0
#       98      0       0       0
#       99      0       0       0
#       100     50      2       5
#
#       99.9%   Cu Conversion Efficiency (CT strand)
#       99.9%   Cu Conversion Efficiency (GA strand)
#        Note. Conversion efficiency may be inaccurate if the sample is heterogeneous.
# Finished at Fri Mar 18 10:14:07 2011

The Cu Conversion efficiencies are tallied from Cytosines that are called as unmethylated.

Consensus & Methylation Report

In consensus report format the novomethyl report is similar to a samtools pileup report with added fields for cytosine context, status, quality and percent methylated.

novomethyl -h -o Consensus< <biseq.mpileup.txt >Cmethyl.txt

An example report is shown below followed by a description of the columns.

gnl|Amel_4.0|Group10.3  1401    T       T       65      .       .       .       .       .       .       7       T$T$TTTTT       =??ABBA 2       TT      CA
gnl|Amel_4.0|Group10.3  1402    T       T       58      .       .       .       .       .       .       5       TTTTT   B@BCB   2       TT      CB
gnl|Amel_4.0|Group10.3  1403    A       A       64      .       .       .       .       .       .       5       AAAAA   @?BBA   2       AA      @A
gnl|Amel_4.0|Group10.3  1404    C       C       56      CHGu    +       66      0       0       5       5       tTttt   BBCAB   2       CC      B@
gnl|Amel_4.0|Group10.3  1405    C       C       56      CpGu    +       66      0       0       5       5       tTttt   A@CCB   2       CC      B@
gnl|Amel_4.0|Group10.3  1406    G       G       63      CpGu    -       55      0       0       2       5       GGGGG   BABCB   2       AA      A@
gnl|Amel_4.0|Group10.3  1407    T       T       58      .       .       .       .       .       .       5       TTTTT   B:ACC   2       TT      BA
gnl|Amel_4.0|Group10.3  1408    A       A       64      .       .       .       .       .       .       5       AAAAA   @AB@@   2       AA      A@
gnl|Amel_4.0|Group10.3  1409    T       T       58      .       .       .       .       .       .       5       TTTTT   @ABAB   2       TT      CB
...
gnl|Amel_4.0|Group10.3  1553    T       C/T     66      CHGu    +       27      0       0       4       9       TTTTTTTTT       BBBCCC?@B       10      ccTcCcTTCT      CB?=BA8@A?
...
gnl|Amel_4.0|Group10.3  88622   A       A       82      .       .       .       .       .       .       8       AAAAAAA^~A      68?@AABB        2       AA      A?
gnl|Amel_4.0|Group10.3  88623   A       A       82      .       .       .       .       .       .       8       AAAAAAAA        8;BABBCC        2       AA      C7
gnl|Amel_4.0|Group10.3  88624   C       C       56      CHGu    +       84      0       0       8       8       tTttTTTt        7>1@B@C=        2       Cn      B!
gnl|Amel_4.0|Group10.3  88625   C       C       15      CpGm    +       10      75      6       8       8       CCCCTCTC        B2>CB@CA        2       CC      ;>
gnl|Amel_4.0|Group10.3  88626   G       G       84      CpGm    -       32      100     2       2       8       GGGGGGGG        A@9BBBC;        2       GG      AB
gnl|Amel_4.0|Group10.3  88627   A       A       88      .       .       .       .       .       .       9       AAAAAAAA^~A     @B<B@BB@B       2       AA      B=
gnl|Amel_4.0|Group10.3  88628   C       C       9       CpGu    +       4       67      6       9       9       CCCCTCTtC       B7CBAABCC       2       CC      B;
gnl|Amel_4.0|Group10.3  88629   G       G       90      CpGm    -       32      100     2       2       9       GGGGGGGGG       BAA=ABCBB       2       GG      AA
gnl|Amel_4.0|Group10.3  88630   A       A       88      .       .       .       .       .       .       9       AAAAAAAAA       A@BAA@BAB       2       AA      @?
gnl|Amel_4.0|Group10.3  88631   A       A       88      .       .       .       .       .       .       9       A$AAAAAAAA      A)BB<BBCC       2       AA      BA

A log file is also written to stderr. The log file is a combination of the status and methylation percentage reports. For sake of brevity an example is not given.

Consensus Report Columns

1.	RNAME	Reference sequence name from pileup
2.	POS	1-based Coordinate
3.	REF	Reference Sequence Base. May include IUB ambiguous codes if present in the mpileup.
4.	CONSENSUS	Consensus base, may show heterozygous base such as A/T
5.	QUAL	Quality of the consensus call
6.	CONTEXT	If it is a Cytosine we show the context as CpG, CHG or CHH together with methylation status indicator ‘u’ or ‘m’ if methylation quality is gretaer than 3.
7.	STRAND	Watson-Crick location of the cytosine is shown as +/-
8.	MQUAL	The quality of the methylation status given that we have a cytosine at this locus
9.	MPCNT	The % of cytosines that were not converted to T’s or a period if the number of samples is too low to accurately call a percentage.
10.	NU	The number of unconverted cytosines.
11.	NC	The total number of cytosines. If a site is called as heterozygous C/T then the lesser of half the total C&T bases or the total T bases are assumed to result from the heterozygous T and are not included in the NC count.
12.	NCT	The number of reads aligning on the Watson strand. The following 6 fields are directly from samtools mpileup file and we refer you to samtools documentation for further information.
13.	CTBASES	The bases and read mapping qualities from Watson strand.
14.	CTQUALITIES	The base qualities for Watson strand.
15.	NGA	Next 3 fields are as above for Crick strand.
16.	GABASES
17.	GAQUALITIES

Novomethyl Command Line Option

Usage:

novomethyl options < mpileup.txt Options (with default value shown):

-1	Sets Haploid mode. The sample is expected to be from a Haploid genome and and heterozygosity is not considered as a possible base state.
-2	Sets Diploid mode (default) and only reports Homozygous Cytosines. The sample is expected to be from a Diploid genome and and heterozygosity is considered as a possible base state. In BED format, only Homozygous cytosines have their methylation status or percentage reported.
-h	Turns on Diploid mode with reporting of both homozygous and heterozygous Cytosines. The sample is expected to be from a Diploid genome and and heterozygosity is considered as a possible base state. Homozygous & Heterozygous cytosines will have their methylation status or percentage reported.
-%	In BED format report % methylated rather than methylation status and quality. In this mode we are calling % methylation for the cytosine. This can be combined with the above Haploid & Diploid modes. If in Diploid mode (-h) and a locii is identified as Heterozygous C/[AGT] then % methylation is calculated on the C and can still show 100% methylated.
-o Consensus	Uses consensus sequence report format. This format contains much more information however you may have to filter and process it to make a track suitable for your genome browser.
-s 0.001	Sets prior probability of a SNP. Used to set prior probabilities that a base is different to the reference genome. The reference base gets a prior of 1.0 all other bases get prior of Psnp. Priors also take into account ambiguous nucleotides in the reference genome.
-e 0.98	Sets bisulphite conversion efficiency. Used in calculation of P(d_k\|H) for unconverted Cytosines. Refer to SOAP paper for more details. We suggest running a test and then adjusting this to match the reported rates especially if your sample is heterogeneous.
-m 0.5	Sets prior probability of Cytosine being methylated. Splits prior probability of Cytosine between methylated and unmethylated states so that prior of a methylated Cytosine is P(C).Pm
-q 10	Sets quality limit for reporting Cytosine methylation status.(Note. A quality of 3 is probability of 0.5 that Cytosine is methylated so it’s really an unknown status.) In both state and % mode a cytosine locii will not be reported if it’s quality is less than this threshold. In state mode this is quality of cm or Cu state, in percent mode it is the quality of the C call. Quality is calculated using a 6 base model (A,Cu,Cm,Gu,Gm,T) with the basic statistical models from SOAP ♦ “SNP detection for massively parallel whole-genome resequencing” (2009) Genome Res. published online May 6, 2009 , doi:10.1101/gr.088013.108
-Q 10	Lower limit on base quality for counting bases for % methylation and conversion rates. Base calls from mpileup file with quality less than this threshold will not be counted for determining % methylation or the conversion efficiency.
-c 1	Sets minimum number of reads covering a Cytosine (strand specific). In percent mode a Cytosine will not be called unless it has at least this minimum number of reads on CT or GA strand.
-d dbname	Indexed reference genome(optional). This is the novoindex of the reference genome used to produce the alignments. If provided it is used for determing the cytosine context (CpG, CHG, CHH).It helps resolve context near the edges of read cover where the mpileup file does not include all the references base need to establish the context.
-D	Debug mode (prints pileup details) The Consensus format report is possibly more useful.

Heterozygous C/G Locii

If a locus is identified as heterozygous for C&G Novomethyl will report the site only once. The consensus is identified as C/G with appropriate quality and then methylation state and percentage is reported for either the C or the G but not both. The strand column will indicate if it is the Watson or Crick cytosine that is reported. The strand that has the higher quality for the cytosine (C or G) call is reported.

Documentation