bdgdiff
Overview
The bdgdiff
command is part of the MACS3 suite of tools and is used
to call differential peaks from four bedGraph tracks of scores,
including treatment and control track for each condition.
Detailed Description
The bdgdiff
command takes four input bedGraph files (two treatment
and two control files) and produces three output files with
differential peaks called. Users should provide paired four bedgraph
files: for each condition, a treatment pileup signal track in bedGraph
format, and a control lambda track in bedGraph format. This
differential calling can only handle one replicate per condition, so
if you have multiple replicates, you should either combine the
replicates when callpeak
, or choose other tool that can consider
within-group variation (such as DESeq2 or edgeR). The method we use to
define the differential peaks is based on multiple likelihood tests,
based on the poisson distribution. Suppose that we have two conditions
A and B, the unique binding sites in condition A over condition B
should be more likely to be a binding event in treatment A over
treatment B, and also more likely to be a real binding site in
condition A while comparing treatment A over control A; the unique
binding sites in condition B is defined in the same way; the common
peaks of both condition should be more likely to be a real binding
sites in condition A while comparing treatment A and control A, and in
condition B while comparing treatment B over control B, and also the
likelihood test while comparing treatment A and treatment B can’t
decide which condition is stronger.
The likelihood function we used while comparing two conditions: ChIP (enrichment) or control (chromatin bias) is:
Here \(LR\) is the likelihood ratio, x is the signal (fragment pileup) we observed in condition 1, and y is the signal in condition 2. And \(ln\) is the natural logarithm.
Note: All regions on the same chromosome in the bedGraph file should be continuous so only bedGraph files from MACS3 are acceptable.
Command Line Options
The command line options for bdgdiff
are defined in /MACS3/Commands/bdgdiff_cmd.py
and /bin/macs3
files. Here is a brief overview of these options:
--t1
: MACS pileup bedGraph for condition 1. Incompatible with callpeak –SPMR output. REQUIRED--t2
: MACS pileup bedGraph for condition 2. Incompatible with callpeak –SPMR output. REQUIRED--c1
: MACS control lambda bedGraph for condition 1. Incompatible with callpeak –SPMR output. REQUIRED--c2
: MACS control lambda bedGraph for condition 2. Incompatible with callpeak –SPMR output. REQUIRED-C
or--cutoff
: log10LR cutoff. Regions with signals lower than the cutoff will not be considered as enriched regions. DEFAULT: 3 (likelihood ratio=1000)-l
or--min-len
: Minimum length of the differential region. Try a bigger value to remove small regions. DEFAULT: 200-g
or--max-gap
: Maximum gap to merge nearby differential regions. Consider a wider gap for broad marks. The maximum gap should be smaller than the minimum length (-g). DEFAULT: 100--d1
or--depth1
: Sequencing depth (# of non-redundant reads in million) for condition 1. It will be used together with –d2. See the description for –d2 below for how to assign them. Default: 1--d2
or--depth2
: Sequencing depth (# of non-redundant reads in million) for condition 2. It will be used together with –d1. DEPTH1 and DEPTH2 will be used to calculate the scaling factor for each sample, to down-scale the larger sample to the level of the smaller one. For example, while comparing 10 million condition 1 and 20 million condition 2, use –d1 10 –d2 20, then the pileup value in bedGraph for condition 2 will be divided by 2. Default: 1--verbose
: Set the verbose level of runtime messages. 0: only show critical messages, 1: show additional warning messages, 2: show process information, 3: show debug messages. DEFAULT: 2--outdir
: If specified, all output files will be written to that directory. Default: the current working directory--o-prefix
: Output file prefix. Actual files will be named as PREFIX_cond1.bed, PREFIX_cond2.bed, and PREFIX_common.bed. Mutually exclusive with -o/–ofile.-o
or--ofile
: Output filenames. Must give three arguments in order: 1. file for unique regions in condition 1; 2. file for unique regions in condition 2; 3. file for common regions in both conditions. Note: mutually exclusive with –o-prefix.
Example Usage
Here is an example of how to use the bdgdiff
command:
macs3 bdgdiff -t1 treatment1.bedGraph -c1 control1.bedGraph -t2 treatment2.bedGraph -c2 control2.bedGraph --depth1 1.0 --depth2 1.0 -o output.bedGraph --minlen 500 --maxgap 1000 --cutoff 1.0
In this example, the program will call differential peaks from the two
pairs of treatment and control bedGraph files and write the result to
output.bedGraph
. The depth of the first and second condition is set
to 1.0, the minimum length of differential peaks is set to 500, the
maximum gap between differential peaks is set to 1000, and the cutoff
for log10LR to call differential peaks is set to 1.0 (or likelihood
ratio 10).
Step-by-step Instruction for calling differential peaks
In this chatper, we will describe how to use MACS3 to identify
differential regions by comparing pileup tracks of two conditions,
starting from the alignment files. Two modules will be involved:
callpeak
and bdgdiff
( predictd
is optional ). We will use human
ChIP-seq data as example, and filenames of raw data are:
cond1_ChIP.bam and cond1_Control.bam for condition 1; cond2_ChIP.bam
and cond2_Control.bam for condition 2.
Step 1: Generate pileup tracks using callpeak module
Purpose of this step is to use callpeak
with -B option to generate
bedGraph files for both conditions. There are several things to
remember: 1. --SPMR
is not compatible with bdgdiff
, so avoid using
it; 2. prepare a pen to write down the number of non-redundant reads
of both conditions – you will find such information in runtime
message or xls output from callpeak
; 3. keep using the same
--extsize
for both conditions ( you can get it from predictd
module).
To get a uniform extension size for running callpeak
, run predictd
:
$ macs3 predictd -i cond1_ChIP.bam
$ macs3 predictd -i cond2_ChIP.bam
An easy solution is to use the average of two ‘fragment size’
predicted in callpeak
, however any reasonable value will work. For
example, you can use 200
for most ChIP-seq datasets for
transcription factors, or ‘’147’’ for most histone modification
ChIP-seq. The only requirement is that you have to keep using the same
extsize for the following commands:
$ macs3 callpeak -B -t cond1_ChIP.bam -c cond1_Control.bam -n cond1 --nomodel --extsize 120
$ macs3 callpeak -B -t cond2_ChIP.bam -c cond2_Control.bam -n cond2 --nomodel --extsize 120
Pay attention to runtime message, or extract the “tags after filtering in treatment” and “tags after filtering in control” lines from xls to see the effective sequencing depths for both conditions. In our previous command lines, ‘–to-large’ is not used, so the effective sequencing depth is the smaller number of treatment and control. For example:
$ egrep "tags after filtering in treatment|tags after filtering in control" cond1_peaks.xls
# tags after filtering in treatment: 19291269
# tags after filtering in control: 12914669
$ egrep "tags after filtering in treatment|tags after filtering in control" cond2_peaks.xls
# tags after filtering in treatment: 19962431
# tags after filtering in control: 14444786
Then actual effective depths of condition 1 and 2 are: 12914669 and 14444786. Keep record of these two numbers and we will use them later. After successfully running ‘’’callpeak’’’, you will have ‘’cond1_treat_pileup.bdg’’, ‘’cond1_control_lambda.bdg’’, ‘’cond2_treat_pileup.bdg’’, and ‘’cond2_control_lambda.bdg’’ in the working directory.
Step 2: Call differential regions
The purpose of this step is to do a three ways comparisons to find out where in the genome has differential enrichment between two conditions. A basic requirement is that this region should be at least enriched in either condition. A log10 likelihood ratio cutoff (C) will be applied in this step. Three types of differential regions will be reported: 1. those having more enrichment in condition 1 over condition 2 ( cond1_ChIP > cond1_Control and cond1_ChIP > cond2_ChIP ); 2. those having more enrichment in condition 2 over condition 1 ( cond2_ChIP > cond2_Control and cond2_ChIP > cond1_ChIP ); those having similar enrichment in both conditions ( cond1_ChIP > cond1_Control and cond2_ChIP > cond2_Control and cond1_ChIP ≈ cond1_ChIP ).
Run this:
$ macs3 bdgdiff --t1 cond1_treat_pileup.bdg --c1 cond1_control_lambda.bdg --t2 cond2_treat_pileup.bdg\
--c2 cond2_control_lambda.bdg --d1 12914669 --d2 14444786 -g 60 -l 120 --o-prefix diff_c1_vs_c2
You will get the following three files in working directory:
diff_c1_vs_c2_c3.0_cond1.bed
This file stores regions that are highly enriched in condition 1 comparing to condition 2. The last column in the file represent the log10 likelihood ratio to show how likely the observed signal in condition 1 in this region is from condition 1 comparing to condition 2. The higher the value, the greater the difference.diff_c1_vs_c2_c3.0_cond2.bed
This file stores regions that are highly enriched in condition 2 comparing to condition 1. The last column in the file represent the log10 likelihood ratio to show how likely the observed signal in condition 2 in this region is from condition 2 comparing to condition 1. Higher the value, bigger the difference.diff_c1_vs_c2_c3.0_common.bed
This file stores regions that are highly enriched in both condition 1 and condition 2, and the difference between condition 1 and condition 2 is not significant. The last column in the file represent the difference between condition 1 and condition 2 in log10 likelihood ratios.