Advanced Step-by-step peak calling using MACS3 commands

Over the years, I have got many emails from users asking if they can analyze their X-Seq (not ChIP-Seq) data using MACS, or if they can turn on or off some features in callpeak for their special needs. In most of cases, I would simply reply that they may have to find more dedicated tool for the type of your data, because the callpeak module is specifically designed and tuned for ChIP-Seq data. However, MACS3 in fact contains a suite of subcommands and if you can design a pipeline to combine them, you can control every single step and analyze your data in a highly customized way. In this tutorial, I show how the MACS3 main function callpeak can be decomposed into a pipeline containing MACS3 subcommands, including filterdup, predictd, pileup, bdgcmp, bdgopt, and bdgpeakcall (or bdgbroadcall in case of broad mark). To analyze your special data in a special way, you may need to skip some of the steps or tweak some of the parameters of certain steps. Now let’s suppose we are dealing with the two testing files CTCF_ChIP_200K.bed.gz and CTCF_Control_200K.bed.gz, that you can find in MACS3 github repository.

Note, currently this tutorial is for single-end datasets. Please modify the command line for paired-end data by yourself.

Step 1: Filter duplicates

In the first step of ChIP-Seq analysis by callpeak, ChIP and control data need to be read and the redundant reads at each genomic loci have to be removed. I won’t go over the rationale, but just tell you how this can be done by filterdup subcommand. By default, the maximum number of allowed duplicated reads is 1, or --keep-dup=1 for callpeak. To simulate this behavior, do the following:

$ macs3 filterdup -i CTCF_ChIP_200K.bed.gz --keep-dup=1 -o CTCF_ChIP_200K_filterdup.bed $ macs3 filterdup -i CTCF_Control_200K.bed.gz --keep-dup=1 -o CTCF_Control_200K_filterdup.bed

You can set different number for --keep-dup or let MACS3 automatically decide the maximum allowed duplicated reads for each genomic loci for ChIP and control separately. Check macs3 filterdup -h for detail, and remember if you go with auto way, the genome size should be set accordingly. Note, in the output, MACS3 will give you the final number of reads kept after filtering, you’d better write the numbers down since we need them when we have to scale the ChIP and control signals to the same depth. In this case, the number is 199583 for ChIP and 199867 for control, and the ratio between them is 199583/199867=.99858

Step 2: Decide the fragment length d

This is an important step for MACS3 to analyze ChIP-Seq and also for other types of data since the location of sequenced read may only tell you the end of a DNA fragment that you are interested in (such as TFBS or DNA hypersensitive regions), and you have to estimate how long this DNA fragment is in order to recover the actual enrichment. You can also regard this as a data smoothing technic. You know that macs3 callpeak will output something like, it can identify certain number of pairs of peaks and it can predict the fragment length, or d in MACS3 terminology, using cross-correlation. In fact, you can also do this using predictd module. Normally, we only need to do this for ChIP data:

$ macs3 predictd -i CTCF_ChIP_200K_filterdup.bed -g hs -m 5 50

Here the -g (the genome size) need to be set according to your sample, and the mfold parameters have to be set reasonably. To simulate the default behavior of macs3 callpeak, set -m 5 50. Of course, you can tweak it. The output from predictd will tell you the fragment length d, and in this example, it is 254. Write the number down on your notebook since we will need it in the next step. Of course, if you do not want to extend the reads or you have a better estimation on fragment length, you can simply skip this step.

Step 3: Extend ChIP sample to get ChIP coverage track

Now you have estimated the fragment length, next, we can use MACS3 pileup subcommand to generate a pileup track in BEDGRAPH format for ChIP sample. Since we are dealing with ChIP-Seq data in this tutorial, we need to extend reads in 5’ to 3’ direction which is the default behavior of pileup function. If you are dealing with some DNAse-Seq data or you think the cutting site, that is detected by short read sequencing, is just in the middle of the fragment you are interested in, you need to use -B option to extend the read in both direction. Here is the command to simulate callpeak behavior:

$ macs3 pileup -i CTCF_ChIP_200K_filterdup.bed -o CTCF_ChIP_200K_filterdup.pileup.bdg --extsize 254

The file CTCF_ChIP_200K_filterdup.pileup.bdg now contains the fragment pileup signals for ChIP sample.

Step 4: Build local bias track from control

By default, MACS3 callpeak function computes the local bias by taking the maximum bias from surrounding 1kb (set by --slocal), 10kb (set by --llocal), the size of fragment length d (predicted as what you got from predictd), and the whole genome background. Here I show you how each of the bias is calculated and how they can be combined using the subcommands.

The d background

Basically, to create the background noise track, you need to extend the control read to both sides (-B option) using pileup function. The idea is that the cutting site from control sample contains the noise representing a region surrounding it. To do this, take half of d you got from predictd, 127 (1/2 of 254) for our example, then:

$ macs3 pileup -i CTCF_Control_200K_filterdup.bed -B --extsize 127 -o d_bg.bdg

The file d_bg.bdg contains the d background from control.

The slocal background

Next, you can create a background noise track of slocal local window, or 1kb window by default. Simply imagine that each cutting site (sequenced read) represent a 1kb (default, you can tweak it) surrounding noise. So:

$ macs3 pileup -i CTCF_Control_200K_filterdup.bed -B --extsize 500 -o 1k_bg.bdg

Note, here 500 is the 1/2 of 1k. Because the ChIP signal track was built by extending reads into d size fragments, we have to normalize the 1kb noise by multiplying the values by d/slocal, which is 254/1000=0.254 in our example. To do so, use the bdgopt subcommand:

$ macs3 bdgopt -i 1k_bg.bdg -m multiply -p 0.254 -o 1k_bg_norm.bdg

The file1k_bg_norm.bdg contains the slocal background from control. Note, we don’t have to do this for d background because the multiplier is simply 1.

The llocal background

The background noise from larger region can be generated in the same way as slocal backgound. The only difference is the size for extension. MACS3 callpeak by default asks for 10kb (you can change this value) surrounding window, so:

$ macs3 pileup -i CTCF_Control_200K_filterdup.bed -B --extsize 5000 -o 10k_bg.bdg

The extsize has to be set as 1/2 of llocal. Then, the multiplier now is d/llocal, or 0.0254 in our example.

$ macs3 bdgopt -i 10k_bg.bdg -m multiply -p 0.0254 -o 10k_bg_norm.bdg

The file 10k_bg_norm.bdg now contains the slocal background from control.

The genome background

The whole genome background can be calculated as the_number_of_control_reads\fragment_length/genome_size, and in our example, it is \(199867*254/2700000000 ~= .0188023\). You don’t need to run subcommands to build a genome background track since it’s just a single value.

Combine and generate the maximum background noise

Now all the above background noises have to be combined and the maximum bias for each genomic location need be computed. This is the default behavior of MACS3 callpeak, but you can have your own pipeline to include some of them or even make more noise (such as 5k or 50k background) then include more tracks. Here is the way to combine and get the maximum bias.

Take the maximum between slocal (1k) and llocal (10k) background:

macs3 bdgcmp -m max -t 1k_bg_norm.bdg -c 10k_bg_norm.bdg -o 1k_10k_bg_norm.bdg

Then, take the maximum then by comparing with d background:

macs3 bdgcmp -m max -t 1k_10k_bg_norm.bdg -c d_bg.bdg -o d_1k_10k_bg_norm.bdg

Finally, combine with the genome wide background using bdgopt subcommand

macs3 bdgopt -i d_1k_10k_bg_norm.bdg -m max -p .0188023 -o local_bias_raw.bdg

Now the file local_bias_raw.bdg is a BEDGRAPH file containing the raw local bias from control data.

Step 5: Scale the ChIP and control to the same sequencing depth

In order to compare ChIP and control signals, the ChIP pileup and control lambda have to be scaled to the same sequencing depth. The default behavior in callpeak module is to scale down the larger sample to the smaller one. This will make sure the noise won’t be amplified through scaling and improve the specificity of the final results. In our example, the final number of reads for ChIP and control are 199583 and 199867 after filtering duplicates, so the control bias have to be scaled down by multiplying with the ratio between ChIP and control which is 199583/199867=.99858. To do so:

$ macs3 bdgopt -i local_bias_raw.bdg -m multiply -p .99858 -o local_lambda.bdg

Now, I name the output file as local_lambda.bdg since the values in the file can be regarded as the lambda (or expected value) and can be compared with ChIP signals using the local Poisson test.

Step 6: Compare ChIP and local lambda to get the scores in pvalue or qvalue

Next, to find enriched regions and predict the so-called ‘peaks’, the ChIP signals and local lambda stored in BEDGRAPH file have to be compared using certain statistic model. To do so, you need to use bdgcmp module, which will output score for each basepair in the genome. You may wonder it may need a huge file to save score for each basepair in the genome, however the format BEDGRAPH can deal with the problem by merging nearby regions with the same score. So theoratically, the size of the output file for scores depends on the complexity of your data, and the maximum number of data points, if d, slocal, and llocal background are all used, is the minimum value of the genome size and approximately (#read_ChIP+#reads_control\*3)\*2, in our case about 1.6 million. The command to generate score tracks is

$ macs3 bdgcmp -t CTCF_ChIP_200K_filterdup.pileup.bdg -c local_lambda.bdg -m qpois -o CTCF_ChIP_200K_qvalue.bdg or

$ macs3 bdgcmp -t CTCF_ChIP_200K_filterdup.pileup.bdg -c local_bias.bdg -m ppois -o CTCF_ChIP_200K_pvalue.bdg

The CTCF_ChIP_200K_pvalue.bdg or CTCF_ChIP_200K_qvalue.bdg file contains the -log10(p-value)s or -log10(q-value)s for each basepair through local Poisson test, which means the ChIP signal at each basepair will be tested against the corresponding local lambda derived from control with Poisson model. Note, if you follow this tutorial, then you won’t get any 0 in the local lambda track because the smallest value is the whole genome background; however if you do not include genome background, you will see many 0s in local lambda which will crash the Poisson test. In this case, you need to set the pseudocount for bdgcmp through -p option. The pseudocount will be added to both ChIP and local lambda values before Poisson test. The choice of pseudocount is mainly arbitrary and you can search on the web to see some discussion. But in general, higher the pseudocount, higher the specificity and lower the sensitivity.

Step 7: Call peaks on score track using a cutoff

The final task of peak calling is to just take the scores and call those regions higher than certain cutoff. We can use the bdgpeakcall function for narrow peak calling and bdgrroadcall for broad peak calling, and I will only cover the usage of bdgpeakcall in this tutorial. It looks simple however there are extra two parameters for the task. First, if two nearby regions are both above cutoff but the region in-between is lower, and if the region in-between is small enough, we should merge the two nearby regions together into a bigger one and tolerate the fluctuation. This value is set as the read length in MACS3 callpeak function since the read length represent the resolution of the dataset. As for bdgpeakcall, you need to get the read length information from Step 1 or by checking the raw fastq file and set the -g option. Secondly, we don’t want to call too many small peaks so we need to have a minimum length for the peak. MACS3 callpeak function automatically uses the fragment size d as the parameter of minimum length of peak, and as for bdgpeakcall, you need to set the -l option as the d you got from Step 2. Last, you have to set the cutoff value. Remember the scores in the output from bdgcmp are in -log10 form, so if you need the cutoff as 0.05, the -log10 value is about 1.3. The final command is like:

$ macs3 bdgpeakcall -i CTCF_ChIP_200K_qvalue.bdg -c 1.301 -l 245 -g 100 -o CTCF_ChIP_200K_peaks.bed

The output is in fact a narrowPeak format file (a type of BED file) which contains locations of peaks and the summit location in the last column.