# Advanced Step-by-step peak calling using MACS3 commands

Over the years, many users have asked us questions about whether they
can analyze their X-Seq (not ChIP-Seq) data using MACS, or customize
the `callpeak` function to suit their specific needs. Typically, we
recommend finding a tool that's more suited to their data type, as
`callpeak` is specifically optimized for ChIP-Seq. However, MACS3 does
offer a range of subcommands that allow you to customize every step of
your analysis. In this tutorial, I will demonstrate how to break down
the main `callpeak` function into a pipeline using various MACS3
subcommands, such as `filterdup`, `predictd`, `pileup`, `bdgcmp`,
`bdgopt`, and `bdgpeakcall` (or `bdgbroadcall` for broad marks). This
approach allows you to adjust or skip steps and modify parameters to
analyze your data in a highly customized way. Additionally, you will
have a complete idea of how `callpeak` works. We'll use two test
files, `CTCF_ChIP_200K.bed.gz` and `CTCF_Control_200K.bed.gz`, which
you can find in the MACS3 GitHub repository in the `test` directory.

*Note, currently this tutorial is mainly for single-end
datasets. Please read the tutorial carefully and modify the command
line for paired-end data accordingly.*

## Step 1: Filter duplicates

In the initial step of ChIP-Seq analysis with `callpeak`, we read both
ChIP and control data and remove redundant reads from each genomic
location. While we won't delve into the reasons behind this process,
we'll explain how you can accomplish this using the `filterdup`
subcommand. By default, `callpeak` permits only one duplicate read per
location, which is set with `--keep-dup=1`. To replicate this setting,
you can do the following:

```
$ macs3 filterdup -i CTCF_ChIP_200K.bed.gz --keep-dup=1 -o CTCF_ChIP_200K_filterdup.bed`
$ macs3 filterdup -i CTCF_Control_200K.bed.gz --keep-dup=1 -o CTCF_Control_200K_filterdup.bed`
```

You can choose a different setting for `--keep-dup` or allow MACS3 to
automatically determine the maximum number of allowed duplicated reads
for each genomic loci for ChIP and control separately. For more
details, check `macs3 filterdup -h`. If you opt for the `--keep-dup
auto` setting, ensure the genome size is set appropriately. MACS3 will
report the final number of reads retained after filtering. It’s
crucial to record these numbers as we need them to normalize the ChIP
and control signals to the same depth. In this example, the numbers
are 199,583 for ChIP and 199,867 for control, resulting in a ratio of
approximately 0.99858. The output files from `filterdup` are in BED
format. Please note that if you are using Paired-end data in BAM
format as input and specify the `-f BAMPE` option, you will get the
output in `BEDPE` format. Check [this document for detail](./BEDPE.md)
on this `BEDPE` format.

## Step 2: Decide the fragment length `d`

This is a crucial step for analyzing ChIP-Seq with MACS3, as well as
other types of data. The location of a sequenced read typically
indicates only the end of a DNA fragment of interest originated from
such as transcription factor binding sites (TFBS) or DNA
hypersensitive sites. To accurately determine the enrichment, you
must estimate the actual length of this DNA fragment. This process can
also be seen as a data smoothing technique. With `macs3 callpeak`, the
output will typically include the identification of a certain number
of peak pairs and the predicted fragment length, denoted as `d` in
MACS3 terminology. This can also be accomplished using the `predictd`
subcommand, which we need to apply only to ChIP data:

```
$ macs3 predictd -i CTCF_ChIP_200K_filterdup.bed -g hs -m 5 50
```

In this process, the `-g` (genome size) needs to be set based on your
sample, and the mfold parameters for `-m` must be reasonably
established. To mimic the default behavior of macs3 `callpeak`, use
`-m 5 50`. Of course, you can adjust these parameters as needed. The
output from `predictd` will provide the fragment length `d`, which in
this example is `254`. Make sure to write this number down, as we'll
need it in the next step. However, if you prefer not to extend the
reads or if you have a more accurate estimate of the fragment length,
you can skip this step. 

Please note that for paired-end data, this step is still necessary
since this function will tell us the average fragment length defined
by the mapping locations of read pairs. For example, if you run this
on the `CTCF_PE_ChIP_chr22_50k.bedpe.gz` file in the test directory:

```
$ macs3 predictd -f BEDPE -i CTCF_PE_ChIP_chr22_50k.bedpe.gz
```

This function will print out that `# Average insertion length of all
pairs is 253 bps`. You should write down this number `253` for
building the control bias track for pair-end data in Step 4.

## Step 3: Extend ChIP sample to get ChIP coverage track

Now that you've estimated the fragment length, we can proceed to
generate a pileup track for the ChIP sample using the MACS3 `pileup`
subcommand. The following command replicates the behavior of callpeak:

`
$ macs3 pileup -i CTCF_ChIP_200K_filterdup.bed -o CTCF_ChIP_200K_filterdup.pileup.bdg --extsize 254
`

This command produces a file in BEDGRAPH format,
`CTCF_ChIP_200K_filterdup.pileup.bdg`, which contains the fragment
pileup signals for the ChIP sample. We use `--extsize 254`, where
`254` is the number obtained from the `predictd` step. In ChIP-Seq
data processing, as demonstrated in this tutorial, we extend reads in
the 5' -> 3' direction, which is the default behavior of the `pileup`
function.

If you are analyzing DNAse-Seq data, or if the cutting site detected
by short read sequencing (the 5' mapping location of the read) is
believed to be centrally located within the DNA region of interest,
you should use the `-B` option. This setting allows for the extension
of the 5' location in both directions. For example, `-B 100` would
extend the 5' cutting site on each side by 100 base pairs (200bps in
total) before generating the pileup signal.

For paired-end data, there is no need to specify `--extsize` as
`pileup` will automatically pile up the entire region between the two
5' mapping locations of the read pair. However, it is crucial to use
`-f BEDPE` or `-f BAMPE` to indicate to MACS3 that the input file
contains paired-end data.

## Step 4: Build local bias track from control

By default, the MACS3 `callpeak` function calculates local bias by
considering the maximum bias from the surrounding 1kb (set by
`--slocal`), 10kb (set by `--llocal`), the fragment length `d`
(predicted from the `predictd` function), and the whole genome
background. In this section, we demonstrate how each bias is
calculated and how they can be combined using the subcommands. 

Please note that, for paired-data data, we should treat the control as
single-end data by NOT specifying `-f BAMPE` or `-f BEDPE`. Also, we
should use the average fragment length from Step 2 as `d` in the
following steps.

### The `d` background

To create the background noise track, extend the control read to both
sides using the `-B` option in the `pileup` function. This approach
assumes that the cutting site from the control sample represent
certain noise from the exact same location. To accomplish this, use
half of `d` obtained from the `predictd` module. For instance, with
`d` equaling 254, use 127 (half of 254) as follows:

```
$ macs3 pileup -i CTCF_Control_200K_filterdup.bed -B --extsize 127 -o d_bg.bdg
```

The BEDGRAPH file `d_bg.bdg` contains the background noise data (`d`
background) from the control sample.

### The slocal background

Next, you can create a background noise track for the `slocal`bps
local window, or a 1kb window by default. Imagine that each cutting
site represents a surrounding noise of 1kb (this value is
adjustable). Use the following command:

```
$ macs3 pileup -i CTCF_Control_200K_filterdup.bed -B --extsize 500 -o 1k_bg.bdg
```

Here, 500 represents half of 1k, as the default `slocal` for
`callpeak`. Since the ChIP signal track was constructed by extending
reads to `d` size fragments, we need to normalize the 1kb noise by
multiplying the values by `d/slocal`, which is 254/1000 = 0.254 in our
example. To do this, apply the `bdgopt` subcommand:

```
$ macs3 bdgopt -i 1k_bg.bdg -m multiply -p 0.254 -o 1k_bg_norm.bdg
```

The file `1k_bg_norm.bdg` contains the normalized slocal background
from the control sample. Note, normalization for the `d` background is
not necessary since the multiplier is simply 1.

### The llocal background

The background noise from a larger region can be generated similarly
to the previous approach for the slocal background, with the only
difference being the extension size. By default, MACS3 `callpeak` uses
a 10kb surrounding window, but this value can be adjusted:

```
$ macs3 pileup -i CTCF_Control_200K_filterdup.bed -B --extsize 5000 -o 10k_bg.bdg
```

Here, the `extsize` should be set as half of the llocal, which is 5000
in this case. The appropriate multiplier is `d/llocal`, or 0.0254 in
our example:

```
$ macs3 bdgopt -i 10k_bg.bdg -m multiply -p 0.0254 -o 10k_bg_norm.bdg
```

The file `10k_bg_norm.bdg` now contains the normalized llocal
background from the control.

### The genome background

The whole genome background is calculated using the formula:
`number_of_control_reads * fragment_length / genome_size`. In our
example, this calculation would be:

```
199867 * 254 / 2700000000 ≈ 0.0188023
```

You don't need to execute any subcommands to create a genome
background track, as it's represented by just a single value.

### Combine and generate the maximum background noise

To compute the maximum bias for each genomic location, you can follow
the default behavior of MACS3 `callpeak` or customize your pipeline to
include additional noise backgrounds (such as 5k or 50k). Here's how
to combine the background noises and determine the maximum bias:

First, take the maximum between the slocal (1k) and llocal (10k)
backgrounds:

```
macs3 bdgcmp -m max -t 1k_bg_norm.bdg -c 10k_bg_norm.bdg -o 1k_10k_bg_norm.bdg
```

Next, compare this maximum with the `d` background:

```
macs3 bdgcmp -m max -t 1k_10k_bg_norm.bdg -c d_bg.bdg -o d_1k_10k_bg_norm.bdg
```

Finally, combine this with the genome-wide background using the
`bdgopt` subcommand:

```
macs3 bdgopt -i d_1k_10k_bg_norm.bdg -m max -p .0188023 -o local_bias_raw.bdg
```

The resulting file `local_bias_raw.bdg` is a BEDGRAPH file containing
the maximum local bias from control data.

## Step 5: Scale the ChIP and control to the same sequencing depth

To ensure accurate comparison between ChIP and control signals, both
must be scaled to the same sequencing depth. The `callpeak` module
typically scales down the larger sample to match the smaller one,
preventing inflation of smaller values and enhancing the specificity
of the results. In our example, after duplicate filtering, the final
read counts are 199,583 for ChIP and 199,867 for control. Therefore,
the control bias needs to be scaled down using the ratio between ChIP
and control, which is 199,583/199,867 = 0.99858. To accomplish this:


```
$ macs3 bdgopt -i local_bias_raw.bdg -m multiply -p .99858 -o local_lambda.bdg
```

The output file is named `local_lambda.bdg`, as it contains values
that represent the lambda (or expected value), which can be compared
with ChIP signals using the local Poisson test.

## Step 6: Compare ChIP and local lambda to get the scores in pvalue or qvalue

To identify enriched regions and predict peaks, the ChIP signals and
local lambda stored in the BEDGRAPH file must be compared using a
statistical model. This is done using the `bdgcmp` module, which
outputs a score for each base pair in the genome. Despite potentially
large data sizes, the BEDGRAPH format efficiently manages this by
merging nearby regions with identical scores. Theoretically, the size
of the output file for scores depends on the complexity of your
data. The maximum number of data points, when using `d`, `slocal`, and
`llocal` backgrounds, is the minimum between the genome size and
approximately `(number_of_ChIP_reads + number_of_control_reads*3)*2`,
which in our case is about 1.6 million. 

The commands to generate score tracks are:

```
$ macs3 bdgcmp -t CTCF_ChIP_200K_filterdup.pileup.bdg -c local_lambda.bdg -m qpois -o CTCF_ChIP_200K_qvalue.bdg
```

or

```
$ macs3 bdgcmp -t CTCF_ChIP_200K_filterdup.pileup.bdg -c local_lambda.bdg -m ppois -o CTCF_ChIP_200K_pvalue.bdg
```

The `CTCF_ChIP_200K_pvalue.bdg` or `CTCF_ChIP_200K_qvalue.bdg` file
contains the `-log10(p-values)` or `-log10(q-values)` for each base
pair, derived through a local Poisson test. This means the ChIP signal
at each base pair will be compared against the corresponding local
lambda from the control using a Poisson model. *Note*, following this
tutorial ensures that there are no zeros in the local lambda track, as
the smallest value is the whole genome background. However, if the
genome background is not included, many zeros in the local lambda can
disrupt the Poisson test. In such cases, you need to set the
`pseudocount` for `bdgcmp` using the `-p` option. This pseudocount
will be added to both ChIP and local lambda values before the
test. The choice of pseudocount is largely arbitrary, and you may find
various discussions online. Generally, a higher pseudocount increases
specificity but decreases sensitivity.

## Step 7: Call peaks on score track using a cutoff

The final step in peak calling is to identify regions that surpass a
specific score cutoff using the `bdgpeakcall` function for narrow peak
calling. Although this process may seem straightforward, it involves
two additional parameters:

1. **Merging Nearby Regions**: If two regions exceed the cutoff and
   are separated by a smaller, lower-scoring region, they should be
   merged into a single larger region to account for
   fluctuations. This merging threshold is set as the read length in
   the MACS3 `callpeak` function, reflecting the dataset's
   resolution. For `bdgpeakcall`, you must obtain the read length from
   Step 1 or by examining the raw fastq file and set this with the
   `-g` option.

2. **Minimum Peak Length**: To avoid calling excessively small peaks,
   a minimum peak length is required. The MACS3 `callpeak` function
   automatically uses the fragment size `d` for this purpose. In
   `bdgpeakcall`, set the `-l` option to the `d` value determined in
   Step 2.

Finally, set the cutoff value. Remember, the scores from `bdgcmp` are
in -log10 format. For example, to set a cutoff of 0.05, use a -log10
value of approximately 1.3. The command is as follows:


```
$ macs3 bdgpeakcall -i CTCF_ChIP_200K_qvalue.bdg -c 1.301 -l 245 -g 100 -o CTCF_ChIP_200K_peaks.bed
```

The output is essentially a narrowPeak format file (a type of BED
file), which includes the locations of peaks with the summit location
noted in the last column. If you want to explore how to better decide
the cutoff, please read the [tutorial for cutoff analysis](./cutoffanalysis.md)