MACS: Model-based Analysis for ChIP-Seq

Status License Programming languages CI x64 CI non x64 CI MacOS

PyPIdownload

Latest Release:

  • Github: Github Release

  • PyPI: PyPI ReleasePyPI Python VersionPyPI Format

  • Anaconda: Anaconda-Server Badge

  • Debian Med: Debian StableDebian Unstable

Introduction

With the improvement of sequencing techniques, chromatin immunoprecipitation followed by high throughput sequencing (ChIP-Seq) is getting popular to study genome-wide protein-DNA interactions. To address the lack of powerful ChIP-Seq analysis method, we presented the Model-based Analysis of ChIP-Seq (MACS), for identifying transcript factor binding sites. MACS captures the influence of genome complexity to evaluate the significance of enriched ChIP regions and MACS improves the spatial resolution of binding sites through combining the information of both sequencing tag position and orientation. MACS can be easily used for ChIP-Seq data alone, or with a control sample with the increase of specificity. Moreover, as a general peak-caller, MACS can also be applied to any “DNA enrichment assays” if the question to be asked is simply: where we can find significant reads coverage than the random background.

Changes for MACS (3.0.1)

Bugs fixed

  1. Fixed a bug that the hmmatac can’t correctly save the digested signal files. #605 #611

  2. Applied a patch to remove cython requirement from the installed system. (it’s needed for building the package). #606 #612

  3. Relax the testing script while comparing the peaks called from current codes and the standard peaks. To implement this, we added ‘intersection’ function to ‘Regions’ class to find the intersecting regions of two Regions object (similar to PeakIO but only recording chromosome, start and end positions). And we updated the unit test ‘test_Region.py’ then implemented a script ‘jaccard.py’ to compute the Jaccard Index of two peak files. If the JI > 0.99 we would think the peaks called and the standard peaks are similar. This is to avoid the problem caused by different Numpy/SciPy/sci-kit learn libraries, when certain peak coordinates may have 10bps difference. #615 #619

  4. Due to the changes in scikit-learn 1.3.0, the way hmmlearn 0.3 uses Kmeans will end up with inconsistent results between sklearn <1.3 and sklearn >=1.3. Therefore, we patched the class hmm.GaussianHMM and adjusted the standard output from hmmratac subcommand. The change is based on hmmlearn PR#545. The idea is to do the random seeding of KMeans 10 times. Now the hmmratac results should be more consistent (at least JI>0.99). #615 #620

Other

  1. We added some dependencies to MACS3. hmmratc subcommand needs hmmlearn library, hmmlearn needs scikit-learn and scikit-learn needs scipy. Since major releases have happened for bothscipy and scikit-learn, we have to set specific version requirements for them in order to make sure the output results from hmmratac are consistent.

  2. We updated our documentation website using Sphinx. https://macs3-project.github.io/MACS/

Changes for MACS (3.0.0)

  1. Call variants in peak regions directly from BAM files. The function was originally developed under code name SAPPER. Now SAPPER has been merged into MACS as the callvar command. It can be used to call SNVs and small INDELs directly from alignment files for ChIP-seq or ATAC-seq. We call fermi-lite to assemble the DNA sequence at the enriched genomic regions (binding sites or accessible DNA) and to refine the alignment when necessary. We added simde as a submodule in order to support fermi-lite library under non-x64 architectures.

  2. HMMRATAC module is added as subcommand hmmratac. HMMRATAC is a dedicated software to analyze ATAC-seq data. The basic idea behind HMMRATAC is to digest ATAC-seq data according to the fragment length of read pairs into four signal tracks: short fragments, mono-nucleosomal fragments, di-nucleosomal fragments and tri-nucleosomal fragments. Then integrate the four tracks again using Hidden Markov Model to consider three hidden states: open region, nucleosomal region, and background region. The orginal paper was published in 2019 written in JAVA, by Evan Tarbell. We implemented it in Python/Cython and optimize the whole process using existing MACS functions and hmmlearn. Now it can run much faster than the original JAVA version. Note: evaluation of the peak calling results is still underway.

  3. Speed/memory optimization. Use the cykhash to replace python dictionary. Use buffer (10MB) to read and parse input file (not available for BAM file parser). And many optimization tweaks. We added memory monitoring to the runtime messages.

  4. R wrappers for MACS – MACSr for bioconductor.

  5. Code cleanup. Reorganize source codes.

  6. Unit testing.

  7. Switch to Github Action for CI, support multi-arch testing including x64, armv7, aarch64, s390x and ppc64le. We also test on Mac OS 12.

  8. MACS tag-shifting model has been refined. Now it will use a naive peak calling approach to find ALL possible paired peaks at + and - strand, then use all of them to calculate the cross-correlation. (a related bug has been fix #442)

  9. BAI index and random access to BAM file now is supported. #449.

  10. Support of Python > 3.10 #498

  11. The effective genome size parameters have been updated according to deeptools. #508

  12. Multiple updates regarding dependencies, anaconda built, CI/CD process.

  13. Cython 3 is supported.

  14. Documentations for each subcommand can be found under /docs

Other

  1. Missing header line while no peaks can be called #501 #502

  2. Note: different numpy, scipy, sklearn may give slightly different results for hmmratac results. The current standard results for automated testing in /test directory are from Numpy 1.25.1, Scipy 1.11.1, and sklearn 1.3.0.

Install

The common way to install MACS is through PYPI) or conda. Please check the INSTALL document for detail.

MACS3 has been tested using GitHub Actions for every push and PR in the following architectures:

  • x86_64 (Ubuntu 22, Python 3.9, 3.10, 3.11)

  • aarch64 (Ubuntu 22, Python 3.10)

  • armv7 (Ubuntu 22, Python 3.10)

  • ppc64le (Ubuntu 22, Python 3.10)

  • s390x (Ubuntu 22, Python 3.10)

  • Apple chips (Mac OS 13, Python 3.11)

In general, you can install through PyPI as pip install macs3. To use virtual environment is highly recommended. Or you can install after unzipping the released package downloaded from Github, then use pip install . command. Please note that, we haven’t tested installation on any Windows OS, so currently only Linux and Mac OS systems are supported. Also, for aarch64, armv7, ppc64le and s390x, due to some unknown reason potentially related to the scientific calculation libraries MACS3 depends on, such as Numpy, Scipy, hmm-learn, scikit-learn, the results from hmmratac subcommand may not be consistent with the results from x86 or Apple chips. Please be aware.

Usage

Example for regular peak calling on TF ChIP-seq:

macs3 callpeak -t ChIP.bam -c Control.bam -f BAM -g hs -n test -B -q 0.01

Example for broad peak calling on Histone Mark ChIP-seq:

macs3 callpeak -t ChIP.bam -c Control.bam --broad -g hs --broad-cutoff 0.1

Example for peak calling on ATAC-seq (paired-end mode):

macs3 callpeak -f BAMPE -t ATAC.bam -g hs -n test -B -q 0.01

There are currently 14 functions available in MACS3 serving as sub-commands. Please click on the link to see the detail description of the subcommands.

Subcommand

Description

callpeak

Main MACS3 Function to call peaks from alignment results.

bdgpeakcall

Call peaks from bedGraph file.

bdgbroadcall

Call nested broad peaks from bedGraph file.

bdgcmp

Comparing two signal tracks in bedGraph format.

bdgopt

Operate the score column of bedGraph file.

cmbreps

Combine bedGraph files of scores from replicates.

bdgdiff

Differential peak detection based on paired four bedGraph files.

filterdup

Remove duplicate reads, then save in BED/BEDPE format file.

predictd

Predict d or fragment size from alignment results. In case of PE data, report the average insertion/fragment size from all pairs.

pileup

Pileup aligned reads (single-end) or fragments (paired-end)

randsample

Randomly choose a number/percentage of total reads, then save in BED/BEDPE format file.

refinepeak

Take raw reads alignment, refine peak summits.

callvar

Call variants in given peak regions from the alignment BAM files.

hmmratac

Dedicated peak calling based on Hidden Markov Model for ATAC-seq data.

For advanced usage, for example, to run macs3 in a modular way, please read the advanced usage. There is a Q&A document where we collected some common questions from users.

Contribute

Please read our CODE OF CONDUCT and How to contribute documents. If you have any questions, suggestion/ideas, or just want to have conversions with developers and other users in the community, we recommend using the MACS Discussions instead of posting to our Issues page.

Ackowledgement

MACS3 project is sponsored by CZI EOSS. And we particularly want to thank the user community for their supports, feedbacks and contributions over the years.

Citation

2008: Model-based Analysis of ChIP-Seq (MACS)