callpeak output XLS format

This plain text file format is used as the standard output from callpeak command. It’s not a Microsoft XLS format. We use the suffix as.xls because we can trick the OS to open this plain text file with Microsoft Excel. This is a legacy feature from early MACS. In fact, this is a tsv format file with commented header lines.

Header lines

The commented header lines starting with ‘#’ include important runtime messages from macs3 callpeak command that have been used to generate this output file. Here is an example from a run based on the data in the test directory of MACS3 GitHub repository.

# This file is generated by MACS version 3.0.2
# Command line: callpeak -t CTCF_12878_5M.bed.gz -c Input_12878_5M.bed.gz -n speedtest -B --trackline
# ARGUMENTS LIST:
# name = speedtest
# format = AUTO
# ChIP-seq file = ['CTCF_12878_5M.bed.gz']
# control file = ['Input_12878_5M.bed.gz']
# effective genome size = 2.91e+09
# band width = 300
# model fold = [5, 50]
# qvalue cutoff = 5.00e-02
# The maximum gap between significant sites is assigned as the read length/tag size.
# The minimum length of peaks is assigned as the predicted fragment length "d".
# Larger dataset will be scaled towards smaller dataset.
# Range for calculating regional lambda is: 1000 bps and 10000 bps
# Broad region calling is off
# Paired-End mode is off

# tag size is determined as 36 bps
# total tags in treatment: 4999978
# tags after filtering in treatment: 4636124
# maximum duplicate tags at the same position in treatment = 1
# Redundant rate in treatment: 0.07
# total tags in control: 4999975
# tags after filtering in control: 4895150
# maximum duplicate tags at the same position in control = 1
# Redundant rate in control: 0.02
# d = 101
# alternative fragment length(s) may be 101 bps

The first paragraph contains major parameters for this run, and the second paragraph contains important statistics of the dataset. The numbers for ‘tags after filtering’ or ‘fragments after filtering’ (for PE data), after removing duplicates, are considerred as the library size when MACS3 tries to estimate genome-wide background or to normalize signals. You can also find the estimated fragment size, for single-end dataset; for PE data, you will see the average insert length of all the read pairs after filtering.

Content of peak calls

After the commented header lines, you will see the content of peak calls. The content will start with a header line describing each columns:

chr     start   end     length  abs_summit      pileup  -log10(pvalue)  fold_enrichment -log10(qvalue)  name

The columns include:

  1. chromosome name - Name of the chromosome (or contig, scaffold, etc.), as used in the input alignment file for MACS3.

  2. start position of peak - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 1 which is different with BED-style file but similar to GFF-style file.

  3. end position of peak - The ending position of the feature in the chromosome or scaffold. The chromEnd base is included in the peak which is different with BED-style file but similar to GFF-style file.

  4. length of peak region - =end-start+1.

  5. absolute peak summit position - This is the field we record the so-called peak summit location, where the enrichment is the highest in the peak region. This is the absolute location of summit, different with the relative summit location in narrowPeak format.

  6. pileup height at peak summit

  7. p-score - It’s the -log10(pvalue) from local Poisson test for the peak summit (e.g. pvalue =1e-10, then this value should be 10)

  8. fold enrichment for this peak summit against random Poisson distribution with local lambda,

  9. q-score - It’s the -log10(qvalue) at peak summit. We used Benjamini-horchberg process over the whole genome, treating the p-value calculation on each basepair as an independent test, to get the FDR and q-value.

Note: When --broad is enabled for broad peak calling, the pileup, p-value, q-value, and fold change in this file will be the mean value across the entire peak region, since peak summit won’t be called in broad peak calling mode.