MACS3.Signal.PairedEndTrack module

Module for filter duplicate tags from paired-end data

This code is free software; you can redistribute it and/or modify it under the terms of the BSD License (see the file LICENSE included with the distribution).

class MACS3.Signal.PairedEndTrack.PETrackI(anno='', buffer_size=100000)

Bases: object

In-memory paired-end fragment container grouped by chromosome.

The track exposes utilities for sorting, filtering, downsampling, and pileup generation on numpy structured arrays of left/right coordinates.

add_loc(chromosome, start, end)

Append a paired-end fragment to the track.

Parameters:
  • chromosome (bytes) – Chromosome name (as bytes) that owns the fragment.

  • start (int) – Zero-based start coordinate of the fragment (5’ end).

  • end (int) – Zero-based end coordinate of the fragment (3’ end).

Notes

Fragments are stored in structured numpy arrays keyed by chromosome, and the running fragment count and total template length are updated in place.

annotation = None
average_template_length = None
buf_size = None
buffer_size = None
count_fraglengths()

Count observed fragment lengths across the track.

Returns:

Mapping from fragment length to observed count, useful for downstream models such as HMMRATAC.

Return type:

dict

destroy()

Release numpy buffers held by the track.

All per-chromosome arrays are resized to zero so the memory footprint is returned to the allocator, and the track is marked as destroyed.

exclude(regions)

Remove fragments that overlap the provided exclusion regions.

Parameters:

regions (MACS3.Signal.Region.Regions) – Sorted region collection whose intervals should be excluded.

Notes

The operation mutates the track in place and finishes by calling finalize() to refresh cached statistics.

filter_dup(maxnum=-1)

Limit the number of duplicate fragments at identical coordinates.

Parameters:

maxnum (int, optional) – Maximum number of fragments allowed per unique (start, end) pair. A negative value disables duplicate filtering.

Notes

Fragments exceeding maxnum for the same coordinates are dropped and the aggregate template length is adjusted accordingly.

finalize()

Shrink backing arrays and sort fragments in place.

Each per-chromosome array is resized to the observed fragment count, sorted by the left and right coordinates, and the aggregate counters total and average_template_length are refreshed. Call this after loading data.

fraglengths()

Return all fragment lengths as a single array.

Returns:

Concatenated array of end - start for every stored fragment across chromosomes.

Return type:

numpy.ndarray

get_chr_names()

Return the set of chromosome names stored in the track.

Returns:

Chromosome names (bytes) that currently have fragments.

Return type:

set

get_locations_by_chr(chromosome)

Return the fragment array for a chromosome.

Parameters:

chromosome (bytes) – Chromosome name, provided as bytes.

Returns:

Structured array with ('l', 'i4') and ('r', 'i4') fields.

Return type:

numpy.ndarray

Raises:

Exception – If the chromosome is not present in the track.

get_rlengths()

Return the reference chromosome lengths associated with the track.

Returns:

Mapping from chromosome name (bytes) to reference length. Chromosomes without a recorded length default to INT_MAX.

Return type:

dict

is_destroyed: _fake_callable
is_sorted = None
length = None
locations = None
pileup_a_chromosome(chrom, scale_factor=1.0, baseline_value=0.0)

Compute a coverage pileup for a single chromosome.

Parameters:
  • chrom (bytes) – Chromosome name to pile up.

  • scale_factor (float, optional) – Value used to scale the resulting coverage.

  • baseline_value (float, optional) – Minimum value enforced on the coverage array.

Returns:

Two-element list [positions, values] with numpy arrays describing the pileup breakpoints and scaled coverage.

Return type:

list

pileup_a_chromosome_c(chrom, ds, scale_factor_s, baseline_value=0.0)

Project paired-end fragments into pseudo single-end pileups.

Parameters:
  • chrom (bytes) – Chromosome name to pile up.

  • ds (list[int]) – Fragment lengths used to build the projections.

  • scale_factor_s (list[float]) – Scale factors paired with each entry in ds.

  • baseline_value (float, optional) – Minimum value enforced on the coverage array.

Returns:

Two-element list [positions, values] representing the merged pileup with the maximum value taken across projections.

Return type:

list

pileup_bdg(scale_factor=1.0, baseline_value=0.0)

Build a bedGraphTrackI with pileups for every chromosome.

Parameters:
  • scale_factor (float, optional) – Value used to scale the coverage for each chromosome.

  • baseline_value (float, optional) – Minimum value enforced on the coverage arrays.

Returns:

BedGraph track populated with per-chromosome pileup data.

Return type:

bedGraphTrackI

pileup_bdg_hmmr(mapping, baseline_value=0.0)

Generate HMMRATAC-style pileups for every chromosome.

Parameters:
  • mapping (list) – Weight mapping produced by HMMRATAC EM training describing the short, mono-, di-, and tri-nucleosomal signals.

  • baseline_value (float, optional) – Reserved parameter for API compatibility; not currently applied.

Returns:

List of dictionaries mirroring mapping where each dictionary maps chromosome names to pileup arrays returned by pileup_from_LR_hmmratac().

Return type:

list

print_to_bed(fhd=None)

Write fragments to a three-column BEDPE-style stream.

Parameters:

fhd (io.IOBase, optional) – Writable file-like object. Defaults to sys.stdout.

Notes

Each fragment is emitted as chrom     start   end using decoded chromosome names and the stored integer coordinates.

rlengths = None
sample_num(samplesize, seed=-1)

Down-sample fragments in place to approximately samplesize.

Parameters:
  • samplesize (int) – Target number of fragments across all chromosomes.

  • seed (int, optional) – Deterministic seed forwarded to sample_percent().

Notes

The method converts samplesize into a sampling fraction using self.total. Ensure finalize() has been called so counts are up to date.

sample_num_copy(samplesize, seed=-1)

Return a down-sampled copy with approximately samplesize fragments.

Parameters:
  • samplesize (int) – Target number of fragments across all chromosomes.

  • seed (int, optional) – Deterministic seed forwarded to sample_percent_copy().

Returns:

New track containing the sampled fragments.

Return type:

PETrackI

sample_percent(percent, seed=-1)

Down-sample fragments in place by a fixed percentage.

Parameters:
  • percent (float) – Fraction of fragments to keep per chromosome between 0 and 1 (inclusive).

  • seed (int, optional) – Deterministic seed for the RNG; a negative value uses NumPy’s global state.

Notes

Sampling is performed independently for each chromosome by shuffling the fragments, resizing the arrays, and restoring coordinate order.

sample_percent_copy(percent, seed=-1)

Return a down-sampled copy of the track.

Parameters:
  • percent (float) – Fraction of fragments to retain per chromosome between 0 and 1.

  • seed (int, optional) – Deterministic seed used when shuffling; a negative value disables seeding.

Returns:

New track containing the sampled fragments with metadata copied over.

Return type:

PETrackI

set_rlengths(rlengths)

Attach reference chromosome lengths to the track.

Parameters:

rlengths (dict) – Mapping from chromosome name (bytes) to reference length.

Returns:

True when the length mapping has been updated.

Return type:

bool

Notes

Any chromosome stored in the track but missing from rlengths is assigned INT_MAX so downstream bounds checks can succeed.

size = None
sort()

Sort fragments for each chromosome by genomic coordinate.

Fragments are ordered first by their left coordinate and then by their right coordinate. The is_sorted flag is set to True when sorting completes.

total = None
class MACS3.Signal.PairedEndTrack.PETrackII(anno='', buffer_size=100000)

Bases: object

Paired-end track for single-cell ATAC fragments with barcode metadata.

Each chromosome stores a structured array of fragment coordinates and counts alongside an integer-encoded barcode array to support barcode-aware analyses.

add_loc(chromosome, start, end, barcode, count)

Append a fragment together with its barcode and count.

Parameters:
  • chromosome (bytes) – Chromosome name (as bytes) for the fragment.

  • start (int) – Zero-based start coordinate of the fragment.

  • end (int) – Zero-based end coordinate of the fragment.

  • barcode (bytes) – Raw barcode sequence associated with the fragment.

  • count (int) – Number of occurrences represented by the fragment.

Notes

Barcodes are interned into integers via barcode_dict for compact storage and the accumulated template length is weighted by count.

annotation = None
average_template_length = None
barcode_dict = None
barcode_last_n: typedef
barcodes = None
buf_size = None
buffer_size = None
count_fraglengths()

Count fragment lengths weighted by per-fragment counts.

Returns:

Mapping from fragment length to the total count contributed by fragments of that length.

Return type:

dict

destroy()

Release fragment and barcode arrays held by the track.

All per-chromosome arrays are resized to zero, barcode mappings are cleared, and the track is marked as destroyed.

exclude(regions)

Remove fragments that overlap the provided exclusion regions.

Parameters:

regions (MACS3.Signal.Region.Regions) – Sorted region collection whose intervals should be excluded.

Notes

The operation mutates the track in place, adjusts fragment counts and lengths, and finishes by calling finalize().

finalize()

Shrink arrays, sort fragments, and refresh aggregate counters.

Each per-chromosome fragment array is resized to its observed length, sorted by ('l', 'r'), and the accompanying barcode array is reordered to match. The method updates total and average_template_length using count weights and marks the track as sorted.

Raises:

AssertionError – If no fragments are present when finalizing.

fraglengths()

Return all fragment lengths expanded by their counts.

Returns:

Array of end - start values repeated according to the stored counts.

Return type:

numpy.ndarray

get_chr_names()

Return the set of chromosome names stored in the track.

Returns:

Chromosome names (bytes) that currently have fragments.

Return type:

set

get_locations_by_chr(chromosome)

Return the fragment array for a chromosome.

Parameters:

chromosome (bytes) – Chromosome name, provided as bytes.

Returns:

Structured array with ('l', 'i4'), ('r', 'i4'), and ('c', 'u2') fields.

Return type:

numpy.ndarray

Raises:

Exception – If the chromosome is not present in the track.

get_rlengths()

Return the reference chromosome lengths associated with the track.

Returns:

Mapping from chromosome name (bytes) to reference length. Chromosomes without a recorded length default to INT_MAX.

Return type:

dict

is_destroyed: _fake_callable
is_sorted = None
length = None
locations = None
pileup_a_chromosome(chrom, scale_factor=1.0, baseline_value=0.0)

Compute a coverage pileup for a single chromosome.

Parameters:
  • chrom (bytes) – Chromosome name to pile up.

  • scale_factor (float, optional) – Value used to scale the resulting coverage.

  • baseline_value (float, optional) – Minimum value enforced on the coverage array.

Returns:

Two-element list [positions, values] with numpy arrays describing the pileup breakpoints and scaled coverage.

Return type:

list

pileup_a_chromosome_c(chrom, ds, scale_factor_s, baseline_value=0.0)

Project paired-end fragments into pseudo single-end pileups.

Parameters:
  • chrom (bytes) – Chromosome name to pile up.

  • ds (list[int]) – Fragment lengths used to build the projections.

  • scale_factor_s (list[float]) – Scale factors paired with each entry in ds.

  • baseline_value (float, optional) – Minimum value enforced on the coverage array.

Returns:

Two-element list [positions, values] representing the merged pileup with the maximum value taken across projections.

Return type:

list

pileup_bdg(scale_factor=1.0, baseline_value=0.0)

Build a bedGraphTrackI with pileups for every chromosome.

Parameters:
  • scale_factor (float, optional) – Value used to scale the coverage for each chromosome.

  • baseline_value (float, optional) – Minimum value enforced on the coverage arrays.

Returns:

BedGraph track populated with per-chromosome pileup data.

Return type:

bedGraphTrackI

pileup_bdg2()

Build a bedGraphTrackII with pileups for every chromosome.

Returns:

BedGraph track populated with per-chromosome pileup arrays and finalized.

Return type:

bedGraphTrackII

pileup_bdg_hmmr(mapping, baseline_value=0.0)

Generate HMMRATAC-style pileups for every chromosome.

Parameters:
  • mapping (list) – Weight mapping produced by HMMRATAC EM training describing the short, mono-, di-, and tri-nucleosomal signals.

  • baseline_value (float, optional) – Reserved parameter for API compatibility; not currently applied.

Returns:

List of dictionaries mirroring mapping where each dictionary maps chromosome names to pileup arrays returned by pileup_from_LR_hmmratac().

Return type:

list

rlengths = None
sample_num(samplesize, seed=-1)

Down-sample fragments in place so total counts approximate samplesize.

Parameters:
  • samplesize (int) – Target total count across all chromosomes.

  • seed (int, optional) – Deterministic seed forwarded to sample_percent().

Notes

The method converts samplesize into a sampling fraction using the current total count and reuses sample_percent().

sample_num_copy(samplesize, seed=-1)

Return a down-sampled copy whose total counts approximate samplesize.

Parameters:
  • samplesize (int) – Target total count across all chromosomes.

  • seed (int, optional) – Deterministic seed forwarded to sample_percent_copy().

Returns:

New track containing the sampled fragments.

Return type:

PETrackII

sample_percent(percent, seed=-1)

Down-sample fragments in place so counts reflect a given percentage.

Parameters:
  • percent (float) – Fraction of total counts to keep per chromosome between 0 and 1 (inclusive).

  • seed (int, optional) – Deterministic seed for the RNG; a negative value uses NumPy’s global state.

Notes

Fragments are sampled proportionally to their counts by expanding to an index vector, shuffling, and collapsing counts for the retained entries. Aggregate statistics are recomputed and the result is resorted.

sample_percent_copy(percent, seed=-1)

Return a down-sampled copy whose counts reflect a given percentage.

Parameters:
  • percent (float) – Fraction of total counts to keep per chromosome between 0 and 1 (inclusive).

  • seed (int, optional) – Deterministic seed for the RNG; a negative value uses NumPy’s global state.

Returns:

New track containing the sampled fragments with metadata preserved.

Return type:

PETrackII

Notes

Fragments are sampled proportionally to their counts and the returned track is sorted with reference lengths copied from the source track.

set_rlengths(rlengths)

Attach reference chromosome lengths to the track.

Parameters:

rlengths (dict) – Mapping from chromosome name (bytes) to reference length.

Returns:

True when the length mapping has been updated.

Return type:

bool

Notes

Any chromosome stored in the track but missing from rlengths is assigned INT_MAX so downstream bounds checks can succeed.

size = None
sort()

Sort fragments and barcodes for each chromosome.

Fragments are ordered first by their left coordinate and then by their right coordinate, and the barcode array is reordered alongside the fragment array. The is_sorted flag is set to True when sorting completes.

subset(selected_barcodes)

Build a new track containing only fragments from selected barcodes.

Parameters:

selected_barcodes (set) – Set of barcode byte strings to retain.

Returns:

New track restricted to the provided barcodes with metadata preserved.

Return type:

PETrackII

total = None
MACS3.Signal.PairedEndTrack.bool(*args, **kwargs)