Peak Calling

Before running the pipeline that is used to calculate the metric, a few different files are required that can be obtained via MACS. Usually, peak calling with MACS is via the function callpeaks. However, this process does not give us the outputs of intermediate steps that MACS carries out. As such the script PeakCall.sh will use the fundamental functions of MACS to generate these intermediate files:

The coverage/pileup track
The bias track
The p-values track
The set of peaks before merge process
The set of peaks after merge process

If you want to understand the steps behind the MACS peak calling process, it is recommended that you read the official tutorial.

Running

To run this script, you will need to complete the setup first. After this, you can submit the script PeakCall.sh to a SLURM queue with:

sbatch -p queue .../PeakCall.sh path/to/config_file.txt

The configuration file to use for this script is given in example_config_call.txt. The queue to send the job to will depend on the HPC system you are using. You can view all available partitions with:

sinfo show partitions | awk '{print $1}' | uniq

The script is designed to work with or without a configuration file and will process a single set of reads in any format MACS can work with. It is expected that you run this script for each sample (or merged set of samples) so it is recommended that you have two configuration files (so jobs can run concurrently).

Why merged and unmerged peaks

In the final step of the peak calling process, MACS encourages the user to merge any peaks that are suitably close together. The recommended minimum gap between peaks is the read length of the dataset. The metric needs to be able to differentiate between peaks that are called due to this merging process because such base pair positions will have very low coverage. The metric heavily relies on comparing p-values and read counts, if a peak has very low p-values and read counts, this can be misleading if the merging process is not accounted for.

Cutoff analysis

In order to call peaks, a line in the sand must be drawn somewhere. This often feels arbitrary and is a difficult topic in statistics. However, MACS does allow for cutoff analysis where different thresholds are used and some metrics are displayed for each threshold used. These metrics are:

The number of peaks
The average length of the peaks
The total length of all peaks combined

Using these there are a couple of different approaches you can take to get a suitable value for the cutoff.

Elbow analysis
Using biological knowledge
The expected number of peaks
The expected average peak length

Currently the pipeline allows you to either pick a hard cutoff value or use the most stringent cutoff that gives you a certain average peak length. If you have your own idea here, a file is put in your chosen output directory under (sample_name)_cutoff_analysis.txt. The pipeline doesn't use elbow analysis as this is hard to automate ("How do you define a dramatic change?").

Time

The time MACS takes to peak call will vary with the number of reads that you have, however the process is usually very fast. For a dataset with 11,000,000 reads the total time taken was roughly 5 minutes (where the cpu was Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz).