Inputs

Bedmethyl file (required)

A bedmethyl file is given as a positional argument to HyLoRD and it will contain information about your bulk data. One can create a bedmethyl file from a modified-base called .bam file (obtained likely through dorado or similar) by using modkit, a tool provided by ONT.

The expected format of this bedmethyl file is given here. Please ensure that the file is indeed sorted else you may experience errors or significant slowdown. Provided you haven't tampered with the file since creating it via modkit's pileup command, the file will be sorted.

Reference matrix (optional)

If you have cell sorted ONT data at your disposal, you can concatenate the datasets together to create a reference matrix. The expected format for this file is as follows (there should be no header in the file):

chromosome	start	end	mark	cell_one	cell_two	...	cell_n
chr1	200	201	h	0.5	0.5	...	0
chr1	200	201	m	97.5	20.5	...	30
...	...	...	...	...	...	...	...

Note that the values for each cell type column is represented as a percentage between 0 and 100 (this reflects the format of bedmethyl files).

Please ensure that this file is sorted (chr1 before chr2, h before m etc.) else you may experience errors or significant slowdown.

This reference matrix can be generated by merging multiple bedmethyl files obtained from modkit. This could be accomplished with a mixture of bedtools and awk for example. The fields you want to keep from the bedmethyl files are:

1 - Chromosome field
- UCSC format, can also just be the chromosome number (no "chr" prefix)
2 - Start field
3 - End field
4 - Mark name field
11 - Fraction modified field (percent methylated)

Remarks: The reference matrix and CpG list inputs (listed below) can be made easily with SQUIRE. This is a python-based CLI made by the same author that speeds up the process considerably (no need to write your own scripts). If you aren't working with bedmethyl data for the reference matrix, SQUIRE will not immediately help, but the ideas present within the source code might.

CpG list (optional)

Providing a CpG list will greatly improve computational performance with HyLoRD. Not every CpG site is useful in the deconvolution process (i.e. if all cell types have the same methylation status at a certain CpG, then this CpG provides us with no useful information). Good CpGs will vary in methylation status between cell types, giving HyLoRD the necessary information to accurately predict cell proportions in the bulk dataset.

The expected format of this file is:

chromosome	start	end	mark
chr1	200	201	h
chr1	200	201	m
...	...	...	...

Make sure that this file is sorted else you may experience errors or significant slowdown.

Creating this file using a reference matrix

If you have a reference matrix to work from, you can perform a statistical test on each CpG site to determine which sites convey a greater deal of information. For example, you may elect to use a chi squared test for each CpG site (providing read depth and fraction modified data for each cell type) and pick the top 100,000 CpG sites from this statistical analysis (most significant CpGs).

Creating this file without a reference matrix

This file may be difficult to obtain if you don't have a reference matrix to work from. In such an event it is recommended to take a few random subsets of the CpGs present in your bedmethyl file and determine how wildly different the proportions come out as. Running HyLoRD for all CpGs in your dataset (likely ~2.5 million), will cause significant slowdown for very little gain.

There is also the option of creating a CpG list by using other epigenomic datasets at your disposal. For example, cell sorted WGBS or array data could be used (again, using a chi squared test, some CpGs may differ more than others between cell types).

If you do choose to go down this route, you will likely only have collated CpGs that differ in 5-methyl-cytosine signal (and not 5-hydroxymethyl-cytosine signal). In this scenario you can gain a small speedup, by providing the option --only-methylation-signal on the command line.

Note: You could conceivably use this WGBS/array based data as your reference matrix. Be warned however that there is not a 100% correlation between the technologies (both in sequencing and processing), so accuracy will likely suffer.

Cell type list (optional, recommended)

If you provide a reference matrix, it is recommended to provide this file to accompany it. This is a newline separated file of cell type names that should correspond one to one with the reference matrix. This approach was chosen over using a header with the reference matrix file for three reasons:

1) It is expected that the reference matrix is created by merging bedmethyl files (which don't contain header lines)

2) It allows the user to produce their own novel cell type methylation profiles without needing to name these cell types themselves (where these profiles would be added to the given reference matrix).

3) If not providing a reference matrix, the user may want HyLoRD to apply some custom names to each cell type proportion (instead of the default naming convention used).

An example of such a file would be:

Neuron
Oligodendrocyte
Microglial
Astrocyte

Outputs

Aside from warning/error messages, HyLoRD has one output, the predicted cell proportions of the bulk bedmethyl data provided. The format of this output is (headers not included):

Cell type name	Percentage make up of bulk data
cell_one	45
cell_two	55

This tab separated list will be printed to the standard output stream by default. If the user provides a file path with the -o/--outpath option on the command line, this table will be written to the provided file path.

Naming

The cell type name (first field) in the output will be determined by the contents of the cell type list (if provided). For all trailing cell types (that are not covered by this list), a generic name will be given instead in the form: unknown_cell_type_i (where i is an integer).

Table of Contents