HyLoRD v0.2.1
A Hybrid Cell Type Deconvolution Algorithm
|
A bedmethyl file is given as a positional argument to HyLoRD and it will contain information about your bulk data. One can create a bedmethyl file from a modified-base called .bam
file (obtained likely through dorado or similar) by using modkit, a tool provided by ONT.
The expected format of this bedmethyl file is given here. Please ensure that the file is indeed sorted else you may experience errors or significant slowdown. Provided you haven't tampered with the file since creating it via modkit's pileup
command, the file will be sorted.
If you have cell sorted ONT data at your disposal, you can concatenate the datasets together to create a reference matrix. The expected format for this file is as follows (there should be no header in the file):
chromosome | start | end | mark | cell_one | cell_two | ... | cell_n |
---|---|---|---|---|---|---|---|
chr1 | 200 | 201 | h | 0.5 | 0.5 | ... | 0 |
chr1 | 200 | 201 | m | 97.5 | 20.5 | ... | 30 |
... | ... | ... | ... | ... | ... | ... | ... |
Note that the values for each cell type column is represented as a percentage between 0 and 100 (this reflects the format of bedmethyl files).
Please ensure that this file is sorted (chr1 before chr2, h before m etc.) else you may experience errors or significant slowdown.
This reference matrix can be generated by merging multiple bedmethyl files obtained from modkit. This could be accomplished with a mixture of bedtools
and awk
for example. The fields you want to keep from the bedmethyl files are:
Providing a CpG list will greatly improve computational performance with HyLoRD. Not every CpG site is useful in the deconvolution process (i.e. if all cell types have the same methylation status at a certain CpG, then this CpG provides us with no useful information). Good CpGs will vary in methylation status between cell types, giving HyLoRD the necessary information to accurately predict cell proportions in the bulk dataset.
The expected format of this file is:
chromosome | start | end | mark |
---|---|---|---|
chr1 | 200 | 201 | h |
chr1 | 200 | 201 | m |
... | ... | ... | ... |
Make sure that this file is sorted else you may experience errors or significant slowdown.
If you have a reference matrix to work from, you can perform a statistical test on each CpG site to determine which sites convey a greater deal of information. For example, you may elect to use a chi squared test for each CpG site (providing read depth and fraction modified data for each cell type) and pick the top 100,000 CpG sites from this statistical analysis (most significant CpGs).
This file may be difficult to obtain if you don't have a reference matrix to work from. In such an event it is recommended to take a few random subsets of the CpGs present in your bedmethyl file and determine how wildly different the proportions come out as. Running HyLoRD for all CpGs in your dataset (likely ~2.5 million), will cause significant slowdown for very little gain.
There is also the option of creating a CpG list by using other epigenomic datasets at your disposal. For example, cell sorted WGBS or array data could be used (again, using a chi squared test, some CpGs may differ more than others between cell types).
If you do choose to go down this route, you will likely only have collated CpGs that differ in 5-methyl-cytosine signal (and not 5-hydroxymethyl-cytosine signal). In this scenario you can gain a small speedup, by providing the option --only-methylation-signal
on the command line.
If you provide a reference matrix, it is recommended to provide this file to accompany it. This is a newline separated file of cell type names that should correspond one to one with the reference matrix. This approach was chosen over using a header with the reference matrix file for three reasons:
1) It is expected that the reference matrix is created by merging bedmethyl files (which don't contain header lines)
2) It allows the user to produce their own novel cell type methylation profiles without needing to name these cell types themselves (where these profiles would be added to the given reference matrix).
3) If not providing a reference matrix, the user may want HyLoRD to apply some custom names to each cell type proportion (instead of the default naming convention used).
An example of such a file would be:
Aside from warning/error messages, HyLoRD has one output, the predicted cell proportions of the bulk bedmethyl data provided. The format of this output is (headers not included):
Cell type name | Percentage make up of bulk data |
---|---|
cell_one | 45 |
cell_two | 55 |
This tab separated list will be printed to the standard output stream by default. If the user provides a file path with the -o/--outpath
option on the command line, this table will be written to the provided file path.
The cell type name (first field) in the output will be determined by the contents of the cell type list (if provided). For all trailing cell types (that are not covered by this list), a generic name will be given instead in the form: unknown_cell_type_i
(where i
is an integer).