High level explanation of main pipeline

There are two different algorithms that can be used by HyLoRD. One is the reference-based approach, where all cell types are accounted for in reference matrix and --additional-cell-types is set to zero. The other is the hybrid/reference-free approach where the reference matrix is not completed/given respectively. Considering the current lack of cell sorted ONT long read sequencing data, the second approach is more likely to be useful.

Reference-based approach

In this scenario, the deconvolution algorithm consists of a single step where the following question is converted into a quadratic programming problem.

‍What cell type proportions when combined with the reference matrix best describe the bulk methylation profile (present in the bedmethyl file from ONT data)?

For information on how this problem is converted to a quadratic programming problem, visit this page.

After converting the input data into the desired form, HyLoRD offloads the solving of said quadratic programming problem to a library called qpmad. This process will then find the optimal cell type proportions that describe the input bulk bedmethyl data.

Hybrid/reference-free approach

This approach is very similar to the reference-based approach, except that there is now a couple of extra steps.

First, the reference matrix is either incomplete or missing, so we need to generate methylation (and hydroxymethylation) profiles for novel cell types. Once this step is completed, HyLoRD completes the following two steps in a loop:

1) Converts the data into the form of a quadratic programming problem and solves the QPP using qpmad

2) Updates the reference matrix using the acquired cell proportions from step 1.

These two steps repeat until at least one of the following criteria are met:

1) The maximum number of iterations is exceeded (can be user specified with --max-iterations)

2) The L2 norm of the objective function (bulk data - (reference matrix * predicted cell proportions)) is below the convergence threshold (can be user specified with --convergence-threshold).

How novel cell methylation profiles are created

This could be completed with a trivial approach like setting each CpG's methylation status to a random value (for each additional cell type required), but HyLoRD takes a sligthly more sophisticated approach. Using previously obtained bedmethyl files from modkit an approximate cumulative distribution function (CDF) was created (these CDFs used can be found in random.hpp). The decision to use an approximate CDF was because we just want a methylation profile that is plausible, not a completely accurate depiction of some cell type (as that is completely out of the question).

The CDFs for each epigenomic signal were generated in the following two steps:

Step 1 - Extract methylation

Some bedmethyl files (obtained from modkit) were combined, raising the average read depth. Then the fraction modified (field 11) was extracted where the modified base code was 'm' (methylated) and the valid coverage (score) was greater than 10. This was done with the following awk command:

awk '$4=="m" && $5>=10 {print $11}' .../bedmethyl.bed > methylations.txt

This process was repeated for the hydroxymethylation signal as well (fourth field set to 'h'). The minimum read depth of 10 (although arbitrary) was chosen with two considerations:

1) This is count based data, a low read depth gives very little statistical power. For example, a read depth of 1 only allows the fraction modified to be 0 or 1 (which can't really be trusted as a ground truth).

2) Picking a read depth that is too high would result in too many CpG sites being thrown out, which would result in an inaccurate representation of methylation proportions across the epigenome.

Step 2 - Obtain CDF

To obtain the CDF, the following python code was utilised:

import numpy as np
import pandas as pd
 
methylations = pd.read_csv("methylations.txt")
 
# values are discretised to nearest 10 to get a smaller in size cdf
methylations = np.round(methylations, -1) 
values, counts = np.unique(methylations, return_counts=True)
probabilities = counts / counts.sum()
 
# cdf maps to 0,0.1,0.2,...,1.0
cdf = np.cumsum(probabilities)

How the reference matrix is updated

The original reference matrix that is generated is highly likely to be poor quality. The methylation proportions present for the novel cell types make sense on a macro level, but not for individual CpG sites. For example, some CpG sites are methylated in the vast majority of cell types (or none at all), however the generated reference matrix will likely not reflect this quality.

Upon obtaining estimates for the proportions of each cell type, the reference matrix can be updated. To update the reference matrix, we turn to the relationship between the bulk methylation profile, the reference matrix and the cellular proportions estimated from the quadratic programming problem:

\begin{equation*} \mathbf{B} \simeq \mathit{R}\mathit{p} \end{equation*}

Where:

$\mathbf{B}$ is the bulk methylation profile
$\mathbf{p}$ is the estimated cell proportions
$\mathit{R}$ is the reference matrix

To update our reference matrix, we need to apply some matrix manipulation. First, we need to split our reference matrix and cell proportions in two. Some of the reference matrix is set in stone (methylation profiles provided by user) and we don't want to overwrite these.

\begin{equation} \mathbf{B} \simeq \mathit{R}_k\mathbf{p}_k + \mathit{R}_l\mathbf{p}_l \label{eq:split-formula} \end{equation}

Equation $\eqref{eq:split-formula}$ showcases this splitting process. The $k$ left-most columns in the reference matrix (and the first $k$ cell proportions) are the $k$ cell type profiles that are given by the user (where $k\geq 0$). The $l$ remaining columns of the reference matrix (and last $l$ cell proportions) are what's left to change.

We can now update the $l$ right most columns of the reference matrix by rearranging equation $\eqref{eq:split-formula}$ and multiplying by the psuedo-inverse of $\mathbf{p}_l$. The result of these operations is given in equation $\eqref{eq:new-reference-matrix-cols}$:

\begin{equation} \mathit{R}_l \simeq (\mathbf{B} - \mathit{R}_k\mathbf{p}_k) \bigg(\frac{\mathbf{p}_l^T}{\mathbf{p}_l^T\mathbf{p}}_l\bigg) \label{eq:new-reference-matrix-cols} \end{equation}

From here, the reference matrix can be updated by replacing the right $l$ columns with the result of the formula given in $\eqref{eq:update-reference-matrix}$.

\begin{equation} (\mathbf{B} - \mathit{R}_k\mathbf{p}_k) \bigg(\frac{\mathbf{p}_l^T}{\mathbf{p}_l^T\mathbf{p}}_l\bigg) \label{eq:update-reference-matrix} \end{equation}

Table of Contents