Introduction to problem

HyLoRD's purpose is to take some bulk methylation profile and deconvolve this by estimating the proportions of known/unknown cell types that make up the bulk profile. For the purpose of this page, we will assume that the reference matrix (the amalgamation of methylation profiles for individual cell types) has been formed already. Mathematically, we have the following relationship between the reference matrix and the bulk profile:

\begin{equation} B_i = \sum^N_{k=1} p_kR_{i,k} \label{eq:startingPoint} \end{equation}

Where:

\(B_i\) is the proportion of reads at CpG site \(i\) that showed a methylation signal (or hydroxymethylation signal),
\(p_k\) is the proportion of cell type \(k\) (e.g. Neuron, Leukocyte),
\(R_{i,k}\) is a value from the reference matrix, corresponding with the expected methylation signal for cell type \(k\) and CpG site \(i\),
\(N\) is the total number of cell types.

Equation \(\eqref{eq:startingPoint}\) is equivalent to saying that the bulk profile is just a weighted average of the methylation profiles of each cell type present in the data.

Considering \(R_{i,k}\) and \(B_i\) are both known, all that remains is to find \(p_k \: \forall k\in[N]\).

Approach

To get good values for \(p_k\), there are several approaches one could take. For example, you could use some machine learning algorithm. For HyLoRD (and many other cell type deconvolution algorithms), we elect to go for a more direct approach by converting equation \(\eqref{eq:startingPoint}\) into the form of a quadratic programming problem. The reason behind this choice is that many efficient solvers have been created for quadratic programming problems (QPPs) over the years and it allows us to be direct in estimating each \(p_k\). This is opposed to a machine learning approach where one trains some model, tests it then applies it for each possible dataset. In general, if a direct approach can be taken, it should be favoured over a machine learning approach (think linear regression).

Specifically, this page will outline the proof that we can express \(\eqref{eq:startingPoint}\) in the following form:

\begin{equation} minimize \: \frac{1}{2}\mathbf{x}^T\mathit{Q}\mathbf{x} + \mathbf{c}^T\mathbf{x},\\ subject \; to \: \mathit{A}\mathbf{x} \leq \mathbf{b} \label{eq:qpp-form} \end{equation}

For QPP \(\eqref{eq:qpp-form}\), the matrices \(\mathit{Q},\mathit{A}\) and the vectors \(\mathbf{b},\mathbf{c}\) are known and \(\mathbf{x}\) is the (solution) vector being solved for.

Derivation

Matrix form

First, we will rewrite equation \(\eqref{eq:startingPoint}\) in matrix form:

\begin{equation*} \mathbf{B} = \sum_{k=1}^Np_k\mathbf{R_k} \end{equation*}

Where:

\(\mathbf{B}\) is a \(M \times 1\) column vector, where \(M\) is the number of CpG sites
\(\mathbf{R_k}\) is an \(M \times 1\) column vector for each cell type k.

We can then write this as:

\begin{equation} \mathbf{B} = \mathit{R}\mathit{p} \label{eq:matrix-form} \end{equation}

Here:

\(\mathbf{B}\) is a \(M \times 1\) column vector, where \(M\) is the number of CpG sites
\(\mathbf{p}\) is a \(N \times 1\) column vector, where \(N\) is the number of cell types
\(\mathit{R}\) is an \(M \times N\) matrix

Below is proof that \(\mathbf{p}^T\mathit{R}^T = \sum_{k=1}^Np_k\mathbf{R_k}^T\) (and thus \(\mathit{R}\mathbf{p} = \sum_{k=1}^Np_k\mathbf{R_k}\), transpose was used here to make the proof take up less space):

\begin{aligned} \mathbf{p}^T\mathit{R}^T &= (p_1, \dots, p_N) \cdot \begin{pmatrix} r_{1,1} & r_{2,1} & \cdots & r_{M,1} \\ r_{1,2} & r_{2,2} & \cdots & r_{M,2} \\ \vdots & \vdots & \ddots & \vdots \\ r_{1,N} & r_{2,N} & \cdots & r_{M,N} \end{pmatrix} \\ &=(p_1r_{1,1} + \cdots + p_Nr_{1,N}, \dots, p_1r_{M,1} + \cdots + p_Nr_{M,N}) \\ &=p_1(r_{1,1}, \dots, r_{M,1}) + (p_2r_{1,2} + \cdots + p_Nr_{1,N}, \dots, p_2r_{M,2} + \cdots + p_Nr_{M,N}) \\ &\cdots \\ &=p_1(r_{1,1}, \dots, r_{M,1}) + \cdots + p_N(r_{1,N}, \dots , r_{M,N}) \\ &=p_1\mathbf{R}^T_1 + \cdots + p_N\mathbf{R}^T_N \\ &=\sum^N_{k=1}p_k\mathbf{R}^T_k \end{aligned}

Problem formulation

Currently, equation \(\eqref{eq:matrix-form}\) is not in the form of a quadratic programming problem. To create a QPP from this equation, we need to generate some objective function (a value to minimise). In our case, this is actually very simple. We want to find some vector of cell proportions ( \(p_k\)), such that multiplying it with the reference matrix and retrieves a vector as close as possible to the bulk profile.

Initially, \(\eqref{eq:bad-start-objective-function}\) may seem like an obvious objective function to start with. However, minimisation could result in a large negative number instead of a value close to 0 which isn't quite what we want. We could improve upon this by using the same objective function, but taking the modulus (so the best solution would render the objective function to evaluate to 0). However, we won't be using this as you quickly find yourself stuck, unable to go any further in the derivation (try it yourself). The 'correct' choice to go with is actually to frame the problem as a least squares problem like in \(\eqref{eq:least-squares}\) (where \(||\cdot||_2^2\) is the L2 norm).

\begin{equation} minimize \: \mathit{R}\mathbf{p} - \mathbf{B} \label{eq:bad-start-objective-function} \end{equation}

\begin{equation} minimize \: ||\mathit{R}\mathbf{p} - \mathbf{B}||^2_2 \label{eq:least-squares} \end{equation}

Note that we still need some constraints and apply a few matrix operations for this to completely match up with the desired form given by \(\eqref{eq:qpp-form}\).

Retrieving the correct objective function form

First we need to expand the L2 norm used in \(\eqref{eq:least-squares}\).

\begin{aligned} ||\mathit{R}\mathbf{p} - \mathbf{B}||^2_2 &=(\mathit{R}\mathbf{p} - \mathbf{B})^T(\mathbf{p}\mathit{R} - \mathbf{B}) \\ &=(\mathbf{p}^T\mathit{R}^T - \mathbf{B}^T)(\mathit{R}\mathbf{p} - \mathbf{B}) \\ &=\mathbf{p}^T\mathit{R}^T\mathit{R}\mathbf{p} -\mathbf{B}^T\mathit{R}\mathbf{p} - \mathbf{p}^T\mathit{R}\mathbf{B} + \mathbf{B}^T\mathbf{B} \\ &=\mathbf{p}^T\mathit{R}^T\mathit{R}\mathbf{p} -\mathbf{B}^T\mathit{R}\mathbf{p} - (\mathbf{p}^T\mathit{R}\mathbf{B})^T + \mathbf{B}^T\mathbf{B} \\ &=\mathbf{p}^T\mathit{R}^T\mathit{R}\mathbf{p} -2\mathbf{B}^T\mathit{R}\mathbf{p} + \mathbf{B}^T\mathbf{B} \end{aligned}

Next we can define a few variables:

\(\mathit{Q}:=\mathit{R}\mathit{R}^T\)
\(\mathbf{c}^T:=-\mathbf{B}^T\mathit{R}\)

Substituting these into our objective function \(\eqref{eq:least-squares}\) we get:

\begin{equation} minimize \: \mathbf{p}^T\mathit{Q}\mathbf{p} + 2\mathbf{c}^T\mathbf{p} + \mathbf{B}^T\mathbf{B} \label{eq:substituted-Q-c} \end{equation}

This is close to the form required by \(\eqref{eq:qpp-form}\), from here, we need to multiply by 0.5 to get:

\begin{equation*} minimize \: \frac{1}{2}\mathbf{p}^T\mathit{Q}\mathbf{p} + \mathbf{c}^T\mathbf{p} + \frac{1}{2}\mathbf{B}^T\mathbf{B} \end{equation*}

And then notice that our objective function has a constant in it (namely \(\frac{1}{2}\mathbf{B}^T\mathbf{B}\)). Our choice of \(\mathbf{p}\) has no effect on this term of the objective function. Considering we don't actually care what the value of the objective function is, just that it is minimised, we can just remove this term entirely. This then retrieves the expected form of the QPP (as seen in equation \(\eqref{eq:objective-function}\)).

\begin{equation} minimize \: \frac{1}{2}\mathbf{p}^T\mathit{Q}\mathbf{p} + \mathbf{c}^T\mathbf{p} \label{eq:objective-function} \end{equation}

Constraints

We could apply trivial constraints by declaring \(\mathit{A}\) as a zeros matrix and be done with it. However, there are a few constraints that should be applied to our QPP. First, there is nothing constraining the minimum and maximum values for each cell proportion in the solution. In the context of our deconvolution algorithm, this can lead to nonsensical solutions where some cell type makes up a negative proportion of the bulk profile. Further, all of the proportions should add up to 1.

Note: We could actually leave the second constraint (proportions add to 1) out and just normalise the vector of proportions after. However, doing this opens up the possibility of losing numerical stability (as the proportions could all be very close to 0 or so big numerical overflow occurs).

HyLoRD uses qpmad to solve the QPP, and this software allows for constraints on both sides (less than-constraint and greater-than constraint). As a result of this, we can ensure that indeed the proportions add up to 1 (bounded above and below by 1 is equivalent to equal to 1) and that each proportion is bounded above by 1 and bounded below by 0.

Putting it all together

After adding constraints to QPP \(\eqref{eq:objective-function}\) we get:

\begin{equation} minimize \: \frac{1}{2}\mathbf{p}^T\mathit{Q}\mathbf{p} + \mathbf{c}^T\mathbf{p} \\ subject \; to \: 1 \leq \mathit{A}\mathbf{p} \leq 1,\: \mathbf{1} \leq \mathbf{p} \leq \mathbf{1} \label{eq:retrieved-qpp} \end{equation}

Where:

\(\mathit{A}\) is an \(N \times 1\) matrix of 1s

This is slightly different from the expected QPP form given in \(\eqref{eq:qpp-form}\), but we can combine the two constraints into the form \(\mathit{A}\mathbf{x} \leq \mathbf{b}\) if we wanted (in terms of the code for HyLoRD however, this form is more accurate).

On positive-definiteness

Due to \(\mathit{Q}\) being defined as \(\mathit{R}^T\mathit{R}\), we ensure that \(\mathit{Q}\) in \(\eqref{eq:retrieved-qpp}\) is symmetric positive-semidefinite. This is important as the qpmad solver implements the Goldfarb&Idnani algorithm.

Actually, for Goldfarb&Idnani, our matrix \(\mathit{Q}\) actually needs to be positive definite (definition only locks in positive semi-definite-ness). To achieve this we need to be confident that the columns of the original reference matrix are linearly independent. Whilst we can't guarantee this, we can at least be sure that such a reference matrix is incredibly unlikely to entered (unless intentional). The reason for this is that our reference matrix will have thousands, hundreds of thousands of rows (one for each CpG) of floating point numbers (methylation measured as a proportion between 0 and 1). Due to how floating point numbers are stored computationally (and the sheer number of rows), the chance of a column being expressed as a linear combination of the other columns in the matrix is basically zero.

In the event that the matrix is not positive-definite, you will see a message alluding to Cholesky factorisation after running HyLoRD. Please check that you don't have any duplicated columns in the reference matrix you provided. There is also the extremely unlikely case where the novel cell type generator managed to produce a cell type that is a linear combination of the others in the matrix. In such a case, running the tool again should fix this issue (unless the case occurs again of course).

Table of Contents

Introduction to problem

Approach

Derivation

Matrix form

Problem formulation

Retrieving the correct objective function form

Constraints

Putting it all together

On positive-definiteness