2019 Nat Commun Bulk tissue cell type deconvolution

题目：Bulk tissue cell type deconvolution with multi-subject single-cell expression reference

摘要

Knowledge of cell type composition in disease relevant tissues is an important step towards the identiﬁcation of cellular targets of disease.

知道细胞类型组成很重要。

We present MuSiC,

MuSic, 好名字，我喜欢。

a method that utilizes cell-type speciﬁc gene expression from single-cell RNA sequencing (RNA-seq) data to characterize cell type compositions from bulk RNA-seq data in complex tissues.

输入：

cell-type speciﬁc gene expression from single-cell RNA sequencing (RNA-seq) data
bulk RNA-seq data in complex tissues

输出：cell type compositions

By appropriate weighting of genes showing cross-subject and cross-cell consistency,

首先我们需要genes showing cross-subject (个体之间) and cross-cell (细胞类型之间) consistency，对标我的工作，就是cross-sample and cross-species。然后我们赋予他们权重(of course)。那么问题来了，这个consistency如何measure?

作者也说了，这个“marker gene consistency”是这个工具的一个key concept，那下面我们一起来了解一下吧～

cross-subject consistency: guard against bias in subject selection. 对标我就是我的3个technical replicate.
cross-cell consistency: measures the cell type specificity of genes, guard against biaas in cell capture. 这个我太好measure了，就是shared-by-n.

那这个consistency有啥用呢？“up-weighting genes with low cross-subject variace (informative genes) and down-weighing genes with high cross-subject variance (non-informative genes). 这个我直接就可以用啊！上mix！

“The top ranked non-marker genes… tend to be consistently expressed across subjects, and are highly expressed”, 就是类似house keeping嘛。也就是说那些shared-by-all且highly expressed（比如我的poly A）权重很高，但是因为他们并不是某类cell type所特有，所以并不是传统意义上的marker gene。这跟我的研究很不一样。叶绿体里面kmer的abundance是由于kmer太短导致的，是一种homology，不是orthology。可能更好的方式是更长的k, 然后引入sparsity, 还需要包括每个物种所特有的kmer (目前这个30万的X里面是没有的）。

“However, we note that some of the top-ranked genes are not necessarily marker genes”

What are marker genes?

Usually:

“These methods rely on pre-selected cell type-specific marker genes”.

在文章中搜索了rank这个词，并为找到答案，那就看看Methods吧。你们准备好看公式了吗？准备好了的扣1，没准备好的明天再来。

For gene g, let Xjg be the total number of mRNA molecules in subject j of the given tissue, which is composed of K cell types. 好的，那X这个矩阵就是mix的矩阵，也就是我们现在代码里的Y.

下面将Xjg展开成与Xjgc的关系，后者是the number of mRNA molecules of gene g in cell (not cell type) c of subject j.

接着是gene g in subject j for cell type k的relative abundance的表示(Eq. 1)

那cross-cell consistency呢？

里面还提到recursive tree-guided deconvolution for closely related cell types, 这个跟我要解决两个closely-related species是一致的。single-cell sequencing真的是矩阵方法宝藏。

To-do:

去掉low cross-subject consistency markers

Huan Fan / 2022-09-01
Published under (CC) BY-NC-SA in categories 1paperAday tagged with liana single-cell-sequencing