Huan Fan

Huan Fan — 2023-07-08T08:59:38+00:00

I’d like to run PICRUST2 in R so I don’t need to run it from a terminal by myself. I’ve installed the PICRUST2 environment via conda the first time, but due to my stupidness, I installed it inside another environment! The second time I tried to install it, for some reason the same steps for conda environment did not work. Therefore I installed mamba and then an environemtn for PICRUST2.

However when I tried to activate my PICRUST2 environment via mamba by mamba activate picrust2, it complains that

Run 'mamba init' to be able to run mamba activate/deactivate and start a new shell session. Or use conda to activate/deactivate.

Therefore I added mamba init to the command but the complain persists as if the mamba initiated does not work for the next command. This must be something I do not understand about the shell that a command is run via system or system2 in R.

I looked up how to activate conda environment in R. This post is perticularly helpful.

Firstly I tried to activate my env in R, which was successful.

  # activate PICRUST2
  if ('picrust2' %in% conda_list()[[1]]){
    use_condaenv('picrust2', required = TRUE)
  }else{
    print('no picrust2 environment')
  }

However, when I tried to run picrust2_pipeline.py via system, R acted like I’ve never activated picrust2.

Someone else suggested indicating the python in the picrust2 env:

path2python <- '/home/user/mambaforge/envs/picrust2/bin/python'
use_python(path2python, required = TRUE)

This also did not work.

Then I tried to indicate the full path to picrust2_pipeline.py in the system call. This works until picrust2_pipeline.py needs further excutables from the picrust2 environment, including place_seq.py, epa-ng,gappa, hsp.py,metagenome_pipeline.py,pathway_pipeline.py (don’t ask why I know!) They are all in the same path as picrust2_pipeline.py and soft-linking them to one of my current paths worked (totally cheating). Specifically for picrust2, you need to install the castor package in R. Oh my, runing R inside python inside R!

Of course a more reasonable solution would be adding the path to picrust2 (/home/user/mambaforge/envs/picrust2/bin/) to the $PATH via:

export PATH=$PATH:/path-to-picrust2/bin

So, my current answer to the title of this article is, no, you cannot enter a conda environment in R console. However one realization I had was that you can activate your mamba environment via conda. So conda activate picrust2 also works.

Huan Fan — 2023-07-08T08:59:38+00:00

For some reason nMDS always separates the samples nicer than other ordination methods. Therefore I use it a lot. However it is hard to see the variance explained by each nMDS vectors. Therefore one should always consider a principle component method first (such as PCA, PCoA or CA).

My reference is mainly from this book: Numerical Ecology with R-2nd edition.

Today we will talk about CA(correspondence analysis). Some key facts:

It handles: presence-absence or abundance data (frequencies or frequencies like, dimensionally homogeneous, non-negative)
It is not influenced by double zeros.
Asymmetrical
No pre-transformation is needed.
The variation explained by each orthogonal axes is measured by total inertia (sum of squares of all values in matrix Q bar)
Two scalings: 1. rows(sites) are at the centroids of the columns (species). Default.

Huan Fan — 2023-07-08T08:59:38+00:00

Link to the article

题目：Trade-offs between physical and chemical carbon-based leaf defence: of intraspecific variation and trait evolution 物理与化学防御之前的权衡：种内差异及形状进化视角

Summary

1.Despite recent advances in studies on trade-offs between plant defence traits, little is known about whether trade-offs reflect (i) evolutionary constraints at the species level or (ii) allocation constraints at the individual level. Here, we asked to which degree physical and chemical carbon-based leaf defence traits covary within and across species. 虽然最近有一些植物在不同防御性状之间存在权衡的研究，但是我们还是不明白这种权衡究竟是反映了（i）受进化限制的物种间的差异，还是（2）受资源限制的个体间的差异。

2.We assessed leaf toughness, leaf total phenolic and tannin concentrations for 51 subtropical tree spe-cies. Species trait means, sample-specific values and phylogenetically independent contrasts were used in regression analyses. Phylogenetic signals and trait evolution were assessed along the phylogeny. 我们选取了51种亚热带树种，分别测量了他们的叶片韧性（物理防御），叶片总酚和单宁含量（化学防御）。这些性状呢，我们有按物种进行平均，也有直接用个体水平值进行回归分析。（phylogenetically independent contrasts是什么鬼？不过既然我们的研究主要是(ii) allocation constraints at the individual level，不涉及phylogeny,我暂时偷个懒。）我们还检测了这三种性状的系统发育信号及其在系统发育树上的进化。

3.Analyses of species-level trait means revealed significant negative trait covariations between physical and chemical defence traits in analyses over all species. 在物种水平，物理防御跟化学防御的物种性状平均值呈负相关。 These covariations were inconsistent at the within-species level. 但是这些负相关性，在种内并不总是存在。

All three defence aspects showed strong phylogenetic signals, but differed in the degree of conservatism along the phylogeny. 三种性状都有显著的系统发育信号，但是信号强弱有差别。

Inclusion of intraspecific trait variability significantly decreased the strength of these covariations. 如果使用个体水平数据（猜测使用了混合模型，用物种作为随机效应）则降低了性状之间的相关性。 Strong negative covariations were detected between physical and chemical defence traits when phylogenetic non-independence was accounted for. 加入系统发育关系之后物理防御与化学防御性状之间的相关性有所增强。

4.Synthesis. We addressed two sources of variation (allocation and evolution) independently from each other in the context of trait interrelationships. 总结起来：我们在性状相关性的这个框架下分析了两种不同的形状变异来源（种内和种间）。 The observed negative covariations hint at the existence of a trade-off between physical and chemical defence traits. 性状之间的负相关暗示了物理防御跟化学防御之间的折中。

The finding that intraspecific trait variation contributed less to this relationship suggests that the trade-off is dominated by evolutionary constraints rather than by carbon allocation constraints. 种内性状差异的这个折中不明显，表明这个折中主要是受进化影响，而不是受C资源分配影响。

末了，我只能说，这个作者也太能写了！我瞎猜他设计的时候并没有想用物种均值（有个体数据，谁会愿意用均值呢），但个体水平数据显著性不好，于是又用了物种均值，发现信号很强！这下才想出了(i) evolutionary constraints at the species level or (ii) allocation constraints at the individual level这么高大上的解释。系统发育分析当然是常规操作了。

其实在这类研究中加入进化的解释角度是很有意义的，但生态学家不一定能想到，想到也不一定能写好。Respect!

Huan Fan — 2023-07-08T08:59:38+00:00

Link to the article

题目：Defence against vertebrate herbivores trades off into architectural and low nutrient strategies amongst savanna Fabaceae species 生活在稀树草原的一些豆科植物为了抵御植食动物（不含昆虫）的啃食，进化出了物理防御或者低营养的策略

摘要

Herbivory contributes substantially to plant functional diversity and 植食作用对植物功能多样性有着明显的贡献， in ways that move far beyond direct defence trait patterns, 这种贡献不仅体现在直接跟防御有关的形状的规律上， as effective growth strategies under herbivory require modification of multiple functional traits that are indirectly related to defence. 在植食压力下植物如何有效成长，还需要间接跟防御相关的多种功能特征在此压力下的修正。 In order to understand how herbivory has shaped plant functional diversity, 要理解植食作用是如何塑造了植物的功能多样性， we need to consider the physiology and architecture of the herbivores and how this constrains effective defence strategies. 我们需要从植食动物的生理和构造上来理解它们如何限制植物的有效防御策略（反正就是互相对抗吧）。 Here we consider herbivory by mammals in savanna communities that range from semi-arid to humid conditions. 本文研究不同水分条件下（半干旱到湿润）哺乳动物对萨王纳植物群落的影响。

We posited that the saplings of savanna trees can be grouped into two contrasting defence strategies against mammals, namely architectural defence versus low nutrient defence. 我们提出，萨王纳树种的幼树依据他们对植食动物的防御策略聚为两类，结构型和低能型（手动狗头）。

We provide a mechanistic explanation for these different strategies based on the fact that plants are under competing selection pressures to limit herbivore damage and outcompete neighbouring plants. 对于这些植物在来自植食者和周围植物的竞争选择压力下采取的不同策略，我们提出一个机制性的解释：

Plant competitiveness depends on growth rate, itself a function of leaf mass fraction (LMF) and leaf nitrogen per unit mass (Nm). 植物的竞争力与其生长速率有关，而生长速率跟叶重比例（LMF）和单位质量叶片氮（Nm）都是成正比的（公式1）。

Architectural defence against vertebrates (which includes spinescence) limits herbivore access to plant leaf materials, 结构型防御（包括刺）妨碍了植食动物对植物叶片的取食， and partly depends on leaf-size reduction, 且通常伴随有叶片面积的减小， thereby compromising LMF. 所以LMF较小。 Low nutrient defence requires that leaf material is of insufficient nutrient value to support vertebrate metabolic requirements, 低营养型防御植物的叶子营养满足不了脊椎动物的代谢需求， which depends on low Nm. 因为他们的Nm低。 Thus there is an enforced tradeoff between LMF and Nm, leading to distinct trait suites for each defence strategy. 所以如果要生长速率高，就不可能LMF和Nm都低，只能选一样。（不能都选吗？能，但是一旦资源放在两边，就不可能竞争的过资源放一边的，选择比努力重要啊朋友） We demonstrate this tradeoff by showing that numerous traits can be distinguished between 28 spinescent (architectural defenders) and non-spinescent (low nutrient defenders) Fabaceae tree species from savannas, where mammalian herbivory is an important constraint on plant growth. 我们比较了生活在植食压力很大的萨王纳里分属这两类防御策略的28个豆科树种的许多特征，展示了LMF-Nm之间的权衡。（到这里都没提水的事…)
Distributions of the strategies along an LMF-Nm tradeoff further provides a predictive and parsimonious explanation for the uneven distribution of spinescent and non-spinescent species across water and nutrient gradients. 因为不同的物种在这个LMF-Nm权衡光谱上，可以想见带刺的植物算是选择了低LMF，即高Nm, 那可能需要生活在营养条件比较好的地方，（跟水什么关系?光和作用吗?)，于是造成了有刺植物跟没刺植物在水分和营养梯度上的不均匀分布。

最后一句真的很突然，因为说实验设计那里没有提到有水分或者养分的处理啊。当然摘要里面不说不代表没有。只能说老板真的不是很在意摘要。

Huan Fan — 2023-07-08T08:59:38+00:00

I am dealing with a dataset of plant transcriptome where I am mining for microbial signals. As you might have guessed that I have tried HuMANN3-alfa (please see the previous post). With its

Huan Fan — 2023-07-08T08:59:38+00:00

"Oh you are a data scientist! Can you fix my computer?"
"No."

Today I was asked to install a linux system on a laptop with windows on. This is not the type of work I’d like to take at all, but I need to go to a long meeting so I thought I could do this at the background since most of the time you just wait. But when I asked what type of linux this person wants (I assumed Ubuntu), I was told CentOS. I’ve never installed it before so here is to remind me not to take dirty jobs like this.

Step 1 : making a USB stick as your installer

Found this helpful post on setting up a USB key to install CentOS. The only problem is that I have Mac. Therefore needed this post to help with the use of dd. In short:

Download a centOS iso
Plug in your USB and find out where it is mounted by duskutil list. In my case it is mounted at /dev/disk2.
Unmount the disk diskutil unmountDisk /dev/disk2
Write the image to it sudo dd if=CentOS-7-x86_64-DVD-1810.iso of=/dev/disk2. Noticed that centOS is much better than Ubuntu and this steps takes a while (3-4h).

Step 2：Installation

From now on follow the actual post.

F12 for Lenovo to be able to select USB for booting.

First warning comes with disk space. Deleted one of the partition. Needed about 100G.

Everything else is intuitive except that in the post it used minimal install which is equivalent to server version of Ubuntu. Selected Gnome version since this was done on an laptop for personal uses only. This explains why the file was so big. It contains all versions of centOS.

Huan Fan — 2023-07-08T08:59:38+00:00

"How much sequencing should I get for each sample?" asked the experimental scientist.
"Depends on your sample." Answers the bioinformatician.

You got some money for a sequencing project. You’ve done your experimental design to make sure each treatment has a fair number of replicates, and then it comes the million dollar question: how much sequences should you get.

How Should Normalizatin Happen

Huan Fan — 2023-02-27T00:00:00+00:00

I spent a lot of time thinking about the normalizatin problem in microbiome dataset.

The commom practice right now is rarefraction to the lowest number of reads. But apparently that is not the best way of doing things (McMurdie 2014 PLoS Computational Biology). I tried to look for ways that work best for my dataset and after studying Lorens-Rico 2021 Nature Communication very carefully, I settled on TMM(trimmed mean of M values). However there is still one question, there are so many steps I need to take in my data preparation pipeline such as removing low abundant ASVs, singletons, or subseting for a perticular reason (a specific tissue only, or fungi from a perticular trophic mode). When should TMM happen?

Since TMM is fixing the problem of uneven library size, obviously it has to happen sooner than later. But the ASVs calling part is not set in stone. Should removing low abundant ASVs happen first? To answer this questin, I decided to read Nearing 2022 Nature Communication (wow it seems like everyone has a NC paper!) on Microbiome differential abundance methods produce different results across 38 datasets.

This paper talks about different ways of identifying differentially abundant microbes. I’ve actually asked chatGPT about the same question and it gave me the answer as follows:

Differential abundance analysis: This approach involves comparing the abundance of each species in the treatment group to the control group to identify which species are differentially abundant in the treatment group. This can be done using a variety of statistical methods, such as t-tests, ANOVA, or generalized linear models. The advantage of this approach is that it can identify individual species that are responding to the treatment, but it may not capture interactions between species.
Co-occurrence network analysis: This approach involves constructing a network of co-occurring species and identifying modules or clusters of species that are correlated with the treatment group. This approach can be useful for identifying groups of species that are responding to the treatment, but it may not capture individual species that are responding.
Machine learning: This approach involves using machine learning algorithms to predict the response of each species to the treatment based on its abundance and other metadata. This can be a powerful approach if there are complex interactions between species or if the response of individual species is difficult to predict based on abundance alone.
Correlation analysis: This approach involves identifying species that are positively or negatively correlated with the treatment group. This can be done using correlation coefficients such as Pearson’s correlation coefficient or Spearman’s rank correlation coefficient. The advantage of this approach is that it can identify species that are responding to the treatment, but it may not capture the direction or magnitude of the response.

But as you already know my focus is on the pre-processing of the data, and Table 1 tells you all. Basically for differnet statistical tests, different normalizatin methods are used (or no normalization at all).

Therefore for my current analysis, PERMANOVA and Mantel’s test, I need to see what would be the best normalization method. Therefore it is important to keep original phyloseq objects without further pre-processing.

So back to my own question: what kind of normalization do I need for

PERMANOVA test
Mantel’s test.

FAPROTAX

Huan Fan — 2022-12-04T00:00:00+00:00

Download and install

The current version is 1.2.6. Then follow their instructions (http://www.loucalab.com/archive/FAPROTAX/lib/php/index.php?section=Instructions) from which I learnt a lot.

No installation is needed. The whole package is basically a python script plus a text format data file. But make sure you have some Cython and biom.

$ pip install Cython
$ pip install biom-format

Options

The input file is different from PICRUST2. This one needs the taxonomy annotation, similar to the one I received from Rhonin. It is clearly explained in their instruction page under Taxonomy format in the input table. See function ps2total_table in amplicon_functions.R for generating the input file from a phyloseq object.

Then we run the script.

$ path_to/FAPROTAX_1.2.6/collapse_table.py -i input_table.txt -o func_table.tsv -g path_to/FAPROTAX_1.2.6/FAPROTAX.txt -d “taxonomy” -c “#” -v –omit_columns 0 -r out_report.tsv -s sub_table –group_leftovers_as “Unassigned” -f

Help on the options:

	-i, --input_table		Path to input OTU table listing OTU abundances per sample, in classical (tabular) or BIOM format. By default columns should represent samples and rows should represent OTUs or taxa. 
	-g, --input_groups_file		Path to FAPROTAX database file, or any other similar specification of groups by which to collapse the OTU table.
	-o, --out_collapsed		Path to output function table, listing functional group abundances per sample. (optional)
	-r, --out_report		Path to output report file, listing OTUs associated with each functional group and some other summary statistics (optional).
	-d, --row_names_are_in_column		Column listing the taxonomic paths in the input OTU table (if in classical format). If column names are available as a header (see option --column_names_are_in), this specifies a column name, otherwise it specifies a column index (first column is 0).
	-s, --out_sub_tables_dir		Path to output directory, to which sub-tables of the original OTU table (one per functional group) shall be saved. Each sub-table will only list OTUs included in the particular functional group. (optional)
	--omit_columns		Comma-separated list of any column indices to ignore in the input OTU table (if in classical format). For example, if the first column lists OTU IDs (not taxonomic paths), you should pass '--omit_columns 0', otherwise the first column will be treated as another sample.
	--group_leftovers_as		Optional group name for listing all OTUs not assigned to any functional group.
	-f, --force		(Flag) Replace all existing output files without warning.

My understanding on the options:

	-i	Discussed
	-g	Came with the package
	-o	Abundance matrix for downstream analysis. But why is this optional?
	-r	Recorded the verbose part and reported the grouping info of each ASV organized by functinal groups in -g.
	-d	name of the column that contains the taxonomy information.
	-s	A directory to hold seperate -o for groups found in -i.
	--omit_columns	Index of the columns that are neither taxonomy info nor sample, such as ASV ids or other metadata.
	--group_leftovers_as	In my data, only 20% of my ASVs were assigned to groups. Therefore it is worth identifying the rest. I'd like to call them "unassigned"
	-f	Very helpful

There is also a normalization option

	-n, --normalize_collapsed		How to normalize the output function table. Options include 'none' (no normalization, default), 'columns_before_collapsing' (TSS of the OTU table), 'columns_after_collapsing' (TSS of the function table), 'columns_before_collapsing_excluding_unassigned' (TSS of the OTU table restricted to functionally assigned OTUs).

TSS stands for total sum scaling, which divides feature read counts (the number of reads from a particular sample that cluster within the same OTU) by the total number of reads in each sample, i.e., it converts feature counts to appropriately scaled ratios. a.k.a naive percentage. I don’t think it is very helpful.

Output files

There are two output files.

The first one is --out_collapsed, with rows for functions and columns for samples. This can be directly used for downstream analysis. However, my pipeline for functional composition is the same as taxonomic composition, therefore I need a stratified version (instead of collapsed), which can be generated by concatenating all the files in --out_sub_tables_dir. Note that PICRUST2 can provide both collapsed and stratified results as well.

Downstream analyisis

To follow my plotting convention, I will make the collapsed output file into a phyloseq object and work from it. See picrust2_allthree.R for details.

Yuan 2021 Nature Climate Change

Huan Fan — 2022-11-28T00:00:00+00:00

Title: Climate warming enhances microbial network complexity and stability

Abstract:

Unravelling the relationships between network complexity and stability under changing climate is a challenging topic in theoretical ecology that remains understudied in the field of microbial ecology. 大的科学问题：网络的复杂性与稳定性之间的关系，背景：气候变化，系统：微生物网络。

Here, we examined the effects of long-term experimental warming on the complexity and stability of molecular ecological networks in grassland soil microbial communities. 实验：草原土壤微生物群落长期增温实验

Warming significantly increased network complexity, including network size, connectivity, connectance, average clustering coefficient, relative modularity and number of keystone species, as compared with the ambient control. 与背景处理相比，增温显著的提高了网络的复杂性，具体体现在：network size, connectivity, connectance, average clustering coefficient, relative modularity and number of keystone species的增加。

Molecular ecological networks under warming became significantly more robust, with network stability strongly correlated with network complexity, supporting the central ecological belief that complexity begets stability. 分子生态网络（这是啥？）

Furthermore, warming significantly strengthened the relationships of network structure to community functional potentials and key ecosystem functioning. These results indicate that preserving microbial ‘interactions’ is critical for ecosystem management and for projecting ecological consequences of future climate warming.

Molecular Ecological Networks (MEN): because the association networks in microbial ecology are typically reconstructed on the basis of molecular markers so they refer to them as molecular ecological networks (MENs). The original