from HUMAnN2 to HUMAnN3-Alfa

In this post I will first log some of my experience with HUMAnN2 and then move on to my experience of trying the bleeding-edge HuMAnN3-alfa.

Part I: HUMAnN2

Run through the standard workflow. Problems that I ran into:

  1. Pair-end reads: need to concatenate before feeding to the program
  2. The default setting uses only one core. Make sure you use the ‘–thread’ option if you wish to use multiple cores.
  3. Normalization
    • You need to do some normalization to generate the relative abundance using humann_renorm_table; the numbers reported in the output files are not in percentages.
    • A R package called MaAsLin2 for downstream analysis is available. Might be worth exploring. Now that Rstudio supports python, I might spend the majority of my time in Rstudio besides terminal.
    • You can combine results from different samples into one table by humann2_join_tables
  4. More annotation besides MetaCyc (the one used in the pathway files) using humann_regroup_table and make it human-readable via humann_rename_table
  • KEGG Orthogroups (KOs)
  • Pfam domains
  • Level-4 enzyme commission (EC) categories
  • EggNOG (including COGs)
  • Gene Ontology (GO)
  • Informative GO

Part II: HuMAnN3

Wow, alfa, usually I don’t even try beta. But I ran into some problems while installing metaphlan2, there the biom module (thought not necessary) is not supported by python2 anymore. I figure it is time to move on. So, HuMANN3-alfa, here I come!

According to the HuMAnN3 website, here are the updates in this new version.

## Major updates in v3.0

  • Designed in tandem with MetaPhlAn 3.0.
  • Based on UniProt/UniRef 2019_01 sequences and annotations.
  • Contains 2x more species pangenomes and 3x more gene families (vs. v2.0).
  • Removed version number from executables (just metaphlan and humann now).

As you can see, this is mainly a database update forced by the sunset of python2.

## Minor updates in v3.0

  • Pangenome sequences must be covered at >50% of sites to be reported (tunable).
  • Additional accuracy and performance re-tuning across all search steps.
  • Translated search updated to use DIAMOND 0.9.

To understand the first update you need to first understand the workflow of humann. There are three major steps:

  1. Identification of microbial species using MetaPhlAn. The relevant parameter at this stage is --prescreen-threshold <0.01>, I think it means only bugs with relative abundance of >1% are kept. This option was carried over from HUMAnN2. A pangenome database is constructed based on the bug list ($sample_metaphlan_bugs_list.tsv).

  2. Nucleotide-level mapping of all reads against the pangenome database constructed in step 1. This is done by bowtie2. The relevant params are

    • --nucleotide-identity-threshold <0.0> This is whether bowtie made a call for match or not. I actually don’t know how you tune this. Maybe through --bowtie-options?

I tried to understand what is going on here and you can see the discussion I had with Eric Franzosa (author of HUMAnN2 and 3). However it seems like he is referring to the second point in the updates: “Additional accuracy and performance re-tuning across all search steps.”. He talked about the split of “identity and coverage-filter params”, which I think was referring to the following params:

Let’s split them into what Eric called identity and coverage-related: Identity: [–nucleotide-identity-threshold <0.0>] tool: metaphlan/bowtie2, database: chocophlan, output: *_metaphlan_bugs_list.tsv [–translated-identity-threshold <Automatically: 50.0 or 80.0, Custom: 0.0-100.0>] tool: diamond, database: UniRef90, output: *_diamond_aligned.tsv. Here, Coverage: [–translated-subject-coverage-threshold <50.0>] [–nucleotide-subject-coverage-threshold <50.0>] [–translated-query-coverage-threshold <90.0>] [–nucleotide-query-coverage-threshold <90.0>]

In HUMAnN2, indeed there were only three parameters: [–identity-threshold <50.0>] [–translated-subject-coverage-threshold <50.0>] [–translated-query-coverage-threshold <90.0>]

So it seems like [–identity-threshold] becomes [–nucleotide-identity-threshold <0.0>] and [–translated-identity-threshold <Automatically: 50.0 or 80.0, Custom: 0.0-100.0>]. Meaning there are now two seperate thresholds for the two steps, the nucleotide one for MetaPhlAn and translated one for UniRef90. Since nucleotide one is set to 0 basically there is no filtering at that step. This is the only one that with a 90 setting, and Eric says “Translated search in v3.0 against UniRef90 is slightly more permissive (less conservative), with the identity threshold lowered from 90 to 80%.”

According to Eric, the translated search is referring to the diamond search against UniRef90, the function/pathway part. This explains the difference between subject and query. But what are the differences between identity and coverage, and what are the differences between translated and nucleotide? I would guess that identity and coverage here are referring to the We all know that humann works on two stages: 1. bug identification 2. pathway construction. Or the coverage here means the coverage of the pangenome in the chocophlan database? And identity is the same as the column in blast result?

The installation is uneventful. Just follow the instructions. Remember to use your HUMAnN3 (or just humann) also in the biobakery3 environment.

  1. Install metaphlan3
  2. Follow instructions on https://huttenhower.sph.harvard.edu/humann3/ $ conda create –name biobakery3 python=3.7 # need vpn $ conda activate biobakery3 $ conda install humann -c biobakery Testing testing $ humann_test All OK.
  3. Update databases. New databases! To upgrade your pangenome database: humann_databases –download chocophlan full /path/to/databases –update-config yes To upgrade your protein database: humann_databases –download uniref uniref90_diamond /path/to/databases –update-config yes To upgrade your annotations database: humann_databases –download utility_mapping full /path/to/databases –update-config yes

Features still missing:

  1. pair-end still not accepted. I understand that this is read-level id and does not uses pair-end information. However this would require to concatenate the two files into one to get one consensus result. (or is it? Kevin mentioned some sort of post-processing script that humann has. Time to explore?)
Huan Fan /
Published under (CC) BY-NC-SA in categories notes  tagged with Mac 
comments powered by Disqus