nifH database for dada2

I have some nifH gene amplicon sequencing data and I have never dealt with them before. Here are some road blocks I encountered.

How to trim the sequence.

There are currently two ways in my dada2 pipeline, the ITS way where the lengths of the genes vary a lot so that I only set the minimum length based on previous knowledge and suggestions. Then there is the 16S way where the length of the fragments are pretty much fixed and you just need to see where the quality of the sequence drops and truncate there. What’s the case for nifH?

After removing reads containing Ns and primer removal, the quality profile of one of my samples looks like this:

plotQualityProfile

There seems to be some variable starting from 50bp to 250bp, but 90% of the reads makes it to 250bp. Let’s take the 16S route but setting the truncLen as 250bp.

Then comes the Annotation.

I googled nifH database. The first thing I found was from the Buckley Lab. There is a fasta version but it dates 2012.

The second thing I found was from the Zehr Lab. It is quite up-to-date (April 2021). It comes with a fasta file and a metadata file. The fasta file does not have the full rank of each entry but we could fix that based on the accession #. The metadata file is not necessary in the classification but very informative:

	acc,GI,full_name,author,clone,collected_by,collection_date,country,date,description,full_taxon_name,gene,host,isolation_source,lat_lon,mol_type,note,nuc_acc_and_pos,title,sequence_source,AMINO_July2012

It came with the suffix of .nds. I do not know what that stands for but it is basically a csv. The only catch is that some of the columns are wrapped with quotes and some are not. That is something to be fixed later.

Before working on retrieving the full ranks of my fasta file, I would like to know whether I am re-inventing the wheel.

From the papers I’ve read so far, people usually use the OTU pipeline and the nifH has its own arb database (under the same link from the Zehr Lab).

I chose one paper to read with care:

Evaluation of Primers Targeting the Diazotroph Functional Gene and Development of NifMAP – A Bioinformatics Pipeline for Analyzing nifH Amplicon Data

Yes I still read papers from frontiers in microbiology even though one of my NSFC reviewer thinks that I should not pushlish there.

They studied 8 primers (4F and 4R) but the pair I used (Soares 2006) was not included. Umm… OK, can I still use their NifMAP pipeline? Their resources seems a bit out of date.

What else.

Oh, a 2022 New Phytologiest paper! Oh well they did qPCR so only the abundance was measured, not diversity.

Oh my, look what I found! A Bioinformatics paper with a R package! Looks promising. It is still within the same school of thoughts as the NifMAP where they try to put the new sequences into an exsiting NifH gene tree. This is different from the 16S classification method, but it might be what we have to do with a much sparser database. I think I will do both and compare the results. So I do need to process the fasta file myself. While in the meantime, let’s try this PPIT.

Apparently PPIT works on inference of taxaonomy of the newly identified ASV, which the placement is done with SEPP. Wow, Tandy Warnow! Divide and conquer! Old memory when I was still doing phylogenomics. Thanks Tandy for inviting me to the phylogenomics symposium in DC in 2012!

OK so SEPP.

The installation finished without a problem. Four input files are required.

 -t TREE, --tree TREE  Input tree file (newick format) [default: None]
 -r RAXML, --raxml RAXML
	                RAxML_info file including model parameters, generated
	                by RAxML.[default: None]
 -a ALIGN, --alignment ALIGN
	                Aligned fasta file [default: None]
 -f FRAG, --fragment FRAG
	                fragment file [default: None]

I imagine the first three files are reference files, and the last one is my query file (in fasta format). OK let’s give it a try. The database files can be found in the data folder of the PPIT repo but they all came as rda files. Need to read and write them into plain txt format. I might start a pull request for adding the code for saving those objects into proper formats so people do not have to do that themselves. I imagine anyone who want to process their own data needs to use them for the SEPP run or am I missing something?

	$run_sepp.py -t ppit_nifh_reference_tree_v2.tre -r ppit_nifh_reference_RAxML_info_v2.txt -a ppit_nifh_reference_alignment_v2.fa  -f myASV.fasta 
[17:02:34] config.py (line 370):     INFO: Seed number: 297834
[17:02:34] algorithm.py (line 268):     INFO: Reading input alignment: <_io.TextIOWrapper name='ppit_nifh_reference_alignment_v2.fa' mode='r' encoding='UTF-8'>
[17:02:35] algorithm.py (line 274):     INFO: Reading input tree: <_io.TextIOWrapper name='ppit_nifh_reference_tree_v2.tre' mode='r' encoding='UTF-8'>
[17:02:36] algorithm.py (line 249):     INFO: Decomposition Sizes are set to alignment: 0 placement: 0
[17:02:37] tree.py (line 404):  WARNING: It was not possible to break-down the following tree according to given subset sizes: 0 , 0:
 Ruminococcus_sp__AM29-19LB-AM29-19LB-12869-13636-NZ_QTYJ01000015_1-WP_117861494_1
Traceback (most recent call last):
  File "/home/user/build/anaconda3/bin/run_sepp.py", line 26, in <module>
    ExhaustiveAlgorithm().run()
  File "/home/user/build/anaconda3/lib/python3.7/site-packages/sepp/algorithm.py", line 157, in run
    self.root_problem = self.build_subproblems()
  File "/home/user/build/anaconda3/lib/python3.7/site-packages/sepp/exhaustive.py", line 345, in build_subproblems
    maxDiam=None)
  File "/home/user/build/anaconda3/lib/python3.7/site-packages/sepp/tree.py", line 396, in decompose_tree
    decomp_strategy, pdistance, distances)
  File "/home/user/build/anaconda3/lib/python3.7/site-packages/sepp/tree.py", line 396, in decompose_tree
    decomp_strategy, pdistance, distances)
  File "/home/user/build/anaconda3/lib/python3.7/site-packages/sepp/tree.py", line 396, in decompose_tree
    decomp_strategy, pdistance, distances)
  [Previous line repeated 8 more times]
  File "/home/user/build/anaconda3/lib/python3.7/site-packages/sepp/tree.py", line 398, in decompose_tree
    decomp_strategy, pdistance, distances)
  File "/home/user/build/anaconda3/lib/python3.7/site-packages/sepp/tree.py", line 393, in decompose_tree
    (t1, t2, e) = self.bisect_tree(strategy, minSize)
  File "/home/user/build/anaconda3/lib/python3.7/site-packages/sepp/tree.py", line 348, in bisect_tree
    assert snl == tree1.n_leaves + tree2.n_leaves
AssertionError

Oops. It seems like something is wrong with the tree. No time for this for now. I will go directly with the first option which is adding proper rank info to the fasta file I found at the Zehr lab.

I know that two formats that works for dada2::assignTaxonomy, the UNITE one and the silvar one.

The UNITE format looks like this:

>Cryptosphaeria_sp|KU761147|SH1572218.08FU|reps|k__Fungi;p__Ascomycota;c__Sordariomycetes;o__Diaporthales;f__Valsaceae;g__Cryptosphaeria;s__Cryptosphaeria_sp

can be detected automatically by the function via the following code:

UNITE <- FALSE
  if (all(grepl("FU\\|re[pf]s", tax[1:10]))) {
    UNITE <- TRUE
    cat("UNITE fungal taxonomic reference detected.\n")
    tax <- sapply(strsplit(tax, "\\|"), `[`, 5)
    tax <- gsub("[pcofg]__unidentified;", "_DADA2_UNSPECIFIED;", 
	        tax)
    tax <- gsub(";s__(\\w+)_", ";s__", tax)
    tax <- gsub(";s__sp$", ";_DADA2_UNSPECIFIED", tax)
  }

Basically it looks for FU|reps or FU|refs in the tag to see whether it is from UNITE. Otherwise it looks for ‘;’ as in silva. If you would like to keep accession number in your customized database, you could go the first route. Please see my code for a working script that takes a fasta file with accession number and return one with one of the format for ranks. Sequences without phylum classification are abandoned.

Huan Fan / 2022-11-07
Published under (CC) BY-NC-SA in categories database tagged with amplicon microbiome