Download taxonomy
$ kraken2-build --download-taxonomy --db kraken2_db
This downloading job is not trivial (35G), and the taxonomy folder can be shared across different kraken2 database, therefore I usually put it outside the kraken2_db
folder, mark it by the date it was downloaded, and make a soft link inside the kraken2_db
folder. I usually update this folder every half a year.
kraken2_db$ mv taxonomy ../taxonomy20220725
kraken2_db$ ln -s ../taxonomy20220725 taxonomy
Download exsiting databases
$ kraken2-build --download-library archaea --db kraken2_db unexpected FTP path (new server?) for*"
This is a known issue. Fixed by changing ftp to https on line 46 of
$ kraken2-build --download-library viral --db kraken2_db
$ kraken2-build --download-library fungi --db kraken2_db
$ kraken2-build --download-library bacteria --db kraken2_db
Step 1/2: Performing rsync file transfer of requested files
rsync: read error: Connection timed out (110)
rsync error: error in socket IO (code 10) at io.c(794) [receiver=3.1.2]
rsync: connection unexpectedly closed (5538447 bytes received so far) [generator]
rsync error: error in rsync protocol data stream (code 12) at io.c(235) [generator=3.1.2] rsync error, exiting: 3072
The files for bacteria are >100G and network error might occur. Just re-run this command.
Step 1/2: Performing rsync file transfer of requested files
Rsync file transfer complete.
Step 2/2: Assigning taxonomic IDs to sequences
Processed 32364 projects (76359 sequences, 134.83 Gbp)... done.
All files processed, cleaning up extra sequence files... done, library complete.
Masking low-complexity regions of downloaded library...sed: couldn't write 60 items to stdout: No space left on device
The masking step will generate a tmp file that’s about the same size as the library.fna
file, therefore at least 135*2G of disk space is required to run this command.
In order to not re-download library.fna
, we need to pick up from where this command failed: ‘Masking low-complexity regions of downloaded library’.
for file in path_to_kraken2/kraken2/*
echo $file
grep Masking $file
It appered in
. Apparently this is about the latter. Checking out the code in the bash file, it’s calling
kranken2_db$ ~/build/kraken2/ .
/home/hfan/build/kraken2/ line 17: KRAKEN2_PROTEIN_DB: unbound variable
Missing environment variable KRAKEN2_PROTEIN_DB
. Here we are not building protein databases.
$ export KRAKEN2_PROTEIN_DB=‘’
kranken2_db$ .
It turns out that nothing was masked. library.fna.masked
is empty. This is true for archaea, bacteria, fungi and virus.
Add customized files
There’s not much regulation on the customized files to be added except that
- it needs to be in fasta format
the needs to be in this format:
here 1234567 needs to be the taxID of the sequence you are adding. If the taxID of the sequences is unknown, you can use a taxID that shares the most recent ancestor (for example, another strain under the same species or another species in the same genus) that is not present in your kraken2 database to represent it. You need to take record of couse.for file in dir/*.fna do kraken2-build –add-to-library $file –db kraken2_db done
$kraken2-build --build --threads n --db kraken2_db
Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map complete. [4.092s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 61765191680 bytes
Capacity estimation complete. [30m6.427s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 16 bits reserved for taxid.
Completed processing of 93901 sequences, 138676758167 bp
Writing data to disk... complete.
Database files completed. [7h48m55.558s]
Database construction complete. [Total: 8h19m6.133s]