Skip to main content

run_dbCAN

The standalone version of the dbCAN3 annotation tool for automated CAZyme annotation


Important

This file type requires the parsomics-plugin-dbcan plugin

Important

As of now, parsomics was only tested on run_dbCAN v3 and v4. The compatibility with v5 or later is unknown and not guaranteed.

File naming

The file names must adhere to one of the following patterns:

  • "<MAG-name>.OUT.overview.txt",
  • "<MAG-name>.overview.txt",
  • "<MAG-name>_rundbcanoverview.txt",

File format

tip

For more information on the InterProScan TSV output file format, check out the run_dbCAN documentation. and source code

The files must include a header (i.e. they should include column names at the top). They must have the following columns, in that exact order:

Column nameColumn obligatorinessData typeData nullability
Gene IDMandatoryStringNot nullable
EC#OptionalStringNullable
HMMEROptionalStringNullable
dbCAN_subOptionalStringNullable
DIAMONDOptionalStringNullable
eCAMIOptionalStringNullable
SignalpOptional (ignored)StringN/A
#ofToolsOptional (ignored)IntegerN/A

A few things to keep in mind:

  • The Gene ID should actually contain the name of the protein that the annotation refers to. Also, remember that primary and foreign keys in parsomics are named with key not with id, to avoid mixing up names and keys.

  • Different versions of dbCAN use different sources for the Enzyme Commision Number (EC#) column. From run_dbCAN 4.0.0 onwards, EC# are predicted using dbCAN_sub instead of eCAMI. parsomics is able to adapt to both cases.

  • There are so many optional columns because the run_dbCAN output format is not normalized. This is further explained below.

info

Normalization

The run_dbCAN overview.txt file is not normalized, because it doesn't have "one property per column, one observation per row".

In the example below, notice how each row contains multiple observations (e.g. the first row contains three annotations from three different sources) and how the property of annotation source is spread across multiple columns (i.e. HMMER, dbCAN_sub, DIAMOND).

Gene IDEC#HMMERdbCAN_subDIAMOND#ofTools
AIFGPLGP_01443-CE4(38-162)CE4_e21CE43
AIFGPLGP_01587-GT105(88-212)GT105_e6-2
AIFGPLGP_002292.4.2.43:1-CBM48_e59-1

A more normalized representation of the same data would look like this:

Gene IDSource nameDescriptionAnnotation type
AIFGPLGP_01443HMMERCE4(38-162)DOMAIN
AIFGPLGP_01443dbCAN_subCE4_e21DOMAIN
AIFGPLGP_01443DIAMONDCE4DOMAIN
AIFGPLGP_01587HMMERGT105(88-212)DOMAIN
AIFGPLGP_01587dbCAN_subGT105_e6DOMAIN
AIFGPLGP_00229dbCAN_sub2.4.2.43:1EC_NUMBER
AIFGPLGP_00229dbCAN_subCBM48_e59DOMAIN

That is still not full normalized though, because some annotations include start and stop coordinates in their descriptions. For example, in "CE4(38-162)", 38 and 162 are the start and stop coordinates, respectively. As different properties, these should be in their own columns, as shown below:

Gene IDSource nameDescriptionStart coordinateStop coordinateAnnotation type
AIFGPLGP_01443HMMERCE438162DOMAIN
AIFGPLGP_01443dbCAN_subCE4_e21N/AN/ADOMAIN
AIFGPLGP_01443DIAMONDCE4N/AN/ADOMAIN
AIFGPLGP_01587HMMERGT10588212DOMAIN
AIFGPLGP_01587dbCAN_subGT105_e6N/AN/ADOMAIN
AIFGPLGP_00229dbCAN_sub2.4.2.43:1N/AN/AEC_NUMBER
AIFGPLGP_00229dbCAN_subCBM48_e59N/AN/ADOMAIN

Mapping to database

ProteinAnnotationFile

Original dataProteinAnnotationFile field
run_dbCAN overview.txt file pathpath

ProteinAnnotationEntry

Normalized dataProteinAnnotationEntry field
Gene IDprotein_key 1
Source namesource_key 2
Descriptiondescription
Start coordinatecoord_start
Stop coordinatecoord_stop
Annotation typeannotation_type
Important

The run_dbCAN and CLEAN plugins treat Enzyme Commision Number (EC#) annotations differently. The former stores EC# as a annotation description, while the latter stores EC# as an annotation accession. EC# are accession strings to the Expasy database.

Footnotes

  1. As previosly stated, the "Gene ID" column in the run_dbCAN overview.txt file contains a protein names, which are used to query the primary key of the corresponding proteins in the database

  2. The "Source name" column in the run_dbCAN overview.txt file is used to query the primary key of the corresponding sources in the database