GTDB-Tk
A toolkit for assigning objective taxonomic classifications to prokaryote genomes
File naming
The file name must end with ".summary.tsv".
File format
For more information on the GTDB-Tk output format, visit the GTDB-Tk documentation.
Results Summary (.tsv
)
The file must include a header (i.e. the column names at the top). The column names can be anything, as long as the order is exactly the same. It must have the following columns, in that order:
Column name | Column obligatoriness | Data type | Data nullability |
---|---|---|---|
user_genome | Mandatory | String | Not nullable |
classification | Mandatory | String | Not nullable |
closest_genome_reference | Mandatory | String | Nullable |
closest_genome_reference_radius | Mandatory | Float | Nullable |
closest_genome_taxonomy | Mandatory (ignored) | N/A | N/A |
closest_genome_ani | Mandatory | Float | Nullable |
closest_genome_af | Mandatory | Float | Nullable |
closest_placement_reference | Mandatory | String | Nullable |
closest_placement_radius | Mandatory | Float | Nullable |
closest_placement_taxonomy | Mandatory (ignored) | N/A | N/A |
closest_placement_ani | Mandatory | Float | Nullable |
closest_placement_af | Mandatory | Float | Nullable |
pplacer_taxonomy | Mandatory (ignored) | N/A | N/A |
classification_method | Mandatory | String | Not nullable |
note | Mandatory | String | Nullable |
other_related_references | Mandatory (ignored) | N/A | N/A |
msa_percent | Mandatory (ignored) | N/A | N/A |
translation_table | Mandatory (ignored) | N/A | N/A |
red_value | Mandatory | Float | Nullable |
warnings | Mandatory | String | Nullable |
Why are there mandatory columns that are ignored?
That has to do with the way the GTDB-Tk file parser is written. When the file is read, it must comply with a pre-defined schema (column order and types), even though some of these columns end up being dropped later.
Mapping to database
GTDBTkTsvFile
Original data | GTDBTkTsvFile field | Notes |
---|---|---|
GTDB-Tk file path | path |
GTDBTkTsvEntry
Original data | GTDBTkTsvEntry field |
---|---|
user_genome | genome_key 1 |
classification | domain , phylum , klass , order , family , genus , species 2 |
closest_genome_reference or closest_placement_reference column | reference 3 |
closest_genome_reference_radius or closest_placement_radius column | radius 3 |
closest_genome_ani or closest_placement_ani column | ani 3 |
closest_genome_af or closest_placement_af column | af 3 |
classification_method | classification_method |
note | note |
red_value | red_value |
warnings | warnings |
Footnotes
-
The MAG name in the GTDB-Tk file name is used to query the primary key of the corresponding genome in the database. ↩
-
The classification column is broken down into multiple fields for better readability. ↩
-
The
closes_placement_*
columns are only filled when the classification method used by GTDB-Tk is ANI screen. Otherwise, theclosest_genome_*
columns are filled. With that in mind,parsomics
includes only the relevant metrics to each classification method. ↩ ↩2 ↩3 ↩4