Skip to main content

GTDB-Tk

A toolkit for assigning objective taxonomic classifications to prokaryote genomes


File naming

The file name must end with ".summary.tsv".

File format

tip

For more information on the GTDB-Tk output format, visit the GTDB-Tk documentation.

Results Summary (.tsv)

The file must include a header (i.e. the column names at the top). The column names can be anything, as long as the order is exactly the same. It must have the following columns, in that order:

Column nameColumn obligatorinessData typeData nullability
user_genomeMandatoryStringNot nullable
classificationMandatoryStringNot nullable
closest_genome_referenceMandatoryStringNullable
closest_genome_reference_radiusMandatoryFloatNullable
closest_genome_taxonomyMandatory (ignored)N/AN/A
closest_genome_aniMandatoryFloatNullable
closest_genome_afMandatoryFloatNullable
closest_placement_referenceMandatoryStringNullable
closest_placement_radiusMandatoryFloatNullable
closest_placement_taxonomyMandatory (ignored)N/AN/A
closest_placement_aniMandatoryFloatNullable
closest_placement_afMandatoryFloatNullable
pplacer_taxonomyMandatory (ignored)N/AN/A
classification_methodMandatoryStringNot nullable
noteMandatoryStringNullable
other_related_referencesMandatory (ignored)N/AN/A
msa_percentMandatory (ignored)N/AN/A
translation_tableMandatory (ignored)N/AN/A
red_valueMandatoryFloatNullable
warningsMandatoryStringNullable
info

Why are there mandatory columns that are ignored?

That has to do with the way the GTDB-Tk file parser is written. When the file is read, it must comply with a pre-defined schema (column order and types), even though some of these columns end up being dropped later.

Mapping to database

GTDBTkTsvFile

Original dataGTDBTkTsvFile fieldNotes
GTDB-Tk file pathpath

GTDBTkTsvEntry

Original dataGTDBTkTsvEntry field
user_genomegenome_key 1
classificationdomain, phylum, klass, order, family, genus, species 2
closest_genome_reference or closest_placement_reference columnreference 3
closest_genome_reference_radius or closest_placement_radius columnradius 3
closest_genome_ani or closest_placement_ani columnani 3
closest_genome_af or closest_placement_af columnaf 3
classification_methodclassification_method
notenote
red_valuered_value
warningswarnings

Footnotes

  1. The MAG name in the GTDB-Tk file name is used to query the primary key of the corresponding genome in the database.

  2. The classification column is broken down into multiple fields for better readability.

  3. The closes_placement_* columns are only filled when the classification method used by GTDB-Tk is ANI screen. Otherwise, the closest_genome_* columns are filled. With that in mind, parsomics includes only the relevant metrics to each classification method. 2 3 4