Skip to main content

prokka

Rapid prokaryotic genome annotation


File naming

The prokka format requires four files per MAG:

  • <MAG-name>.fna
  • <MAG-name>.ffn
  • <MAG-name>.faa
  • <MAG-name>.gff
info

FASTA files must have unambiguous file extensions that indicate what kind of sequence they hold. The accepted extensions are .fna, .ffn, .faa for contig, gene, and protein sequences, respectively.

File format

tip

For more information on the prokka output files, visit the prokka repository.

FASTA of contigs (.fna)

The file must follow the FASTA format standard, with an additional requirement for the heading: it must start with the contig name, then (optionally) a space followed by a description.

FASTA of genes (.ffn)

The file must follow the FASTA format standard, with an additional requirement for the heading: it must start with the gene name, then (optionally) a space followed by a description.

FASTA of proteins (.faa)

The file must follow the FASTA format standard, with an additional requirement for the heading: it must start with the protein name, then (optionally) a space followed by a description.

GFF (.gff or .gff3)

The file must follow the General Feature Format (GFF). It must have columns representing the following data, in that order and without a header:

Column nameColumn obligatorinessData typeData nullability
seqidMandatoryStringNot nullable
sourceMandatoryStringNullable
typeMandatoryStringNot nullable
startMandatoryIntegerNot nullable
endMandatoryIntegerNot nullable
scoreMandatoryFloatNullable
strandMandatoryStringNullable
phaseMandatoryIntegerNullable
attributesMandatoryStringNullable

GFF entries gene fragments (e.g. CDS, exon, etc) must include either locus_tag or ID in their attributes column. This is what parsomics uses to link GFF entries to genes.

Mapping to database

FASTAFile

Original dataFASTAFile field
FASTA file pathpath
FASTA file extensionsequence_type 1
FASTA file namegenome_key 2

FASTAEntry

Original dataFASTAEntry field
FASTA entry IDsequence_name 3
FASTA entry Descriptiondescription
FASTA entry Sequencesequence

GFFFile

Original dataGFFFile field
GFF file pathpath
GFF file namegenome_key 4

GFFEntry

Original dataGFFEntry field
GFF entry seqid columnsequence_name
GFF entry source columnsource_name
GFF entry type columnfragment_type
GFF entry start columncoord_start
GFF entry end columncoord_stop
GFF entry score columnscore
GFF entry strand columnstrand
GFF entry phase columnphase
GFF entry attributes columnattributes 5

Footnotes

  1. .fna for SequenceType.CONTIG ("CONTIG"), .ffn for SequenceType.GENE ("GENE"), .faa for FragmentType.PROTEIN ("PROTEIN").

  2. The MAG name in the FASTA file name is used to query the primary key of the corresponding genome in the database.

  3. The "ID" refers to the sequence name, not the primary key! To avoid confusion, primary keys in parsomics are named key, not id.

  4. The MAG name in the GFF file name is used to query the primary key of the corresponding genome in the database

  5. For easier access to data, this column is converted from a string to a JSONB property in the database.