prokka
Rapid prokaryotic genome annotation
File naming
The prokka format requires four files per MAG:
<MAG-name>.fna
<MAG-name>.ffn
<MAG-name>.faa
<MAG-name>.gff
FASTA files must have unambiguous file extensions that indicate what kind of
sequence they hold. The accepted extensions are .fna
, .ffn
, .faa
for
contig, gene, and protein sequences, respectively.
File format
For more information on the prokka output files, visit the prokka repository.
FASTA of contigs (.fna
)
The file must follow the FASTA format standard, with an additional requirement for the heading: it must start with the contig name, then (optionally) a space followed by a description.
FASTA of genes (.ffn
)
The file must follow the FASTA format standard, with an additional requirement for the heading: it must start with the gene name, then (optionally) a space followed by a description.
FASTA of proteins (.faa
)
The file must follow the FASTA format standard, with an additional requirement for the heading: it must start with the protein name, then (optionally) a space followed by a description.
GFF (.gff
or .gff3
)
The file must follow the General Feature Format (GFF). It must have columns representing the following data, in that order and without a header:
Column name | Column obligatoriness | Data type | Data nullability |
---|---|---|---|
seqid | Mandatory | String | Not nullable |
source | Mandatory | String | Nullable |
type | Mandatory | String | Not nullable |
start | Mandatory | Integer | Not nullable |
end | Mandatory | Integer | Not nullable |
score | Mandatory | Float | Nullable |
strand | Mandatory | String | Nullable |
phase | Mandatory | Integer | Nullable |
attributes | Mandatory | String | Nullable |
GFF entries gene fragments (e.g. CDS, exon, etc) must include either
locus_tag
or ID
in their attributes
column. This is what parsomics
uses
to link GFF entries to genes.
Mapping to database
FASTAFile
Original data | FASTAFile field |
---|---|
FASTA file path | path |
FASTA file extension | sequence_type 1 |
FASTA file name | genome_key 2 |
FASTAEntry
Original data | FASTAEntry field |
---|---|
FASTA entry ID | sequence_name 3 |
FASTA entry Description | description |
FASTA entry Sequence | sequence |
GFFFile
Original data | GFFFile field |
---|---|
GFF file path | path |
GFF file name | genome_key 4 |
GFFEntry
Original data | GFFEntry field |
---|---|
GFF entry seqid column | sequence_name |
GFF entry source column | source_name |
GFF entry type column | fragment_type |
GFF entry start column | coord_start |
GFF entry end column | coord_stop |
GFF entry score column | score |
GFF entry strand column | strand |
GFF entry phase column | phase |
GFF entry attributes column | attributes 5 |
Footnotes
-
.fna
forSequenceType.CONTIG
("CONTIG"),.ffn
forSequenceType.GENE
("GENE"),.faa
forFragmentType.PROTEIN
("PROTEIN"). ↩ -
The MAG name in the FASTA file name is used to query the primary key of the corresponding genome in the database. ↩
-
The "ID" refers to the sequence name, not the primary key! To avoid confusion, primary keys in parsomics are named
key
, notid
. ↩ -
The MAG name in the GFF file name is used to query the primary key of the corresponding genome in the database ↩
-
For easier access to data, this column is converted from a string to a JSONB property in the database. ↩