InterProScan
Genome-scale protein function classification using the InterPro database
This file type requires the parsomics-plugin-interpro
plugin
As of now, parsomics
was only tested on InterproScan v5. The compatibility
with v6 or later is unknown and not guaranteed.
File naming
The file names must adhere to one of the following patterns:
"<MAG-name>_interpro_out.tsv"
,"<MAG-name>_interpro.tsv"
,
File format
For more information on the InterProScan TSV output file format, visit the InterProScan documentation.
The file must NOT include a header (i.e. it should not include column names at the top). It must have the following columns, in that exact order:
Column property | Column obligatoriness | Data type | Data nullability |
---|---|---|---|
Protein name | Mandatory | String | Not nullable |
Sequence MD5 digest | Mandatory (ignored) | N/A | N/A |
Sequence length | Mandatory (ignored) | N/A | N/A |
Source name | Mandatory | String | Nullable |
Annotation accession | Mandatory | String | Nullable |
Annotation description | Mandatory | String | Nullable |
Start location | Mandatory | Integer | Nullable |
Stop location | Mandatory | Integer | Nullable |
Score (e-value) | Mandatory | String | Nullable |
Status | Mandatory | String | Nullable |
Date | Mandatory (ignored) | N/A | N/A |
InterPro annotations accession | Mandatory | String | Nullable |
InterPro annotations description | Mandatory | String | Nullable |
A few things to keep in mind:
-
The "Score" column is a string that actually represents a number in scientific notation. For example: "3.1E-52", which is equivalent to .
-
The "Status" column tells if annotations were successful ("T") or unsuccessful ("F"). This column isn't added to the database but it isn't ignored either: it is used to filter successful annotations, then it is dropped.
-
Most columns are nullable — This is by design, to make the
proteinannotationentry
table as flexible as possible. However, only the last two columns ("InterPro annotations accession" and "InterPro annotations description") typically have null values in InterProScan output files. All other columns generally do not contain null values.
Why are there mandatory columns that are ignored?
That has to do with the way the InterProScan file parser is written. When the file is read, it must comply with a pre-defined schema (column order and types), even though some of these columns end up being dropped later.
Mapping to database
ProteinAnnotationFile
Original data | ProteinAnnotationFile field |
---|---|
InterProScan TSV file path | path |
ProteinAnnotationEntry
Original data | ProteinAnnotationEntry field |
---|---|
Protein name | protein_key 1 |
Source name | source_key 2 |
Annotation accession | accession |
Annotation description | description |
Start location | coord_start |
Stop location | coord_stop |
Score (e-value) | score 3 |
InterPro annotations accession | details["interpro_annotation_accession"] |
InterPro annotations description | details["interpro_annotation_description"] |
Footnotes
-
The protein name in the InterproScan TSV file name is used to query the primary key of the corresponding protein in the database ↩
-
The source name in the InterproScan TSV file name is used to query the primary key of the corresponding source in the database ↩
-
The score is converted from string to float before being entered into the database ↩