InterProScan

Genome-scale protein function classification using the InterPro database

Important

This file type requires the parsomics-plugin-interpro plugin

Important

As of now, parsomics was only tested on InterproScan v5. The compatibility with v6 or later is unknown and not guaranteed.

File naming

The file names must adhere to one of the following patterns:

"<MAG-name>_interpro_out.tsv",
"<MAG-name>_interpro.tsv",

File format

tip

For more information on the InterProScan TSV output file format, visit the InterProScan documentation.

The file must NOT include a header (i.e. it should not include column names at the top). It must have the following columns, in that exact order:

Column property	Column obligatoriness	Data type	Data nullability
Protein name	Mandatory	String	Not nullable
Sequence MD5 digest	Mandatory (ignored)	N/A	N/A
Sequence length	Mandatory (ignored)	N/A	N/A
Source name	Mandatory	String	Nullable
Annotation accession	Mandatory	String	Nullable
Annotation description	Mandatory	String	Nullable
Start location	Mandatory	Integer	Nullable
Stop location	Mandatory	Integer	Nullable
Score (e-value)	Mandatory	String	Nullable
Status	Mandatory	String	Nullable
Date	Mandatory (ignored)	N/A	N/A
InterPro annotations accession	Mandatory	String	Nullable
InterPro annotations description	Mandatory	String	Nullable

A few things to keep in mind:

The "Score" column is a string that actually represents a number in scientific notation. For example: "3.1E-52", which is equivalent to $3.1 \cdot 10^{-52}$ .
The "Status" column tells if annotations were successful ("T") or unsuccessful ("F"). This column isn't added to the database but it isn't ignored either: it is used to filter successful annotations, then it is dropped.
Most columns are nullable — This is by design, to make the proteinannotationentry table as flexible as possible. However, only the last two columns ("InterPro annotations accession" and "InterPro annotations description") typically have null values in InterProScan output files. All other columns generally do not contain null values.

info

Why are there mandatory columns that are ignored?

That has to do with the way the InterProScan file parser is written. When the file is read, it must comply with a pre-defined schema (column order and types), even though some of these columns end up being dropped later.

Mapping to database

`ProteinAnnotationFile`

Original data	`ProteinAnnotationFile` field
InterProScan TSV file path	`path`

`ProteinAnnotationEntry`

Original data	`ProteinAnnotationEntry` field
Protein name	`protein_key` ¹
Source name	`source_key` ²
Annotation accession	`accession`
Annotation description	`description`
Start location	`coord_start`
Stop location	`coord_stop`
Score (e-value)	`score` ³
InterPro annotations accession	`details["interpro_annotation_accession"]`
InterPro annotations description	`details["interpro_annotation_description"]`

The protein name in the InterproScan TSV file name is used to query the primary key of the corresponding protein in the database ↩
The source name in the InterproScan TSV file name is used to query the primary key of the corresponding source in the database ↩
The score is converted from string to float before being entered into the database ↩

File naming​

File format​

Mapping to database​

ProteinAnnotationFile​

ProteinAnnotationEntry​

Footnotes​

File naming

File format

Mapping to database

`ProteinAnnotationFile`

`ProteinAnnotationEntry`

Footnotes