Skip to main content

InterProScan

Genome-scale protein function classification using the InterPro database


Important

This file type requires the parsomics-plugin-interpro plugin

Important

As of now, parsomics was only tested on InterproScan v5. The compatibility with v6 or later is unknown and not guaranteed.

File naming

The file names must adhere to one of the following patterns:

  • "<MAG-name>_interpro_out.tsv",
  • "<MAG-name>_interpro.tsv",

File format

tip

For more information on the InterProScan TSV output file format, visit the InterProScan documentation.

The file must NOT include a header (i.e. it should not include column names at the top). It must have the following columns, in that exact order:

Column propertyColumn obligatorinessData typeData nullability
Protein nameMandatoryStringNot nullable
Sequence MD5 digestMandatory (ignored)N/AN/A
Sequence lengthMandatory (ignored)N/AN/A
Source nameMandatoryStringNullable
Annotation accessionMandatoryStringNullable
Annotation descriptionMandatoryStringNullable
Start locationMandatoryIntegerNullable
Stop locationMandatoryIntegerNullable
Score (e-value)MandatoryStringNullable
StatusMandatoryStringNullable
DateMandatory (ignored)N/AN/A
InterPro annotations accessionMandatoryStringNullable
InterPro annotations descriptionMandatoryStringNullable

A few things to keep in mind:

  • The "Score" column is a string that actually represents a number in scientific notation. For example: "3.1E-52", which is equivalent to 3.110523.1 \cdot 10^{-52}.

  • The "Status" column tells if annotations were successful ("T") or unsuccessful ("F"). This column isn't added to the database but it isn't ignored either: it is used to filter successful annotations, then it is dropped.

  • Most columns are nullable — This is by design, to make the proteinannotationentry table as flexible as possible. However, only the last two columns ("InterPro annotations accession" and "InterPro annotations description") typically have null values in InterProScan output files. All other columns generally do not contain null values.

info

Why are there mandatory columns that are ignored?

That has to do with the way the InterProScan file parser is written. When the file is read, it must comply with a pre-defined schema (column order and types), even though some of these columns end up being dropped later.

Mapping to database

ProteinAnnotationFile

Original dataProteinAnnotationFile field
InterProScan TSV file pathpath

ProteinAnnotationEntry

Original dataProteinAnnotationEntry field
Protein nameprotein_key 1
Source namesource_key 2
Annotation accessionaccession
Annotation descriptiondescription
Start locationcoord_start
Stop locationcoord_stop
Score (e-value)score 3
InterPro annotations accessiondetails["interpro_annotation_accession"]
InterPro annotations descriptiondetails["interpro_annotation_description"]

Footnotes

  1. The protein name in the InterproScan TSV file name is used to query the primary key of the corresponding protein in the database

  2. The source name in the InterproScan TSV file name is used to query the primary key of the corresponding source in the database

  3. The score is converted from string to float before being entered into the database