Skip to main content

Step 3: Edit validated_file.py

This file implements a class whose purpose is to distinguish files that should be processed (valid files) from files that should not be procesed (invalid files).

Explanation

How validation works

In parsomics, files are validated/invalidated based on patterns in their paths. Usually, just checking the ending of file names is enough. For that, the parsomics-core library exposes the ValidatedFileWithGenome class, which you must subclass in your plugin.

Your ValidatedFileWithGenome subclass takes care of two things:

  1. File validation. It happens when we attempt to construct an object of that class: if the object is succesfully constructed, the file is valid, otherwise the file is invalid. Behind the scenes, this validation is powered by the excellent type validation system provided by the pydantic library.

  2. MAG referencing. This class must also implement a helper method for extracting the MAG name from the file path. This will be useful later down the line, when we need to "link" the database entries for protein annotation files to the database entries of the MAGs that they refer to.

warning

parsomics assumes you have one MAG per protein annotation file, but not necessarily one protein annotation file per MAG. That means all protein annotation files must refer to a MAG, but not all MAGs must have a protein annotation file.

Hands on

  1. Remove the uneeded part of the template (i.e. the part that subclasses ValidatedFile instead of ValidtedFileWithGenome). Also remove comments and triple quotes. You will end up with something like this:

    from typing import ClassVar
    from pathlib import Path

    from parsomics_core.entities.files.validated_file import ValidatedFileWithGenome

    class InterproValidatedFile(ValidatedFileWithGenome):
    _VALID_FILE_TERMINATIONS: ClassVar[list[str]] = [
    <"file termination", for example ".tsv">
    ]

    @property
    def genome_name(self) -> str:
    path_obj = Path(self.path)
    pass # continue your implementation here
  2. Add the valid file terminations in _VALID_FILE_TERMINATIONS class variable. For InterproScan, these are "_interpro_out.tsv" and "_interpro.tsv", so you should have this:

    _VALID_FILE_TERMINATIONS: ClassVar[list[str]] = [
    "_interpro_out.tsv",
    "_interpro.tsv",
    ]
    info

    These valid file terminations need to be enough to distinguish the files that should be processed from the files that shouldn't, among those that typically end up in the output directory of the tool you are adding support for.

    For example, if a tool outputs both their most important output and their logs to ".txt" files, then the ".txt" termination is not enough to distinguish important files from unimportant ones.

  3. Implement the genome_name property method. Usually, if you correctly identified the _VALID_FILE_TERMINATIONS, extracting the MAG name is as simple as removing them from the files' names. That is indeed the case for our example:

    @property
    def genome_name(self) -> str:
    file_name: str = Path(self.path).name

    # Remove valid file terminations to extract genome name
    for termination in InterproTsvValidatedFile._VALID_FILE_TERMINATIONS:
    file_name = file_name.removesuffix(termination)

    genome_name = file_name
    if genome_name is None:
    raise Exception(
    f"Failed at extracting genome name from interpro tsv file: {self.path}"
    )

    return genome_name
  4. Stage validated_file.py and commit it

Result

validated_file.py
from pathlib import Path
from typing import ClassVar

from parsomics_core.entities.files.validated_file import ValidatedFileWithGenome


class InterproTsvValidatedFile(ValidatedFileWithGenome):
_VALID_FILE_TERMINATIONS: ClassVar[list[str]] = [
"_interpro_out.tsv",
"_interpro.tsv",
]

@property
def genome_name(self) -> str:
file_name: str = Path(self.path).name

# Remove valid file terminations to extract genome name
for termination in InterproTsvValidatedFile._VALID_FILE_TERMINATIONS:
file_name = file_name.removesuffix(termination)

genome_name = file_name
if genome_name is None:
raise Exception(
f"Failed at extracting genome name from interpro tsv file: {self.path}"
)

return genome_name