Step 3: Edit validated_file.py
This file implements a class whose purpose is to distinguish files that should be processed (valid files) from files that should not be procesed (invalid files).
Explanation
How validation works
In parsomics
, files are validated/invalidated based on patterns in their
paths. Usually, just checking the ending of file names is enough. For that, the
parsomics-core
library exposes the ValidatedFileWithGenome
class, which you
must subclass in your plugin.
Your ValidatedFileWithGenome
subclass takes care of two things:
-
File validation. It happens when we attempt to construct an object of that class: if the object is succesfully constructed, the file is valid, otherwise the file is invalid. Behind the scenes, this validation is powered by the excellent type validation system provided by the pydantic library.
-
MAG referencing. This class must also implement a helper method for extracting the MAG name from the file path. This will be useful later down the line, when we need to "link" the database entries for protein annotation files to the database entries of the MAGs that they refer to.
parsomics
assumes you have one MAG per protein annotation file, but not
necessarily one protein annotation file per MAG. That means all protein
annotation files must refer to a MAG, but not all MAGs must have a protein
annotation file.
Hands on
-
Remove the uneeded part of the template (i.e. the part that subclasses
ValidatedFile
instead ofValidtedFileWithGenome
). Also remove comments and triple quotes. You will end up with something like this:from typing import ClassVar
from pathlib import Path
from parsomics_core.entities.files.validated_file import ValidatedFileWithGenome
class InterproValidatedFile(ValidatedFileWithGenome):
_VALID_FILE_TERMINATIONS: ClassVar[list[str]] = [
<"file termination", for example ".tsv">
]
@property
def genome_name(self) -> str:
path_obj = Path(self.path)
pass # continue your implementation here -
Add the valid file terminations in
_VALID_FILE_TERMINATIONS
class variable. For InterproScan, these are "_interpro_out.tsv" and "_interpro.tsv", so you should have this:_VALID_FILE_TERMINATIONS: ClassVar[list[str]] = [
"_interpro_out.tsv",
"_interpro.tsv",
]infoThese valid file terminations need to be enough to distinguish the files that should be processed from the files that shouldn't, among those that typically end up in the output directory of the tool you are adding support for.
For example, if a tool outputs both their most important output and their logs to ".txt" files, then the ".txt" termination is not enough to distinguish important files from unimportant ones.
-
Implement the
genome_name
property method. Usually, if you correctly identified the_VALID_FILE_TERMINATIONS
, extracting the MAG name is as simple as removing them from the files' names. That is indeed the case for our example:@property
def genome_name(self) -> str:
file_name: str = Path(self.path).name
# Remove valid file terminations to extract genome name
for termination in InterproTsvValidatedFile._VALID_FILE_TERMINATIONS:
file_name = file_name.removesuffix(termination)
genome_name = file_name
if genome_name is None:
raise Exception(
f"Failed at extracting genome name from interpro tsv file: {self.path}"
)
return genome_name -
Stage
validated_file.py
and commit it
Result
validated_file.py
from pathlib import Path
from typing import ClassVar
from parsomics_core.entities.files.validated_file import ValidatedFileWithGenome
class InterproTsvValidatedFile(ValidatedFileWithGenome):
_VALID_FILE_TERMINATIONS: ClassVar[list[str]] = [
"_interpro_out.tsv",
"_interpro.tsv",
]
@property
def genome_name(self) -> str:
file_name: str = Path(self.path).name
# Remove valid file terminations to extract genome name
for termination in InterproTsvValidatedFile._VALID_FILE_TERMINATIONS:
file_name = file_name.removesuffix(termination)
genome_name = file_name
if genome_name is None:
raise Exception(
f"Failed at extracting genome name from interpro tsv file: {self.path}"
)
return genome_name