Glossary

Software Development

Metapackage

A software package that bundles multiple other software packages. For example the parsomics PyPI package bundles all packages of the project.

RDBMS

An RDBMS is a Relational Database Management System. The RDBMS used in the parsomics project is PostgreSQL. Other examples of RDBMS are mySQL, SQLite, and MariaDB.

Container engine

A container engine is a program used to run sandboxed environments called containers. The most popular container engine out there is Docker, though the parsomics project uses podman.

PyPI

The Python Package Index is the official repository for Python packages. parsomics and its components are distributed via PyPI for easy installation using tools like pip.

Semantic versioning (SemVer)

A versioning scheme that uses a three-part format: MAJOR.MINOR.PATCH. Updates increment the MAJOR version for incompatible changes, MINOR for added functionality, and PATCH for bug fixes. Read more about SemVer in their official website.

Database migration

The process of applying incremental, version-controlled changes to a database schema. parsomics uses Alembic to manage database migrations.

Package

A self-contained unit of code distribution, typically a Python module or collection of modules, that can be installed and reused.

Dependencies

External packages or libraries required by a project to run correctly.

REST API

A web-based interface that uses standard HTTP methods (GET, POST, etc.) to access and manipulate resources in a stateless, structured format, typically JSON. It is possible to interact with a local parsomics database through a REST API using the parsomics-api-server module.

Wizard

An interactive tool that guides users through a sequence of steps to complete a task. An example of wizard is the parsomics setup command, which walks users through all the steps in the initial setup.

CLI

Short for Command-Line Interface, a text-based interface that allows users to interact with the software via typed commands. parsomics includes a CLI (the parsomics-cli module) for managing analyzes, databases, and plugins.

Bioinformatics

dRep

dRep is a python program for rapidly comparing large numbers of genomes. dRep can also "de-replicate" a genome set by identifying groups of highly similar genomes and choosing the best representative genome for each genome set.

GTKDB-Tk

GTDB-Tk is a software toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes based on the Genome Database Taxonomy (GTDB). It is designed to work with recent advances that allow hundreds or thousands of metagenome-assembled genomes (MAGs) to be obtained directly from environmental samples. It can also be applied to isolate and single-cell genomes.

prokka

Whole genome annotation is the process of identifying features of interest in a set of genomic DNA sequences, and labelling them with useful information. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.

Interpro

InterPro is a database which integrates predictive information about proteins’ function from a number of partner resources, giving an overview of the families that a protein belongs to and the domains and sites it contains.

InterproScan

Users who have novel nucleotide or protein sequences that they wish to functionally characterize can use the software package InterProScan to run the same scanning algorithms from the InterPro database.

CLEAN

CLEAN, Contrastive Learning enabled Enzyme ANnotation, is a machine learning algorithm to assign Enzyme Commission (EC) number with better accuracy, reliability, and sensitivity than all existing computational tools.

ProteInfer

ProteInfer is an approach for predicting the functional properties of protein sequences using deep neural networks.

run_dbCAN

run_dbcan is the standalone version of the dbCAN3 annotation tool for automated CAZyme annotation. This tool, known as run_dbcan, incorporates HMMER, Diamond, and dbCAN_sub for annotating CAZyme families, and integrates CAZyme Gene Clusters (CGCs) and substrate predictions.

`parsomics` Concepts

File validation

The process that differentiates between files that should be processed (valid) from files that should not be processed (invalid).

Modules

The parsomics project is composed of multiple repositories. Each of these repositories is called a "module". A few examples of modules are:

parsomics-core: the library that implements the code for inserting data into the parsomics database
parsomics-api-server: a REST API for interacting with the parsomics database
parsomics-cli: a CLI for parsomics that manages analyzes, databases, and plugins
parsomics-plugin-interpro: a plugin that adds support for processing annotations from InterproScan

Plugins

A special kind of module that adds support for certain protein and gene annotation formats. These modules can be installed, installed, and more using the parsomics plugin command of the parsomics-cli.

Fragments

An abstraction that encompasses all gene "subunits" (exon, CDS, mRNA, etc) that may appear in a GFF file.

Parsing

The complete procedure of adding the data (and metadata) from a single file into the parsomics relational database.

Processing

The complete procedure of adding the data from all files of a certain tool run into the parsomics local relational database. One processing involves multiple parsings.

Analysis

The complete procedure of processing the files from all runs specified in the configuration file and inserting their data (and metadata) in the relational database. One analysis involves multiple processings.

Software Development​

Metapackage​

RDBMS​

Container engine​

PyPI​

Semantic versioning (SemVer)​

Database migration​

Package​

Dependencies​

REST API​

Wizard​

CLI​

Bioinformatics​

dRep​

GTKDB-Tk​

prokka​

Interpro​

InterproScan​

CLEAN​

ProteInfer​

run_dbCAN​

parsomics Concepts​

File validation​

Modules​

Plugins​

Fragments​

Parsing​

Processing​

Analysis​