Glossary
Software Development
Metapackage
A software package that bundles multiple other software packages. For example
the parsomics
PyPI package bundles all packages of the project.
RDBMS
An RDBMS is a Relational Database Management System. The RDBMS used in the
parsomics
project is PostgreSQL. Other examples of RDBMS are mySQL, SQLite,
and MariaDB.
Container engine
A container engine is a program used to run sandboxed environments called
containers. The most popular container engine out there is Docker, though the
parsomics project uses podman
.
PyPI
The Python Package Index is the official repository for Python packages.
parsomics
and its components are distributed via PyPI for easy installation
using tools like pip
.
Semantic versioning (SemVer)
A versioning scheme that uses a three-part format: MAJOR.MINOR.PATCH
. Updates
increment the MAJOR
version for incompatible changes, MINOR
for added
functionality, and PATCH
for bug fixes. Read more about SemVer in their
official website.
Database migration
The process of applying incremental, version-controlled changes to a database
schema. parsomics
uses Alembic
to manage database migrations.
Package
A self-contained unit of code distribution, typically a Python module or collection of modules, that can be installed and reused.
Dependencies
External packages or libraries required by a project to run correctly.
REST API
A web-based interface that uses standard HTTP methods (GET, POST, etc.) to
access and manipulate resources in a stateless, structured format, typically
JSON. It is possible to interact with a local parsomics
database through a
REST API using the parsomics-api-server
module.
Wizard
An interactive tool that guides users through a sequence of steps to complete a
task. An example of wizard is the parsomics setup
command, which walks users
through all the steps in the initial setup.
CLI
Short for Command-Line Interface, a text-based interface that allows users to
interact with the software via typed commands. parsomics
includes a CLI (the
parsomics-cli
module) for managing analyzes, databases, and plugins.
Bioinformatics
dRep
dRep is a python program for rapidly comparing large numbers of genomes. dRep can also "de-replicate" a genome set by identifying groups of highly similar genomes and choosing the best representative genome for each genome set.
GTKDB-Tk
GTDB-Tk is a software toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes based on the Genome Database Taxonomy (GTDB). It is designed to work with recent advances that allow hundreds or thousands of metagenome-assembled genomes (MAGs) to be obtained directly from environmental samples. It can also be applied to isolate and single-cell genomes.
prokka
Whole genome annotation is the process of identifying features of interest in a set of genomic DNA sequences, and labelling them with useful information. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.
Interpro
InterPro is a database which integrates predictive information about proteins’ function from a number of partner resources, giving an overview of the families that a protein belongs to and the domains and sites it contains.
InterproScan
Users who have novel nucleotide or protein sequences that they wish to functionally characterize can use the software package InterProScan to run the same scanning algorithms from the InterPro database.
CLEAN
CLEAN, Contrastive Learning enabled Enzyme ANnotation, is a machine learning algorithm to assign Enzyme Commission (EC) number with better accuracy, reliability, and sensitivity than all existing computational tools.
ProteInfer
ProteInfer is an approach for predicting the functional properties of protein sequences using deep neural networks.
run_dbCAN
run_dbcan is the standalone version of the dbCAN3 annotation tool for automated CAZyme annotation. This tool, known as run_dbcan, incorporates HMMER, Diamond, and dbCAN_sub for annotating CAZyme families, and integrates CAZyme Gene Clusters (CGCs) and substrate predictions.
parsomics
Concepts
File validation
The process that differentiates between files that should be processed (valid) from files that should not be processed (invalid).
Modules
The parsomics
project is composed of multiple repositories. Each of these
repositories is called a "module". A few examples of modules are:
parsomics-core
: the library that implements the code for inserting data into theparsomics
databaseparsomics-api-server
: a REST API for interacting with theparsomics
databaseparsomics-cli
: a CLI forparsomics
that manages analyzes, databases, and pluginsparsomics-plugin-interpro
: a plugin that adds support for processing annotations from InterproScan
Plugins
A special kind of module that adds support for certain protein and gene
annotation formats. These modules can be installed, installed, and more
using the parsomics plugin
command of the parsomics-cli
.
Fragments
An abstraction that encompasses all gene "subunits" (exon, CDS, mRNA, etc) that may appear in a GFF file.
Parsing
The complete procedure of adding the data (and metadata) from a single
file into the parsomics
relational database.
Processing
The complete procedure of adding the data from all files of a certain tool
run into the parsomics
local relational database. One processing involves
multiple parsings.
Analysis
The complete procedure of processing the files from all runs specified in the configuration file and inserting their data (and metadata) in the relational database. One analysis involves multiple processings.