Summary (2-4 sentences)
The problem: It is often challenging to query, filter, and analyze large biological sample repositories such as the SRA because the associated metadata is not standardized, consisting of free-text key-value pairs.
The approach: We developed machine learning models for mapping the free-text metadata to standardized ontology terms annotated with their relationships to the samples. To do this, we built a computational pipeline for training and evaluating our models.
The technology: Our pipeline is a YAML-configurable Snakemake workflow for dispatching HTCondor jobs that preprocess data, extract features, train ML models, evaluate them, and generate performance reports.
Availability of the Speaker
Cannot present July 10