Speaker
Yuriy Sverchkov
(University of Wisconsin - Madison)
Summary (2-4 sentences)
The problem: It is often challenging to query, filter, and analyze large biological sample repositories such as the SRA because the associated metadata is not standardized, consisting of free-text key-value pairs.
The approach: We developed machine learning models for mapping the free-text metadata to standardized ontology terms annotated with their relationships to the samples. To do this, we built a computational pipeline for training and evaluating our models.
The technology: Our pipeline is a YAML-configurable Snakemake workflow for dispatching HTCondor jobs that preprocess data, extract features, train ML models, evaluate them, and generate performance reports.
Availability of the Speaker
Cannot present July 10
Primary authors
Prof.
Colin Dewey
(University of Wisconsin - Madison)
Prof.
Mark Craven
(University of Wisconsin - Madison)
Yuriy Sverchkov
(University of Wisconsin - Madison)