- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
Recent technological advances have revealed an enormous diversity of lifeforms by sequencing their genomes. There are now millions of available genomes, each comprised of thousands of genes. The universe of newly discovered genes is expanding far faster than our ability to study them in the laboratory. Here, I will present how high-throughput computing is unlocking the function of novel genes at an unfathomable scale.
Fermilab is the first High Energy Physics institution to transition from X.509 user certificates to authentication tokens in production systems. All the experiments that Fermilab hosts are now using JSON Web Token (JWT) access tokens in their grid jobs. The tokens are defined using the WLCG Common JWT Profile. Many software components have been either created or updated for this transition, and the changes to those components are described. Most of the software is available to others as open source. There have been some glitches and learning curve issues but in general the system has been performing well and is being improved as operational problems are addressed.
Panel Discussion led by Miron LIvny.
NSF NCAR’s labs and programs collectively cover a breadth of research topics in Earth system science, from the effects of the Sun on Earth's atmosphere to the role of the ocean in weather and climate prediction, as well as supporting and training the next-generation of Earth system scientists. However, with the current legacy `download and analyze’ model followed by most of our remote users, we are not realizing the full research potential of NCAR’s wealth of datasets. Our goal is to integrate NCAR’s curated data collections with the OSDF data and compute fabric to broaden access capabilities. In this talk, we present progress on this collaboration and demonstrate geoscience workflows which ingest data from NCAR’s Research Data Archive using pelicanFS via OSDF caches.
The Monitoring Infrastructure for Network and Computing Environment Research (MINCER) project aims to provide a foundation for in-depth insight and analysis of distributed heterogeneous computing environments, supporting and enhancing research and education in computer and network systems. Our approach is to work in conjunction with the Open Science Grid (OSG) by providing a set of MINCER containers for integration, including the necessary software to obtain measurements for different types of system experiments on available platforms. These containers will be accompanied by detailed instructions and examples illustrating their use, and will be submitted to the OSG repository for broader community access.
To ensure portability and reproducibility, we will follow best practices for constructing lightweight containers, providing examples that illustrate how to robustly specify dependencies not directly included in the container environment. Our current focus includes containers that enable performance and power measurement capabilities for workflows running on GPUs. In parallel, we are developing a separate profiling tool that leverages cyPAPI to extract compute and memory metrics from Python applications, generating roofline models for performance analysis. This tool will also be containerized to facilitate streamlined deployment and analysis across varied platforms, further supporting comprehensive insights into GPU-based workflows.
Furthermore, we are developing a container-based strategy for enabling OSG sites contributing to OSG’s OSPool to collect more detailed job-specific data using tools such as DCGM and LDMS. DCGM (Data Center GPU Manager) is an NVIDIA tool that tracks GPU metrics like utilization, memory use, and power. LDMS (Lightweight Distributed Metric Service) is a low-overhead system for collecting performance data from clusters. Using these tools will allow OSG sites to export more comprehensive data to the OSG dashboard if desired. They can also use this job-level data to better tune their systems to meet workload demands.
The Algorithms Research and Development Group (ARDG) at the National Radio Astronomy Observatory (NRAO) has been using HTCondor and compute resources at the Open Science Grid (OSG) to significantly improve throughput in radio astronomy imaging by up to 2 orders of magnitude in single imaging workflows, and we are currently putting efforts towards extending these imaging capabilities to multiple imaging workflows. Besides developing and maintaining software to efficiently process data and manage the imaging workflows, we found that monitoring and diagnosing workflow executions is critical to achieve and maintain high throughput. We will present and discuss some of the tools and methodology that we use to assess workflow health and efficiency, which enable us to visualize and solve problems, eventually also contributing to optimizing and advancing capabilities of HTCondor and the OSG.
Since its establishment in 2006, the MIT Tier-2 computing center has been a long-standing contributor to CMS computing efforts. Recently, an opportunity arose to take part in the usage of a shared tape storage system operated by Harvard University. In the context of a pilot project to explore this system we acquired tape cartridges with a total capacity of 15 PB and successfully integrated them for use by CMS. A key challenge was the lack of direct access to the tape libraries. To address this, we developed a novel setup that is currently unique within the CMS infrastructure. In this talk, we present our technical approach and share insights that may benefit other sites considering participation in shared tape storage systems.
Computational notebooks have become a critical tool of scientific discovery, by wrapping together code, results, and visualization into a common package. However, moving complex notebooks between different facilities is not so easy: complex workflows require precise software stacks, access to large data, and large backend computational resources. The Floability project aims to connect these two worlds, making it possible to specify, share, and execute computational workflows through the familiar notebook interface. This talk will introduce the Floability concept of a workflow "backpack" and demonstrate applications in high energy physics, machine learning, and geosciences.
The Event Workflow Management System (EWMS) enables previously impractical scientific workflows by transforming how HTCondor is used for massively parallel, short-runtime tasks. This talk explores what’s now possible from a user’s perspective. Integrated into IceCube’s Realtime Alert pipeline and powered by OSG’s national-scale compute resources, EWMS’s debut application delivers directional reconstructions of high-energy neutrinos within minutes. The system’s user-first design streamlines scientific workflows, while built-in tools provide reliable administrative control. This talk will showcase how EWMS is accelerating discovery today and explore how its capabilities could unlock new research across domains—from astrophysics to protein modeling, large-scale text mining, and beyond.
The MIT Tier-2 computing center, established in 2006, has been a long-standing contributor to CMS computing. As hardware ages and computing demands evolve, we are undertaking a major redesign of the center’s infrastructure. In this talk, we present a holistic cost analysis that includes not only hardware purchases but also power consumption, cooling, and rack space—factors often excluded from conventional cost models. Using power measurements under typical CMS workloads, we evaluate the cost-effectiveness of maintaining aging hardware versus timely replacement, and propose optimal hardware retirement and procurement policies.
While there are perhaps hundreds of petabytes of datasets available to researchers, instead of swimming in seas of data there is often a feel of sitting in a data desert: there’s a mismatch between what sits in carefully curated repositories around the world versus what’s accessible at the computational resources locally available. The Pelican Project (https://pelicanplatform.org/) aims to bridge the gap between repositories and compute by providing a software platform to connect the two sides. Pelican’s flagship instance, the Open Science Data Federation (OSDF), serves billions of objects and more than a hundred petabytes a year to national-scale resources. This tutorial, targeted at end-user data consumers and data providers, will cover the data access model of Pelican, guide participants to access and share data through an existing data federation, and consider how data movement via Pelican and the OSDF can enable their research computing.
We have a common notes and action items document at:
https://docs.google.com/document/d/1y3V4HKGQ8EMUTxH9_MDCqegtA8sGeNtu6UmFk2fRyLQ/edit?usp=sharing
Experiences with running tape storage systems at ATLAS and CMS Tier-2s
We present the initial design and proposed implementation for a series of long-baseline, distributed inference experiments leveraging ARA --- a platform for advanced wireless research that spans approximately 500 square kilometers near Iowa State University, including campus, the City of Ames, local research and producer farms, and neighboring rural communities in central Iowa. These experiments aim to demonstrate, characterize, and evaluate the use of distributed inference for computer vision tasks in rural and remote regions where high-capacity, low-latency wireless broadband access and backhaul networks enable edge computing devices and sensors in the field to offload compute-intensive workloads to cloud and high-performance computing systems embedded throughout the edge-to-cloud continuum. In each experiment, a distributed implementation of the MLPerf Inference benchmarks for image classification and object detection will measure standard inference performance metrics for an ARA subsystem configuration under different workload scenarios. Real-time network and weather conditions will also be monitored throughout each experiment to evaluate their impact on inference performance. Here, we highlight the role of HTCondor as the common scheduler and workload manager used to distribute the inference workload across ARA and beyond. We also discuss some of the unique challenges in deploying HTCondor on ARA and provide an update on the current status of the project.
Pegasus is a widely used scientific workflow management system built on top of HTCondor DAGMan. This talk will highlight how Pegasus is deployed within the NSF ACCESS ecosystem and the NAIRR Pilot. We will cover access point deployments, including the hosted ACCESS Pegasus platform (Open OnDemand and Jupyter), workflow execution nodes in HPC environments, and a JupyterLab-based access point within the Purdue Anvil Composable Subsystem. On the execution side, we will discuss several provisioning strategies, including HTCondor Annex, custom virtual machines on the Jetstream2 cloud, simple glidein configurations for campus clusters, and a dynamically autoscaled TestPool environment designed for workflow development and testing.
We have a common notes and action items document at:
https://docs.google.com/document/d/1y3V4HKGQ8EMUTxH9_MDCqegtA8sGeNtu6UmFk2fRyLQ/edit?usp=sharing
Discuss tools and options for the next capacity challenge.
Discuss plans for SENSE/Rucio testing for USATLAS/USCMS
Discuss capacity and capability challenges. Which are we interesting in pursuing? Who will participate? When to schedule?
We have a common notes and action items document at:
https://docs.google.com/document/d/1y3V4HKGQ8EMUTxH9_MDCqegtA8sGeNtu6UmFk2fRyLQ/edit?usp=sharing
Presentation/discussion canceled.
Quick overview of AI/ML in WLCG so far
Needs for AI/ML for Infrastructure AND Infrastructure for AI/ML
Funding opportunities
Next step, areas on common interest/effort?
While the DaemonStartTime and MonitorSelfAge attributes of HTCondor daemons provide a slice of insight as to the uptime and availability of the service, they're not well-suited for tracking longer-term up/down-time stats over the course of days, weeks, or months.
One illustration of this limitation is that if a malfunctioning node or service restarts every five minutes, the values are reset to zero each time and there's no accumulation of the total uptime across the restarts.
Longer-time-period uptime statistics are essential for contractual Service Level Agreement (SLA) management, and are an important aspect of monitoring the overall health of large HTCondor pools.
Using straightforward scripting, the start daemon's Cron system, and an external file to store an uptime-centered ClassAd, long-term statistics can be maintained and easily queried.
HTCondor is the leading system for building a dynamic overlay batch scheduling system on resources managed by any scheduling system, by means of glideins. One fundamental property of these setups is the use of late binding of containerized user workloads. From a resource provider point of view, a compute resource is thus claimed before the user container image is selected. Kubernetes allows for both multi-container requests and dynamic updates to the container image being used. In this talk we show how HTCondor can exploit these features to both increase the effectiveness and the security of gildeins running on top of Kubernetes-managed resources.
We have a common notes and action items document at:
https://docs.google.com/document/d/1y3V4HKGQ8EMUTxH9_MDCqegtA8sGeNtu6UmFk2fRyLQ/edit?usp=sharing
Shared development and prototyping for AFs?
Storage technologies to support AFs (Carlos Gamboa ?, 10 minutes)
Joint AFs: Can we “share” AFs between experiments: Belle II and ATLAS or Dune and CMS, etc ? (Hiro/Ofer/Lincoln ?)
Can we agree on a minimum baseline for AFs?
Standard “login”
New requirements from HEP data analysis includes limited access to login nodes, resource needed from the experiments program rather than the login nodes and efficient data access for collaborative workflows etc. We have developed an Interactive aNalysis worKbench (INK), a web-based platform leveraging the HTCondor cluster. INK transforms traditional batch-processing resources into a user-friendly, web-accessible interface, enabling researchers to leveage the cluster computing and storage resources directly via their browsers.
A loosely coupled architecture with token-based authentication to ensures thesecurity, while fine-grained permission management allows customizable access for users and experimental applications. Universal public interfaces abstract the heterogeneity of underlying resource heterogeneity. Following the first version released in Mar.2025, user feedback has been highly positive.
The Compact Muon Solenoid (CMS) experiment at CERN generates and processes vast volumes of data requiring significant computing capacity. To meet these demands, CMS has adopted a federated throughput computing model distributed across a global infrastructure based on HTCondor, the CMS Submission Infrastructure. Seamless integration of heterogeneous resources from multiple sites allows for operating a unified, virtualized pool. This infrastructure currently provides access to over 500,000 CPU cores, enabling CMS to efficiently execute a wide variety of data processing and simulation workloads.
This federation, however, comes with substantial operational challenges, notably, the need for robust and scalable monitoring. To ensure reliability, performance, and rapid diagnosis of issues, we have developed a comprehensive monitoring ecosystem that spans job execution, resource availability, and system health across the entire pool. This talk will present the architecture of the CMS federated compute infrastructure, detail the role of HTCondor in enabling global workload distribution, and highlight recent developments in monitoring that are critical to operating such a large-scale system effectively.
I describe the process of deploying the National Data Platform Endpoint (formerly Point of Presence / POP) on local infrastructure to provide a data streaming service for a published science dataset where the data origin is located in Hawaii. From the perspective of a software engineer I will cover the process of deploying the endpoint into a Kubernetes cluster or using Docker Compose. I will explain how I integrated access to local data with the endpoint and describe some useful lessons that were learned. Finally, from the perspective of a science user, I will demonstrate how to use the deployed endpoint to preprocess data and stream the results to a Jupyter notebook to visualize the data.