Jun 2 – 6, 2025
Fluno Center on the University of Wisconsin-Madison Campus
America/Chicago timezone

Integration of MINCER with Open Science Grid

Jun 4, 2025, 9:30 AM
15m
Howard Auditorium (Fluno Center)

Howard Auditorium

Fluno Center

Speaker

Irvin Lopez-Audetat (University of Texas at El Paso)

Description

The Monitoring Infrastructure for Network and Computing Environment Research (MINCER) project aims to provide a foundation for in-depth insight and analysis of distributed heterogeneous computing environments, supporting and enhancing research and education in computer and network systems. Our approach is to work in conjunction with the Open Science Grid (OSG) by providing a set of MINCER containers for integration, including the necessary software to obtain measurements for different types of system experiments on available platforms. These containers will be accompanied by detailed instructions and examples illustrating their use, and will be submitted to the OSG repository for broader community access.
To ensure portability and reproducibility, we will follow best practices for constructing lightweight containers, providing examples that illustrate how to robustly specify dependencies not directly included in the container environment. Our current focus includes containers that enable performance and power measurement capabilities for workflows running on GPUs. In parallel, we are developing a separate profiling tool that leverages cyPAPI to extract compute and memory metrics from Python applications, generating roofline models for performance analysis. This tool will also be containerized to facilitate streamlined deployment and analysis across varied platforms, further supporting comprehensive insights into GPU-based workflows.
Furthermore, we are developing a container-based strategy for enabling OSG sites contributing to OSG’s OSPool to collect more detailed job-specific data using tools such as DCGM and LDMS. DCGM (Data Center GPU Manager) is an NVIDIA tool that tracks GPU metrics like utilization, memory use, and power. LDMS (Lightweight Distributed Metric Service) is a low-overhead system for collecting performance data from clusters. Using these tools will allow OSG sites to export more comprehensive data to the OSG dashboard if desired. They can also use this job-level data to better tune their systems to meet workload demands.

Primary authors

Ashkan Arabi Dr Deepak K. Tosh Irvin Lopez-Audetat (University of Texas at El Paso) Dr Shirley V. Moore

Presentation materials