Jun 2 – 6, 2025
Fluno Center on the University of Wisconsin-Madison Campus
America/Chicago timezone

Operating a Federated HTCondor Infrastructure: Monitoring and Management for CMS Computing

Jun 5, 2025, 4:35 PM
20m
Howard Auditorium (Fluno Center on the University of Wisconsin-Madison Campus)

Howard Auditorium

Fluno Center on the University of Wisconsin-Madison Campus

601 University Avenue, Madison, WI 53715-1035

Speaker

Bruno Coimbra (Fermilab)

Description

The Compact Muon Solenoid (CMS) experiment at CERN generates and processes vast volumes of data requiring significant computing capacity. To meet these demands, CMS has adopted a federated throughput computing model distributed across a global infrastructure based on HTCondor, the CMS Submission Infrastructure. Seamless integration of heterogeneous resources from multiple sites allows for operating a unified, virtualized pool. This infrastructure currently provides access to over 500,000 CPU cores, enabling CMS to efficiently execute a wide variety of data processing and simulation workloads.

This federation, however, comes with substantial operational challenges, notably, the need for robust and scalable monitoring. To ensure reliability, performance, and rapid diagnosis of issues, we have developed a comprehensive monitoring ecosystem that spans job execution, resource availability, and system health across the entire pool. This talk will present the architecture of the CMS federated compute infrastructure, detail the role of HTCondor in enabling global workload distribution, and highlight recent developments in monitoring that are critical to operating such a large-scale system effectively.

Primary authors

Bruno Coimbra (Fermilab) Marco Mascheroni (UCSD)

Co-authors

Antonio Pérez-Calero Yzquierdo (CMS) Hyunwoo Kim (Fermilab) Mr Ralf Von Cube Ms Vaiva Zokaite

Presentation materials

There are no materials yet.