Speaker
Description
The Compact Muon Solenoid (CMS) experiment at CERN generates and processes vast volumes of data requiring significant computing capacity. To meet these demands, CMS has adopted a federated throughput computing model distributed across a global infrastructure based on HTCondor, the CMS Submission Infrastructure. Seamless integration of heterogeneous resources from multiple sites allows for operating a unified, virtualized pool. This infrastructure currently provides access to over 500,000 CPU cores, enabling CMS to efficiently execute a wide variety of data processing and simulation workloads.
This federation, however, comes with substantial operational challenges, notably, the need for robust and scalable monitoring. To ensure reliability, performance, and rapid diagnosis of issues, we have developed a comprehensive monitoring ecosystem that spans job execution, resource availability, and system health across the entire pool. This talk will present the architecture of the CMS federated compute infrastructure, detail the role of HTCondor in enabling global workload distribution, and highlight recent developments in monitoring that are critical to operating such a large-scale system effectively.