Summary (2-4 sentences)
The International Gravitational Wave Network (IGWN) extensively utilizes the globally distributed IGWN HTCondor pool, a mix of dedicated, allocated, and opportunistic HTC resources with IGWN- and PATh-operated middleware.
Maintaining stability and ensuring a consistent user experience in this diverse environment pose challenges. To monitor and diagnose issues, we developed the IGWN Grid Exerciser. This tool mimics the IGWN user experience using ropes and pulleys. It periodically launches DAGMan workflows with representative test jobs, gathers statistics via condor_adstash/elasticsearch, and generates regular reports and deeper insights using Grafana dashboards.
I will provide an overview of the grid exerciser framework and how it identifies and resolves various issues encountered by IGWN. I will also discuss the challenges and frustrations of troubleshooting in an often-opaque infrastructure, supporting IGWN analyses that are more familiar with dedicated environments.