Speaker
Description
Researchers using HTCondor for high-throughput computing routinely submit groups of related jobs, known as Clusters, ranging from hundreds to tens of thousands of jobs each. Current tools report per-job data, making it difficult to diagnose Cluster-wide issues such as jobs stuck on hold, poor resource utilization, or unexpected failures. We present a Python toolkit, to be included as a part of the HTCondor suite, that bridges this gap. Given a single Cluster ID, the toolkit evaluates job status distribution, runtime patterns, hold reasons, and resource utilization, then synthesizes these into a colour-coded health summary that tells researchers and facilitators whether their Cluster is running well, and if not, why. Each analysis transforms thousands of individual job records into a concise, actionable report.