Speaker
Description
While the DaemonStartTime and MonitorSelfAge attributes of HTCondor daemons provide a slice of insight as to the uptime and availability of the service, they're not well-suited for tracking longer-term up/down-time stats over the course of days, weeks, or months.
One illustration of this limitation is that if a malfunctioning node or service restarts every five minutes, the values are reset to zero each time and there's no accumulation of the total uptime across the restarts.
Longer-time-period uptime statistics are essential for contractual Service Level Agreement (SLA) management, and are an important aspect of monitoring the overall health of large HTCondor pools.
Using straightforward scripting, the start daemon's Cron system, and an external file to store an uptime-centered ClassAd, long-term statistics can be maintained and easily queried.