Jun 2 – 6, 2025
Fluno Center on the University of Wisconsin-Madison Campus
America/Chicago timezone

Tracking HTCondor Uptime

Jun 5, 2025, 1:30 PM
20m
Howard Auditorium (Fluno Center on the University of Wisconsin-Madison Campus)

Howard Auditorium

Fluno Center on the University of Wisconsin-Madison Campus

601 University Avenue, Madison, WI 53715-1035

Speaker

Michael Pelletier

Description

While the DaemonStartTime and MonitorSelfAge attributes of HTCondor daemons provide a slice of insight as to the uptime and availability of the service, they're not well-suited for tracking longer-term up/down-time stats over the course of days, weeks, or months.

One illustration of this limitation is that if a malfunctioning node or service restarts every five minutes, the values are reset to zero each time and there's no accumulation of the total uptime across the restarts.

Longer-time-period uptime statistics are essential for contractual Service Level Agreement (SLA) management, and are an important aspect of monitoring the overall health of large HTCondor pools.

Using straightforward scripting, the start daemon's Cron system, and an external file to store an uptime-centered ClassAd, long-term statistics can be maintained and easily queried.

Presentation materials