-
Khyathi Vagolu (UW-Madison)6/12/26, 11:00 AM
When network disruptions or worker node failures occur, HTCondor relies on a static lease timeout, traditionally 40 minutes, before abandoning a job. This static window creates a costly trade-off: waiting too long causes massive machine idle time on unrecoverable failures, while cutting it too short prematurely kills jobs that could have successfully reconnected. Can we use AI to solve this?...
Go to contribution page -
Ilija Vukotic (University of Chicago)6/12/26, 11:25 AM
-
Tom Smith (Brookhaven National Laboratory)6/12/26, 11:50 AM
-
Ron Tapia (Penn State University)
Discussion of metrics that Condor can provide about the performance of external services. Condor has a unique view of the performance of the services that it uses on behalf of jobs. Examples of external services include file transfer plugins and credmons. Individual failures are not very interesting to cluster administrators, but widespread failures affecting many jobs are. What sort of...
Go to contribution page
Choose timezone
Your profile timezone: