Speaker
Khyathi Vagolu
(UW-Madison)
Description
When network disruptions or worker node failures occur, HTCondor relies on a static lease timeout, traditionally 40 minutes, before abandoning a job. This static window creates a costly trade-off: waiting too long causes massive machine idle time on unrecoverable failures, while cutting it too short prematurely kills jobs that could have successfully reconnected. Can we use AI to solve this? This talk explores how we can dynamically predict an optimal timeout duration by training a simple ML model!