> You could put a health check in the epilog of a job so that after every job 
> the node is checked. If it's in bad shape you can down it. For the normal 
> case with long running jobs this should not be a problem and only one job 
> will fail.

I came to pretty much the same conclusion except that I put it to prologue :) 
This way the node is checked before every job (the health check runs in 1-2s so 
should complete within message timeout, but I've just in case increased that 
too) and if the prologue fails it will firstly set the node to draining, but 
also as the prologue exits with non-zero exit code the job is rescheduled 
automatically elsewhere. In the best case scenario this would imply no failed 
jobs except for those that were running at the time of failure if they are 
impacted. Will see if this works. 

Mario Kadastik, PhD
Researcher

---
  "Physics is like sex, sure it may have practical reasons, but that's not why 
we do it" 
     -- Richard P. Feynman

Reply via email to