> You could put a health check in the epilog of a job so that after every job
> the node is checked. If it's in bad shape you can down it. For the normal
> case with long running jobs this should not be a problem and only one job
> will fail.
I came to pretty much the same conclusion except that I put it to prologue :)
This way the node is checked before every job (the health check runs in 1-2s so
should complete within message timeout, but I've just in case increased that
too) and if the prologue fails it will firstly set the node to draining, but
also as the prologue exits with non-zero exit code the job is rescheduled
automatically elsewhere. In the best case scenario this would imply no failed
jobs except for those that were running at the time of failure if they are
impacted. Will see if this works.
Mario Kadastik, PhD
Researcher
---
"Physics is like sex, sure it may have practical reasons, but that's not why
we do it"
-- Richard P. Feynman