[slurm-dev] Re: Disable black hole nodes automatically

Magnus Jonsson Fri, 08 Feb 2013 01:34:24 -0800

You could put a health check in the epilog of a job so that after every job the node is checked. If it's in bad shape you can down it. For the normal case with long running jobs this should not be a problem and only one job will fail.


/Magnus


On 2013-02-08 10:11, Mario Kadastik wrote:


Hi,

I'm wondering if there's a way to detect a fast churn rate for a node. Last night we had 
one node lose the software area so all jobs that were scheduled failed within a few 
minutes (the jobs use wrappers that do health checking of environment so the job exit 
code was 0, the wrapper propagated the actual error code to the users software). We have 
a self test run by slurm every 5 minutes and it did detect the node failure, but before 
it could the node had "failed" hundreds of jobs in that 5 minute window. We 
assume most jobs would run for at least tens of minutes so if slurm sees a node churning 
through jobs in less than a minute it should disable the node. Is there any way to handle 
this beyond moving self test script execution up from 5 minutes to say every 30 seconds?

Thanks,

Mario Kadastik, PhD
Researcher

---
   "Physics is like sex, sure it may have practical reasons, but that's not why we 
do it"
      -- Richard P. Feynman


--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet

smime.p7s
Description: S/MIME Cryptographic Signature

[slurm-dev] Re: Disable black hole nodes automatically

Reply via email to