You could put a health check in the epilog of a job so that after every job the node is checked. If it's in bad shape you can down it. For the normal case with long running jobs this should not be a problem and only one job will fail.

/Magnus

On 2013-02-08 10:11, Mario Kadastik wrote:

Hi,

I'm wondering if there's a way to detect a fast churn rate for a node. Last night we had 
one node lose the software area so all jobs that were scheduled failed within a few 
minutes (the jobs use wrappers that do health checking of environment so the job exit 
code was 0, the wrapper propagated the actual error code to the users software). We have 
a self test run by slurm every 5 minutes and it did detect the node failure, but before 
it could the node had "failed" hundreds of jobs in that 5 minute window. We 
assume most jobs would run for at least tens of minutes so if slurm sees a node churning 
through jobs in less than a minute it should disable the node. Is there any way to handle 
this beyond moving self test script execution up from 5 minutes to say every 30 seconds?

Thanks,

Mario Kadastik, PhD
Researcher

---
   "Physics is like sex, sure it may have practical reasons, but that's not why we 
do it"
      -- Richard P. Feynman


--
Magnus Jonsson, Developer, HPC2N, UmeƄ Universitet

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to