/Magnus
On 2013-02-08 10:11, Mario Kadastik wrote:
Hi,
I'm wondering if there's a way to detect a fast churn rate for a node. Last night we had
one node lose the software area so all jobs that were scheduled failed within a few
minutes (the jobs use wrappers that do health checking of environment so the job exit
code was 0, the wrapper propagated the actual error code to the users software). We have
a self test run by slurm every 5 minutes and it did detect the node failure, but before
it could the node had "failed" hundreds of jobs in that 5 minute window. We
assume most jobs would run for at least tens of minutes so if slurm sees a node churning
through jobs in less than a minute it should disable the node. Is there any way to handle
this beyond moving self test script execution up from 5 minutes to say every 30 seconds?
Thanks,
Mario Kadastik, PhD
Researcher
---
"Physics is like sex, sure it may have practical reasons, but that's not why we
do it"
-- Richard P. Feynman
-- Magnus Jonsson, Developer, HPC2N, UmeƄ Universitet
smime.p7s
Description: S/MIME Cryptographic Signature
