Re: [slurm-users] Draining hosts because of failing jobs

2021-05-04 Thread Alex Chekholko
In my most recent experience, I have some SSDs in compute nodes that occasionally just drop off the bus, so the compute node loses its OS disk. I haven't thought about it too hard, but the default NHC scripts do not notice that. Similarly, Paul's proposed script might need to also check that the

Re: [slurm-users] Draining hosts because of failing jobs

2021-05-04 Thread Paul Edmon
Since you can run an arbitrary script as a node health checker I might add a script that counts failures and then closes if it hits a threshold.  The script shouldn't need to talk to the slurmctld or slurmdbd as it should be able to watch the log on the node and see the fail. -Paul Edmon- On

[slurm-users] Draining hosts because of failing jobs

2021-05-04 Thread Gerhard Strangar
Hello, how do you implement something like "drain host after 10 consecutive failed jobs"? Unlike a host check script, that checks for known errors, I'd like to stop killing jobs just because one node is faulty. Gerhard