On 30/01/15 03:02, Mehdi Denou wrote:

> Maybe you can script something with the "HealthCheckProgram." ?

That sounds like a good idea - check the load and if it's over a
threshold (and if the node is not already drained) set the nodes state
to "DRAIN" with a reason of "AUTO: over load threshold".

Then when it's not over that limit and it's marked as drained with that
reason (and only that reason) you should be safe to set its state to
"RESUME" to get it running Slurm jobs again.

Of course, once your system is all Slurm turn that off and consider
using cgroups to contain jobs.

Best of luck!
Chris
-- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: [email protected] Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

Reply via email to