On 30/01/15 03:02, Mehdi Denou wrote: > Maybe you can script something with the "HealthCheckProgram." ?
That sounds like a good idea - check the load and if it's over a threshold (and if the node is not already drained) set the nodes state to "DRAIN" with a reason of "AUTO: over load threshold". Then when it's not over that limit and it's marked as drained with that reason (and only that reason) you should be safe to set its state to "RESUME" to get it running Slurm jobs again. Of course, once your system is all Slurm turn that off and consider using cgroups to contain jobs. Best of luck! Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: [email protected] Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
