Look at the HealthCheckProgram
HealthCheckProgram
Fully qualified pathname of a script to execute as user
root periodically on all
compute nodes that are not in the DOWN state. This may be
used to verify the node
is fully operational and DRAIN the node or send email if
a problem is detected.
Any action to be taken must be explicitly performed by the
program (e.g. execute
"scontrol update NodeName=foo State=drain
Reason=tmp_file_system_full" to drain a
node). The interval is controlled using the
HealthCheckInterval parameter. Note
that the HealthCheckProgram will be executed at the same
time on all nodes to min-
imize its impact upon parallel programs. This program is
will be killed if it
does not terminate normally within 60 seconds. By
default, no program will be
executed.
We use this along with some pro-active checks in prolog and epilog.
Make sure to look at ReturnToService as well, as it will allow a node to
come back once it again passes the right checks.
--Jerry
On 10/31/11 1:43 PM, "Michael Di Domenico" <[email protected]> wrote:
>Does slurm have an plugins or abilities for black holing a bad node.
>We had a situation recently with a big queue, where a single node that
>could accept jobs but could not run jobs drained the queue for a user,
>but no work was actually done. I cursory look at the docs and
>internet search didn't turn up anything, not even someone else asking
>the same question which seems odd...
>