Look at the HealthCheckProgram

       HealthCheckProgram
              Fully  qualified  pathname of a script to execute as user
root periodically on all
              compute nodes that are not in the DOWN state. This may be
used to verify the  node
              is  fully  operational  and DRAIN the node or send email if
a problem is detected.
              Any action to be taken must be explicitly performed by the
program  (e.g.  execute
              "scontrol  update NodeName=foo State=drain
Reason=tmp_file_system_full" to drain a
              node).  The interval is controlled using the
HealthCheckInterval parameter.   Note
              that the HealthCheckProgram will be executed at the same
time on all nodes to min-
              imize its impact upon parallel programs.  This program is
will  be  killed  if  it
              does  not  terminate  normally  within 60 seconds.  By
default, no program will be
              executed.


We use this along with some pro-active checks in prolog and epilog.



Make sure to look at ReturnToService as well, as it will allow a node to
come back once it again passes the right checks.


--Jerry



On 10/31/11 1:43 PM, "Michael Di Domenico" <[email protected]> wrote:

>Does slurm have an plugins or abilities for black holing a bad node.
>We had a situation recently with a big queue, where a single node that
>could accept jobs but could not run jobs drained the queue for a user,
>but no work was actually done.  I cursory look at the docs and
>internet search didn't turn up anything, not even someone else asking
>the same question which seems odd...
>



Reply via email to