The problem was that if there was more than one job running on the node when
drained, then when the running job completed another job could be started on 
the node (at least for some time). The following patch fixes the problem and 
will
be in slurm v2.2.4, which we plan to release next week.

Index: src/slurmctld/node_mgr.c
===================================================================
--- src/slurmctld/node_mgr.c    (revision 22854)
+++ src/slurmctld/node_mgr.c    (working copy)
@@ -2543,12 +2543,14 @@
                                                slurm_get_slurm_user_id());
        } else if (node_ptr->run_job_cnt) {
                node_ptr->node_state = NODE_STATE_ALLOCATED | node_flags;
-               if (!IS_NODE_NO_RESPOND(node_ptr))
+               if (!IS_NODE_NO_RESPOND(node_ptr) &&
+                    !IS_NODE_FAIL(node_ptr) && !IS_NODE_DRAIN(node_ptr))
                        bit_set(avail_node_bitmap, inx);
                bit_set(up_node_bitmap, inx);
        } else {
                node_ptr->node_state = NODE_STATE_IDLE | node_flags;
-               if (!IS_NODE_NO_RESPOND(node_ptr))
+               if (!IS_NODE_NO_RESPOND(node_ptr) &&
+                    !IS_NODE_FAIL(node_ptr) && !IS_NODE_DRAIN(node_ptr))
                        bit_set(avail_node_bitmap, inx);
                if (!IS_NODE_NO_RESPOND(node_ptr) &&
                    !IS_NODE_COMPLETING(node_ptr))

________________________________________
From: [email protected] [[email protected]] On Behalf 
Of Bjørn-Helge Mevik [[email protected]]
Sent: Thursday, March 24, 2011 4:47 AM
To: [email protected]
Subject: [slurm-dev] Draining of nodes don't work!?

It seems that "scontrol update nodename=foo state=drain" does not
prevent new jobs from starting on the node.  We've verified this on our
test cluster with an unpatched slurm 2.2.3.

Because we always put a maintenance reservation on nodes we drain, we
haven't discovered this earlier.

I guess there must be something strange with our configuration that
triggers this behaviour, otherwise it would have been discovered
already.  See the attached slurm.conf below.


Here is a demonstration:

teflon 774(1)# scontrol show node compute-0-1
NodeName=compute-0-1 Arch=x86_64 CoresPerSocket=2
   CPUAlloc=4 CPUErr=0 CPUTot=4 Features=intel,rack0,ib,sse
   Gres=(null)
   OS=Linux RealMemory=3018 Sockets=2
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=10000 Weight=794
   BootTime=2010-06-07T13:30:58 SlurmdStartTime=2011-03-24T11:17:53
   Reason=(null)

teflon 775(1)# squeue -n compute-0-1
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
  10221    normal arraytes      bhm   R       1:51      1 compute-0-1
  10222    normal  10221.1      bhm   R       1:51      1 compute-0-1
  10219    normal arraytes      bhm   R       1:52      1 compute-0-1
  10220    normal  10219.1      bhm   R       1:52      1 compute-0-1
teflon 776(1)# scontrol update nodename=compute-0-1 state=drain reason=testing
teflon 777(1)# scontrol show node compute-0-1
NodeName=compute-0-1 Arch=x86_64 CoresPerSocket=2
   CPUAlloc=4 CPUErr=0 CPUTot=4 Features=intel,rack0,ib,sse
   Gres=(null)
   OS=Linux RealMemory=3018 Sockets=2
   State=ALLOCATED+DRAIN ThreadsPerCore=1 TmpDisk=10000 Weight=794
   BootTime=2010-06-07T13:30:58 SlurmdStartTime=2011-03-24T11:17:53
   Reason=testing [root@2011-03-24T12:27:39]

teflon 778(1)# squeue -n compute-0-1
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
  10221    normal arraytes      bhm   R       2:29      1 compute-0-1
  10222    normal  10221.1      bhm   R       2:29      1 compute-0-1
  10219    normal arraytes      bhm   R       2:30      1 compute-0-1
  10220    normal  10219.1      bhm   R       2:30      1 compute-0-1
teflon 779(1)# scancel 10222
teflon 780(1)# squeue -n compute-0-1
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
  10221    normal arraytes      bhm   R       2:41      1 compute-0-1
  10219    normal arraytes      bhm   R       2:42      1 compute-0-1
  10220    normal  10219.1      bhm   R       2:42      1 compute-0-1
  10272    normal 10221.11      bhm   R       0:03      1 compute-0-1

So job 10272 started after job 10222 was cancelled.

We thought it might be that slurm allows short jobs to start if they
finish before the longest running job finishes (according to the --time
specification), so we cancelled the two jobs with longer --time (10219
and 10221).  But new jobs continue to start:

teflon 781(1)# squeue -n compute-0-1
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
  10324    normal 10219.21      bhm   R       0:23      1 compute-0-1
  10221    normal arraytes      bhm   R       5:26      1 compute-0-1
  10219    normal arraytes      bhm   R       5:27      1 compute-0-1
  10272    normal 10221.11      bhm   R       2:48      1 compute-0-1
teflon 782(1)# scancel 10219
teflon 783(1)# squeue -n compute-0-1
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
  10373    normal 10223.28      bhm   R       0:02      1 compute-0-1
  10374    normal 10229.30      bhm   R       0:02      1 compute-0-1
  10221    normal arraytes      bhm   R       5:47      1 compute-0-1
  10272    normal 10221.11      bhm   R       3:09      1 compute-0-1
teflon 784(1)# scancel 10221
teflon 785(1)# squeue -n compute-0-1
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
  10391    normal 10223.32      bhm   R       0:01      1 compute-0-1
  10393    normal 10226.32      bhm   R       0:01      1 compute-0-1
  10373    normal 10223.28      bhm   R       0:18      1 compute-0-1
  10374    normal 10229.30      bhm   R       0:18      1 compute-0-1


Is this a bug, or are we misunderstanding something here?



Reply via email to