The problem was that if there was more than one job running on the node when
drained, then when the running job completed another job could be started on
the node (at least for some time). The following patch fixes the problem and
will
be in slurm v2.2.4, which we plan to release next week.
Index: src/slurmctld/node_mgr.c
===================================================================
--- src/slurmctld/node_mgr.c (revision 22854)
+++ src/slurmctld/node_mgr.c (working copy)
@@ -2543,12 +2543,14 @@
slurm_get_slurm_user_id());
} else if (node_ptr->run_job_cnt) {
node_ptr->node_state = NODE_STATE_ALLOCATED | node_flags;
- if (!IS_NODE_NO_RESPOND(node_ptr))
+ if (!IS_NODE_NO_RESPOND(node_ptr) &&
+ !IS_NODE_FAIL(node_ptr) && !IS_NODE_DRAIN(node_ptr))
bit_set(avail_node_bitmap, inx);
bit_set(up_node_bitmap, inx);
} else {
node_ptr->node_state = NODE_STATE_IDLE | node_flags;
- if (!IS_NODE_NO_RESPOND(node_ptr))
+ if (!IS_NODE_NO_RESPOND(node_ptr) &&
+ !IS_NODE_FAIL(node_ptr) && !IS_NODE_DRAIN(node_ptr))
bit_set(avail_node_bitmap, inx);
if (!IS_NODE_NO_RESPOND(node_ptr) &&
!IS_NODE_COMPLETING(node_ptr))
________________________________________
From: [email protected] [[email protected]] On Behalf
Of Bjørn-Helge Mevik [[email protected]]
Sent: Thursday, March 24, 2011 4:47 AM
To: [email protected]
Subject: [slurm-dev] Draining of nodes don't work!?
It seems that "scontrol update nodename=foo state=drain" does not
prevent new jobs from starting on the node. We've verified this on our
test cluster with an unpatched slurm 2.2.3.
Because we always put a maintenance reservation on nodes we drain, we
haven't discovered this earlier.
I guess there must be something strange with our configuration that
triggers this behaviour, otherwise it would have been discovered
already. See the attached slurm.conf below.
Here is a demonstration:
teflon 774(1)# scontrol show node compute-0-1
NodeName=compute-0-1 Arch=x86_64 CoresPerSocket=2
CPUAlloc=4 CPUErr=0 CPUTot=4 Features=intel,rack0,ib,sse
Gres=(null)
OS=Linux RealMemory=3018 Sockets=2
State=ALLOCATED ThreadsPerCore=1 TmpDisk=10000 Weight=794
BootTime=2010-06-07T13:30:58 SlurmdStartTime=2011-03-24T11:17:53
Reason=(null)
teflon 775(1)# squeue -n compute-0-1
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
10221 normal arraytes bhm R 1:51 1 compute-0-1
10222 normal 10221.1 bhm R 1:51 1 compute-0-1
10219 normal arraytes bhm R 1:52 1 compute-0-1
10220 normal 10219.1 bhm R 1:52 1 compute-0-1
teflon 776(1)# scontrol update nodename=compute-0-1 state=drain reason=testing
teflon 777(1)# scontrol show node compute-0-1
NodeName=compute-0-1 Arch=x86_64 CoresPerSocket=2
CPUAlloc=4 CPUErr=0 CPUTot=4 Features=intel,rack0,ib,sse
Gres=(null)
OS=Linux RealMemory=3018 Sockets=2
State=ALLOCATED+DRAIN ThreadsPerCore=1 TmpDisk=10000 Weight=794
BootTime=2010-06-07T13:30:58 SlurmdStartTime=2011-03-24T11:17:53
Reason=testing [root@2011-03-24T12:27:39]
teflon 778(1)# squeue -n compute-0-1
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
10221 normal arraytes bhm R 2:29 1 compute-0-1
10222 normal 10221.1 bhm R 2:29 1 compute-0-1
10219 normal arraytes bhm R 2:30 1 compute-0-1
10220 normal 10219.1 bhm R 2:30 1 compute-0-1
teflon 779(1)# scancel 10222
teflon 780(1)# squeue -n compute-0-1
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
10221 normal arraytes bhm R 2:41 1 compute-0-1
10219 normal arraytes bhm R 2:42 1 compute-0-1
10220 normal 10219.1 bhm R 2:42 1 compute-0-1
10272 normal 10221.11 bhm R 0:03 1 compute-0-1
So job 10272 started after job 10222 was cancelled.
We thought it might be that slurm allows short jobs to start if they
finish before the longest running job finishes (according to the --time
specification), so we cancelled the two jobs with longer --time (10219
and 10221). But new jobs continue to start:
teflon 781(1)# squeue -n compute-0-1
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
10324 normal 10219.21 bhm R 0:23 1 compute-0-1
10221 normal arraytes bhm R 5:26 1 compute-0-1
10219 normal arraytes bhm R 5:27 1 compute-0-1
10272 normal 10221.11 bhm R 2:48 1 compute-0-1
teflon 782(1)# scancel 10219
teflon 783(1)# squeue -n compute-0-1
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
10373 normal 10223.28 bhm R 0:02 1 compute-0-1
10374 normal 10229.30 bhm R 0:02 1 compute-0-1
10221 normal arraytes bhm R 5:47 1 compute-0-1
10272 normal 10221.11 bhm R 3:09 1 compute-0-1
teflon 784(1)# scancel 10221
teflon 785(1)# squeue -n compute-0-1
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
10391 normal 10223.32 bhm R 0:01 1 compute-0-1
10393 normal 10226.32 bhm R 0:01 1 compute-0-1
10373 normal 10223.28 bhm R 0:18 1 compute-0-1
10374 normal 10229.30 bhm R 0:18 1 compute-0-1
Is this a bug, or are we misunderstanding something here?