Hello All,
Aargh, after downgrading from SLURM 14.11 to 14.03.10, because 14.11 has
problems with accumulating Raw Usage on our setup, the original problem
and cause for the upgrade from an older version is back :-/
In short, trying to send a job to a lower priority partition will never
schedule the job to run on a node shared by 3 partitions. If a (free) node
belonging to 3 partitions is specified, then the job gets stuck in a
(ReqNodeNotAvail) state.
This was fixed in 14.11.0, but now we had to go back to 14.03.10, and the
problem again appeared. As 14.11 is rather recent, perhaps someone
remembers a change that would have been made to fix something, perhaps
related to this?
I tried to compare the sources, but the differences were too large; only
got so far as to recognize that for _some_ reason, the
ESLURM_NODE_NOT_AVAIL error code is returned somewhere, apparently in
node_scheduler.c or step_mgr.c...
The logfile doesn't show anything I can make sense of, here's the extract
of a job requesting 1 CPU on a free node belonging to 3 partitions:
[2014-12-11T18:58:35.932] debug2: initial priority for job 6149 is 13958
[2014-12-11T18:58:35.932] debug3: _pick_best_nodes: job 6149 idle_nodes 5
share_nodes 26
[2014-12-11T18:58:35.932] debug2: select_p_job_test for job 6149
[2014-12-11T18:58:35.932] debug2: sched: JobId=6149 allocated resources:
NodeList=(null)
[2014-12-11T18:58:35.932] _slurm_rpc_submit_batch_job JobId=6149 usec=1179
[2014-12-11T18:58:35.933] debug3: Writing job id 6149 to header record of
job_state file
[2014-12-11T18:58:36.775] debug3: JobId=6149 required nodes not avail
[2014-12-11T18:58:48.009] debug2: select_p_job_test for job 6149
[2014-12-11T18:59:18.009] debug2: select_p_job_test for job 6149
Here's the log for a job requesting a specific node belonging to only 2
partitions, but one that is full:
[2014-12-11T18:59:54.431] debug2: initial priority for job 6151 is 13958
[2014-12-11T18:59:54.432] debug3: _pick_best_nodes: job 6151 idle_nodes 5
share_nodes 26
[2014-12-11T18:59:54.432] debug2: select_p_job_test for job 6151
[2014-12-11T18:59:54.432] debug2: sched: JobId=6151 allocated resources:
NodeList=(null)
[2014-12-11T18:59:54.432] _slurm_rpc_submit_batch_job JobId=6151 usec=1169
[2014-12-11T18:59:54.433] debug3: Writing job id 6151 to header record of
job_state file
[2014-12-11T18:59:54.802] debug3: _pick_best_nodes: job 6151 idle_nodes 5
share_nodes 26
[2014-12-11T18:59:54.802] debug2: select_p_job_test for job 6151
[2014-12-11T19:14:18.160] debug3: sched: JobId=6151. State=PENDING.
Reason=Resources. Priority=13946. Partition=backfill.
Finally, the log for a job requesting a specific node belonging to only 2
partitions, with free cores:
[2014-12-11T19:18:41.296] debug2: initial priority for job 6152 is 13940
[2014-12-11T19:18:41.296] debug3: _pick_best_nodes: job 6152 idle_nodes 5
share_nodes 26
[2014-12-11T19:18:41.296] debug2: select_p_job_test for job 6152
[2014-12-11T19:18:41.296] debug2: sched: JobId=6152 allocated resources:
NodeList=(null)
[2014-12-11T19:18:41.296] _slurm_rpc_submit_batch_job JobId=6152 usec=1221
[2014-12-11T19:18:41.297] debug3: Writing job id 6152 to header record of
job_state file
[2014-12-11T19:18:42.271] debug3: _pick_best_nodes: job 6152 idle_nodes 5
share_nodes 26
[2014-12-11T19:18:42.271] debug2: select_p_job_test for job 6152
[2014-12-11T19:18:42.272] debug3: cons_res: _add_job_to_res: job 6152 act 0
[2014-12-11T19:18:42.272] debug3: cons_res: adding job 6152 to part backfill
row 0
[2014-12-11T19:18:42.272] debug3: sched: JobId=6152 initiated
[2014-12-11T19:18:42.272] sched: Allocate JobId=6152 NodeList=node001 #CPUs=1
[2014-12-11T19:18:42.519] debug2: prolog_slurmctld job 6152 prolog completed
Cheers,
Mikael J.
http://www.iki.fi/~mpjohans/
---------- Forwarded message ----------
Date: Thu, 27 Nov 2014 03:34:59 -0800
From: Mikael Johansson <[email protected]>
To: slurm-dev <[email protected]>
Subject: [slurm-dev] Re: Odd (ReqNodeNotAvail) and (PartitionNodeLimit) with
multiple partitions
Hello All,
Just an update on this, it seems it indeed was just a bug in the old 2.2.7
version of SLURM; after an upgrade to 14.11.0, nodes shared by three partitions
do not confuse SLURM anymore.
Cheers,
Mikael J.
http://www.iki.fi/~mpjohans/
On Tue, 21 Oct 2014, Mikael Johansson wrote:
Hello All,
I had a problem with jobs being stuck in the queue and not being scheduled
even with unused cores on the cluster. The system has four partitions, three
different "high priority" ones and one lower priority, "backfill" partition.
A concise description of the setup in slurm.config, SLURM 2.2.7:
PartitionName=backfill Nodes=node[001-026] Default=NO MaxNodes=10
MaxTime=168:00:00 AllowGroups=ALL Priority=1 DisableRootJobs=NO RootOnly=NO
Hidden=NO Shared=NO PreemptMode=requeue
PartitionName=short Nodes=node[005-026] Default=YES MaxNodes=6
MaxTime=002:00:00 AllowGroups=ALL Priority=2 DisableRootJobs=NO RootOnly=NO
Hidden=NO Shared=NO PreemptMode=off
PartitionName=medium Nodes=node[009-026] Default=NO MaxNodes=4
MaxTime=168:00:00 AllowGroups=ALL Priority=2 DisableRootJobs=NO RootOnly=NO
Hidden=NO Shared=NO PreemptMode=off
PartitionName=long Nodes=node[001-004] Default=NO MaxNodes=4
MaxTime=744:00:00 AllowGroups=ALL Priority=2 DisableRootJobs=NO RootOnly=NO
Hidden=NO Shared=NO PreemptMode=off
SchedulerType=sched/builtin
PreemptType=preempt/partition_prio
PreemptMode=requeue
(I'll send more of course if needed)
The problem here is that the backfill jobs will only be scheduled to run on
nodes node[001-008]. They will never start on nodes node[009-026]. I tested
this buy submitting a job explicitly to a specific node (node020) using two
enforcements; both lead to the job being stuck in different, odd, states;
#SBATCH -w node020:
the job gets status (ReqNodeNotAvail), and the log shows "debug2: sched:
JobId=NNNNNN allocated resources: NodeList=(null)" and "debug3:
JobId=NNNNNN required nodes not avail"
#SBATCH -x node[001-019,021-026]:
the job gets status (PartitionNodeLimit), and the log shows "debug3:
JobId=NNNNNN not runnable with present config"
I have no idea how SLURM arrives at these conclusions. In order to find out
what's going on, the following _does_ start the jobs (but breaks the
configuration, of course):
1. Increasing the priority of PartitionName=backfill to the same as the
others, 2
2. Removing node020 from all other partitions
I also thought it might be somehow related to the fact that nodes
node[009-026] are shared by three partitions (instead of just 2, like the
other nodes), which perhaps confuses SLURM 2.2.7. Removing node020 from, for
example, the short partition, leaving it in only medium and backfill does
not help, though.
However, removing node020 from the medium partition, leaving it only in the
short and backfill partitions, does work, and the job starts in backfill
without problems.
To me this sounds like an odd bug, but perhaps I'm missing something. If it
is a bug, and known to be fixed in later versions, that would be a good
reason to force us to upgrade SLURM to something a bit more modern. But at
the same time, if someone comes up with a work-around, it would actually at
least in the short-term be a solution much easier to implement.
So again, all ideas and suggestions, or just explanations of the odd job
states are most welcome!
Cheers,
Mikael J.
http://www.iki.fi/~mpjohans/