Hello All,

Aargh, after downgrading from SLURM 14.11 to 14.03.10, because 14.11 has problems with accumulating Raw Usage on our setup, the original problem and cause for the upgrade from an older version is back :-/

In short, trying to send a job to a lower priority partition will never schedule the job to run on a node shared by 3 partitions. If a (free) node belonging to 3 partitions is specified, then the job gets stuck in a
(ReqNodeNotAvail) state.

This was fixed in 14.11.0, but now we had to go back to 14.03.10, and the problem again appeared. As 14.11 is rather recent, perhaps someone remembers a change that would have been made to fix something, perhaps related to this?


I tried to compare the sources, but the differences were too large; only got so far as to recognize that for _some_ reason, the ESLURM_NODE_NOT_AVAIL error code is returned somewhere, apparently in node_scheduler.c or step_mgr.c...


The logfile doesn't show anything I can make sense of, here's the extract of a job requesting 1 CPU on a free node belonging to 3 partitions:

[2014-12-11T18:58:35.932] debug2: initial priority for job 6149 is 13958
[2014-12-11T18:58:35.932] debug3: _pick_best_nodes: job 6149 idle_nodes 5 
share_nodes 26
[2014-12-11T18:58:35.932] debug2: select_p_job_test for job 6149
[2014-12-11T18:58:35.932] debug2: sched: JobId=6149 allocated resources: 
NodeList=(null)
[2014-12-11T18:58:35.932] _slurm_rpc_submit_batch_job JobId=6149 usec=1179
[2014-12-11T18:58:35.933] debug3: Writing job id 6149 to header record of 
job_state file
[2014-12-11T18:58:36.775] debug3: JobId=6149 required nodes not avail
[2014-12-11T18:58:48.009] debug2: select_p_job_test for job 6149
[2014-12-11T18:59:18.009] debug2: select_p_job_test for job 6149


Here's the log for a job requesting a specific node belonging to only 2 partitions, but one that is full:

[2014-12-11T18:59:54.431] debug2: initial priority for job 6151 is 13958
[2014-12-11T18:59:54.432] debug3: _pick_best_nodes: job 6151 idle_nodes 5 
share_nodes 26
[2014-12-11T18:59:54.432] debug2: select_p_job_test for job 6151
[2014-12-11T18:59:54.432] debug2: sched: JobId=6151 allocated resources: 
NodeList=(null)
[2014-12-11T18:59:54.432] _slurm_rpc_submit_batch_job JobId=6151 usec=1169
[2014-12-11T18:59:54.433] debug3: Writing job id 6151 to header record of 
job_state file
[2014-12-11T18:59:54.802] debug3: _pick_best_nodes: job 6151 idle_nodes 5 
share_nodes 26
[2014-12-11T18:59:54.802] debug2: select_p_job_test for job 6151
[2014-12-11T19:14:18.160] debug3: sched: JobId=6151. State=PENDING. 
Reason=Resources. Priority=13946. Partition=backfill.


Finally, the log for a job requesting a specific node belonging to only 2 partitions, with free cores:

[2014-12-11T19:18:41.296] debug2: initial priority for job 6152 is 13940
[2014-12-11T19:18:41.296] debug3: _pick_best_nodes: job 6152 idle_nodes 5 
share_nodes 26
[2014-12-11T19:18:41.296] debug2: select_p_job_test for job 6152
[2014-12-11T19:18:41.296] debug2: sched: JobId=6152 allocated resources: 
NodeList=(null)
[2014-12-11T19:18:41.296] _slurm_rpc_submit_batch_job JobId=6152 usec=1221
[2014-12-11T19:18:41.297] debug3: Writing job id 6152 to header record of 
job_state file
[2014-12-11T19:18:42.271] debug3: _pick_best_nodes: job 6152 idle_nodes 5 
share_nodes 26
[2014-12-11T19:18:42.271] debug2: select_p_job_test for job 6152
[2014-12-11T19:18:42.272] debug3: cons_res: _add_job_to_res: job 6152 act 0
[2014-12-11T19:18:42.272] debug3: cons_res: adding job 6152 to part backfill 
row 0
[2014-12-11T19:18:42.272] debug3: sched: JobId=6152 initiated
[2014-12-11T19:18:42.272] sched: Allocate JobId=6152 NodeList=node001 #CPUs=1
[2014-12-11T19:18:42.519] debug2: prolog_slurmctld job 6152 prolog completed


Cheers,
    Mikael J.
    http://www.iki.fi/~mpjohans/


---------- Forwarded message ----------
Date: Thu, 27 Nov 2014 03:34:59 -0800
From: Mikael Johansson <[email protected]>
To: slurm-dev <[email protected]>
Subject: [slurm-dev] Re: Odd (ReqNodeNotAvail) and (PartitionNodeLimit) with
    multiple partitions


Hello All,

Just an update on this, it seems it indeed was just a bug in the old 2.2.7 version of SLURM; after an upgrade to 14.11.0, nodes shared by three partitions do not confuse SLURM anymore.

Cheers,
     Mikael J.
     http://www.iki.fi/~mpjohans/

On Tue, 21 Oct 2014, Mikael Johansson wrote:


 Hello All,

 I had a problem with jobs being stuck in the queue and not being scheduled
 even with unused cores on the cluster. The system has four partitions, three
 different "high priority" ones and one lower priority, "backfill" partition.
 A concise description of the setup in slurm.config, SLURM 2.2.7:


 PartitionName=backfill Nodes=node[001-026] Default=NO  MaxNodes=10
 MaxTime=168:00:00 AllowGroups=ALL Priority=1 DisableRootJobs=NO RootOnly=NO
 Hidden=NO Shared=NO PreemptMode=requeue
 PartitionName=short    Nodes=node[005-026] Default=YES MaxNodes=6
 MaxTime=002:00:00 AllowGroups=ALL Priority=2 DisableRootJobs=NO RootOnly=NO
 Hidden=NO Shared=NO PreemptMode=off
 PartitionName=medium   Nodes=node[009-026] Default=NO  MaxNodes=4
 MaxTime=168:00:00 AllowGroups=ALL Priority=2 DisableRootJobs=NO RootOnly=NO
 Hidden=NO Shared=NO PreemptMode=off
 PartitionName=long     Nodes=node[001-004] Default=NO  MaxNodes=4
 MaxTime=744:00:00 AllowGroups=ALL Priority=2 DisableRootJobs=NO RootOnly=NO
 Hidden=NO Shared=NO PreemptMode=off

 SchedulerType=sched/builtin
 PreemptType=preempt/partition_prio
 PreemptMode=requeue


 (I'll send more of course if needed)


 The problem here is that the backfill jobs will only be scheduled to run on
 nodes node[001-008]. They will never start on nodes node[009-026]. I tested
 this buy submitting a job explicitly to a specific node (node020) using two
 enforcements; both lead to the job being stuck in different, odd, states;

 #SBATCH -w node020:
   the job gets status (ReqNodeNotAvail), and the log shows "debug2: sched:
   JobId=NNNNNN allocated resources: NodeList=(null)" and "debug3:
   JobId=NNNNNN required nodes not avail"

 #SBATCH -x node[001-019,021-026]:
   the job gets status (PartitionNodeLimit), and the log shows "debug3:
   JobId=NNNNNN not runnable with present config"


 I have no idea how SLURM arrives at these conclusions. In order to find out
 what's going on, the following _does_ start the jobs (but breaks the
 configuration, of course):

 1. Increasing the priority of PartitionName=backfill to the same as the
    others, 2

 2. Removing node020 from all other partitions


 I also thought it might be somehow related to the fact that nodes
 node[009-026] are shared by three partitions (instead of just 2, like the
 other nodes), which perhaps confuses SLURM 2.2.7. Removing node020 from, for
 example, the short partition, leaving it in only medium and backfill does
 not help, though.

 However, removing node020 from the medium partition, leaving it only in the
 short and backfill partitions, does work, and the job starts in backfill
 without problems.

 To me this sounds like an odd bug, but perhaps I'm missing something. If it
 is a bug, and known to be fixed in later versions, that would be a good
 reason to force us to upgrade SLURM to something a bit more modern. But at
 the same time, if someone comes up with a work-around, it would actually at
 least in the short-term be a solution much easier to implement.


 So again, all ideas and suggestions, or just explanations of the odd job
 states are most welcome!


 Cheers,
     Mikael J.
     http://www.iki.fi/~mpjohans/

Reply via email to