Hello list,
we've got a strange problem: the main scheduling mechanism just looks at
the top default_depth jobs in the queue and the rest remains pending with
reason "None". It looks as though a full sweep of the queue is either not
performed or aborted somehow. Low priority Jobs keep waiting for hours or
even days in the "None" pending state until the number of pending jobs is
small enough.
Here's some info about our install:
slurm 2.6.10
MessageTimeout=60
SchedulerType=sched/backfill
SchedulerParameters=bf_window=21600,bf_resolution=360,bf_interval=120,default_queue_depth=400,max_job_bf=400,bf_max_job_user=50,max_rpc_cnt=100
FastSchedule=1
We have long jobs - up to 15 days, hence the large bf_window and the high
bf_resolution. At first we've tried defer and bf_continue to reduce
overhead, but that didn't improve the situation.
If I increase the default_queue_depth to let's say 4000 then yes, all jobs
get initiated but that can't be the default behaviour.
sdiag output:
*******************************************************
sdiag output at Wed Jun 8 14:09:11 2016
Data since Wed Jun 8 02:00:01 2016
*******************************************************
Server thread count: 3
Agent queue size: 0
Jobs submitted: 881
Jobs started: 7575
Jobs completed: 7175
Jobs canceled: 818
Jobs failed: 0
Main schedule statistics (microseconds):
Last cycle: 73794
Max cycle: 435582
Total cycles: 9607
Mean cycle: 47972
Mean depth cycle: 330
Cycles per minute: 13
Last queue length: 4339
Backfilling stats
Total backfilled jobs (since last slurm start): 9638
Total backfilled jobs (since last stats cycle start): 5130
Total cycles: 297
Last cycle when: Wed Jun 8 14:05:41 2016
Last cycle: 1997114
Max cycle: 33999324
Mean cycle: 6327223
Last depth cycle: 52
Last depth cycle (try sched): 52
Depth Mean: 84
Depth Mean (try depth): 68
Last queue length: 4338
Queue length mean: 2512