Hello list,

we've got a strange problem: the main scheduling mechanism just looks at
the top default_depth jobs in the queue and the rest remains pending with
reason "None". It looks as though a full sweep of the queue is either not
performed or aborted somehow. Low priority Jobs keep waiting for hours or
even days in the "None" pending state until the number of pending jobs is
small enough.

Here's some info about our install:

slurm 2.6.10
MessageTimeout=60
SchedulerType=sched/backfill
SchedulerParameters=bf_window=21600,bf_resolution=360,bf_interval=120,default_queue_depth=400,max_job_bf=400,bf_max_job_user=50,max_rpc_cnt=100
FastSchedule=1

We have long jobs - up to 15 days, hence the large bf_window and the high
bf_resolution. At first we've tried defer and bf_continue to reduce
overhead, but that didn't improve the situation.
If I increase the default_queue_depth to let's say 4000 then yes, all jobs
get initiated but that can't be the default behaviour.

sdiag output:
*******************************************************
sdiag output at Wed Jun  8 14:09:11 2016
Data since      Wed Jun  8 02:00:01 2016
*******************************************************
Server thread count: 3
Agent queue size:    0

Jobs submitted: 881
Jobs started:   7575
Jobs completed: 7175
Jobs canceled:  818
Jobs failed:    0

Main schedule statistics (microseconds):
    Last cycle:   73794
    Max cycle:    435582
    Total cycles: 9607
    Mean cycle:   47972
    Mean depth cycle:  330
    Cycles per minute: 13
    Last queue length: 4339

Backfilling stats
    Total backfilled jobs (since last slurm start): 9638
    Total backfilled jobs (since last stats cycle start): 5130
    Total cycles: 297
    Last cycle when: Wed Jun  8 14:05:41 2016
    Last cycle: 1997114
    Max cycle:  33999324
    Mean cycle: 6327223
    Last depth cycle: 52
    Last depth cycle (try sched): 52
    Depth Mean: 84
    Depth Mean (try depth): 68
    Last queue length: 4338
    Queue length mean: 2512

Reply via email to