What are the values of bf_window and bf_resolution in your configuration?

>From the documentation of bf_window: "The default value is 1440
minutes (one day). A value at least as long as the highest allowed
time limit is generally advisable to prevent job starvation."

2017-06-16 1:16 GMT+02:00 Robbert Eggermont <[email protected]>:
>
> Hello,
>
> In our Slurm setup (now 17.02.4) I've noticed several times now that
> backfilled jobs push back the start time of the highest priority job.
> I'm not sure if this is due to a configuration error or an scheduler error,
> and since I'm having a hard time diagnosing what's happening, I was hoping
> for some insightful tips.
>
> What happens is that when the highest priority pending job needs a lot of
> resources (CPUs, ...), slurm will backfill lower priority jobs with less
> requirements but with a higher timelimit than the currently running jobs.
>
> For example, the highest priority job needs a full node, and the first node
> will become available in 6 days; our slurm will happily backfill all pending
> lower priority 2-CPU 7-day jobs on every possible node in the cluster, thus
> pushing back the highest priority job 1 day.
>
> Looking into the scheduler debugging info, I noticed some things I can't
> explain:
> 1) the highest priority job ("A") is not always scheduled to start on the
> first node ("1") that will become available;
> 2) in the same iteration, the backfill logic will start another, lower
> priority, smaller job with a timelimit longer than the expected start time
> of job "A" on the same node "1";
> 3) when "A" is scheduled to start on another node, the scheduled starting
> time remains the same (i.e. it is not updated to the time that the new node
> becomes available).
> 4) the scheduled starting time of the highest priority job ("A") is
> sometimes later than the time that the node becomes available;
>
> See below for some log entries for these events.
>
> Does anybody have an idea what's going on here, and how we can fix it?
>
> Robbert
>
>
> 1)
> JobID=1315558 has a scheduled start time on node maxwell of
> 2017-06-22T19:11:16; forcing it to another node (by draining maxwell)
> reduces the start time to 2017-06-22T16:47:43.
> (But slurm is consistent: when maxwell is resumed, the job is scheduled
> there again, with the later start time.)
>>
>> [2017-06-15T22:11:26.688] backfill: beginning
>> [2017-06-15T22:11:26.693] backfill test for JobID=1315558 Prio=22703485
>> Partition=general
>> [2017-06-15T22:11:26.694] Job 1315558 to start at 2017-06-22T19:11:16, end
>> at 2017-06-27T07:11:00 on maxwell
>> [2017-06-15T22:11:26.694] backfill: reached end of job queue
>> [2017-06-15T22:11:52.223] update_node: node maxwell state set to DRAINING
>> [2017-06-15T22:11:56.695] backfill: beginning
>> [2017-06-15T22:11:56.701] backfill test for JobID=1315558 Prio=22703485
>> Partition=general
>> [2017-06-15T22:11:56.703] Job 1315558 to start at 2017-06-22T16:47:43, end
>> at 2017-06-27T04:47:00 on parzen
>> [2017-06-15T22:11:56.703] backfill: reached end of job queue
>
> ...
>>
>> [2017-06-15T22:34:46.036] backfill: beginning
>> [2017-06-15T22:34:46.039] =========================================
>> [2017-06-15T22:34:46.039] Begin:2017-06-15T22:34:46
>> End:2017-06-15T22:41:46
>> Nodes:100plus,gauss,hopper,markov,neumann,parzen,sanger,viterbi,watson
>> [2017-06-15T22:34:46.039] =========================================
>> [2017-06-15T22:34:46.040] backfill test for JobID=1315558 Prio=22776428
>> Partition=general
>> [2017-06-15T22:34:46.040] Test job 1315558 at 2017-06-15T22:34:46 on
>> hopper,parzen,viterbi
>> [2017-06-15T22:34:46.040] Job 1315558 to start at 2017-06-22T16:47:43, end
>> at 2017-06-27T04:47:00 on parzen
>> [2017-06-15T22:34:46.040] backfill: reached end of job queue
>> [2017-06-15T22:34:55.236] update_node: node maxwell state set to ALLOCATED
>> [2017-06-15T22:35:16.041] backfill: beginning
>> [2017-06-15T22:35:16.045] =========================================
>> [2017-06-15T22:35:16.045] Begin:2017-06-15T22:35:16
>> End:2017-06-15T22:42:16
>> Nodes:100plus,gauss,hopper,markov,maxwell,neumann,parzen,sanger,viterbi,watson
>> [2017-06-15T22:35:16.045] =========================================
>> [2017-06-15T22:35:16.045] backfill test for JobID=1315558 Prio=22776428
>> Partition=general
>> [2017-06-15T22:35:16.045] Test job 1315558 at 2017-06-15T22:35:16 on
>> hopper,maxwell,parzen,viterbi
>> [2017-06-15T22:35:16.046] Job 1315558 to start at 2017-06-22T19:11:16, end
>> at 2017-06-27T07:11:00 on maxwell
>> [2017-06-15T22:35:16.046] backfill: reached end of job queue
>
>
> 2)
> Our highest prio job is now scheduled to start on maxwell at
> 2017-06-22T18:29:17, but then another job is backfilled on maxwell with
> EndTime=2017-06-22T21:02:10??
>>
>> [2017-06-15T23:22:10.227] backfill: beginning
>> [2017-06-15T23:22:10.231] =========================================
>> [2017-06-15T23:22:10.231] Begin:2017-06-15T23:22:10
>> End:2017-06-15T23:29:10
>> Nodes:100plus,gauss,hopper,markov,maxwell,neumann,parzen,sanger,viterbi,watson
>> [2017-06-15T23:22:10.231] =========================================
>> [2017-06-15T23:22:10.231] backfill test for JobID=1315558 Prio=22938897
>> Partition=general
>> [2017-06-15T23:22:10.231] Test job 1315558 at 2017-06-15T23:22:10 on
>> hopper,maxwell,parzen,viterbi
>> [2017-06-15T23:22:10.232] Job 1315558 to start at 2017-06-22T18:29:17, end
>> at 2017-06-27T06:29:00 on maxwell
>> [2017-06-15T23:22:10.232] backfill test for JobID=1314524 Prio=10000002
>> Partition=general
>> [2017-06-15T23:22:10.233] Test job 1314524 at 2017-06-15T23:22:10 on
>> gauss,hopper,markov,maxwell,neumann,parzen,sanger,viterbi,watson
>> [2017-06-15T23:22:10.233] backfill: Started JobId=1314524 in general on
>> maxwell
>> [2017-06-15T23:22:10.237] backfill: reached end of job queue
>
>
> 3 and 4)
> Our job is immediately scheduled on another node, however the scheduled
> starting time 2017-06-22T18:29:17 remains the same, although the current
> longest jobs on parzen has EndTime=2017-06-22T15:27:08.
> Almost an hour later the start time is updated (note: the job complete
> should be unrelated since the job ran on maxwell, right?).
> Then 40 minutes later the start time is updated again, pushed back (while no
> new jobs were started on parzen in the mean time, and we don't suspend;
> there was however another job complete on maxwell?).
>>
>> [2017-06-15T23:22:40.237] backfill: beginning
>> [2017-06-15T23:22:40.241] =========================================
>> [2017-06-15T23:22:40.242] Begin:2017-06-15T23:22:40
>> End:2017-06-15T23:29:40
>> Nodes:100plus,gauss,hopper,markov,maxwell,neumann,parzen,sanger,viterbi,watson
>> [2017-06-15T23:22:40.242] =========================================
>> [2017-06-15T23:22:40.242] backfill test for JobID=1315558 Prio=22938897
>> Partition=general
>> [2017-06-15T23:22:40.242] Test job 1315558 at 2017-06-15T23:22:40 on
>> hopper,maxwell,parzen,viterbi
>> [2017-06-15T23:22:40.243] Job 1315558 to start at 2017-06-22T18:29:17, end
>> at 2017-06-27T06:29:00 on parzen
>> [2017-06-15T23:22:40.243] backfill: reached end of job queue
>
> ...
>>
>> [2017-06-16T00:16:06.161] backfill: beginning
>> [2017-06-16T00:16:06.165] =========================================
>> [2017-06-16T00:16:06.165] Begin:2017-06-16T00:16:06
>> End:2017-06-16T00:23:06
>> Nodes:100plus,gauss,hopper,markov,maxwell,neumann,parzen,sanger,viterbi,watson
>> [2017-06-16T00:16:06.165] =========================================
>> [2017-06-16T00:16:06.165] backfill test for JobID=1315558 Prio=23087721
>> Partition=general
>> [2017-06-16T00:16:06.165] Test job 1315558 at 2017-06-16T00:16:06 on
>> hopper,maxwell,parzen,viterbi
>> [2017-06-16T00:16:06.166] Job 1315558 to start at 2017-06-22T18:29:17, end
>> at 2017-06-27T06:29:00 on parzen
>> [2017-06-16T00:16:06.167] backfill: reached end of job queue
>> [2017-06-16T00:16:12.763] job_complete: JobID=1316201 State=0x8003
>> NodeCnt=1 done
>> [2017-06-16T00:16:36.167] backfill: beginning
>> [2017-06-16T00:16:36.170] =========================================
>> [2017-06-16T00:16:36.170] Begin:2017-06-16T00:16:36
>> End:2017-06-16T00:23:36
>> Nodes:100plus,gauss,hopper,markov,maxwell,neumann,parzen,sanger,viterbi,watson
>> [2017-06-16T00:16:36.170] =========================================
>> [2017-06-16T00:16:36.170] backfill test for JobID=1315558 Prio=23101442
>> Partition=general
>> [2017-06-16T00:16:36.170] Test job 1315558 at 2017-06-16T00:16:36 on
>> hopper,maxwell,parzen,viterbi
>> [2017-06-16T00:16:36.171] Job 1315558 to start at 2017-06-22T16:41:17, end
>> at 2017-06-27T04:41:00 on parzen
>> [2017-06-16T00:16:36.171] backfill: reached end of job queue
>
> ...
>>
>> [2017-06-16T00:55:41.642] backfill: beginning
>> [2017-06-16T00:55:41.646] =========================================
>> [2017-06-16T00:55:41.646] Begin:2017-06-16T00:55:41
>> End:2017-06-16T01:02:41
>> Nodes:100plus,gauss,hopper,markov,maxwell,neumann,parzen,sanger,viterbi,watson
>> [2017-06-16T00:55:41.646] =========================================
>> [2017-06-16T00:55:41.646] backfill test for JobID=1315558 Prio=23186871
>> Partition=general
>> [2017-06-16T00:55:41.646] Test job 1315558 at 2017-06-16T00:55:41 on
>> hopper,maxwell,parzen,viterbi
>> [2017-06-16T00:55:41.647] Job 1315558 to start at 2017-06-22T16:41:17, end
>> at 2017-06-27T04:41:00 on parzen
>> [2017-06-16T00:55:41.694] backfill: reached end of job queue
>> [2017-06-16T00:55:47.592] job_complete: JobID=1316353 State=0x1 NodeCnt=1
>> WEXITSTATUS 0
>> [2017-06-16T00:55:47.598] job_complete: JobID=1316353 State=0x8003
>> NodeCnt=1 done
>> [2017-06-16T00:56:11.695] backfill: beginning
>> [2017-06-16T00:56:11.701] =========================================
>> [2017-06-16T00:56:11.701] Begin:2017-06-16T00:56:11
>> End:2017-06-16T01:03:11
>> Nodes:100plus,gauss,hopper,markov,maxwell,neumann,parzen,sanger,viterbi,watson
>> [2017-06-16T00:56:11.701] =========================================
>> [2017-06-16T00:56:11.701] backfill test for JobID=1315558 Prio=23186871
>> Partition=general
>> [2017-06-16T00:56:11.701] Test job 1315558 at 2017-06-16T00:56:11 on
>> hopper,maxwell,parzen,viterbi
>> [2017-06-16T00:56:11.703] Job 1315558 to start at 2017-06-22T18:01:33, end
>> at 2017-06-27T06:01:00 on parzen
>> [2017-06-16T00:56:11.703] backfill: reached end of job queue

Reply via email to