What are the values of bf_window and bf_resolution in your configuration? >From the documentation of bf_window: "The default value is 1440 minutes (one day). A value at least as long as the highest allowed time limit is generally advisable to prevent job starvation."
2017-06-16 1:16 GMT+02:00 Robbert Eggermont <[email protected]>: > > Hello, > > In our Slurm setup (now 17.02.4) I've noticed several times now that > backfilled jobs push back the start time of the highest priority job. > I'm not sure if this is due to a configuration error or an scheduler error, > and since I'm having a hard time diagnosing what's happening, I was hoping > for some insightful tips. > > What happens is that when the highest priority pending job needs a lot of > resources (CPUs, ...), slurm will backfill lower priority jobs with less > requirements but with a higher timelimit than the currently running jobs. > > For example, the highest priority job needs a full node, and the first node > will become available in 6 days; our slurm will happily backfill all pending > lower priority 2-CPU 7-day jobs on every possible node in the cluster, thus > pushing back the highest priority job 1 day. > > Looking into the scheduler debugging info, I noticed some things I can't > explain: > 1) the highest priority job ("A") is not always scheduled to start on the > first node ("1") that will become available; > 2) in the same iteration, the backfill logic will start another, lower > priority, smaller job with a timelimit longer than the expected start time > of job "A" on the same node "1"; > 3) when "A" is scheduled to start on another node, the scheduled starting > time remains the same (i.e. it is not updated to the time that the new node > becomes available). > 4) the scheduled starting time of the highest priority job ("A") is > sometimes later than the time that the node becomes available; > > See below for some log entries for these events. > > Does anybody have an idea what's going on here, and how we can fix it? > > Robbert > > > 1) > JobID=1315558 has a scheduled start time on node maxwell of > 2017-06-22T19:11:16; forcing it to another node (by draining maxwell) > reduces the start time to 2017-06-22T16:47:43. > (But slurm is consistent: when maxwell is resumed, the job is scheduled > there again, with the later start time.) >> >> [2017-06-15T22:11:26.688] backfill: beginning >> [2017-06-15T22:11:26.693] backfill test for JobID=1315558 Prio=22703485 >> Partition=general >> [2017-06-15T22:11:26.694] Job 1315558 to start at 2017-06-22T19:11:16, end >> at 2017-06-27T07:11:00 on maxwell >> [2017-06-15T22:11:26.694] backfill: reached end of job queue >> [2017-06-15T22:11:52.223] update_node: node maxwell state set to DRAINING >> [2017-06-15T22:11:56.695] backfill: beginning >> [2017-06-15T22:11:56.701] backfill test for JobID=1315558 Prio=22703485 >> Partition=general >> [2017-06-15T22:11:56.703] Job 1315558 to start at 2017-06-22T16:47:43, end >> at 2017-06-27T04:47:00 on parzen >> [2017-06-15T22:11:56.703] backfill: reached end of job queue > > ... >> >> [2017-06-15T22:34:46.036] backfill: beginning >> [2017-06-15T22:34:46.039] ========================================= >> [2017-06-15T22:34:46.039] Begin:2017-06-15T22:34:46 >> End:2017-06-15T22:41:46 >> Nodes:100plus,gauss,hopper,markov,neumann,parzen,sanger,viterbi,watson >> [2017-06-15T22:34:46.039] ========================================= >> [2017-06-15T22:34:46.040] backfill test for JobID=1315558 Prio=22776428 >> Partition=general >> [2017-06-15T22:34:46.040] Test job 1315558 at 2017-06-15T22:34:46 on >> hopper,parzen,viterbi >> [2017-06-15T22:34:46.040] Job 1315558 to start at 2017-06-22T16:47:43, end >> at 2017-06-27T04:47:00 on parzen >> [2017-06-15T22:34:46.040] backfill: reached end of job queue >> [2017-06-15T22:34:55.236] update_node: node maxwell state set to ALLOCATED >> [2017-06-15T22:35:16.041] backfill: beginning >> [2017-06-15T22:35:16.045] ========================================= >> [2017-06-15T22:35:16.045] Begin:2017-06-15T22:35:16 >> End:2017-06-15T22:42:16 >> Nodes:100plus,gauss,hopper,markov,maxwell,neumann,parzen,sanger,viterbi,watson >> [2017-06-15T22:35:16.045] ========================================= >> [2017-06-15T22:35:16.045] backfill test for JobID=1315558 Prio=22776428 >> Partition=general >> [2017-06-15T22:35:16.045] Test job 1315558 at 2017-06-15T22:35:16 on >> hopper,maxwell,parzen,viterbi >> [2017-06-15T22:35:16.046] Job 1315558 to start at 2017-06-22T19:11:16, end >> at 2017-06-27T07:11:00 on maxwell >> [2017-06-15T22:35:16.046] backfill: reached end of job queue > > > 2) > Our highest prio job is now scheduled to start on maxwell at > 2017-06-22T18:29:17, but then another job is backfilled on maxwell with > EndTime=2017-06-22T21:02:10?? >> >> [2017-06-15T23:22:10.227] backfill: beginning >> [2017-06-15T23:22:10.231] ========================================= >> [2017-06-15T23:22:10.231] Begin:2017-06-15T23:22:10 >> End:2017-06-15T23:29:10 >> Nodes:100plus,gauss,hopper,markov,maxwell,neumann,parzen,sanger,viterbi,watson >> [2017-06-15T23:22:10.231] ========================================= >> [2017-06-15T23:22:10.231] backfill test for JobID=1315558 Prio=22938897 >> Partition=general >> [2017-06-15T23:22:10.231] Test job 1315558 at 2017-06-15T23:22:10 on >> hopper,maxwell,parzen,viterbi >> [2017-06-15T23:22:10.232] Job 1315558 to start at 2017-06-22T18:29:17, end >> at 2017-06-27T06:29:00 on maxwell >> [2017-06-15T23:22:10.232] backfill test for JobID=1314524 Prio=10000002 >> Partition=general >> [2017-06-15T23:22:10.233] Test job 1314524 at 2017-06-15T23:22:10 on >> gauss,hopper,markov,maxwell,neumann,parzen,sanger,viterbi,watson >> [2017-06-15T23:22:10.233] backfill: Started JobId=1314524 in general on >> maxwell >> [2017-06-15T23:22:10.237] backfill: reached end of job queue > > > 3 and 4) > Our job is immediately scheduled on another node, however the scheduled > starting time 2017-06-22T18:29:17 remains the same, although the current > longest jobs on parzen has EndTime=2017-06-22T15:27:08. > Almost an hour later the start time is updated (note: the job complete > should be unrelated since the job ran on maxwell, right?). > Then 40 minutes later the start time is updated again, pushed back (while no > new jobs were started on parzen in the mean time, and we don't suspend; > there was however another job complete on maxwell?). >> >> [2017-06-15T23:22:40.237] backfill: beginning >> [2017-06-15T23:22:40.241] ========================================= >> [2017-06-15T23:22:40.242] Begin:2017-06-15T23:22:40 >> End:2017-06-15T23:29:40 >> Nodes:100plus,gauss,hopper,markov,maxwell,neumann,parzen,sanger,viterbi,watson >> [2017-06-15T23:22:40.242] ========================================= >> [2017-06-15T23:22:40.242] backfill test for JobID=1315558 Prio=22938897 >> Partition=general >> [2017-06-15T23:22:40.242] Test job 1315558 at 2017-06-15T23:22:40 on >> hopper,maxwell,parzen,viterbi >> [2017-06-15T23:22:40.243] Job 1315558 to start at 2017-06-22T18:29:17, end >> at 2017-06-27T06:29:00 on parzen >> [2017-06-15T23:22:40.243] backfill: reached end of job queue > > ... >> >> [2017-06-16T00:16:06.161] backfill: beginning >> [2017-06-16T00:16:06.165] ========================================= >> [2017-06-16T00:16:06.165] Begin:2017-06-16T00:16:06 >> End:2017-06-16T00:23:06 >> Nodes:100plus,gauss,hopper,markov,maxwell,neumann,parzen,sanger,viterbi,watson >> [2017-06-16T00:16:06.165] ========================================= >> [2017-06-16T00:16:06.165] backfill test for JobID=1315558 Prio=23087721 >> Partition=general >> [2017-06-16T00:16:06.165] Test job 1315558 at 2017-06-16T00:16:06 on >> hopper,maxwell,parzen,viterbi >> [2017-06-16T00:16:06.166] Job 1315558 to start at 2017-06-22T18:29:17, end >> at 2017-06-27T06:29:00 on parzen >> [2017-06-16T00:16:06.167] backfill: reached end of job queue >> [2017-06-16T00:16:12.763] job_complete: JobID=1316201 State=0x8003 >> NodeCnt=1 done >> [2017-06-16T00:16:36.167] backfill: beginning >> [2017-06-16T00:16:36.170] ========================================= >> [2017-06-16T00:16:36.170] Begin:2017-06-16T00:16:36 >> End:2017-06-16T00:23:36 >> Nodes:100plus,gauss,hopper,markov,maxwell,neumann,parzen,sanger,viterbi,watson >> [2017-06-16T00:16:36.170] ========================================= >> [2017-06-16T00:16:36.170] backfill test for JobID=1315558 Prio=23101442 >> Partition=general >> [2017-06-16T00:16:36.170] Test job 1315558 at 2017-06-16T00:16:36 on >> hopper,maxwell,parzen,viterbi >> [2017-06-16T00:16:36.171] Job 1315558 to start at 2017-06-22T16:41:17, end >> at 2017-06-27T04:41:00 on parzen >> [2017-06-16T00:16:36.171] backfill: reached end of job queue > > ... >> >> [2017-06-16T00:55:41.642] backfill: beginning >> [2017-06-16T00:55:41.646] ========================================= >> [2017-06-16T00:55:41.646] Begin:2017-06-16T00:55:41 >> End:2017-06-16T01:02:41 >> Nodes:100plus,gauss,hopper,markov,maxwell,neumann,parzen,sanger,viterbi,watson >> [2017-06-16T00:55:41.646] ========================================= >> [2017-06-16T00:55:41.646] backfill test for JobID=1315558 Prio=23186871 >> Partition=general >> [2017-06-16T00:55:41.646] Test job 1315558 at 2017-06-16T00:55:41 on >> hopper,maxwell,parzen,viterbi >> [2017-06-16T00:55:41.647] Job 1315558 to start at 2017-06-22T16:41:17, end >> at 2017-06-27T04:41:00 on parzen >> [2017-06-16T00:55:41.694] backfill: reached end of job queue >> [2017-06-16T00:55:47.592] job_complete: JobID=1316353 State=0x1 NodeCnt=1 >> WEXITSTATUS 0 >> [2017-06-16T00:55:47.598] job_complete: JobID=1316353 State=0x8003 >> NodeCnt=1 done >> [2017-06-16T00:56:11.695] backfill: beginning >> [2017-06-16T00:56:11.701] ========================================= >> [2017-06-16T00:56:11.701] Begin:2017-06-16T00:56:11 >> End:2017-06-16T01:03:11 >> Nodes:100plus,gauss,hopper,markov,maxwell,neumann,parzen,sanger,viterbi,watson >> [2017-06-16T00:56:11.701] ========================================= >> [2017-06-16T00:56:11.701] backfill test for JobID=1315558 Prio=23186871 >> Partition=general >> [2017-06-16T00:56:11.701] Test job 1315558 at 2017-06-16T00:56:11 on >> hopper,maxwell,parzen,viterbi >> [2017-06-16T00:56:11.703] Job 1315558 to start at 2017-06-22T18:01:33, end >> at 2017-06-27T06:01:00 on parzen >> [2017-06-16T00:56:11.703] backfill: reached end of job queue
