Not setting the time here does not mean there is no start time, only that any time set in the previous execution of the backfill scheduler is unchanged, which is probably no longer correct.
Note there is a variable "later_start" that attempts to start jobs at later times based upon when other jobs complete. That logic currently does not try to start jobs upon completion of reservation. Probably what needs to happen is to consider these reservations and advance "later_start" to either the time of the next job ending OR the time when the next reservation ends, whichever comes first. That should generate a correct start time for pending jobs affected by reservations. This shouldn't be too difficult and I'll add that to my to-do list, but don't know when I will have time to work on it. Moe Quoting Mark Nelson <[email protected]>: > > Hi All, > > I've noticed that when using the backfill scheduler on SLURM 2.4 (latest > git), if a job cannot start because it's resources are unavailable (for > example because of an upcoming reservation that will begin before the > job will end) it's start time gets set to "now" + whatever the backfill > window is set to. > > For example, there is a whole system reservation for tomorrow morning, > starting 9am (so it will begin in ~17 hours): > > ~$ scontrol show res > ReservationName=testres StartTime=2012-08-01T09:00:00 > EndTime=2012-08-02T09:00:00 Duration=1-00:00:00 > Nodes=bgp[000x011] NodeCnt=2048 Features=(null) PartitionName=(null) > Flags=MAINT,SPEC_NODES > Users=root Accounts=(null) Licenses=(null) State=INACTIVE > > We have a job that wants half of the machine for one day, so it cannot > start until after the reservation is complete: > > ~$ cat long-waiter > #!/bin/sh > #SBATCH --time=1-0 > #SBATCH --nodes=2048 > > /bin/hostname > sleep 1200 > /usr/bin/uptime > ~$ > > Prior to commit b86bc225f56ec8524243ae848b934844ced83e9c (Add backfill > scheduler resolution parameter), this job was never given a start time. > Part of this commit made the following change (to backfill.c): > > @@ -634,7 +646,9 @@ static int _attempt_backfill(void) > job_ptr->start_time = 0; > goto TRY_LATER; > } > + /* Job can not start until too far in the future */ > job_ptr->time_limit = orig_time_limit; > + job_ptr->start_time = sched_start + backfill_window; > continue; > } > > (this code is now on line 740 of backfill.c if anyone wants to see the > whole of _attempt_backfill() ) > > Thus, I'm now seeing a start time for jobs that previously had no start > time, and that I'm not sure should have a start time set: > > ~$ squeue --start > JOBID PARTITION NAME USER ST START_TIME NODES > MIDPLANELIST(REASON) > 27 main long-wai markn PD 2012-08-02T16:00:08 2K > (Resources) > > > What do others think? > And also, why is it necessary to set start_time to be just outside the > backfill window (which above is 2 days)? > > Many thanks! > Mark >
