Moe Jette <[email protected]> writes: > Hi Kent, > > In the example which fails to work properly (starting job 1252 before > job 1252), the problem is the backfill scheduler not accounting for > all resources and limits. Specifically the backfill scheduler > simulates what resources become available as various jobs begin and > end going forward in time. It accounts for the CPUs, memory, various > limits and job preemption. It does not currently account for the group > limits or licenses. So when the backfill scheduler tries to determine > when job 1252 can start, it notes the association limit, but fails to > recognize the job will be able to start in 57 minutes (when job 1251 > terminates, effecting the group limit) and thus fails to reserve those > resources preventing the initiation of job 1253. > > There is not a simple fix for this problem. It would require adding > new logic to track the group limits through the future to better > determine when and where pending jobs can be initiated.
Thanks for the detailed answer! For now, I guess we will have to use multiple partitions instead of GrpNodes limits on a single partition for the situations where we need the priority to be respected between the users in the groups. Regards, -- Kent Engström, National Supercomputer Centre [email protected], +46 13 28 4444
