Moe Jette <[email protected]> writes:
> Hi Kent,
>
> In the example which fails to work properly (starting job 1252 before  
> job 1252), the problem is the backfill scheduler not accounting for  
> all resources and limits. Specifically the backfill scheduler  
> simulates what resources become available as various jobs begin and  
> end going forward in time. It accounts for the CPUs, memory, various  
> limits and job preemption. It does not currently account for the group  
> limits or licenses. So when the backfill scheduler tries to determine  
> when job 1252 can start, it notes the association limit, but fails to  
> recognize the job will be able to start in 57 minutes (when job 1251  
> terminates, effecting the group limit) and thus fails to reserve those  
> resources preventing the initiation of job 1253.
>
> There is not a simple fix for this problem. It would require adding  
> new logic to track the group limits through the future to better  
> determine when and where pending jobs can be initiated.

Thanks for the detailed answer!

For now, I guess we will have to use multiple partitions instead of
GrpNodes limits on a single partition for the situations where we need
the priority to be respected between the users in the groups.

Regards,
-- 
Kent Engström, National Supercomputer Centre
[email protected], +46 13 28 4444

Reply via email to