Quoting Bjørn-Helge Mevik <[email protected]>:

Here are two patches to speed up the backfilling in slurm.  Both are
made agains slurm 2.2.6, and have been used in production here a while
now.

The first patch is really small, it just adds a call to
acct_policy_job_runnable() to check if a job should have been marked with
AssociationResourceLimit.  This avoids trying to backfill any job who
wouldn't be allowed to run now anyway.  The call to
acct_policy_job_runnable() is quick; usually about 10 usec, and
a few time up to 100-200 usec.

Thanks for the patch. I can understand how this logic would be very slow for systems with large numbers of jobs. Regarding the second patch, do not believe the "else" portion of code is needed. If you have seen no problems running over the past week, that is a good sign that logic is not needed. If you do see jobs stop being scheduled due to what you believe is data corruption in this this area, running "scontrol reconfig" should rebuild all of the select/cons_res datas structures.. I have applied both patches to the SLURM version 2.3 code base, where it can be tested more before being released.

Thanks,
Moe


Reply via email to