Excellent work!
Your patch looks good to me and will be included in version 2.2.3 when 
available.
Thanks!
________________________________________
From: [email protected] [[email protected]] On Behalf 
Of Bjørn-Helge Mevik [[email protected]]
Sent: Wednesday, February 23, 2011 7:25 AM
To: [email protected]
Cc: [email protected]
Subject: Re: [slurm-dev] Slow backfill testing of some jobs.

"Jette, Moe" <[email protected]> writes:

> I still haven't been able see any significant delays in backfill
> scheduling. I have attached a patch which might help you. If you do
> give it a try, please let me know what the results are.

We tried it, but it did not give any speedup for the problematic jobs.

After much code-reading, log-forensics (and a bit of statistics :-), we
found out that the slow backfill tests happen for jobs that ask for
features or resources that only very few nodes have.  In our case,
asking for the feature hugemem (5 nodes out of 680), asking for a
specific rack or a single node.

For jobs like that, typically very many jobs have to be removed with
_rm_job_from_res() before nodes that the job can use become available.
We also discovered that not only the resulting many calls to
cr_job_test() took time, but also the calls to _rm_job_from_res() could
add up to several seconds for one backfill test.

We've thus created a patch that we are now using on our production
cluster.  It reduces the backfill time from about 6 seconds to typically
less than 1 second for these jobs.  We've run it for about 30 hours now,
and virtually all jobs are tested in at most 1 second (see attached graph)


Reply via email to