Danny Auble wrote on May, 02 10:04:12:
> 
> On 05/02/2014 09:45 AM, Chris Harwell wrote:
> >max jobs? lightest weight method for determining number pending?
> >Hi,
> >
> >Just curious - what is the number of maximum pending jobs you have
> >seen where slurm still holds together?  I think we had troubles
> >quite awhile back when numbers would get into 40-80k pd jobs, but
> >haven't looked recently and have also since moved the spool onto
> >SSD.
> I know there are sites running/pending 200k+ regularly with little

Are there any writeups about clusters doing numbers like that? Would
be interested in the configuration to support that size of queue.

> issue.  Depending on the box you are running the slurmctld on and
> the way you have things configured would make a difference.
> (backfill options have a large impact on how well things work or not
> for large job counts)
> >
> >Do you find anything like submitting the job in hold state or
> >dependencies to substantially reduce impact such that you could
> >have an order of magnitude more? Any other tricks?
> >
> >Is this still the best reference?
> >http://slurm.schedmd.com/high_throughput.html
> Yes
> >
> > Is the 14.x series any better? We're still using 2.6.7.
> Outrageously.  There was a lot of work done to enhance the number of
> jobs the system could handle and look at in a timely fashion.
> >
> >Somewhat related, when you have a loaded up cluster, but still
> >want to monitor the number of pending jobs, what is the lightest
> >weight way to do that?
> >
> >I was thinking perhaps this?
> >/opt/slurmcl2/bin/sdiag | awk '/Last queue length:/ { print $4 }'
> >| head -1
> >
> >Though I usually just do this:
> >squeue -o '%A' -h -r -t pd | wc -l
> I would expect sdiag to be faster since it is only looking at and
> sending a small amount of data.  squeue sends every job which could
> be very heavy.
> 
> But I wouldn't expect sdiag to always give you the correct stat you
> are looking for.  It only returns the jobs eligible to run.  Perhaps
> that is what you want, but in the scenario of a 10 node system
> 
> sbatch -N10 --exclusive --begin=tomorrow test.sh
> sbatch -N10 --exclusive test.sh
> sbatch -N10 --exclusive test.sh
> sbatch -N10 --exclusive test.sh
> 
> You get this...
> 
> sdiag | awk '/Last queue length:/ { print $4 }' | head -1
> 2
> 
> squeue -o '%A' -h -r -t pd | wc -l
> 3
> 
> Since sdiag isn't taking into account the job that isn't eligible.
> But perhaps this is exactly what you want.
> 
> Danny
> >
> >Thanks in advance,
> >Chris
> 

-- 
Chris Scheller
Unix System Administrator
Department of Biostatistics
School of Public Health
University of Michigan
Phone: (734) 615-7439
Office: M4218

Reply via email to