On 05/02/2014 09:45 AM, Chris Harwell wrote:
max jobs? lightest weight method for determining number pending?
Hi,
Just curious - what is the number of maximum pending jobs you have
seen where slurm still holds together? I think we had troubles quite
awhile back when numbers would get into 40-80k pd jobs, but haven't
looked recently and have also since moved the spool onto SSD.
I know there are sites running/pending 200k+ regularly with little
issue. Depending on the box you are running the slurmctld on and the
way you have things configured would make a difference. (backfill
options have a large impact on how well things work or not for large job
counts)
Do you find anything like submitting the job in hold state or
dependencies to substantially reduce impact such that you could have
an order of magnitude more? Any other tricks?
Is this still the best reference?
http://slurm.schedmd.com/high_throughput.html
Yes
Is the 14.x series any better? We're still using 2.6.7.
Outrageously. There was a lot of work done to enhance the number of
jobs the system could handle and look at in a timely fashion.
Somewhat related, when you have a loaded up cluster, but still want to
monitor the number of pending jobs, what is the lightest weight way to
do that?
I was thinking perhaps this?
/opt/slurmcl2/bin/sdiag | awk '/Last queue length:/ { print $4 }' |
head -1
Though I usually just do this:
squeue -o '%A' -h -r -t pd | wc -l
I would expect sdiag to be faster since it is only looking at and
sending a small amount of data. squeue sends every job which could be
very heavy.
But I wouldn't expect sdiag to always give you the correct stat you are
looking for. It only returns the jobs eligible to run. Perhaps that is
what you want, but in the scenario of a 10 node system
sbatch -N10 --exclusive --begin=tomorrow test.sh
sbatch -N10 --exclusive test.sh
sbatch -N10 --exclusive test.sh
sbatch -N10 --exclusive test.sh
You get this...
sdiag | awk '/Last queue length:/ { print $4 }' | head -1
2
squeue -o '%A' -h -r -t pd | wc -l
3
Since sdiag isn't taking into account the job that isn't eligible. But
perhaps this is exactly what you want.
Danny
Thanks in advance,
Chris