Danny Auble wrote on May, 02 10:04:12: > > On 05/02/2014 09:45 AM, Chris Harwell wrote: > >max jobs? lightest weight method for determining number pending? > >Hi, > > > >Just curious - what is the number of maximum pending jobs you have > >seen where slurm still holds together? I think we had troubles > >quite awhile back when numbers would get into 40-80k pd jobs, but > >haven't looked recently and have also since moved the spool onto > >SSD. > I know there are sites running/pending 200k+ regularly with little
Are there any writeups about clusters doing numbers like that? Would be interested in the configuration to support that size of queue. > issue. Depending on the box you are running the slurmctld on and > the way you have things configured would make a difference. > (backfill options have a large impact on how well things work or not > for large job counts) > > > >Do you find anything like submitting the job in hold state or > >dependencies to substantially reduce impact such that you could > >have an order of magnitude more? Any other tricks? > > > >Is this still the best reference? > >http://slurm.schedmd.com/high_throughput.html > Yes > > > > Is the 14.x series any better? We're still using 2.6.7. > Outrageously. There was a lot of work done to enhance the number of > jobs the system could handle and look at in a timely fashion. > > > >Somewhat related, when you have a loaded up cluster, but still > >want to monitor the number of pending jobs, what is the lightest > >weight way to do that? > > > >I was thinking perhaps this? > >/opt/slurmcl2/bin/sdiag | awk '/Last queue length:/ { print $4 }' > >| head -1 > > > >Though I usually just do this: > >squeue -o '%A' -h -r -t pd | wc -l > I would expect sdiag to be faster since it is only looking at and > sending a small amount of data. squeue sends every job which could > be very heavy. > > But I wouldn't expect sdiag to always give you the correct stat you > are looking for. It only returns the jobs eligible to run. Perhaps > that is what you want, but in the scenario of a 10 node system > > sbatch -N10 --exclusive --begin=tomorrow test.sh > sbatch -N10 --exclusive test.sh > sbatch -N10 --exclusive test.sh > sbatch -N10 --exclusive test.sh > > You get this... > > sdiag | awk '/Last queue length:/ { print $4 }' | head -1 > 2 > > squeue -o '%A' -h -r -t pd | wc -l > 3 > > Since sdiag isn't taking into account the job that isn't eligible. > But perhaps this is exactly what you want. > > Danny > > > >Thanks in advance, > >Chris > -- Chris Scheller Unix System Administrator Department of Biostatistics School of Public Health University of Michigan Phone: (734) 615-7439 Office: M4218
