There were changes in Slurm version 2.6 with respect to lock handling
which may effect this. If you are using an earlier version of slurm,
that would be a reason to upgrade.
Quoting Paul Edmon <[email protected]>:
I will have to try a few of those tweaks to the configuration we
have. They may help alot.
We are running bear metal hardware though we are logging quite a bit
so that likely doesn't help.
I would say high throughput would be 100 jobs completing
simultaneously and then it trying schedule those cores again only to
have them come available immediately. Essentially the master gets
so busy that it won't respond to any outside probing. The only way
to get any info is to watch the log roll by as sdiag is also
unresponsive.
Again we will have to try some of that machine tuning stuff. It
should be helpful.
-Paul Edmon-
On 1/26/2014 7:21 PM, Moe Jette wrote:
A great deal depends upon your hardware and configuration. Slurm
should be able to handle a few hundred jobs per soecond when tuned
for high throughput as described here:
http://slurm.schedmd.com/high_throughput.html
If not tuned for high throughput, say with lots of logging, running
on a virtual machine, etc. then the slurmctld daemon will
definitely bog down. What sort of throughput were you seeing? Did
the jobs just exit right away?
Moe Jette
SchedMD
Quoting Paul Edmon <[email protected]>:
So I've found that if some one submits a ton of jobs that have a
very short runtime slurm tends to trash as jobs are launching and
exiting pretty much constantly. Is there an easy way to enforce a
minimum runtime?
-Paul Edmon-