I will have to try a few of those tweaks to the configuration we have.
They may help alot.
We are running bear metal hardware though we are logging quite a bit so
that likely doesn't help.
I would say high throughput would be 100 jobs completing simultaneously
and then it trying schedule those cores again only to have them come
available immediately. Essentially the master gets so busy that it
won't respond to any outside probing. The only way to get any info is
to watch the log roll by as sdiag is also unresponsive.
Again we will have to try some of that machine tuning stuff. It should
be helpful.
-Paul Edmon-
On 1/26/2014 7:21 PM, Moe Jette wrote:
A great deal depends upon your hardware and configuration. Slurm
should be able to handle a few hundred jobs per soecond when tuned for
high throughput as described here:
http://slurm.schedmd.com/high_throughput.html
If not tuned for high throughput, say with lots of logging, running on
a virtual machine, etc. then the slurmctld daemon will definitely bog
down. What sort of throughput were you seeing? Did the jobs just exit
right away?
Moe Jette
SchedMD
Quoting Paul Edmon <[email protected]>:
So I've found that if some one submits a ton of jobs that have a very
short runtime slurm tends to trash as jobs are launching and exiting
pretty much constantly. Is there an easy way to enforce a minimum
runtime?
-Paul Edmon-