Most of the discussions have actually been about supporting a higher
job throughput rate, but those same changes would increase SLURM's
ability to handle larger job counts. None of that work has moved
beyond the discussion stage.
Moe
Quoting Chris Harwell <super...@gmail.com>:
we've wondered if that was the case. is there any plan or willingness to
implement a finer grained locking?
On Jan 18, 2012 2:02 PM, "Moe Jette" <je...@schedmd.com> wrote:
We have held some discussions on this subject and it isn't simple to
resolve. The best way to do this would probably be to establish
finer-grained locking so there can be more parallelism, say by locking
individual job records rather than the entire job list. That would impact
quite a few sub-systems, for example how we preserve job state.
If you could submit a smaller number of jobs that each have many job
steps, that could address your problem today (say submitting 1000 jobs each
with 1000 steps).
Moe Jette
Quoting Yuri D'Elia <wav...@thregr.org>:
Hi everyone. I'm trying to increase the number of jobs that can be queued
with SLURM. I'm submitting a lot of very small jobs (that take around ~10
minutes) in batches of ~100k. I would like to be able to queue around 500k
to 1m jobs, if possible, but I'm having a very hard time going beyond 100k
with both 2.3.1 and 2.4.
To make a test, I've raised MaxJobCount to 200000, MessageTimeout to 60
and reduced MinJobAge to 60. Of course, SchedulerParameters has already
"defer" and I tried to set both max_job_bf and interval to very low values
(10 and 600 respectively).
After going beyond ~100k jobs, slurmctld becomes cpu-bound and starts to
timeout on any request. I noticed that just one cpu is used: maybe there is
a way to split the work on multiple cpus?
Is there any other feature that affects performance? I'm using cons_res,
multifactor priority along with accounting, but I would gladly use a
simpler scheduler and less features if I could go beyond the current limit
(which still looks pretty far below my target).
Thanks again.