Re: [slurm-dev] Pushing slurmctld a little

Moe Jette Wed, 18 Jan 2012 11:02:37 -0800

We have held some discussions on this subject and it isn't simple toresolve. The best way to do this would probably be to establishfiner-grained locking so there can be more parallelism, say by lockingindividual job records rather than the entire job list. That wouldimpact quite a few sub-systems, for example how we preserve job state.

If you could submit a smaller number of jobs that each have many jobsteps, that could address your problem today (say submitting 1000 jobseach with 1000 steps).


Moe Jette


Quoting Yuri D'Elia <wav...@thregr.org>:

Hi everyone. I'm trying to increase the number of jobs that can bequeued with SLURM. I'm submitting a lot of very small jobs (thattake around ~10 minutes) in batches of ~100k. I would like to beable to queue around 500k to 1m jobs, if possible, but I'm having avery hard time going beyond 100k with both 2.3.1 and 2.4.
To make a test, I've raised MaxJobCount to 200000, MessageTimeout to60 and reduced MinJobAge to 60. Of course, SchedulerParameters hasalready "defer" and I tried to set both max_job_bf and interval tovery low values (10 and 600 respectively).
After going beyond ~100k jobs, slurmctld becomes cpu-bound andstarts to timeout on any request. I noticed that just one cpu isused: maybe there is a way to split the work on multiple cpus?
Is there any other feature that affects performance? I'm usingcons_res, multifactor priority along with accounting, but I wouldgladly use a simpler scheduler and less features if I could gobeyond the current limit (which still looks pretty far below mytarget).
Thanks again.

Re: [slurm-dev] Pushing slurmctld a little

Reply via email to