I'm also interested in this as I've only ever seen one slurmctld and only at 100%. It would be good if making slurm multithreaded was on the path for the future. I know we will have 100,000's of jobs in flight for our config so it would be good to have something that can take that load.
-Paul Edmon- On 06/12/2013 12:30 PM, Alan V. Cowles wrote: > Hey Guys, > > I've seen a few references to the slurmctld as a multithreaded process > but it doesn't seem that way. > > We had a user submit 18000 jobs to our cluster (512 slots) and it shows > 512 fully loaded, shows those jobs running, shows about 9800 currently > pending, but upon her submission threw errors around 16500. > > Submitted batch job 16589 > Submitted batch job 16590 > Submitted batch job 16591 > sbatch: error: Slurm temporarily unable to accept job, sleeping and > retrying. > sbatch: error: Batch job submission failed: Resource temporarily > unavailable. > > The thing we noticed at this time on our master host is that slurmctld > was pegging at 100% on one cpu quite regularly and paged 16GB of virtual > memory, while all other cpu's were completely idle. > > We wondered if the pegging out of the control daemon is what led to the > submission failure, as we haven't found any limits set anywhere to any > specific job or user, and wondered if perhaps we missed a configure > option for this when we did our original install. > > Any thoughts or ideas? We're running Slurm 2.5.4 on RHEL6. > > AC
