Alan, are you using the port range option on SlurmctldPort (e.g., SlurmctldPort=6817-6818) in slurm.conf<http://slurm.schedmd.com/slurm.conf.html> ?
On Wed, Jun 12, 2013 at 9:55 AM, Alan V. Cowles <[email protected]>wrote: > > Under the Data Objects section on the following page > http://slurm.schedmd.com/selectplugins.html we find the statement: > > "Slurmctld is a multi-threaded program with independent read and write > locks on each data structure type." > > Which is what lead me to believe it's there, that we perhaps missed a > configuration option. > > AC > > > > On 06/12/2013 12:43 PM, Paul Edmon wrote: > > I'm also interested in this as I've only ever seen one slurmctld and > > only at 100%. It would be good if making slurm multithreaded was on the > > path for the future. I know we will have 100,000's of jobs in flight > > for our config so it would be good to have something that can take that > > load. > > > > -Paul Edmon- > > > > On 06/12/2013 12:30 PM, Alan V. Cowles wrote: > >> Hey Guys, > >> > >> I've seen a few references to the slurmctld as a multithreaded process > >> but it doesn't seem that way. > >> > >> We had a user submit 18000 jobs to our cluster (512 slots) and it shows > >> 512 fully loaded, shows those jobs running, shows about 9800 currently > >> pending, but upon her submission threw errors around 16500. > >> > >> Submitted batch job 16589 > >> Submitted batch job 16590 > >> Submitted batch job 16591 > >> sbatch: error: Slurm temporarily unable to accept job, sleeping and > >> retrying. > >> sbatch: error: Batch job submission failed: Resource temporarily > >> unavailable. > >> > >> The thing we noticed at this time on our master host is that slurmctld > >> was pegging at 100% on one cpu quite regularly and paged 16GB of virtual > >> memory, while all other cpu's were completely idle. > >> > >> We wondered if the pegging out of the control daemon is what led to the > >> submission failure, as we haven't found any limits set anywhere to any > >> specific job or user, and wondered if perhaps we missed a configure > >> option for this when we did our original install. > >> > >> Any thoughts or ideas? We're running Slurm 2.5.4 on RHEL6. > >> > >> AC >
