Have you looked at http://slurm.schedmd.com/high_throughput.html
On Jun 12, 2013, at 11:09 AM, Alan V. Cowles <[email protected]> wrote: > Another I just noticed running through logs was that apparently there is a > limit of some type in play, though I don't know what determines the value, if > it is in fact the status of slurmctld > > Jun 12 13:51:17 slurmctld[10910]: error: create_job_record: job_count exceeds > limit > Jun 12 13:51:17 slurmctld[10910]: _slurm_rpc_submit_batch_job: Resource > temporarily unavailable > > We have a maintenance outage scheduled for Monday, so perhaps I will be able > to make the port-range changes, but I would like to figure out where this > limit is specified as well. > > AC > > On 06/12/2013 02:00 PM, Ralph Castain wrote: >> Worth giving it a try, I'd say >> >> On Jun 12, 2013, at 10:54 AM, Alan V. Cowles <[email protected]> wrote: >> >>> Our machine running the daemon is actually a beefy machine we acquired for >>> another purpose that later fell through, so we decided to use it here, it >>> has 16 physical cores, if we set a port range of say 8... 6817-6824, and >>> made slurmd 6825, would that make a significant difference? >>> >>> AC >>> >>> On 06/12/2013 01:52 PM, Ralph Castain wrote: >>>> Not isolating, but blocking. If you have more ports, I believe it will add >>>> more threads to listen on those ports. Each RPC received blocks until it >>>> completes, so having more ports should improve thruput. >>>> >>>> >>>> On Jun 12, 2013, at 10:03 AM, "Alan V. Cowles" <[email protected]> >>>> wrote: >>>> >>>>> No we have it set exclusively to 6817, and slurmdPort 2 lines later to >>>>> 6818. >>>>> >>>>> Is it isolating to processors based on incoming port? >>>>> >>>>> AC >>>>> >>>>> On 06/12/2013 01:00 PM, Lyn Gerner wrote: >>>>>> Alan, are you using the port range option on SlurmctldPort (e.g., >>>>>> SlurmctldPort=6817-6818) in slurm.conf? >>>>>> >>>>>> >>>>>> On Wed, Jun 12, 2013 at 9:55 AM, Alan V. Cowles <[email protected]> >>>>>> wrote: >>>>>> >>>>>> Under the Data Objects section on the following page >>>>>> http://slurm.schedmd.com/selectplugins.html we find the statement: >>>>>> >>>>>> "Slurmctld is a multi-threaded program with independent read and write >>>>>> locks on each data structure type." >>>>>> >>>>>> Which is what lead me to believe it's there, that we perhaps missed a >>>>>> configuration option. >>>>>> >>>>>> AC >>>>>> >>>>>> >>>>>> >>>>>> On 06/12/2013 12:43 PM, Paul Edmon wrote: >>>>>> > I'm also interested in this as I've only ever seen one slurmctld and >>>>>> > only at 100%. It would be good if making slurm multithreaded was on >>>>>> > the >>>>>> > path for the future. I know we will have 100,000's of jobs in flight >>>>>> > for our config so it would be good to have something that can take that >>>>>> > load. >>>>>> > >>>>>> > -Paul Edmon- >>>>>> > >>>>>> > On 06/12/2013 12:30 PM, Alan V. Cowles wrote: >>>>>> >> Hey Guys, >>>>>> >> >>>>>> >> I've seen a few references to the slurmctld as a multithreaded process >>>>>> >> but it doesn't seem that way. >>>>>> >> >>>>>> >> We had a user submit 18000 jobs to our cluster (512 slots) and it >>>>>> >> shows >>>>>> >> 512 fully loaded, shows those jobs running, shows about 9800 currently >>>>>> >> pending, but upon her submission threw errors around 16500. >>>>>> >> >>>>>> >> Submitted batch job 16589 >>>>>> >> Submitted batch job 16590 >>>>>> >> Submitted batch job 16591 >>>>>> >> sbatch: error: Slurm temporarily unable to accept job, sleeping and >>>>>> >> retrying. >>>>>> >> sbatch: error: Batch job submission failed: Resource temporarily >>>>>> >> unavailable. >>>>>> >> >>>>>> >> The thing we noticed at this time on our master host is that slurmctld >>>>>> >> was pegging at 100% on one cpu quite regularly and paged 16GB of >>>>>> >> virtual >>>>>> >> memory, while all other cpu's were completely idle. >>>>>> >> >>>>>> >> We wondered if the pegging out of the control daemon is what led to >>>>>> >> the >>>>>> >> submission failure, as we haven't found any limits set anywhere to any >>>>>> >> specific job or user, and wondered if perhaps we missed a configure >>>>>> >> option for this when we did our original install. >>>>>> >> >>>>>> >> Any thoughts or ideas? We're running Slurm 2.5.4 on RHEL6. >>>>>> >> >>>>>> >> AC >>>>>> >>>>> >>>> >>>> >>> >> >
