Have you looked at

http://slurm.schedmd.com/high_throughput.html


On Jun 12, 2013, at 11:09 AM, Alan V. Cowles <[email protected]> wrote:

> Another I just noticed running through logs was that apparently there is a 
> limit of some type in play, though I don't know what determines the value, if 
> it is in fact the status of slurmctld
> 
> Jun 12 13:51:17 slurmctld[10910]: error: create_job_record: job_count exceeds 
> limit 
> Jun 12 13:51:17 slurmctld[10910]: _slurm_rpc_submit_batch_job: Resource 
> temporarily unavailable 
> 
> We have a maintenance outage scheduled for Monday, so perhaps I will be able 
> to make the port-range changes, but I would like to figure out where this 
> limit is specified as well.
> 
> AC
> 
> On 06/12/2013 02:00 PM, Ralph Castain wrote:
>> Worth giving it a try, I'd say
>> 
>> On Jun 12, 2013, at 10:54 AM, Alan V. Cowles <[email protected]> wrote:
>> 
>>> Our machine running the daemon is actually a beefy machine we acquired for 
>>> another purpose that later fell through, so we decided to use it here, it 
>>> has 16 physical cores, if we set a port range of say 8... 6817-6824, and 
>>> made slurmd 6825, would that make a significant difference?
>>> 
>>> AC
>>> 
>>> On 06/12/2013 01:52 PM, Ralph Castain wrote:
>>>> Not isolating, but blocking. If you have more ports, I believe it will add 
>>>> more threads to listen on those ports. Each RPC received blocks until it 
>>>> completes, so having more ports should improve thruput.
>>>> 
>>>> 
>>>> On Jun 12, 2013, at 10:03 AM, "Alan V. Cowles" <[email protected]> 
>>>> wrote:
>>>> 
>>>>> No we have it set exclusively to 6817, and slurmdPort 2 lines later to 
>>>>> 6818.
>>>>> 
>>>>> Is it isolating to processors based on incoming port?
>>>>> 
>>>>> AC
>>>>> 
>>>>> On 06/12/2013 01:00 PM, Lyn Gerner wrote:
>>>>>> Alan, are you using the port range option on SlurmctldPort (e.g., 
>>>>>> SlurmctldPort=6817-6818) in slurm.conf?
>>>>>> 
>>>>>> 
>>>>>> On Wed, Jun 12, 2013 at 9:55 AM, Alan V. Cowles <[email protected]> 
>>>>>> wrote:
>>>>>> 
>>>>>> Under the Data Objects section on the following page
>>>>>> http://slurm.schedmd.com/selectplugins.html we find the statement:
>>>>>> 
>>>>>> "Slurmctld is a multi-threaded program with independent read and write
>>>>>> locks on each data structure type."
>>>>>> 
>>>>>> Which is what lead me to believe it's there, that we perhaps missed a
>>>>>> configuration option.
>>>>>> 
>>>>>> AC
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 06/12/2013 12:43 PM, Paul Edmon wrote:
>>>>>> > I'm also interested in this as I've only ever seen one slurmctld and
>>>>>> > only at 100%.  It would be good if making slurm multithreaded was on 
>>>>>> > the
>>>>>> > path for the future.  I know we will have 100,000's of jobs in flight
>>>>>> > for our config so it would be good to have something that can take that
>>>>>> > load.
>>>>>> >
>>>>>> > -Paul Edmon-
>>>>>> >
>>>>>> > On 06/12/2013 12:30 PM, Alan V. Cowles wrote:
>>>>>> >> Hey Guys,
>>>>>> >>
>>>>>> >> I've seen a few references to the slurmctld as a multithreaded process
>>>>>> >> but it doesn't seem that way.
>>>>>> >>
>>>>>> >> We had a user submit 18000 jobs to our cluster (512 slots) and it 
>>>>>> >> shows
>>>>>> >> 512 fully loaded, shows those jobs running, shows about 9800 currently
>>>>>> >> pending, but upon her submission threw errors around 16500.
>>>>>> >>
>>>>>> >> Submitted batch job 16589
>>>>>> >> Submitted batch job 16590
>>>>>> >> Submitted batch job 16591
>>>>>> >> sbatch: error: Slurm temporarily unable to accept job, sleeping and
>>>>>> >> retrying.
>>>>>> >> sbatch: error: Batch job submission failed: Resource temporarily
>>>>>> >> unavailable.
>>>>>> >>
>>>>>> >> The thing we noticed at this time on our master host is that slurmctld
>>>>>> >> was pegging at 100% on one cpu quite regularly and paged 16GB of 
>>>>>> >> virtual
>>>>>> >> memory, while all other cpu's were completely idle.
>>>>>> >>
>>>>>> >> We wondered if the pegging out of the control daemon is what led to 
>>>>>> >> the
>>>>>> >> submission failure, as we haven't found any limits set anywhere to any
>>>>>> >> specific job or user, and wondered if perhaps we missed a configure
>>>>>> >> option for this when we did our original install.
>>>>>> >>
>>>>>> >> Any thoughts or ideas? We're running Slurm 2.5.4 on RHEL6.
>>>>>> >>
>>>>>> >> AC
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 
> 

Reply via email to