Another I just noticed running through logs was that apparently there is
a limit of some type in play, though I don't know what determines the
value, if it is in fact the status of slurmctld
Jun 12 13:51:17 slurmctld[10910]: error: create_job_record: job_count
exceeds limit
Jun 12 13:51:17 slurmctld[10910]: _slurm_rpc_submit_batch_job: Resource
temporarily unavailable
We have a maintenance outage scheduled for Monday, so perhaps I will be
able to make the port-range changes, but I would like to figure out
where this limit is specified as well.
AC
On 06/12/2013 02:00 PM, Ralph Castain wrote:
Worth giving it a try, I'd say
On Jun 12, 2013, at 10:54 AM, Alan V. Cowles <[email protected]
<mailto:[email protected]>> wrote:
Our machine running the daemon is actually a beefy machine we
acquired for another purpose that later fell through, so we decided
to use it here, it has 16 physical cores, if we set a port range of
say 8... 6817-6824, and made slurmd 6825, would that make a
significant difference?
AC
On 06/12/2013 01:52 PM, Ralph Castain wrote:
Not isolating, but blocking. If you have more ports, I believe it
will add more threads to listen on those ports. Each RPC received
blocks until it completes, so having more ports should improve thruput.
On Jun 12, 2013, at 10:03 AM, "Alan V. Cowles" <[email protected]
<mailto:[email protected]>> wrote:
No we have it set exclusively to 6817, and slurmdPort 2 lines later
to 6818.
Is it isolating to processors based on incoming port?
AC
On 06/12/2013 01:00 PM, Lyn Gerner wrote:
Re: [slurm-dev] Re: Slurmctld multithreaded?
Alan, are you using the port range option on SlurmctldPort (e.g.,
SlurmctldPort=6817-6818) in slurm.conf
<http://slurm.schedmd.com/slurm.conf.html>?
On Wed, Jun 12, 2013 at 9:55 AM, Alan V. Cowles
<[email protected] <mailto:[email protected]>> wrote:
Under the Data Objects section on the following page
http://slurm.schedmd.com/selectplugins.html we find the statement:
"Slurmctld is a multi-threaded program with independent read
and write
locks on each data structure type."
Which is what lead me to believe it's there, that we perhaps
missed a
configuration option.
AC
On 06/12/2013 12:43 PM, Paul Edmon wrote:
> I'm also interested in this as I've only ever seen one
slurmctld and
> only at 100%. It would be good if making slurm
multithreaded was on the
> path for the future. I know we will have 100,000's of jobs
in flight
> for our config so it would be good to have something that
can take that
> load.
>
> -Paul Edmon-
>
> On 06/12/2013 12:30 PM, Alan V. Cowles wrote:
>> Hey Guys,
>>
>> I've seen a few references to the slurmctld as a
multithreaded process
>> but it doesn't seem that way.
>>
>> We had a user submit 18000 jobs to our cluster (512 slots)
and it shows
>> 512 fully loaded, shows those jobs running, shows about
9800 currently
>> pending, but upon her submission threw errors around 16500.
>>
>> Submitted batch job 16589
>> Submitted batch job 16590
>> Submitted batch job 16591
>> sbatch: error: Slurm temporarily unable to accept job,
sleeping and
>> retrying.
>> sbatch: error: Batch job submission failed: Resource
temporarily
>> unavailable.
>>
>> The thing we noticed at this time on our master host is
that slurmctld
>> was pegging at 100% on one cpu quite regularly and paged
16GB of virtual
>> memory, while all other cpu's were completely idle.
>>
>> We wondered if the pegging out of the control daemon is
what led to the
>> submission failure, as we haven't found any limits set
anywhere to any
>> specific job or user, and wondered if perhaps we missed a
configure
>> option for this when we did our original install.
>>
>> Any thoughts or ideas? We're running Slurm 2.5.4 on RHEL6.
>>
>> AC