Thanks very much for these suggestions - I've set a value for max_rpc_cnt and 
we should see soon if
this helps.

Cheers

Stuart


On 27/03/15 14:09, Paul Edmon wrote:
> 
> So we have had the same problem, usually due to the scheduler receiving tons 
> of requests.  Usually
> this is fixed by having the scheduler slow itself down by using defer, or the 
> max_rpc_cnt options.
> We in particular use max_rpc_cnt=16.  I actually did a test yesterday where I 
> removed this and let
> it go with out defer as I was hoping that the newer version of slurm would be 
> able to run with out
> it. In our environment the thread count saturated and slurm started to drop 
> requests all over the
> place.  This even caused some node_fail messages as nodes themselves couldn't 
> talk to the master
> because it was so busy.
> 
> It sounds like your problem is similar to ours in that we just have too much 
> traffic and demand
> hitting the master (we run with about 1000 cores and 800 users hitting the 
> scheduler).  So I advise
> using defer or max_rpc_cnt.  Max_rpc_cnt is good because when things are less 
> busy the scheduler
> automatically switches out of defer state and starts to schedule things more 
> quickly, so its adaptive.
> 
> However an even better thing would be to have slurm more intelligently auto 
> throttle itself (which
> max_rpc_cnt is a somewhat simple way of doing) when it receives ton of 
> traffic due to completion,
> user queries or just general busyness. The threads tend to stack up when you 
> have a bunch of
> blocking requests.  sdiag can help you track some of that back, but some 
> can't be prevented.  So it
> would be good at least from the developer end if the number of blocking 
> requests could be reduced. 
> In particular in our case it would be good if user requests for information 
> were responded to with a
> intelligent busy signal and an estimate of return time, as our users get a 
> bit punchy if requests
> for information don't return immediately.
> 
> Regardless, I would look at defer or max_rpc_cnt.
> 
> -Paul Edmon-
> 
> On 03/27/2015 07:32 AM, Mehdi Denou wrote:
>> Hi,
>>
>> Using gdb you can retrieve which thread own the locks on the slurmctld
>> internal structures (and block all the others).
>> Then it will be easier to understand what is happening.
>>
>> Le 27/03/2015 12:24, Stuart Rankin a écrit :
>>> Hi,
>>>
>>> I am running slurm 14.11.4 on a 800 node RHEL6.6 general-purpose University 
>>> cluster. Since upgrading
>>> from 14.03.3 we have been seeing the following problem and I'd appreciate 
>>> any advice (maybe it's a
>>> bug but maybe I'm missing something obvious).
>>>
>>> Occasionally the number of slurmctld threads starts to rise rapidly until 
>>> it hits the hard coded 256
>>> limit and stays there. The threads are in the futex_ state according to ps 
>>> and logging stops
>>> (nothing out of the ordinary leaps out in the log before this happens). 
>>> Naturally slurm clients then
>>> start failing with timeout messages (which isn't trivial since it is 
>>> causing some not very resilient
>>> user pipelines to fail). This condition has persisted for several hours 
>>> during the night without
>>> being detected. However there is a simple workaround, which is to send a 
>>> STOP signal to slurmctld
>>> process, wait a few seconds, then resume it - this clears the logjam. 
>>> Merely attaching a debugger
>>> has the same effect!
>>>
>>> I feel this must be a clue as to the root cause. I have already tried 
>>> setting CommitDelay in
>>> slurmdbd.conf, increasing MessageTimeout, setting 
>>> HealthCheckNodeState=CYCLE and
>>> decreasing/increasing bf_yield_interval/bf_yield_sleep without any apparent 
>>> impact (please see
>>> slurm.conf attached).
>>>
>>> Any advice would be gratefully received.
>>>
>>> Many thanks -
>>>
>>> Stuart
>>>
>>>

-- 
Dr. Stuart Rankin

Senior System Administrator
High Performance Computing Service
University of Cambridge
Email: [email protected]
Tel: (+)44 1223 763517

Reply via email to