[slurm-dev] Re: slurmctld thread number blowups leading to deadlock in 14.11.4

Paul Edmon Fri, 27 Mar 2015 07:09:07 -0700

So we have had the same problem, usually due to the scheduler receivingtons of requests. Usually this is fixed by having the scheduler slowitself down by using defer, or the max_rpc_cnt options. We in particularuse max_rpc_cnt=16. I actually did a test yesterday where I removedthis and let it go with out defer as I was hoping that the newer versionof slurm would be able to run with out it. In our environment the threadcount saturated and slurm started to drop requests all over the place.This even caused some node_fail messages as nodes themselves couldn'ttalk to the master because it was so busy.

It sounds like your problem is similar to ours in that we just have toomuch traffic and demand hitting the master (we run with about 1000 coresand 800 users hitting the scheduler). So I advise using defer ormax_rpc_cnt. Max_rpc_cnt is good because when things are less busy thescheduler automatically switches out of defer state and starts toschedule things more quickly, so its adaptive.

However an even better thing would be to have slurm more intelligentlyauto throttle itself (which max_rpc_cnt is a somewhat simple way ofdoing) when it receives ton of traffic due to completion, user queriesor just general busyness. The threads tend to stack up when you have abunch of blocking requests. sdiag can help you track some of that back,but some can't be prevented. So it would be good at least from thedeveloper end if the number of blocking requests could be reduced. Inparticular in our case it would be good if user requests for informationwere responded to with a intelligent busy signal and an estimate ofreturn time, as our users get a bit punchy if requests for informationdon't return immediately.


Regardless, I would look at defer or max_rpc_cnt.

-Paul Edmon-

On 03/27/2015 07:32 AM, Mehdi Denou wrote:

Hi,

Using gdb you can retrieve which thread own the locks on the slurmctld
internal structures (and block all the others).
Then it will be easier to understand what is happening.

Le 27/03/2015 12:24, Stuart Rankin a écrit :

Hi,

I am running slurm 14.11.4 on a 800 node RHEL6.6 general-purpose University 
cluster. Since upgrading
from 14.03.3 we have been seeing the following problem and I'd appreciate any 
advice (maybe it's a
bug but maybe I'm missing something obvious).

Occasionally the number of slurmctld threads starts to rise rapidly until it 
hits the hard coded 256
limit and stays there. The threads are in the futex_ state according to ps and 
logging stops
(nothing out of the ordinary leaps out in the log before this happens). 
Naturally slurm clients then
start failing with timeout messages (which isn't trivial since it is causing 
some not very resilient
user pipelines to fail). This condition has persisted for several hours during 
the night without
being detected. However there is a simple workaround, which is to send a STOP 
signal to slurmctld
process, wait a few seconds, then resume it - this clears the logjam. Merely 
attaching a debugger
has the same effect!

I feel this must be a clue as to the root cause. I have already tried setting 
CommitDelay in
slurmdbd.conf, increasing MessageTimeout, setting HealthCheckNodeState=CYCLE and
decreasing/increasing bf_yield_interval/bf_yield_sleep without any apparent 
impact (please see
slurm.conf attached).

Any advice would be gratefully received.

Many thanks -

Stuart

[slurm-dev] Re: slurmctld thread number blowups leading to deadlock in 14.11.4

Reply via email to