So we have had the same problem, usually due to the scheduler receiving tons of requests. Usually this is fixed by having the scheduler slow itself down by using defer, or the max_rpc_cnt options. We in particular use max_rpc_cnt=16. I actually did a test yesterday where I removed this and let it go with out defer as I was hoping that the newer version of slurm would be able to run with out it. In our environment the thread count saturated and slurm started to drop requests all over the place. This even caused some node_fail messages as nodes themselves couldn't talk to the master because it was so busy.

It sounds like your problem is similar to ours in that we just have too much traffic and demand hitting the master (we run with about 1000 cores and 800 users hitting the scheduler). So I advise using defer or max_rpc_cnt. Max_rpc_cnt is good because when things are less busy the scheduler automatically switches out of defer state and starts to schedule things more quickly, so its adaptive.

However an even better thing would be to have slurm more intelligently auto throttle itself (which max_rpc_cnt is a somewhat simple way of doing) when it receives ton of traffic due to completion, user queries or just general busyness. The threads tend to stack up when you have a bunch of blocking requests. sdiag can help you track some of that back, but some can't be prevented. So it would be good at least from the developer end if the number of blocking requests could be reduced. In particular in our case it would be good if user requests for information were responded to with a intelligent busy signal and an estimate of return time, as our users get a bit punchy if requests for information don't return immediately.

Regardless, I would look at defer or max_rpc_cnt.

-Paul Edmon-

On 03/27/2015 07:32 AM, Mehdi Denou wrote:
Hi,

Using gdb you can retrieve which thread own the locks on the slurmctld
internal structures (and block all the others).
Then it will be easier to understand what is happening.

Le 27/03/2015 12:24, Stuart Rankin a écrit :
Hi,

I am running slurm 14.11.4 on a 800 node RHEL6.6 general-purpose University 
cluster. Since upgrading
from 14.03.3 we have been seeing the following problem and I'd appreciate any 
advice (maybe it's a
bug but maybe I'm missing something obvious).

Occasionally the number of slurmctld threads starts to rise rapidly until it 
hits the hard coded 256
limit and stays there. The threads are in the futex_ state according to ps and 
logging stops
(nothing out of the ordinary leaps out in the log before this happens). 
Naturally slurm clients then
start failing with timeout messages (which isn't trivial since it is causing 
some not very resilient
user pipelines to fail). This condition has persisted for several hours during 
the night without
being detected. However there is a simple workaround, which is to send a STOP 
signal to slurmctld
process, wait a few seconds, then resume it - this clears the logjam. Merely 
attaching a debugger
has the same effect!

I feel this must be a clue as to the root cause. I have already tried setting 
CommitDelay in
slurmdbd.conf, increasing MessageTimeout, setting HealthCheckNodeState=CYCLE and
decreasing/increasing bf_yield_interval/bf_yield_sleep without any apparent 
impact (please see
slurm.conf attached).

Any advice would be gratefully received.

Many thanks -

Stuart


Reply via email to