So we have had the same problem, usually due to the scheduler receiving
tons of requests. Usually this is fixed by having the scheduler slow
itself down by using defer, or the max_rpc_cnt options. We in particular
use max_rpc_cnt=16. I actually did a test yesterday where I removed
this and let it go with out defer as I was hoping that the newer version
of slurm would be able to run with out it. In our environment the thread
count saturated and slurm started to drop requests all over the place.
This even caused some node_fail messages as nodes themselves couldn't
talk to the master because it was so busy.
It sounds like your problem is similar to ours in that we just have too
much traffic and demand hitting the master (we run with about 1000 cores
and 800 users hitting the scheduler). So I advise using defer or
max_rpc_cnt. Max_rpc_cnt is good because when things are less busy the
scheduler automatically switches out of defer state and starts to
schedule things more quickly, so its adaptive.
However an even better thing would be to have slurm more intelligently
auto throttle itself (which max_rpc_cnt is a somewhat simple way of
doing) when it receives ton of traffic due to completion, user queries
or just general busyness. The threads tend to stack up when you have a
bunch of blocking requests. sdiag can help you track some of that back,
but some can't be prevented. So it would be good at least from the
developer end if the number of blocking requests could be reduced. In
particular in our case it would be good if user requests for information
were responded to with a intelligent busy signal and an estimate of
return time, as our users get a bit punchy if requests for information
don't return immediately.
Regardless, I would look at defer or max_rpc_cnt.
-Paul Edmon-
On 03/27/2015 07:32 AM, Mehdi Denou wrote:
Hi,
Using gdb you can retrieve which thread own the locks on the slurmctld
internal structures (and block all the others).
Then it will be easier to understand what is happening.
Le 27/03/2015 12:24, Stuart Rankin a écrit :
Hi,
I am running slurm 14.11.4 on a 800 node RHEL6.6 general-purpose University
cluster. Since upgrading
from 14.03.3 we have been seeing the following problem and I'd appreciate any
advice (maybe it's a
bug but maybe I'm missing something obvious).
Occasionally the number of slurmctld threads starts to rise rapidly until it
hits the hard coded 256
limit and stays there. The threads are in the futex_ state according to ps and
logging stops
(nothing out of the ordinary leaps out in the log before this happens).
Naturally slurm clients then
start failing with timeout messages (which isn't trivial since it is causing
some not very resilient
user pipelines to fail). This condition has persisted for several hours during
the night without
being detected. However there is a simple workaround, which is to send a STOP
signal to slurmctld
process, wait a few seconds, then resume it - this clears the logjam. Merely
attaching a debugger
has the same effect!
I feel this must be a clue as to the root cause. I have already tried setting
CommitDelay in
slurmdbd.conf, increasing MessageTimeout, setting HealthCheckNodeState=CYCLE and
decreasing/increasing bf_yield_interval/bf_yield_sleep without any apparent
impact (please see
slurm.conf attached).
Any advice would be gratefully received.
Many thanks -
Stuart