Thanks very much for these suggestions - I've set a value for max_rpc_cnt and we should see soon if this helps.
Cheers Stuart On 27/03/15 14:09, Paul Edmon wrote: > > So we have had the same problem, usually due to the scheduler receiving tons > of requests. Usually > this is fixed by having the scheduler slow itself down by using defer, or the > max_rpc_cnt options. > We in particular use max_rpc_cnt=16. I actually did a test yesterday where I > removed this and let > it go with out defer as I was hoping that the newer version of slurm would be > able to run with out > it. In our environment the thread count saturated and slurm started to drop > requests all over the > place. This even caused some node_fail messages as nodes themselves couldn't > talk to the master > because it was so busy. > > It sounds like your problem is similar to ours in that we just have too much > traffic and demand > hitting the master (we run with about 1000 cores and 800 users hitting the > scheduler). So I advise > using defer or max_rpc_cnt. Max_rpc_cnt is good because when things are less > busy the scheduler > automatically switches out of defer state and starts to schedule things more > quickly, so its adaptive. > > However an even better thing would be to have slurm more intelligently auto > throttle itself (which > max_rpc_cnt is a somewhat simple way of doing) when it receives ton of > traffic due to completion, > user queries or just general busyness. The threads tend to stack up when you > have a bunch of > blocking requests. sdiag can help you track some of that back, but some > can't be prevented. So it > would be good at least from the developer end if the number of blocking > requests could be reduced. > In particular in our case it would be good if user requests for information > were responded to with a > intelligent busy signal and an estimate of return time, as our users get a > bit punchy if requests > for information don't return immediately. > > Regardless, I would look at defer or max_rpc_cnt. > > -Paul Edmon- > > On 03/27/2015 07:32 AM, Mehdi Denou wrote: >> Hi, >> >> Using gdb you can retrieve which thread own the locks on the slurmctld >> internal structures (and block all the others). >> Then it will be easier to understand what is happening. >> >> Le 27/03/2015 12:24, Stuart Rankin a écrit : >>> Hi, >>> >>> I am running slurm 14.11.4 on a 800 node RHEL6.6 general-purpose University >>> cluster. Since upgrading >>> from 14.03.3 we have been seeing the following problem and I'd appreciate >>> any advice (maybe it's a >>> bug but maybe I'm missing something obvious). >>> >>> Occasionally the number of slurmctld threads starts to rise rapidly until >>> it hits the hard coded 256 >>> limit and stays there. The threads are in the futex_ state according to ps >>> and logging stops >>> (nothing out of the ordinary leaps out in the log before this happens). >>> Naturally slurm clients then >>> start failing with timeout messages (which isn't trivial since it is >>> causing some not very resilient >>> user pipelines to fail). This condition has persisted for several hours >>> during the night without >>> being detected. However there is a simple workaround, which is to send a >>> STOP signal to slurmctld >>> process, wait a few seconds, then resume it - this clears the logjam. >>> Merely attaching a debugger >>> has the same effect! >>> >>> I feel this must be a clue as to the root cause. I have already tried >>> setting CommitDelay in >>> slurmdbd.conf, increasing MessageTimeout, setting >>> HealthCheckNodeState=CYCLE and >>> decreasing/increasing bf_yield_interval/bf_yield_sleep without any apparent >>> impact (please see >>> slurm.conf attached). >>> >>> Any advice would be gratefully received. >>> >>> Many thanks - >>> >>> Stuart >>> >>> -- Dr. Stuart Rankin Senior System Administrator High Performance Computing Service University of Cambridge Email: [email protected] Tel: (+)44 1223 763517
