Hi,

Using gdb you can retrieve which thread own the locks on the slurmctld
internal structures (and block all the others).
Then it will be easier to understand what is happening.

Le 27/03/2015 12:24, Stuart Rankin a écrit :
> Hi,
>
> I am running slurm 14.11.4 on a 800 node RHEL6.6 general-purpose University 
> cluster. Since upgrading
> from 14.03.3 we have been seeing the following problem and I'd appreciate any 
> advice (maybe it's a
> bug but maybe I'm missing something obvious).
>
> Occasionally the number of slurmctld threads starts to rise rapidly until it 
> hits the hard coded 256
> limit and stays there. The threads are in the futex_ state according to ps 
> and logging stops
> (nothing out of the ordinary leaps out in the log before this happens). 
> Naturally slurm clients then
> start failing with timeout messages (which isn't trivial since it is causing 
> some not very resilient
> user pipelines to fail). This condition has persisted for several hours 
> during the night without
> being detected. However there is a simple workaround, which is to send a STOP 
> signal to slurmctld
> process, wait a few seconds, then resume it - this clears the logjam. Merely 
> attaching a debugger
> has the same effect!
>
> I feel this must be a clue as to the root cause. I have already tried setting 
> CommitDelay in
> slurmdbd.conf, increasing MessageTimeout, setting HealthCheckNodeState=CYCLE 
> and
> decreasing/increasing bf_yield_interval/bf_yield_sleep without any apparent 
> impact (please see
> slurm.conf attached).
>
> Any advice would be gratefully received.
>
> Many thanks -
>
> Stuart
>
>

-- 
---
Mehdi Denou
International HPC support
+336 45 57 66 56

Reply via email to