On Wed, Jun 29, 2016 at 8:47 PM, Jonathon Nelson <jdnel...@dyn.com> wrote:

> I have a slurm cluster on SL 6.7 using slurm 15.08.11 (built in-house but
> with only very minor changes to the spec file).  Every node works fine
> except for one, which also happens to run slurmctld.  This node invariably
> ends up with a hung slurmd (which complains of
> MAX_THREADS==current_threads).
>
> an strace of the hung slurmd:
>
> <lots of pids waiting for futex(0x7e8fc0, ...) >
> [pid 16460] futex(0x7e8fc0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
> ...
> [pid 13957] connect(12, {sa_family=AF_LOCAL,
> sun_path="/var/run/slurmd/NODENAME_5388890.0"}, 34 <unfinished ...>
> [pid 13808] connect(7, {sa_family=AF_LOCAL,
> sun_path="/var/run/slurmd/NODENAME_5388890.0"}, 34 <unfinished ...>
> [pid 13793] futex(0x7e8d64, FUTEX_WAIT_PRIVATE, 231, NULL
>
>
> /var/run/slurmd has a "cred_state" file that is 37 *megabytes* in size, a
> similar file /var/run/slurmd/cred_state.old, and a socket
> NODENAME_5388890.0.
>
> However,
> netstat -planetu | grep /var/run/slurmd
> shows nothing listening.
>
> This is the only node that seems to suffer so, and I'm not able to
> identify how this node might differ from others. I have to kill -9 slurmd
> and then - once restarted - it will run fine for a while, but not for very
> long: often under 20 minutes.
>
> squeue | grep NODENAME returns nothing.
>

Perhaps this gdb backtrace might be useful:

#0  0x00007f7de2f3a68c in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x0000000000429cef in _increment_thd_count () at slurmd.c:497
#2  0x000000000042a00b in _handle_connection (fd=259, cli=0x7c10c80) at
slurmd.c:555
#3  0x0000000000429b63 in _msg_engine () at slurmd.c:459
#4  0x0000000000429859 in main (argc=1, argv=0x7ffe27f6f8c8) at slurmd.c:370

Reply via email to