On Wed, Jun 29, 2016 at 8:47 PM, Jonathon Nelson <jdnel...@dyn.com> wrote:
> I have a slurm cluster on SL 6.7 using slurm 15.08.11 (built in-house but > with only very minor changes to the spec file). Every node works fine > except for one, which also happens to run slurmctld. This node invariably > ends up with a hung slurmd (which complains of > MAX_THREADS==current_threads). > > an strace of the hung slurmd: > > <lots of pids waiting for futex(0x7e8fc0, ...) > > [pid 16460] futex(0x7e8fc0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> > ... > [pid 13957] connect(12, {sa_family=AF_LOCAL, > sun_path="/var/run/slurmd/NODENAME_5388890.0"}, 34 <unfinished ...> > [pid 13808] connect(7, {sa_family=AF_LOCAL, > sun_path="/var/run/slurmd/NODENAME_5388890.0"}, 34 <unfinished ...> > [pid 13793] futex(0x7e8d64, FUTEX_WAIT_PRIVATE, 231, NULL > > > /var/run/slurmd has a "cred_state" file that is 37 *megabytes* in size, a > similar file /var/run/slurmd/cred_state.old, and a socket > NODENAME_5388890.0. > > However, > netstat -planetu | grep /var/run/slurmd > shows nothing listening. > > This is the only node that seems to suffer so, and I'm not able to > identify how this node might differ from others. I have to kill -9 slurmd > and then - once restarted - it will run fine for a while, but not for very > long: often under 20 minutes. > > squeue | grep NODENAME returns nothing. > Perhaps this gdb backtrace might be useful: #0 0x00007f7de2f3a68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000000000429cef in _increment_thd_count () at slurmd.c:497 #2 0x000000000042a00b in _handle_connection (fd=259, cli=0x7c10c80) at slurmd.c:555 #3 0x0000000000429b63 in _msg_engine () at slurmd.c:459 #4 0x0000000000429859 in main (argc=1, argv=0x7ffe27f6f8c8) at slurmd.c:370