I have a slurm cluster on SL 6.7 using slurm 15.08.11 (built in-house but with only very minor changes to the spec file). Every node works fine except for one, which also happens to run slurmctld. This node invariably ends up with a hung slurmd (which complains of MAX_THREADS==current_threads).
an strace of the hung slurmd: <lots of pids waiting for futex(0x7e8fc0, ...) > [pid 16460] futex(0x7e8fc0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> ... [pid 13957] connect(12, {sa_family=AF_LOCAL, sun_path="/var/run/slurmd/NODENAME_5388890.0"}, 34 <unfinished ...> [pid 13808] connect(7, {sa_family=AF_LOCAL, sun_path="/var/run/slurmd/NODENAME_5388890.0"}, 34 <unfinished ...> [pid 13793] futex(0x7e8d64, FUTEX_WAIT_PRIVATE, 231, NULL /var/run/slurmd has a "cred_state" file that is 37 *megabytes* in size, a similar file /var/run/slurmd/cred_state.old, and a socket NODENAME_5388890.0. However, netstat -planetu | grep /var/run/slurmd shows nothing listening. This is the only node that seems to suffer so, and I'm not able to identify how this node might differ from others. I have to kill -9 slurmd and then - once restarted - it will run fine for a while, but not for very long: often under 20 minutes. squeue | grep NODENAME returns nothing. Any thoughts? -- Jon Nelson Dyn / Principal Software Engineer p. +1 (603) 263-8029