I have a slurm cluster on SL 6.7 using slurm 15.08.11 (built in-house but
with only very minor changes to the spec file).  Every node works fine
except for one, which also happens to run slurmctld.  This node invariably
ends up with a hung slurmd (which complains of
MAX_THREADS==current_threads).

an strace of the hung slurmd:

<lots of pids waiting for futex(0x7e8fc0, ...) >
[pid 16460] futex(0x7e8fc0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
...
[pid 13957] connect(12, {sa_family=AF_LOCAL,
sun_path="/var/run/slurmd/NODENAME_5388890.0"}, 34 <unfinished ...>
[pid 13808] connect(7, {sa_family=AF_LOCAL,
sun_path="/var/run/slurmd/NODENAME_5388890.0"}, 34 <unfinished ...>
[pid 13793] futex(0x7e8d64, FUTEX_WAIT_PRIVATE, 231, NULL


/var/run/slurmd has a "cred_state" file that is 37 *megabytes* in size, a
similar file /var/run/slurmd/cred_state.old, and a socket
NODENAME_5388890.0.

However,
netstat -planetu | grep /var/run/slurmd
shows nothing listening.

This is the only node that seems to suffer so, and I'm not able to identify
how this node might differ from others. I have to kill -9 slurmd and then -
once restarted - it will run fine for a while, but not for very long: often
under 20 minutes.

squeue | grep NODENAME returns nothing.

Any thoughts?


-- 
Jon Nelson
Dyn / Principal Software Engineer
p. +1 (603) 263-8029

Reply via email to