Every now and again, I find a node has been kicked (set to state down with
an asterisk) and the node has a 'slurmd' that is unresponsive and one or
more 'slurmstepd' that have been running for a *long* time.

Usually, I see messages like this:

slurmstepd[25221]: error: Unable to establish controller machine

Which is weird, because the controller (and it's backup) are listed both by
name with with the 'Addr' config variable set.

gdb hasn't been useful, but strace has.
I see this:

[pid 24248] strlen("(null)/cgroup.procs" <unfinished ...>

and can't help but think it's related.
I tracked that down to the right source file, which makes multiple unsafe
uses of snprintf (I have a patch in progress for this).

If I kill *any* of the slurmtepd that are misbehaving, slurmd picks right
up, registers itself, and processes jobs.

Has anybody encountered anything like this or have any ideas?

I should note that this is slurm 14.11.8 build for ScientificLinux 6.6.


-- 
Jon Nelson
Dyn / Senior Software Engineer
p. +1 (603) 263-8029

Reply via email to