I've noticed the following behavior several times, immediately following a
yum (rpm) upgrade of slurm.  NOTE: I'm not changing *versions* with these
upgrades, these are updated builds of the same version of slurm.

Nodes with active jobs go off the rails and never come back. slurmd must be
killed with -9. Upon restart of slurmd, things return to normal.

The logs show the following pattern:

slurmd[$PID]: launch task $TASKID.0 request from $UID.$GID@$HOST (port
$PORT)
slurmstepd[$PID]: done with job
....

I'm using srun (as root!) to actually perform the upgrade. Quite literally:

srun --no-allocate -w $SOME_NODES -- yum upgrade -y

I *think* I see this job here:

slurmd[$PID]: launch task $CRAZY_HIGH_NUMBER.$SOME_OTHER_LARGE_NUMBER
request from 0.0@$HOST (port $PORT)

However, it's followed by:
slurmd[$PID]: task rank unavailable due to invalid job credential, step
completion RPC impossible

and then several more iterations.
Eventually, I see this:

slurmd[$PID]: active_threads == MAX_THREADS(256)


Following this, the slurmd spins (strace shows it doing lots of stuff), but
the controller and slurmd are no longer communicating and the node is
placed in "*down" state. Normal attempts to kill slurmd fail, and one must
resort to using kill -9.

What might be going on here?





-- 
Jon Nelson
Dyn / Senior Software Engineer
p. +1 (603) 263-8029

Reply via email to