I've noticed the following behavior several times, immediately following a yum (rpm) upgrade of slurm. NOTE: I'm not changing *versions* with these upgrades, these are updated builds of the same version of slurm.
Nodes with active jobs go off the rails and never come back. slurmd must be killed with -9. Upon restart of slurmd, things return to normal. The logs show the following pattern: slurmd[$PID]: launch task $TASKID.0 request from $UID.$GID@$HOST (port $PORT) slurmstepd[$PID]: done with job .... I'm using srun (as root!) to actually perform the upgrade. Quite literally: srun --no-allocate -w $SOME_NODES -- yum upgrade -y I *think* I see this job here: slurmd[$PID]: launch task $CRAZY_HIGH_NUMBER.$SOME_OTHER_LARGE_NUMBER request from 0.0@$HOST (port $PORT) However, it's followed by: slurmd[$PID]: task rank unavailable due to invalid job credential, step completion RPC impossible and then several more iterations. Eventually, I see this: slurmd[$PID]: active_threads == MAX_THREADS(256) Following this, the slurmd spins (strace shows it doing lots of stuff), but the controller and slurmd are no longer communicating and the node is placed in "*down" state. Normal attempts to kill slurmd fail, and one must resort to using kill -9. What might be going on here? -- Jon Nelson Dyn / Senior Software Engineer p. +1 (603) 263-8029
