Wow! Using Slurm to update the software on the cluster? And I'll
guess that you frequently ski Tuckerman's Ravine? :-)
First, there is the possibility that Slurm is entirely innocent
here, and that some other package's update procedure is wiping out
things like context files (especially if they are in /tmp, /var/tmp,
/var/run) -- they shouldn't be doing that, but who knows.
I think you'll find that simply killing and restarting slurmd while
a job is running on a node will usually allow the job to survive,
but I don't think I would count on that.
If you want to dig a bit farther into this, you'll need to provide
[*]Operating system and version
[*]Slurm version
[*]Who is responsible for building your Slurm RPMs
[*]A sanitized copy of your slurm.conf file (with node names and
IP addresses appropriately obscured)
Speaking as a "friend of Slurm," and not as the official voice of
anything, I suspect that you won't find much interest here in
tracking down what might be happening if the problem is a side
effect of the way that Slurm has been packaged. Let me suggest,
instead, that you look into using the excellent pdsh package at
https://code.google.com/p/pdsh/ for your system upgrades, and let
Slurm keep doing what it does best.
Best regards
Andy
On 04/08/2015 10:29 AM, Jonathon Nelson
wrote:
very strange behavior with slurmd post upgrade
I've noticed the following behavior several
times, immediately following a yum (rpm)
upgrade of slurm. NOTE: I'm not changing
*versions* with these upgrades, these are
updated builds of the same version of slurm.
Nodes with active jobs go off the rails and
never come back. slurmd must be killed with -9.
Upon restart of slurmd, things return to normal.
The logs show the following pattern:
slurmd[$PID]: launch task $TASKID.0 request from
$UID.$GID@$HOST (port $PORT)
slurmstepd[$PID]: done with job
....
I'm using srun (as root!) to actually perform the
upgrade. Quite literally:
srun --no-allocate -w $SOME_NODES -- yum upgrade -y
I *think* I see this job here:
slurmd[$PID]: launch task
$CRAZY_HIGH_NUMBER.$SOME_OTHER_LARGE_NUMBER request from
0.0@$HOST (port $PORT)
However, it's followed by:
slurmd[$PID]: task rank unavailable due to invalid job
credential, step completion RPC impossible
and then several more iterations.
Eventually, I see this:
slurmd[$PID]: active_threads == MAX_THREADS(256)
Following this, the slurmd spins (strace shows it doing lots
of stuff), but the controller and slurmd are no longer
communicating and the node is placed in "*down" state. Normal
attempts to kill slurmd fail, and one must resort to using
kill -9.
What might be going on here?
--
Jon Nelson
Dyn / Senior Software Engineer
p. +1 (603) 263-8029