Wow! Using Slurm to update the software on the cluster? And I'll
 guess that you frequently ski Tuckerman's Ravine? :-)
 
 First, there is the possibility that Slurm is entirely innocent
 here, and that some other package's update procedure is wiping out
 things like context files (especially if they are in /tmp, /var/tmp,
 /var/run) -- they shouldn't be doing that, but who knows.
 
 I think you'll find that simply killing and restarting slurmd while
 a job is running on a node will usually allow the job to survive,
 but I don't think I would count on that.
 
 If you want to dig a bit farther into this, you'll need to provide
 
   [*]Operating system and version
     [*]Slurm version
     [*]Who is responsible for building your Slurm RPMs
     [*]A sanitized copy of your slurm.conf file (with node names and
     IP addresses appropriately obscured)
 Speaking as a "friend of Slurm," and not as the official voice of
   anything, I suspect that you won't find much interest here in
   tracking down what might be happening if the problem is a side
   effect of the way that Slurm has been packaged. Let me suggest,
   instead, that you look into using the excellent pdsh package at
   https://code.google.com/p/pdsh/ for your system upgrades, and let
   Slurm keep doing what it does best.
 
 Best regards
   Andy
 
 On 04/08/2015 10:29 AM, Jonathon Nelson
   wrote:
   very strange behavior with slurmd post upgrade
                     I've noticed the following behavior several
                       times, immediately following a yum (rpm)
                       upgrade of slurm.  NOTE: I'm not changing
                       *versions* with these upgrades, these are
                       updated builds of the same version of slurm.
                     Nodes with active jobs go off the rails and
                     never come back. slurmd must be killed with -9.
                     Upon restart of slurmd, things return to normal.
                   The logs show the following pattern:
                 slurmd[$PID]: launch task $TASKID.0 request from
                 $UID.$GID@$HOST (port $PORT)
               
               slurmstepd[$PID]: done with job
               ....
             I'm using srun (as root!) to actually perform the
             upgrade. Quite literally:
             
             srun --no-allocate -w $SOME_NODES -- yum upgrade -y
             
             I *think* I see this job here:
             
             slurmd[$PID]: launch task
             $CRAZY_HIGH_NUMBER.$SOME_OTHER_LARGE_NUMBER request from
             0.0@$HOST (port $PORT)
           However, it's followed by:
         
         slurmd[$PID]: task rank unavailable due to invalid job
         credential, step completion RPC impossible
       and then several more iterations. 
       Eventually, I see this:
       
       slurmd[$PID]: active_threads == MAX_THREADS(256)
       Following this, the slurmd spins (strace shows it doing lots
       of stuff), but the controller and slurmd are no longer
       communicating and the node is placed in "*down" state. Normal
       attempts to kill slurmd fail, and one must resort to using
       kill -9.
     What might be going on here?
                           -- 
                                 Jon Nelson
                                   Dyn / Senior Software Engineer
                                   p. +1 (603) 263-8029

Reply via email to