I upgraded SLURM rom 15.08 to 16.05 without draining the nodes and without loosing any jobs, this was my procedure:
I increased timeouts in slurm.conf: SlurmctldTimeout=3600 SlurmdTimeout=3600 Did mysqldump of slurm database and copied slurmstate dir (just in case), I increased innodb_buffer_size in my.cnf to 128M, then I followed the instructions on slurm page: Shutdown the slurmdbd daemon Upgrade the slurmdbd daemon Restart the slurmdbd daemon Shutdown the slurmctld daemon(s) Shutdown the slurmd daemons on the compute nodes Upgrade the slurmctld and slurmd daemons Restart the slurmd daemons on the compute nodes Restart the slurmctld daemon(s) For me it worked. Cheers, Barbara On 08/17/2016 03:38 PM, Ole Holm Nielsen wrote: > > > On 08/03/2016 03:04 AM, Christopher Samuel wrote: >> So you always go in the order of upgrading: >> >> * slurmdbd >> * slurmctld >> [recompile all plugins, MPI stacks, etc that link against Slurm] >> * slurmd >> >> We use a health check script that defines the version of Slurm that is >> considered production so we can just bump that number first, wait for >> all the compute nodes to be marked as drained and then as nodes become >> idle we can start restarting slurmd knowing that we will never get a job >> that spans both old and new slurmd's. > > Obviously upgrading slurmd's which are running jobs is quite tricky! I > have some questions: > > 1. Can't you replace the health check by a global scontrol like this? > scontrol update NodeName=<nodelist> State=drain Reason="Upgrading > slurmd" > > 2. Do you really have to wait for *all* nodes to become drained before > starting to upgrade? This could take weeks! > > 3. Is it OK to upgrade subsets of nodes after they become drained? > > 4. I assume that upgraded nodes can be returned to the IDLE state by: > scontrol update NodeName=<nodelist> State=resume > > Could you possibly elaborate on the steps which you described? > > FYI, I'm trying to capture this advice in my Wiki: > https://wiki.fysik.dtu.dk/niflheim/SLURM#upgrading-on-centos-7 > > Thanks, > Ole