[slurm-dev] Re: Long delay starting slurmdbd after upgrade to 17.02

Loris Bennett Tue, 20 Jun 2017 22:37:10 -0700

Hi Ole,

Ole Holm Nielsen <[email protected]> writes:


> On 06/20/2017 04:32 PM, Loris Bennett wrote:
>> We do our upgrades while full production is up and running.  We just stop
>> the Slurm daemons, dump the database and copy the statesave directory
>> just in case.  We then do the update, and finally restart the Slurm
>> daemons.  We only lost jobs once during an upgrade back around 2.2.6 or
>> so, but that was due a rather brittle configuration provided by our
>> vendor (the statesave path contained the Slurm version), rather than
>> Slurm itself and was before we had acquired any Slurm expertise
>> ourselves.
>
> 1. When you refer to "daemons", do you mean slurmctld, slurmdbd as well as
> slurmd on all compute nodes?  AFAIK, the recommended procedure upgrading and
> restarting in this order: 1) slurmdbd, 2) slurmctld, 3) slurmd on nodes.

We don't stop slurmd on the nodes.  The nodes only get the new Slurm
version on the next reboot.  The documentation mentions this possibility
of this kind of rolling upgrade and we haven't had any problems with it.

> 2. When you mention statesave, I suppose this is what you refer to:
> # scontrol show config | grep -i statesave
> StateSaveLocation       = /var/spool/slurmctld

Yes, that's right.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin         Email [email protected]

[slurm-dev] Re: Long delay starting slurmdbd after upgrade to 17.02

Reply via email to