Yeah, we keep around a test cluster environment for that purpose to vet
slurm upgrades before we roll them on the production cluster.
Thus far no problems. However, paranoia is usually a good thing for
cases like this.
-Paul Edmon-
On 06/26/2017 07:30 AM, Ole Holm Nielsen wrote:
On 06/26/2017 01:24 PM, Loris Bennett wrote:
We have also upgraded in place continuously from 2.2.4 to currently
16.05.10 without any problems. As I mentioned previously, it can be
handy to make a copy of the statesave directory, once the daemons have
been stopped.
However, if you want to know how long the upgrade might take, then yours
is a good approach. What is your use case here? Do you want to inform
the users about the length of the outage with regard to job submission?
I want to be 99.9% sure that upgrading (my first one) will actually
work. I also want to know how roughly long the slurmdbd will be down
so that the cluster doesn't kill all jobs due to timeouts. Better to
be safe than sorry.
I don't expect to inform the users, since the operation is expected to
run smoothly without troubles for user jobs.
Thanks,
Ole
Re: [slurm-dev] Dry run upgrade procedure for the slurmdbd database
We did it in place, worked as noted on the tin. It was less painful
than I expected. TBH, your procedures are admirable, but you shouldn't
worry - it's a relatively smooth process.
cheers
L.
------
"Mission Statement: To provide hope and inspiration for collective
action, to build collective power, to achieve collective
transformation, rooted in grief and rage but pointed towards vision
and dreams."
- Patrisse Cullors, Black Lives Matter founder
On 26 June 2017 at 20:04, Ole Holm Nielsen
<[email protected]> wrote:
We're planning to upgrade Slurm 16.05 to 17.02 soon. The most
critical step seems to me to be the upgrade of the slurmdbd
database, which may also take tens of minutes.
I thought it's a good idea to test the slurmdbd database upgrade
locally on a drained compute node in order to verify both
correctness and the time required.
I've developed the dry run upgrade procedure documented in the
Wiki page
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm
Question 1: Would people who have real-world Slurm upgrade
experience kindly offer comments on this procedure?
My testing was actually successful, and the database conversion
took less than 5 minutes in our case.
A crucial step is starting the slurmdbd manually after the
upgrade. But how can we be sure that the database conversion has
been 100% completed?
Question 2: Can anyone confirm that the output "slurmdbd: debug2:
Everything rolled up" indeed signifies that conversion is complete?
Thanks,
Ole