Yeah, we keep around a test cluster environment for that purpose to vet slurm upgrades before we roll them on the production cluster.

Thus far no problems. However, paranoia is usually a good thing for cases like this.

-Paul Edmon-


On 06/26/2017 07:30 AM, Ole Holm Nielsen wrote:

On 06/26/2017 01:24 PM, Loris Bennett wrote:
We have also upgraded in place continuously from 2.2.4 to currently
16.05.10 without any problems.  As I mentioned previously, it can be
handy to make a copy of the statesave directory, once the daemons have
been stopped.

However, if you want to know how long the upgrade might take, then yours
is a good approach.  What is your use case here?  Do you want to inform
the users about the length of the outage with regard to job submission?

I want to be 99.9% sure that upgrading (my first one) will actually work. I also want to know how roughly long the slurmdbd will be down so that the cluster doesn't kill all jobs due to timeouts. Better to be safe than sorry.

I don't expect to inform the users, since the operation is expected to run smoothly without troubles for user jobs.

Thanks,
Ole

Re: [slurm-dev] Dry run upgrade procedure for the slurmdbd database

We did it in place, worked as noted on the tin. It was less painful
than I expected. TBH, your procedures are admirable, but you shouldn't
worry - it's a relatively smooth process.

cheers
L.

------
"Mission Statement: To provide hope and inspiration for collective action, to build collective power, to achieve collective transformation, rooted in grief and rage but pointed towards vision and dreams."

- Patrisse Cullors, Black Lives Matter founder

On 26 June 2017 at 20:04, Ole Holm Nielsen <[email protected]> wrote:

We're planning to upgrade Slurm 16.05 to 17.02 soon. The most critical step seems to me to be the upgrade of the slurmdbd database, which may also take tens of minutes.

I thought it's a good idea to test the slurmdbd database upgrade locally on a drained compute node in order to verify both correctness and the time required.

I've developed the dry run upgrade procedure documented in the Wiki page https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm

Question 1: Would people who have real-world Slurm upgrade experience kindly offer comments on this procedure?

My testing was actually successful, and the database conversion took less than 5 minutes in our case.

A crucial step is starting the slurmdbd manually after the upgrade. But how can we be sure that the database conversion has been 100% completed?

Question 2: Can anyone confirm that the output "slurmdbd: debug2: Everything rolled up" indeed signifies that conversion is complete?

  Thanks,
  Ole




Reply via email to