Hi Nick, We do our upgrades while full production is up and running. We just stop the Slurm daemons, dump the database and copy the statesave directory just in case. We then do the update, and finally restart the Slurm daemons. We only lost jobs once during an upgrade back around 2.2.6 or so, but that was due a rather brittle configuration provided by our vendor (the statesave path contained the Slurm version), rather than Slurm itself and was before we had acquired any Slurm expertise ourselves.
Paul: How do you pause the jobs? SIGSTOP all the user processes on the cluster? Cheers, Loris Paul Edmon <[email protected]> writes: > If you follow the guide on the Slurm website you shouldn't have many > problems. We've made it standard practice here to set all partitions to DOWN > and suspend all the jobs when we do upgrades. This has led to > far greater stability. So we haven't lost any jobs in an upgrade. The only > weirdness we have seen is if jobs exit while the DB upgrade is going. > Sometimes it can leave residual jobs in the DB that were properly closed > out. This is why we pause all the jobs as it makes it such that we don't end > up with jobs exiting before the DB is back. In 16.05+ you have the: > > sacctmgr show runawayjobs > > Feature which can clean up all those orphan jobs. So its not as much a > concern anymore. > > Beyond that we follow the guide at the bottom of this page: > > https://slurm.schedmd.com/quickstart_admin.html > > I haven't tried going two major versions at once though. The docs indicate > that it should work fine. We generally try to keep pace with current stable. > > Given that you only have 100,000 jobs your upgrade should probably go fairly > quick. I could imagine around 10-15 minutes. Our DB has several million jobs > and it takes about 30 min to an hour depending on what > operations are bing done. > > -Paul Edmon- > > On 06/20/2017 09:37 AM, Nicholas McCollum wrote: > > I'm about to update 15.08 to the latest SLURM in August and would appreciate > any notes you have on the process. > > I'm especially interested in maintaining the DB as well as associations. I'd > also like to keep the pending job list if possible. > > I've only got around 100,000 jobs in the DB so far, since January. > > Thanks > > Nick McCollum > Alabama Supercomputer Authority > > On Jun 20, 2017 8:07 AM, Paul Edmon <[email protected]> wrote: > > Yeah, that sounds about right. Changes between major versions can take > quite a bit of time. In the past I've seen upgrades take 2-3 hours for > the DB. > > As for ways to speed it up. Putting the DB on newer hardware if you > haven't already helps quite a bit (depends on architecture as to how > much gain you will get, we went from AMD Abu Dhabi to Intel Broadwell > and saw a factor of 3-4 speed improvement). Upgrading to the latest > version of MariaDB if you are on an old version of MySQL can get you > about 30-40%. > > Doing all of these whittled our DB upgrade times for major upgrades to > about 30 min or so. > > Beyond that I imagine some more specific DB optimization tricks could be > done, but I'm not a DB admin so I won't venture to say. > > -Paul Edmon- > > On 06/20/2017 08:42 AM, Tim Fora wrote: > > Hi, > > > > Upgraded from 15.08 to 17.02. It took about one hour for slurmdbd to > > start. Logs show most of the time was spent on this step and other table > > changes: > > > > adding column admin_comment after account in table > > > > Does this sound right? Any ideas to help things speed up. > > > > Thanks, > > Tim > > > > > > -- Dr. Loris Bennett (Mr.) ZEDAT, Freie Universität Berlin Email [email protected]
