We actually use:

scontrol suspend

I will note though that with the suspend behavior that slurm has, it is best to do this while the partitions are set to DOWN so new jobs don't try to schedule. Else you will end up with oversubscribed nodes. Else SIGSTOP is a good route to go.

-Paul Edmon-


On 06/20/2017 10:32 AM, Loris Bennett wrote:
Hi Nick,

We do our upgrades while full production is up and running.  We just stop
the Slurm daemons, dump the database and copy the statesave directory
just in case.  We then do the update, and finally restart the Slurm
daemons.  We only lost jobs once during an upgrade back around 2.2.6 or
so, but that was due a rather brittle configuration provided by our
vendor (the statesave path contained the Slurm version), rather than
Slurm itself and was before we had acquired any Slurm expertise
ourselves.

Paul: How do you pause the jobs?  SIGSTOP all the user processes on the
cluster?

Cheers,

Loris


Paul Edmon <ped...@cfa.harvard.edu> writes:

If you follow the guide on the Slurm website you shouldn't have many problems. 
We've made it standard practice here to set all partitions to DOWN and suspend 
all the jobs when we do upgrades. This has led to
far greater stability. So we haven't lost any jobs in an upgrade. The only 
weirdness we have seen is if jobs exit while the DB upgrade is going. Sometimes 
it can leave residual jobs in the DB that were properly closed
out. This is why we pause all the jobs as it makes it such that we don't end up 
with jobs exiting before the DB is back. In 16.05+ you have the:

sacctmgr show runawayjobs

Feature which can clean up all those orphan jobs. So its not as much a concern 
anymore.

Beyond that we follow the guide at the bottom of this page:

https://slurm.schedmd.com/quickstart_admin.html

I haven't tried going two major versions at once though. The docs indicate that 
it should work fine. We generally try to keep pace with current stable.

Given that you only have 100,000 jobs your upgrade should probably go fairly 
quick. I could imagine around 10-15 minutes. Our DB has several million jobs 
and it takes about 30 min to an hour depending on what
operations are bing done.

-Paul Edmon-

On 06/20/2017 09:37 AM, Nicholas McCollum wrote:

  I'm about to update 15.08 to the latest SLURM in August and would appreciate 
any notes you have on the process.

  I'm especially interested in maintaining the DB as well as associations. I'd 
also like to keep the pending job list if possible.

  I've only got around 100,000 jobs in the DB so far, since January.

  Thanks

  Nick McCollum
  Alabama Supercomputer Authority

  On Jun 20, 2017 8:07 AM, Paul Edmon <ped...@cfa.harvard.edu> wrote:

  Yeah, that sounds about right. Changes between major versions can take
  quite a bit of time. In the past I've seen upgrades take 2-3 hours for
  the DB.

  As for ways to speed it up. Putting the DB on newer hardware if you
  haven't already helps quite a bit (depends on architecture as to how
  much gain you will get, we went from AMD Abu Dhabi to Intel Broadwell
  and saw a factor of 3-4 speed improvement). Upgrading to the latest
  version of MariaDB if you are on an old version of MySQL can get you
  about 30-40%.

  Doing all of these whittled our DB upgrade times for major upgrades to
  about 30 min or so.

  Beyond that I imagine some more specific DB optimization tricks could be
  done, but I'm not a DB admin so I won't venture to say.

  -Paul Edmon-

  On 06/20/2017 08:42 AM, Tim Fora wrote:
  > Hi,
  >
  > Upgraded from 15.08 to 17.02. It took about one hour for slurmdbd to
  > start. Logs show most of the time was spent on this step and other table
  > changes:
  >
  > adding column admin_comment after account in table
  >
  > Does this sound right? Any ideas to help things speed up.
  >
  > Thanks,
  > Tim
  >
  >


Reply via email to