Because we want to maximize usage we actually have opted to just cancel all running jobs the day of.  We send out notification to all the users that this will happen.  We haven't really seen any complaints and we've been doing this for years.  At the start of the outage we set all partitions to down, then run a cancel over all the running jobs.  Pending jobs are left in place, and users are allowed to submit work during the outage and when we reopen everything gets going again.

So there is a third option, though you have to accept that jobs will be cancelled to pull it off.

-Paul Edmon-

On 8/6/2020 1:13 PM, Jason Simms wrote:
Hello all,

Later this month, I will have to bring down, patch, and reboot all nodes in our cluster for maintenance. The two options available to set nodes into a maintenance mode seem to be either: 1) creating a system-wide reservation, or 2) setting all nodes into a DRAIN state.

I'm not sure it really matters either way, but is there any preference one way or the other? Any gotchas I should be aware of?

Warmest regards,
Jason

--
*Jason L. Simms, Ph.D., M.P.H.*
Manager of Research and High-Performance Computing
XSEDE Campus Champion
Lafayette College
Information Technology Services
710 Sullivan Rd | Easton, PA 18042
Office: 112 Skillman Library
p: (610) 330-5632

Reply via email to