On 27/09/14 08:30, John Brunelle wrote:

> This caused a bit of trouble for us when we patched some head nodes
> before compute nodes.

We did some testing to confirm that:

A) If you update a login node before compute nodes jobs will fail as
John describes.

B) If you update a compute node when there are jobs queued under the
previous bash then they will fail when they run there (also cannot find
modules, even though a prologue of ours sets BASH_ENV to force the env
vars to get set).


Our way to (hopefully safely) upgrade our x86-64 clusters was:

0) Note that our slurmctld runs on the cluster management node which is
separate to the login nodes and not accessible to users.

1) Kick all the users off the login nodes, update bash, reboot them
(ours come back with nologin enabled to stop users getting back on
before we're ready).

2) Set all partitions down to stop new jobs starting

3) Move all compute nodes to an "old" partition

4) Move all queued (pending) jobs to the "old" partition

5) Update bash on any idle nodes and move them back to our "main"
(default) partition

6) Set an AllowGroups on the "old" partition so users can't submit jobs
to it by accident.

7) Let users back onto the login nodes.

8) Set partitions back to "up" to start jobs going again.


Hope this helps folks..

cheers!
Chris
-- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: [email protected] Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

Reply via email to