On 27/09/14 08:30, John Brunelle wrote: > This caused a bit of trouble for us when we patched some head nodes > before compute nodes.
We did some testing to confirm that: A) If you update a login node before compute nodes jobs will fail as John describes. B) If you update a compute node when there are jobs queued under the previous bash then they will fail when they run there (also cannot find modules, even though a prologue of ours sets BASH_ENV to force the env vars to get set). Our way to (hopefully safely) upgrade our x86-64 clusters was: 0) Note that our slurmctld runs on the cluster management node which is separate to the login nodes and not accessible to users. 1) Kick all the users off the login nodes, update bash, reboot them (ours come back with nologin enabled to stop users getting back on before we're ready). 2) Set all partitions down to stop new jobs starting 3) Move all compute nodes to an "old" partition 4) Move all queued (pending) jobs to the "old" partition 5) Update bash on any idle nodes and move them back to our "main" (default) partition 6) Set an AllowGroups on the "old" partition so users can't submit jobs to it by accident. 7) Let users back onto the login nodes. 8) Set partitions back to "up" to start jobs going again. Hope this helps folks.. cheers! Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: [email protected] Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
