2014-09-29 11:10 GMT+02:00 Alan Orth <[email protected]>:
>
> Wow, well spotted. I came here to see if anyone had reported this same
> issue with environment modules, as I noticed several of my jobs failing
> on our cluster this morning. Turns out, I'm probably the only one who
> had failed jobs, as I have a long-running tmux session open on the head
> node, and therefore old bash. ;)
>
> Other users wouldn't have noticed because we updated all of our
> infrastructure in one go using ansible[0]
^^^^^^^^^ +1 :)
> last Friday.
>
> In any case, glad to be in good company. Cheers!
>
> Alan
>
> [0]
>
> http://mjanja.co.ke/2014/09/update-hosts-via-ansible-to-mitigate-bash-shellshock-vulnerability/
>
> On 09/29/2014 08:27 AM, Christopher Samuel wrote:
> > On 27/09/14 08:30, John Brunelle wrote:
> >
> >> This caused a bit of trouble for us when we patched some head nodes
> >> before compute nodes.
> > We did some testing to confirm that:
> >
> > A) If you update a login node before compute nodes jobs will fail as
> > John describes.
> >
> > B) If you update a compute node when there are jobs queued under the
> > previous bash then they will fail when they run there (also cannot find
> > modules, even though a prologue of ours sets BASH_ENV to force the env
> > vars to get set).
> >
> >
> > Our way to (hopefully safely) upgrade our x86-64 clusters was:
> >
> > 0) Note that our slurmctld runs on the cluster management node which is
> > separate to the login nodes and not accessible to users.
> >
> > 1) Kick all the users off the login nodes, update bash, reboot them
> > (ours come back with nologin enabled to stop users getting back on
> > before we're ready).
> >
> > 2) Set all partitions down to stop new jobs starting
> >
> > 3) Move all compute nodes to an "old" partition
> >
> > 4) Move all queued (pending) jobs to the "old" partition
> >
> > 5) Update bash on any idle nodes and move them back to our "main"
> > (default) partition
> >
> > 6) Set an AllowGroups on the "old" partition so users can't submit jobs
> > to it by accident.
> >
> > 7) Let users back onto the login nodes.
> >
> > 8) Set partitions back to "up" to start jobs going again.
> >
> >
> > Hope this helps folks..
> >
> > cheers!
> > Chris
>
> --
> Alan Orth
> [email protected]
> http://alaninkenya.org
> http://mjanja.co.ke
> "I have always wished for my computer to be as easy to use as my
> telephone; my wish has come true because I can no longer figure out how to
> use my telephone." -Bjarne Stroustrup, inventor of C++
> GPG public key ID: 0x8cb0d0acb5cd81ec209c6cdfbd1a0e09c2f836c0
>