On Sat, Mar 3, 2012 at 4:24 PM, Malcolm Cowe <[email protected]> wrote:

>  We've been running a "shared-nothing" SGE deployment for the last 3 years
> with no real issues. Upgrading software can be a pain on a large
> population, but it's manageable. We also set up an active-passive HA
> failover cluster for the SGE queue master, since the SGE master/shadow
> configuration won't work if there is no shared SGE_ROOT.
>
> The HA cluster is has 2 nodes with a shared disk for the SGE software and
> configuration. We use Heartbeat (don't ask -- it's too complicated) but
> would recommend a modern cluster platform such as Pacemaker (on Linux). The
> resource group has a floating IP address in addition to the shared disk and
> the qmaster service. We set "$SGE_ROOT/$SGE_CELL/common/act_qmaster" to the
> host name of the floating IP on all hosts. We also set the host_aliases to
> map the floating IP to each of the cluster nodes, but I don't think that's
> actually necessary.
>

In the end, I moved only the spool directories to local directories. On
each execution host, I copied the existing spool directory to
/gridengine/spool (a local directory) and then on the SGE_ROOT tree, I
created a symbolic link to /gridengine/spool called
SGE_ROOT/$SGE_CELL/spool

This solved the problems that I was seeing.

We have a small enough network that I have not worried about failover for
any services except DNS (except for dhcpd, but then I removed the dhcpd
failover since failover in ISC dhcpd seems to cause more problems than it
solves).

Simon

>
> Malcolm.
>
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to