On Sat, Mar 3, 2012 at 4:24 PM, Malcolm Cowe <[email protected]> wrote:
> We've been running a "shared-nothing" SGE deployment for the last 3 years > with no real issues. Upgrading software can be a pain on a large > population, but it's manageable. We also set up an active-passive HA > failover cluster for the SGE queue master, since the SGE master/shadow > configuration won't work if there is no shared SGE_ROOT. > > The HA cluster is has 2 nodes with a shared disk for the SGE software and > configuration. We use Heartbeat (don't ask -- it's too complicated) but > would recommend a modern cluster platform such as Pacemaker (on Linux). The > resource group has a floating IP address in addition to the shared disk and > the qmaster service. We set "$SGE_ROOT/$SGE_CELL/common/act_qmaster" to the > host name of the floating IP on all hosts. We also set the host_aliases to > map the floating IP to each of the cluster nodes, but I don't think that's > actually necessary. > In the end, I moved only the spool directories to local directories. On each execution host, I copied the existing spool directory to /gridengine/spool (a local directory) and then on the SGE_ROOT tree, I created a symbolic link to /gridengine/spool called SGE_ROOT/$SGE_CELL/spool This solved the problems that I was seeing. We have a small enough network that I have not worried about failover for any services except DNS (except for dhcpd, but then I removed the dhcpd failover since failover in ISC dhcpd seems to cause more problems than it solves). Simon > > Malcolm. > >
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
