Am 04.03.2012 um 01:33 schrieb Simon Matthews: > On Sat, Mar 3, 2012 at 4:24 PM, Malcolm Cowe <[email protected]> wrote: > We've been running a "shared-nothing" SGE deployment for the last 3 years > with no real issues. Upgrading software can be a pain on a large population, > but it's manageable. We also set up an active-passive HA failover cluster for > the SGE queue master, since the SGE master/shadow configuration won't work if > there is no shared SGE_ROOT.
Yes, what needs to be shared across the cluster is $SGE_ROOT/default /common (resp. you cell's name) to reflect the changed name of the act_qmaster. And the spool directory between the qmaster machines. > The HA cluster is has 2 nodes with a shared disk for the SGE software and > configuration. We use Heartbeat (don't ask -- it's too complicated) but would > recommend a modern cluster platform such as Pacemaker (on Linux). The > resource group has a floating IP address in addition to the shared disk and > the qmaster service. We set "$SGE_ROOT/$SGE_CELL/common/act_qmaster" to the > host name of the floating IP on all hosts. We also set the host_aliases to > map the floating IP to each of the cluster nodes, but I don't think that's > actually necessary. > > In the end, I moved only the spool directories to local directories. On each > execution host, I copied the existing spool directory to > /gridengine/spool (a local directory) and then on the SGE_ROOT tree, I > created a symbolic link to /gridengine/spool called > SGE_ROOT/$SGE_CELL/spool NB If anyone finds this thread and wants to implement it too, as it can also be done: - shut down the execds - create /gridengine/spool/ (or /var/spool/sge) on all nodes - change in `qconf -mconf` the location of the spool directory (first entry) to the above - start execds, the node's directory will be created automatically in /gridengine/spool/ -- Reuti > This solved the problems that I was seeing. > > We have a small enough network that I have not worried about failover for any > services except DNS (except for dhcpd, but then I removed the dhcpd failover > since failover in ISC dhcpd seems to cause more problems than it solves). > > Simon > > Malcolm. > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
