Am 04.03.2012 um 01:33 schrieb Simon Matthews:

> On Sat, Mar 3, 2012 at 4:24 PM, Malcolm Cowe <[email protected]> wrote:
> We've been running a "shared-nothing" SGE deployment for the last 3 years 
> with no real issues. Upgrading software can be a pain on a large population, 
> but it's manageable. We also set up an active-passive HA failover cluster for 
> the SGE queue master, since the SGE master/shadow configuration won't work if 
> there is no shared SGE_ROOT.

Yes, what needs to be shared across the cluster is $SGE_ROOT/default /common 
(resp. you cell's name) to reflect the changed name of the act_qmaster. And the 
spool directory between the qmaster machines.


> The HA cluster is has 2 nodes with a shared disk for the SGE software and 
> configuration. We use Heartbeat (don't ask -- it's too complicated) but would 
> recommend a modern cluster platform such as Pacemaker (on Linux). The 
> resource group has a floating IP address in addition to the shared disk and 
> the qmaster service. We set "$SGE_ROOT/$SGE_CELL/common/act_qmaster" to the 
> host name of the floating IP on all hosts. We also set the host_aliases to 
> map the floating IP to each of the cluster nodes, but I don't think that's 
> actually necessary.
> 
> In the end, I moved only the spool directories to local directories. On each 
> execution host, I copied the existing spool directory to 
> /gridengine/spool (a local directory) and then on the SGE_ROOT tree, I 
> created a symbolic link to /gridengine/spool called
> SGE_ROOT/$SGE_CELL/spool

NB If anyone finds this thread and wants to implement it too, as it can also be 
done:

- shut down the execds
- create /gridengine/spool/ (or /var/spool/sge) on all nodes
- change in `qconf -mconf` the location of the spool directory (first entry) to 
the above
- start execds, the node's directory will be created automatically in 
/gridengine/spool/

-- Reuti


> This solved the problems that I was seeing. 
> 
> We have a small enough network that I have not worried about failover for any 
> services except DNS (except for dhcpd, but then I removed the dhcpd failover 
> since failover in ISC dhcpd seems to cause more problems than it solves). 
> 
> Simon
> 
> Malcolm.
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to