Hello,

I got a long message. I am not sure if really appropriate for the pacemaker 
group but you guys got a lot of experience... Happy to be told not appropiate.

We have two data centers connected by multitple dark fibre about 13km apart. 
Latency is about 0.2-0.3 ms. Forget if round trip or one way. I am not 
convinced having an extra fibre via an independent provided to make redundant 
connection between the two sites would be financially possible. Plus there is a 
view we should not rely on such things. We would like to move to more "modern" 
tech that has clustering builtin, commodity hardware and at somepoint a 3rd 
data center.

Initially was thinking just to move floating ip addresses between the two sites 
and run synchronous db syncs (its a SAP installation but not that relevant i 
think).  But this view/approach does not working anymore. Then I realised as a 
company we should not rely as said above.

So instead we would DB sync within a a site and async between sites.

To do this is a bit more in complexity as i need to have odd number of nodes in 
each side and ensure have no single point of failure on both sides)
As an optimisation would be ok one to be be mini with out single point of 
failure and just switch back to the main site if you know I mean in the case of 
failover/take over to the mini site.
(main site would be then 5 nodes in a pacemaker cluster and the mini site 3 
nodes in the cluster)

I do not worry about scaling out we can just add nodes 2 at time at both sites.

Failover  in the site would be automated using pacemaker eventually and planned 
takeovers we could do between sites by telling pacemaker manually on both sides 
what to do. Obviously we test this all out, certified etc.

I hope i can use the term failover i.e unplanned and takeover planned. 🙂 Our 
initial goal is reduced planned downtime to zero (we do not have that now for 
upgrades and patching etc) and to move to RPO 0 and minimal RTO.

As we do not have real redudnant networks being dependent on quorum devices is 
not so good as if the quorum device is lost the whole cluster goes down. And as 
I understand it you can only have one quorum device. So thats a SPOF. So 
instead i have odd numbers of nodes in the pacemaker cluster in each 
datacenter. For me thats ok and somehow i think better than quorum devices.

We use Vmware (sigh...) and NetApp

In terms of fencing we are trying to fence using industry standards e.g not 
going to the management console of vmware. But more standard protocols e.g in 
shared storage.  I think I can make a good case for self fencing using watchdog 
as I understand this is the minimal that SBD needs. I found that statement on a 
page on the old clusterlabs website i have not looked at the new.

So what  are my questions

  1.
Am I right the quorum device is a single point of failure? Just out of interest
  2.
 If we ever want to some how automate or semi automate using Booth between data 
centeres, is this a good idea. I looked a bit for documentation on booth, I 
should harder. But from gut feel is Booth possible. Is there any alternative.
  3.
Is watchdog only fencing using sbd the absolute mininum.
  4.
Would you recommend in addition to watchdog to do resource fencing i.e take the 
storage away, pull the ethernet cable away virtually (not sure how that works 
though). Or just node fencing in addition to watchdog via some defined way.
  5.
Using the shared storage in sbd, fot poison pills does that given me really 
anything. I cant justify to myself if it does. Does is give anything else 
except poison pills?
  6.
Have I forgotten a topic 😉

Sorry for typos and grammar mistakes, it is late over her.

regards
Angelo







_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to