Re: zookeeper deployment strategy for multi data centers

Flavio Junqueira Sat, 04 Jun 2016 07:38:53 -0700

There is no black magic going on with weights or hierarchies. Whatever you do, 
you can't escape from the curse of the intersection, and I'm sorry for stating 
the obvious, but that's the key reason why masking a data center going down out 
of two is so hard.


In general, using majorities makes it simpler to reason about the problem, in 
particular because the failure scenarios are uniform. With weights, you can 
play some tricks, but they do not necessarily give you a great advantage. For 
example, say I have 6 servers, I split them between two data centers, and I 
assign a weight of 2 to one of them. I can form a quorum in one single data 
center despite the fact that I have an equal number of servers in each data 
center. I could alternatively have 5 voting members and one observer and get a 
similar behavior. One difference is that in the case the server with weight 2 
is up, I can tolerate up to 3 nodes crashing. If the server with weight 2 is 
down, then I can tolerate only one additional crash. With the 5V+1O 
configuration, I can tolerate two crashed servers always.

The hierarchical stuff starts making sense for me with more complex scenarios. 
For example, say I have three data centers and I put three servers in each. 
Using a hierarchy, I can tolerate one data center going down plus one crash in 
each of the remaining data centers, so I can make progress with 4 nodes up, 
even though my total is 9. This works because we pick a majority of votes from 
a majority of groups. Note that it is not any 4 servers, though. If one data 
center is down, then I can't crash two nodes in one of the remaining data 
centers. Also, because we only require 4 votes to form a quorum, we require 
fewer votes to commit requests, and possibly fewer cross-DC votes. This last 
observation affects mostly failure scenarios because with majority, one also 
needs only two cross-dc votes to form a quorum in the absence of crashes.

The scenario that has been traditionally a problem, and that's really not new, 
is the active-passive one, which we can't really make it work transparently. 
The workaround is to manually reconfigure servers, knowing that there could be 
some data loss. Even with reconfiguration in 3.5, we are still stuck in the 
case the active goes down because we need an old quorum to get the 
reconfiguration to succeed. 

It is possible that the hierarchical approach makes some sense for some 
scenarios because of the following argument. It is not always desirable, but 
say that you want to replicate synchronously across data centers. If I use the 
5V + 1O configuration with 3V in the active DC and 2V in the passive DC, I 
can't guarantee that committed requests will be persisted in at least one node 
in the passive DC. In fact, to make that guarantee, we need a whole DC worth of 
votes + 1. With the hierarchical approach, we can have two groups, which forces 
every commit to be replicated in both groups, but I can tolerate crashes in 
both groups and guarantee that updates are synchronously replicated. In the 
case the active DC goes down, we can reconfigure manually from two groups to 
one group, and since the replication is synchronous, the passive DC will have 
all commits.

Hopefully this analysis is correct and makes some sense. Cross-dc replication 
is a fascinating topic!

-Flavio


The hierarchical quorum stuff starts to make sense when you have more complex 
deployments, not when 
> On 03 Jun 2016, at 23:54, Camille Fournier <cami...@apache.org> wrote:
> 
> You don't need weights to run this cluster successfully across 2
> datacenters, unless you want to run with 4 live read/write nodes which
> isn't really a recommended setup (we advise odd numbers because running
> with even numbers doesn't generally buy you anything).
> 
> I would probably run 3 voting members, 1 observer if you want to run 4
> nodes. In that setup you can lose any one voting node, and of course the
> observer, and be fine. If you lose 2 voting nodes, whether in the same DC
> or x-DC, you will not be able to continue. But votes only need to be acked
> by any 2 servers to be committed.
> 
> In the case of weights and 4 servers, you will either need to ack both of
> the servers in the weighted datacenter or the 2 in the unweighted DC and
> one in the weighted DC.
> 
> I've actually yet to see the killer app for using hierarchy and weights
> although I'd be interested in hearing about it if someone has an example.
> It's not clear that there's a huge value here unless the observer is
> significantly less effective than a full r/w quorum member which would be
> surprising.
> 
> C
> 
> 
> On Fri, Jun 3, 2016 at 6:33 PM, Dan Benediktson <
> dbenedikt...@twitter.com.invalid> wrote:
> 
>> Weights will at least let you do better: if you weight it, you can make it
>> so that datacenter A will survive even if datacenter B goes down, but not
>> the other way around. While not ideal, it's probably better than the
>> non-weighted alternative. (2, 2, 1, 1) weights might work fairly well - as
>> long as any three machines are up, or both machines in the preferred
>> datacenter, quorum can be achieved.
>> 
>> On Fri, Jun 3, 2016 at 3:23 PM, Camille Fournier <cami...@apache.org>
>> wrote:
>> 
>>> You can't solve this with weights.
>>> On Jun 3, 2016 6:03 PM, "Michael Han" <h...@cloudera.com> wrote:
>>> 
>>>> ZK supports more than just majority quorum rule, there are also
>> weights /
>>>> hierarchy of groups based quorum [1]. So probably one can assign more
>>>> weights to one out of two data center which can form a weight based
>>> quorum
>>>> even if another DC is failing?
>>>> 
>>>> Another idea is to instead of forming a single ZK ensemble across DCs,
>>>> forming multiple ZK ensembles across DCs with one ensemble per DC. This
>>>> solution might be applicable for heavy read / light write workload
>> while
>>>> providing certain degree of fault tolerance. Some relevant discussions
>>> [2].
>>>> 
>>>> [1] https://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html (
>>>> weight.x=nnnnn, group.x=nnnnn[:nnnnn]).
>>>> [2]
>>>> 
>>>> 
>>> 
>> https://www.quora.com/Has-anyone-deployed-a-ZooKeeper-ensemble-across-data-centers
>>>> 
>>>> On Fri, Jun 3, 2016 at 2:16 PM, Alexander Shraer <shra...@gmail.com>
>>>> wrote:
>>>> 
>>>>>> Is there any settings to override the quorum rule? Would you know
>> the
>>>>> rationale behind it?
>>>>> 
>>>>> The rule comes from a theoretical impossibility saying that you must
>>>> have n
>>>>>> 2f replicas
>>>>> to tolerate f failures, for any algorithm trying to solve consensus
>>> while
>>>>> being able to handle
>>>>> periods of asynchrony (unbounded message delays, processing times,
>>> etc).
>>>>> The earliest proof is probably here: paper
>>>>> <
>>> https://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/p225-chandra.pdf
>>>>> .
>>>>> ZooKeeper is assuming this model, so the bound applies
>>>>> to it.
>>>>> 
>>>>> The intuition is what's called a 'partition argument'. Essentially if
>>>> only
>>>>> 2f replicas were sufficient, you
>>>>> could arbitrarily divide them into 2 sets of f replicas, and create a
>>>>> situation where each set of f
>>>>> must go on independently without coordinating with the other set
>> (split
>>>>> brain), when the links between the two sets are slow (i.e., a network
>>>>> partition),
>>>>> simply because the other set could also be down (the algorithm
>>> tolerates
>>>> f
>>>>> failures) and it can't distinguish the two situations.
>>>>> When n > 2f this can be avoided since one of the sets will have
>>> majority
>>>>> while the other set won't.
>>>>> 
>>>>> The key here is that the links between the two data centers can
>>>> arbitrarily
>>>>> delay messages, so an automatic
>>>>> 'fail-over' where one data center decides that the other one is down
>> is
>>>>> usually considered unsafe. If in your system
>>>>> you have a reliable way to know that the other data center is really
>> in
>>>>> fact down (this is a synchrony assumption), you could do as Camille
>>>>> suggested and
>>>>> reconfigure the system to only include the remaining data center.
>> This
>>>>> would still be very tricky to do since this reconfiguration
>>>>> would have to involve manually changing configuration files and
>>> rebooting
>>>>> servers, while somehow making sure that you're
>>>>> not loosing committed state. So not recommended.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Jun 3, 2016 at 11:30 PM, Camille Fournier <
>> cami...@apache.org>
>>>>> wrote:
>>>>> 
>>>>>> 2 servers is the same as 1 server wrt fault tolerance, so yes, you
>>> are
>>>>>> correct. If they want fault tolerance, they have to run 3 (or
>> more).
>>>>>> 
>>>>>> On Fri, Jun 3, 2016 at 4:25 PM, Shawn Heisey <apa...@elyograg.org>
>>>>> wrote:
>>>>>> 
>>>>>>> On 6/3/2016 1:44 PM, Nomar Morado wrote:
>>>>>>>> Is there any settings to override the quorum rule? Would you
>> know
>>>> the
>>>>>>>> rationale behind it? Ideally, you will want to operate the
>>>>> application
>>>>>>>> even if at least one data center is up.
>>>>>>> 
>>>>>>> I do not know if the quorum rule can be overridden, or whether
>> your
>>>>>>> application can tell the difference between a loss of quorum and
>>>>>>> zookeeper going down entirely.  I really don't know anything
>> about
>>>>>>> zookeeper client code or zookeeper internals.
>>>>>>> 
>>>>>>> From what I understand, majority quorum is the only way to be
>>>>>>> *completely* sure that cluster software like SolrCloud or your
>>>>>>> application can handle write operations with confidence that they
>>> are
>>>>>>> applied correctly.  If you lose quorum, which will happen if only
>>> one
>>>>> DC
>>>>>>> is operational, then your application should go read-only.  This
>> is
>>>>> what
>>>>>>> SolrCloud does.
>>>>>>> 
>>>>>>> I am a committer on the Apache Solr project, and Solr uses
>>> zookeeper
>>>>>>> when it is running in SolrCloud mode.  The cloud code is handled
>> by
>>>>>>> other people -- I don't know much about it.
>>>>>>> 
>>>>>>> I joined this list because I wanted to have the ZK devs include a
>>>>>>> clarification in zookeeper documentation -- oddly enough, related
>>> to
>>>>> the
>>>>>>> very thing we are discussing.  I wanted to be sure that the
>>>>>>> documentation explicitly mentioned that three serversare required
>>>> for a
>>>>>>> fault-tolerant setup.  Some SolrCloud users don't want to accept
>>> this
>>>>> as
>>>>>>> a fact, and believe that two servers should be enough.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Shawn
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Cheers
>>>> Michael.
>>>> 
>>> 
>>

Re: zookeeper deployment strategy for multi data centers

Reply via email to