You can't solve this with weights. On Jun 3, 2016 6:03 PM, "Michael Han" <[email protected]> wrote:
> ZK supports more than just majority quorum rule, there are also weights / > hierarchy of groups based quorum [1]. So probably one can assign more > weights to one out of two data center which can form a weight based quorum > even if another DC is failing? > > Another idea is to instead of forming a single ZK ensemble across DCs, > forming multiple ZK ensembles across DCs with one ensemble per DC. This > solution might be applicable for heavy read / light write workload while > providing certain degree of fault tolerance. Some relevant discussions [2]. > > [1] https://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html ( > weight.x=nnnnn, group.x=nnnnn[:nnnnn]). > [2] > > https://www.quora.com/Has-anyone-deployed-a-ZooKeeper-ensemble-across-data-centers > > On Fri, Jun 3, 2016 at 2:16 PM, Alexander Shraer <[email protected]> > wrote: > > > > Is there any settings to override the quorum rule? Would you know the > > rationale behind it? > > > > The rule comes from a theoretical impossibility saying that you must > have n > > > 2f replicas > > to tolerate f failures, for any algorithm trying to solve consensus while > > being able to handle > > periods of asynchrony (unbounded message delays, processing times, etc). > > The earliest proof is probably here: paper > > <https://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/p225-chandra.pdf > >. > > ZooKeeper is assuming this model, so the bound applies > > to it. > > > > The intuition is what's called a 'partition argument'. Essentially if > only > > 2f replicas were sufficient, you > > could arbitrarily divide them into 2 sets of f replicas, and create a > > situation where each set of f > > must go on independently without coordinating with the other set (split > > brain), when the links between the two sets are slow (i.e., a network > > partition), > > simply because the other set could also be down (the algorithm tolerates > f > > failures) and it can't distinguish the two situations. > > When n > 2f this can be avoided since one of the sets will have majority > > while the other set won't. > > > > The key here is that the links between the two data centers can > arbitrarily > > delay messages, so an automatic > > 'fail-over' where one data center decides that the other one is down is > > usually considered unsafe. If in your system > > you have a reliable way to know that the other data center is really in > > fact down (this is a synchrony assumption), you could do as Camille > > suggested and > > reconfigure the system to only include the remaining data center. This > > would still be very tricky to do since this reconfiguration > > would have to involve manually changing configuration files and rebooting > > servers, while somehow making sure that you're > > not loosing committed state. So not recommended. > > > > > > > > On Fri, Jun 3, 2016 at 11:30 PM, Camille Fournier <[email protected]> > > wrote: > > > > > 2 servers is the same as 1 server wrt fault tolerance, so yes, you are > > > correct. If they want fault tolerance, they have to run 3 (or more). > > > > > > On Fri, Jun 3, 2016 at 4:25 PM, Shawn Heisey <[email protected]> > > wrote: > > > > > > > On 6/3/2016 1:44 PM, Nomar Morado wrote: > > > > > Is there any settings to override the quorum rule? Would you know > the > > > > > rationale behind it? Ideally, you will want to operate the > > application > > > > > even if at least one data center is up. > > > > > > > > I do not know if the quorum rule can be overridden, or whether your > > > > application can tell the difference between a loss of quorum and > > > > zookeeper going down entirely. I really don't know anything about > > > > zookeeper client code or zookeeper internals. > > > > > > > > From what I understand, majority quorum is the only way to be > > > > *completely* sure that cluster software like SolrCloud or your > > > > application can handle write operations with confidence that they are > > > > applied correctly. If you lose quorum, which will happen if only one > > DC > > > > is operational, then your application should go read-only. This is > > what > > > > SolrCloud does. > > > > > > > > I am a committer on the Apache Solr project, and Solr uses zookeeper > > > > when it is running in SolrCloud mode. The cloud code is handled by > > > > other people -- I don't know much about it. > > > > > > > > I joined this list because I wanted to have the ZK devs include a > > > > clarification in zookeeper documentation -- oddly enough, related to > > the > > > > very thing we are discussing. I wanted to be sure that the > > > > documentation explicitly mentioned that three serversare required > for a > > > > fault-tolerant setup. Some SolrCloud users don't want to accept this > > as > > > > a fact, and believe that two servers should be enough. > > > > > > > > Thanks, > > > > Shawn > > > > > > > > > > > > > > > > > -- > Cheers > Michael. >
