Weights will at least let you do better: if you weight it, you can make it so that datacenter A will survive even if datacenter B goes down, but not the other way around. While not ideal, it's probably better than the non-weighted alternative. (2, 2, 1, 1) weights might work fairly well - as long as any three machines are up, or both machines in the preferred datacenter, quorum can be achieved.
On Fri, Jun 3, 2016 at 3:23 PM, Camille Fournier <cami...@apache.org> wrote: > You can't solve this with weights. > On Jun 3, 2016 6:03 PM, "Michael Han" <h...@cloudera.com> wrote: > > > ZK supports more than just majority quorum rule, there are also weights / > > hierarchy of groups based quorum [1]. So probably one can assign more > > weights to one out of two data center which can form a weight based > quorum > > even if another DC is failing? > > > > Another idea is to instead of forming a single ZK ensemble across DCs, > > forming multiple ZK ensembles across DCs with one ensemble per DC. This > > solution might be applicable for heavy read / light write workload while > > providing certain degree of fault tolerance. Some relevant discussions > [2]. > > > > [1] https://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html ( > > weight.x=nnnnn, group.x=nnnnn[:nnnnn]). > > [2] > > > > > https://www.quora.com/Has-anyone-deployed-a-ZooKeeper-ensemble-across-data-centers > > > > On Fri, Jun 3, 2016 at 2:16 PM, Alexander Shraer <shra...@gmail.com> > > wrote: > > > > > > Is there any settings to override the quorum rule? Would you know the > > > rationale behind it? > > > > > > The rule comes from a theoretical impossibility saying that you must > > have n > > > > 2f replicas > > > to tolerate f failures, for any algorithm trying to solve consensus > while > > > being able to handle > > > periods of asynchrony (unbounded message delays, processing times, > etc). > > > The earliest proof is probably here: paper > > > < > https://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/p225-chandra.pdf > > >. > > > ZooKeeper is assuming this model, so the bound applies > > > to it. > > > > > > The intuition is what's called a 'partition argument'. Essentially if > > only > > > 2f replicas were sufficient, you > > > could arbitrarily divide them into 2 sets of f replicas, and create a > > > situation where each set of f > > > must go on independently without coordinating with the other set (split > > > brain), when the links between the two sets are slow (i.e., a network > > > partition), > > > simply because the other set could also be down (the algorithm > tolerates > > f > > > failures) and it can't distinguish the two situations. > > > When n > 2f this can be avoided since one of the sets will have > majority > > > while the other set won't. > > > > > > The key here is that the links between the two data centers can > > arbitrarily > > > delay messages, so an automatic > > > 'fail-over' where one data center decides that the other one is down is > > > usually considered unsafe. If in your system > > > you have a reliable way to know that the other data center is really in > > > fact down (this is a synchrony assumption), you could do as Camille > > > suggested and > > > reconfigure the system to only include the remaining data center. This > > > would still be very tricky to do since this reconfiguration > > > would have to involve manually changing configuration files and > rebooting > > > servers, while somehow making sure that you're > > > not loosing committed state. So not recommended. > > > > > > > > > > > > On Fri, Jun 3, 2016 at 11:30 PM, Camille Fournier <cami...@apache.org> > > > wrote: > > > > > > > 2 servers is the same as 1 server wrt fault tolerance, so yes, you > are > > > > correct. If they want fault tolerance, they have to run 3 (or more). > > > > > > > > On Fri, Jun 3, 2016 at 4:25 PM, Shawn Heisey <apa...@elyograg.org> > > > wrote: > > > > > > > > > On 6/3/2016 1:44 PM, Nomar Morado wrote: > > > > > > Is there any settings to override the quorum rule? Would you know > > the > > > > > > rationale behind it? Ideally, you will want to operate the > > > application > > > > > > even if at least one data center is up. > > > > > > > > > > I do not know if the quorum rule can be overridden, or whether your > > > > > application can tell the difference between a loss of quorum and > > > > > zookeeper going down entirely. I really don't know anything about > > > > > zookeeper client code or zookeeper internals. > > > > > > > > > > From what I understand, majority quorum is the only way to be > > > > > *completely* sure that cluster software like SolrCloud or your > > > > > application can handle write operations with confidence that they > are > > > > > applied correctly. If you lose quorum, which will happen if only > one > > > DC > > > > > is operational, then your application should go read-only. This is > > > what > > > > > SolrCloud does. > > > > > > > > > > I am a committer on the Apache Solr project, and Solr uses > zookeeper > > > > > when it is running in SolrCloud mode. The cloud code is handled by > > > > > other people -- I don't know much about it. > > > > > > > > > > I joined this list because I wanted to have the ZK devs include a > > > > > clarification in zookeeper documentation -- oddly enough, related > to > > > the > > > > > very thing we are discussing. I wanted to be sure that the > > > > > documentation explicitly mentioned that three serversare required > > for a > > > > > fault-tolerant setup. Some SolrCloud users don't want to accept > this > > > as > > > > > a fact, and believe that two servers should be enough. > > > > > > > > > > Thanks, > > > > > Shawn > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Cheers > > Michael. > > >