Re: zookeeper deployment strategy for multi data centers

Dan Benediktson Fri, 03 Jun 2016 15:35:07 -0700

Weights will at least let you do better: if you weight it, you can make it
so that datacenter A will survive even if datacenter B goes down, but not
the other way around. While not ideal, it's probably better than the
non-weighted alternative. (2, 2, 1, 1) weights might work fairly well - as
long as any three machines are up, or both machines in the preferred
datacenter, quorum can be achieved.


On Fri, Jun 3, 2016 at 3:23 PM, Camille Fournier <cami...@apache.org> wrote:

> You can't solve this with weights.
> On Jun 3, 2016 6:03 PM, "Michael Han" <h...@cloudera.com> wrote:
>
> > ZK supports more than just majority quorum rule, there are also weights /
> > hierarchy of groups based quorum [1]. So probably one can assign more
> > weights to one out of two data center which can form a weight based
> quorum
> > even if another DC is failing?
> >
> > Another idea is to instead of forming a single ZK ensemble across DCs,
> > forming multiple ZK ensembles across DCs with one ensemble per DC. This
> > solution might be applicable for heavy read / light write workload while
> > providing certain degree of fault tolerance. Some relevant discussions
> [2].
> >
> > [1] https://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html (
> > weight.x=nnnnn, group.x=nnnnn[:nnnnn]).
> > [2]
> >
> >
> https://www.quora.com/Has-anyone-deployed-a-ZooKeeper-ensemble-across-data-centers
> >
> > On Fri, Jun 3, 2016 at 2:16 PM, Alexander Shraer <shra...@gmail.com>
> > wrote:
> >
> > > > Is there any settings to override the quorum rule? Would you know the
> > > rationale behind it?
> > >
> > > The rule comes from a theoretical impossibility saying that you must
> > have n
> > > > 2f replicas
> > > to tolerate f failures, for any algorithm trying to solve consensus
> while
> > > being able to handle
> > > periods of asynchrony (unbounded message delays, processing times,
> etc).
> > > The earliest proof is probably here: paper
> > > <
> https://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/p225-chandra.pdf
> > >.
> > > ZooKeeper is assuming this model, so the bound applies
> > > to it.
> > >
> > > The intuition is what's called a 'partition argument'. Essentially if
> > only
> > > 2f replicas were sufficient, you
> > > could arbitrarily divide them into 2 sets of f replicas, and create a
> > > situation where each set of f
> > > must go on independently without coordinating with the other set (split
> > > brain), when the links between the two sets are slow (i.e., a network
> > > partition),
> > > simply because the other set could also be down (the algorithm
> tolerates
> > f
> > > failures) and it can't distinguish the two situations.
> > > When n > 2f this can be avoided since one of the sets will have
> majority
> > > while the other set won't.
> > >
> > > The key here is that the links between the two data centers can
> > arbitrarily
> > > delay messages, so an automatic
> > > 'fail-over' where one data center decides that the other one is down is
> > > usually considered unsafe. If in your system
> > > you have a reliable way to know that the other data center is really in
> > > fact down (this is a synchrony assumption), you could do as Camille
> > > suggested and
> > > reconfigure the system to only include the remaining data center. This
> > > would still be very tricky to do since this reconfiguration
> > > would have to involve manually changing configuration files and
> rebooting
> > > servers, while somehow making sure that you're
> > > not loosing committed state. So not recommended.
> > >
> > >
> > >
> > > On Fri, Jun 3, 2016 at 11:30 PM, Camille Fournier <cami...@apache.org>
> > > wrote:
> > >
> > > > 2 servers is the same as 1 server wrt fault tolerance, so yes, you
> are
> > > > correct. If they want fault tolerance, they have to run 3 (or more).
> > > >
> > > > On Fri, Jun 3, 2016 at 4:25 PM, Shawn Heisey <apa...@elyograg.org>
> > > wrote:
> > > >
> > > > > On 6/3/2016 1:44 PM, Nomar Morado wrote:
> > > > > > Is there any settings to override the quorum rule? Would you know
> > the
> > > > > > rationale behind it? Ideally, you will want to operate the
> > > application
> > > > > > even if at least one data center is up.
> > > > >
> > > > > I do not know if the quorum rule can be overridden, or whether your
> > > > > application can tell the difference between a loss of quorum and
> > > > > zookeeper going down entirely.  I really don't know anything about
> > > > > zookeeper client code or zookeeper internals.
> > > > >
> > > > > From what I understand, majority quorum is the only way to be
> > > > > *completely* sure that cluster software like SolrCloud or your
> > > > > application can handle write operations with confidence that they
> are
> > > > > applied correctly.  If you lose quorum, which will happen if only
> one
> > > DC
> > > > > is operational, then your application should go read-only.  This is
> > > what
> > > > > SolrCloud does.
> > > > >
> > > > > I am a committer on the Apache Solr project, and Solr uses
> zookeeper
> > > > > when it is running in SolrCloud mode.  The cloud code is handled by
> > > > > other people -- I don't know much about it.
> > > > >
> > > > > I joined this list because I wanted to have the ZK devs include a
> > > > > clarification in zookeeper documentation -- oddly enough, related
> to
> > > the
> > > > > very thing we are discussing.  I wanted to be sure that the
> > > > > documentation explicitly mentioned that three serversare required
> > for a
> > > > > fault-tolerant setup.  Some SolrCloud users don't want to accept
> this
> > > as
> > > > > a fact, and believe that two servers should be enough.
> > > > >
> > > > > Thanks,
> > > > > Shawn
> > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Cheers
> > Michael.
> >
>

Re: zookeeper deployment strategy for multi data centers

Reply via email to