Actually its still not clear to me how you would enforce the 2x+1. In Zookeeper we can guarantee liveness (progress) only when x+1 are connected and up, however safety (correctness) is always guaranteed, even if 2 out of 3 servers are temporarily down. Your design needs the 2x+1 for safety, which I think is problematic unless you can accurately detect failures (synchrony) and failures are permanent.
Alex On Mar 15, 2012, at 3:54 PM, Alexander Shraer <[email protected]> wrote: > I think the concern is that the old VM can recover and try to > reconnect. Theoretically you could even go back and forth between new > and old VM. For example, suppose that you have servers > A, B and C in the cluster, A is the leader. C is slow and "replaced" > with C', then update U is acked by A and C', then A fails. In this > situation you cannot have additional failures. But with the > automatic replacement thing it can (theoretically) happen that C' > becomes a little slow, C connects to B and is chosen as leader, and > the committed update U is lost forever. This is perhaps unlikely but > possible... > > Alex > > On Thu, Mar 15, 2012 at 1:35 PM, <[email protected]> wrote: >> I agree with your points about any kind of VMs having a hard to predict >> runtime behaviour and that participants of the zookeeper ensemble running on >> a VM could fail to send keep-alives for an uncertain amount of time. But I >> don't yet understand how that would break the approach I was mentioning: >> Just trying to re-resolve the InetAddress after an IO exception should in >> that case still lead to the same original IP address (and eventually to that >> node rejoining the ensemble). >> Only if that host name (the old node was using) would be re-assigned to >> another instance this step of re-resolving would point to a new IP (and >> hence cause the old server to be replaced). >> >> Did I understand your objection correctly? >> >> ________________________________________ >> Von: ext Ted Dunning [[email protected]] >> Gesendet: Donnerstag, 15. März 2012 19:50 >> Bis: [email protected] >> Cc: [email protected] >> Betreff: Re: Zookeeper on short lived VMs and ZOOKEEPER-107 >> >> Alexander's comment still applies. >> >> VM's can function or go away completely, but they can also malfunction >> in more subtle ways such that they just go VEEEERRRRY slowly. You >> have to account for that failure mode. These failures can even be >> transient. >> >> This would probably break your approach. >> >> On 3/15/12, [email protected] <[email protected]> wrote: >>> Oh sorry there is a slight misunderstanding. With VM I did not mean the java >>> vm but the Linux vm that contains the zookeeper node. We get notified if >>> that goes away and is repurposed. >>> >>> BR >>> Christian >>> >>> Gesendet von meinem Nokia Lumia 800 >>> ________________________________ >>> Von: ext Alexander Shraer >>> Gesendet: 15.03.2012 16:33 >>> An: [email protected]; Ziech Christian (Nokia-LC/Berlin) >>> Betreff: Re: Zookeeper on short lived VMs and ZOOKEEPER-107 >>> >>> yes, by replacing x at a time from 2x+1 you have quorum intersection. >>> >>> i have one more question - zookeeper itself doesn't assume perfect >>> failure detection, which your scheme requires. what if the VM didn't >>> actually fail but just slow and then tries to reconnect ? >>> >>> On Thu, Mar 15, 2012 at 2:50 AM, Christian Ziech >>> <[email protected]> wrote: >>>> I don't think that we could be running into a split brain problem in our >>>> use >>>> case. >>>> Let me try to describe the scenario we are worried about (assuming an >>>> ensemble of 5 nodes A,B,C,D,E): >>>> - The ensemble is up and running and in sync >>>> - Node A with the host name "zookeeperA.whatever-domain.priv" goes down >>>> because the VM has gone away >>>> - That removal of the VM is detected and a new VM is spawned with the same >>>> host name "zookeeperA.whatever-domain.priv" - let's call that node A' >>>> - Node A' zookeeper wants to join the cluster - right now this gets >>>> rejected >>>> by the others since A' has a different IP address than A (and the old one >>>> is >>>> "cached" in the InetSocketAddress of the QuorumPeer instance) >>>> >>>> We could ensure that at any given time there is only at most one node with >>>> host name "zookeeperA.whatever-domain.priv" known by the ensemble and that >>>> once one node is replaced, it would not come back. Also we could make sure >>>> that our ensemble is big enough to compensate for a replacement of more >>>> than >>>> x nodes at a time (setting it to x*2 + 1 nodes). >>>> >>>> So if I did not misestimate our problem it should be (due to the >>>> restrictions) simpler than the problem to be solved by zookeeper-107. My >>>> intention is basically by solving this smaller discrete problem to not >>>> need >>>> to wait for that zookeeper-107 makes it into a release (the assumption is >>>> that a smaller fix has a possibly a chance to make it into the 3.4.x >>>> branch >>>> even). >>>> >>>> Am 15.03.2012 07:46, schrieb ext Alexander Shraer: >>>>> >>>>> Hi Christian, >>>>> >>>>> ZK-107 would indeed allow you to add/remove servers and change their >>>>> addresses. >>>>> >>>>>> We could ensure that we always have a more or less fixed quorum of >>>>>> zookeeper servers with a fixed set of host names. >>>>> >>>>> You should probably also ensure that a majority of the old ensemble >>>>> intersects with a majority of the new one. >>>>> Otherwise you have to run a reconfiguration protocol similarly to ZK-107. >>>>> For example, if you have 3 servers A B and C, and now you're adding D and >>>>> E >>>>> that replace B and C, how would this work ? it is probable that D and E >>>>> don't have the latest state (as you mention) and A is down or doesn't >>>>> have >>>>> the latest state too (a minority might not have the latest state). Also, >>>>> how >>>>> do you prevent split brain in this case ? meaning B and C thinking that >>>>> they >>>>> are still operational ? perhaps I'm missing something but I suspect that >>>>> the >>>>> change you propose won't be enough... >>>>> >>>>> Best Regards, >>>>> Alex >>>>> >>>>> >>>>> On Wed, Mar 14, 2012 at 10:01 AM, Christian Ziech >>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>> >>>>> Just a small addition: In my opinion the patch could really boil >>>>> down to add a >>>>> >>>>> quorumServer.electionAddr = new >>>>> InetSocketAddress(electionAddr.getHostName(), >>>>> electionAddr.getPort()); >>>>> >>>>> in the catch(IOException e) clause of the connectOne() method of >>>>> the QuorumCnxManager. In addition on should perhaps make the >>>>> electionAddr field in the QuorumPeer.QuorumServer class volatile >>>>> to prevent races. >>>>> >>>>> I haven't checked this change yet fully for implications but doing >>>>> a quick test on some machines at least showed it would solve our >>>>> use case. What do the more expert users / maintainers think - is >>>>> it even worthwhile to go that route? >>>>> >>>>> Am 14.03.2012 17:04, schrieb ext Christian Ziech: >>>>> >>>>> LEt me describe our upcoming use case in a few words: We are >>>>> planning to use zookeeper in a cloud were typically nodes come >>>>> and go unpredictably. We could ensure that we always have a >>>>> more or less fixed quorum of zookeeper servers with a fixed >>>>> set of host names. However the IPs associated with the host >>>>> names would change every time a new server comes up. I browsed >>>>> the code a little and it seems right now that the only problem >>>>> is that the zookeeper server is remembering the resolved >>>>> InetSocketAddress in its QuorumPeer hash map. >>>>> >>>>> I saw that possibly ZOOKEEPER-107 would also solve that >>>>> problem but possibly in a more generic way than actually >>>>> needed (perhaps here I underestimate the impact of joining as >>>>> a server with an empty data directory to replace a server that >>>>> previously had one). >>>>> >>>>> Given that - from looking at ZOOKEEPER-107 - it seems that it >>>>> will still take some time for the proposed fix to make it into >>>>> a release, would it make sense to invest time into a smaller >>>>> work fix just for this "replacing a dropped server without >>>>> rolling restarts" use case? Would there be a chance that a fix >>>>> for this makes it into the 3.4.x branch? >>>>> >>>>> Are there perhaps other ways to get this use case supported >>>>> without the need for doing rolling restarts whenever we need >>>>> to replace one of the zookeeper servers? >>>>> >>>> >>>
