Re: Zookeeper on short lived VMs and ZOOKEEPER-107

Alexander Shraer Thu, 15 Mar 2012 20:44:20 -0700

Actually its still not clear to me how you would enforce the 2x+1. In Zookeeper 
we can guarantee liveness (progress) only when x+1 are connected and up, 
however safety (correctness) is always guaranteed, even if 2 out of 3 servers 
are temporarily down. Your design needs the 2x+1 for safety, which I think is 
problematic unless you can accurately detect failures (synchrony) and failures 
are permanent.


Alex


On Mar 15, 2012, at 3:54 PM, Alexander Shraer <[email protected]> wrote:

> I think the concern is that the old VM can recover and try to
> reconnect. Theoretically you could even go back and forth between new
> and old VM. For example, suppose that you have servers
> A, B and C in the cluster, A is the leader. C is slow and "replaced"
> with C', then update U is acked by A and C', then A fails. In this
> situation you cannot have additional failures. But with the
> automatic replacement thing it can (theoretically) happen that C'
> becomes a little slow, C connects to B and is chosen as leader, and
> the committed update U is lost forever. This is perhaps unlikely but
> possible...
> 
> Alex
> 
> On Thu, Mar 15, 2012 at 1:35 PM,  <[email protected]> wrote:
>> I agree with your points about any kind of VMs having a hard to predict 
>> runtime behaviour and that participants of the zookeeper ensemble running on 
>> a VM could fail to send keep-alives for an uncertain amount of time. But I 
>> don't yet understand how that would break the approach I was mentioning: 
>> Just trying to re-resolve the InetAddress after an IO exception should in 
>> that case still lead to the same original IP address (and eventually to that 
>> node rejoining the ensemble).
>> Only if that host name (the old node was using) would be re-assigned to 
>> another instance this step of re-resolving would point to a new IP (and 
>> hence cause the old server to be replaced).
>> 
>> Did I understand your objection correctly?
>> 
>> ________________________________________
>> Von: ext Ted Dunning [[email protected]]
>> Gesendet: Donnerstag, 15. März 2012 19:50
>> Bis: [email protected]
>> Cc: [email protected]
>> Betreff: Re: Zookeeper on short lived VMs and ZOOKEEPER-107
>> 
>> Alexander's comment still applies.
>> 
>> VM's can function or go away completely, but they can also malfunction
>> in more subtle ways such that they just go VEEEERRRRY slowly.  You
>> have to account for that failure mode.  These failures can even be
>> transient.
>> 
>> This would probably break your approach.
>> 
>> On 3/15/12, [email protected] <[email protected]> wrote:
>>> Oh sorry there is a slight misunderstanding. With VM I did not mean the java
>>> vm but the Linux vm that contains the zookeeper node. We get notified if
>>> that goes away and is repurposed.
>>> 
>>> BR
>>>  Christian
>>> 
>>> Gesendet von meinem Nokia Lumia 800
>>> ________________________________
>>> Von: ext Alexander Shraer
>>> Gesendet: 15.03.2012 16:33
>>> An: [email protected]; Ziech Christian (Nokia-LC/Berlin)
>>> Betreff: Re: Zookeeper on short lived VMs and ZOOKEEPER-107
>>> 
>>> yes, by replacing x at a time from 2x+1 you have quorum intersection.
>>> 
>>> i have one more question - zookeeper itself doesn't assume perfect
>>> failure detection, which your scheme requires. what if the VM didn't
>>> actually fail but just slow and then tries to reconnect ?
>>> 
>>> On Thu, Mar 15, 2012 at 2:50 AM, Christian Ziech
>>> <[email protected]> wrote:
>>>> I don't think that we could be running into a split brain problem in our
>>>> use
>>>> case.
>>>> Let me try to describe the scenario we are worried about (assuming an
>>>> ensemble of 5 nodes A,B,C,D,E):
>>>> - The ensemble is up and running and in sync
>>>> - Node A with the host name "zookeeperA.whatever-domain.priv" goes down
>>>> because the VM has gone away
>>>> - That removal of the VM is detected and a new VM is spawned with the same
>>>> host name "zookeeperA.whatever-domain.priv" - let's call that node A'
>>>> - Node A' zookeeper wants to join the cluster - right now this gets
>>>> rejected
>>>> by the others since A' has a different IP address than A (and the old one
>>>> is
>>>> "cached" in the InetSocketAddress of the QuorumPeer instance)
>>>> 
>>>> We could ensure that at any given time there is only at most one node with
>>>> host name "zookeeperA.whatever-domain.priv" known by the ensemble and that
>>>> once one node is replaced, it would not come back. Also we could make sure
>>>> that our ensemble is big enough to compensate for a replacement of more
>>>> than
>>>> x nodes at a time (setting it to x*2 + 1 nodes).
>>>> 
>>>> So if I did not misestimate our problem it should be (due to the
>>>> restrictions) simpler than the problem to be solved by zookeeper-107. My
>>>> intention is basically by solving this smaller discrete problem to not
>>>> need
>>>> to wait for that zookeeper-107 makes it into a release (the assumption is
>>>> that a smaller fix has a possibly a chance to make it into the 3.4.x
>>>> branch
>>>> even).
>>>> 
>>>> Am 15.03.2012 07:46, schrieb ext Alexander Shraer:
>>>>> 
>>>>> Hi Christian,
>>>>> 
>>>>> ZK-107 would indeed allow you to add/remove servers and change their
>>>>> addresses.
>>>>> 
>>>>>> We could ensure that we always have a more or less fixed quorum of
>>>>>> zookeeper servers with a fixed set of host names.
>>>>> 
>>>>> You should probably also ensure that a majority of the old ensemble
>>>>> intersects with a majority of the new one.
>>>>> Otherwise you have to run a reconfiguration protocol similarly to ZK-107.
>>>>> For example, if you have 3 servers A B and C, and now you're adding D and
>>>>> E
>>>>> that replace B and C, how would this work ?  it is probable that D and E
>>>>> don't have the latest state (as you mention) and A is down or doesn't
>>>>> have
>>>>> the latest state too (a minority might not have the latest state). Also,
>>>>> how
>>>>> do you prevent split brain in this case ? meaning B and C thinking that
>>>>> they
>>>>> are still operational ? perhaps I'm missing something but I suspect that
>>>>> the
>>>>> change you propose won't be enough...
>>>>> 
>>>>> Best Regards,
>>>>> Alex
>>>>> 
>>>>> 
>>>>> On Wed, Mar 14, 2012 at 10:01 AM, Christian Ziech
>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>> 
>>>>>   Just a small addition: In my opinion the patch could really boil
>>>>>   down to add a
>>>>> 
>>>>>     quorumServer.electionAddr = new
>>>>>     InetSocketAddress(electionAddr.getHostName(),
>>>>>   electionAddr.getPort());
>>>>> 
>>>>>   in the catch(IOException e) clause of the connectOne() method of
>>>>>   the QuorumCnxManager. In addition on should perhaps make the
>>>>>   electionAddr field in the QuorumPeer.QuorumServer class volatile
>>>>>   to prevent races.
>>>>> 
>>>>>   I haven't checked this change yet fully for implications but doing
>>>>>   a quick test on some machines at least showed it would solve our
>>>>>   use case. What do the more expert users / maintainers think - is
>>>>>   it even worthwhile to go that route?
>>>>> 
>>>>>   Am 14.03.2012 17:04, schrieb ext Christian Ziech:
>>>>> 
>>>>>       LEt me describe our upcoming use case in a few words: We are
>>>>>       planning to use zookeeper in a cloud were typically nodes come
>>>>>       and go unpredictably. We could ensure that we always have a
>>>>>       more or less fixed quorum of zookeeper servers with a fixed
>>>>>       set of host names. However the IPs associated with the host
>>>>>       names would change every time a new server comes up. I browsed
>>>>>       the code a little and it seems right now that the only problem
>>>>>       is that the zookeeper server is remembering the resolved
>>>>>       InetSocketAddress in its QuorumPeer hash map.
>>>>> 
>>>>>       I saw that possibly ZOOKEEPER-107 would also solve that
>>>>>       problem but possibly in a more generic way than actually
>>>>>       needed (perhaps here I underestimate the impact of joining as
>>>>>       a server with an empty data directory to replace a server that
>>>>>       previously had one).
>>>>> 
>>>>>       Given that - from looking at ZOOKEEPER-107 - it seems that it
>>>>>       will still take some time for the proposed fix to make it into
>>>>>       a release, would it make sense to invest time into a smaller
>>>>>       work fix just for this "replacing a dropped server without
>>>>>       rolling restarts" use case? Would there be a chance that a fix
>>>>>       for this makes it into the 3.4.x branch?
>>>>> 
>>>>>       Are there perhaps other ways to get this use case supported
>>>>>       without the need for doing rolling restarts whenever we need
>>>>>       to replace one of the zookeeper servers?
>>>>> 
>>>> 
>>>

Re: Zookeeper on short lived VMs and ZOOKEEPER-107

Reply via email to