Re: Zookeeper on short lived VMs and ZOOKEEPER-107

Christian Ziech Thu, 15 Mar 2012 02:52:07 -0700

I don't think that we could be running into a split brain problem in ouruse case.Let me try to describe the scenario we are worried about (assuming anensemble of 5 nodes A,B,C,D,E):

- The ensemble is up and running and in sync

- Node A with the host name "zookeeperA.whatever-domain.priv" goes downbecause the VM has gone away- That removal of the VM is detected and a new VM is spawned with thesame host name "zookeeperA.whatever-domain.priv" - let's call that node A'- Node A' zookeeper wants to join the cluster - right now this getsrejected by the others since A' has a different IP address than A (andthe old one is "cached" in the InetSocketAddress of the QuorumPeer instance)

We could ensure that at any given time there is only at most one nodewith host name "zookeeperA.whatever-domain.priv" known by the ensembleand that once one node is replaced, it would not come back. Also wecould make sure that our ensemble is big enough to compensate for areplacement of more than x nodes at a time (setting it to x*2 + 1 nodes).

So if I did not misestimate our problem it should be (due to therestrictions) simpler than the problem to be solved by zookeeper-107. Myintention is basically by solving this smaller discrete problem to notneed to wait for that zookeeper-107 makes it into a release (theassumption is that a smaller fix has a possibly a chance to make it intothe 3.4.x branch even).


Am 15.03.2012 07:46, schrieb ext Alexander Shraer:

Hi Christian,

ZK-107 would indeed allow you to add/remove servers and change theiraddresses.

> We could ensure that we always have a more or less fixed quorum ofzookeeper servers with a fixed set of host names.

You should probably also ensure that a majority of the old ensembleintersects with a majority of the new one.Otherwise you have to run a reconfiguration protocol similarly toZK-107. For example, if you have 3 servers A B and C, and now you'readding D and E that replace B and C, how would this work ? it isprobable that D and E don't have the latest state (as you mention) andA is down or doesn't have the latest state too (a minority might nothave the latest state). Also, how do you prevent split brain in thiscase ? meaning B and C thinking that they are still operational ?perhaps I'm missing something but I suspect that the change youpropose won't be enough...


Best Regards,
Alex

On Wed, Mar 14, 2012 at 10:01 AM, Christian Ziech<[email protected] <mailto:[email protected]>> wrote:


    Just a small addition: In my opinion the patch could really boil
    down to add a

      quorumServer.electionAddr = new
      InetSocketAddress(electionAddr.getHostName(),
    electionAddr.getPort());

    in the catch(IOException e) clause of the connectOne() method of
    the QuorumCnxManager. In addition on should perhaps make the
    electionAddr field in the QuorumPeer.QuorumServer class volatile
    to prevent races.

    I haven't checked this change yet fully for implications but doing
    a quick test on some machines at least showed it would solve our
    use case. What do the more expert users / maintainers think - is
    it even worthwhile to go that route?

    Am 14.03.2012 17:04, schrieb ext Christian Ziech:

        LEt me describe our upcoming use case in a few words: We are
        planning to use zookeeper in a cloud were typically nodes come
        and go unpredictably. We could ensure that we always have a
        more or less fixed quorum of zookeeper servers with a fixed
        set of host names. However the IPs associated with the host
        names would change every time a new server comes up. I browsed
        the code a little and it seems right now that the only problem
        is that the zookeeper server is remembering the resolved
        InetSocketAddress in its QuorumPeer hash map.

        I saw that possibly ZOOKEEPER-107 would also solve that
        problem but possibly in a more generic way than actually
        needed (perhaps here I underestimate the impact of joining as
        a server with an empty data directory to replace a server that
        previously had one).

        Given that - from looking at ZOOKEEPER-107 - it seems that it
        will still take some time for the proposed fix to make it into
        a release, would it make sense to invest time into a smaller
        work fix just for this "replacing a dropped server without
        rolling restarts" use case? Would there be a chance that a fix
        for this makes it into the 3.4.x branch?

        Are there perhaps other ways to get this use case supported
        without the need for doing rolling restarts whenever we need
        to replace one of the zookeeper servers?

Re: Zookeeper on short lived VMs and ZOOKEEPER-107

Reply via email to