Re: Zookeeper on short lived VMs and ZOOKEEPER-107

Christian Ziech Fri, 16 Mar 2012 02:58:31 -0700

Under normal circumstances the ability to detect failures correctlyshould be given. The scenario I'm aware of includes one zookeeper systemwould be taken down for a reason and then possibly just rebooted or evenstarted from scratch elsewhere. In both cases however the new host wouldhave the old dns name but most likely a different IP. But of course thatonly applies to us and possibly not to all of the users.

When thinking about the scenario you described I understood where theproblem lies. However wouldn't the same problem also be relevant the wayzookeeper is implemented right now? Let me try to explain why (possiblyI'm wrong here since I may miss some points on how zookeeper serversworks internally - corrections are very welcome):- Same scenarios as you described - nodes A with host name a, B hostname b and C with host name c- Also same as in your scenario C is due to some human error falselydetected as down. Hence C' is brought up and is assigned the same DNSname as C

- Now rolling restarts are performed to bring in C'

- A resolves c correctly to the new IP and connects to C' but B stillresolves the host name c to the original address of C and hence does notconnect (I think some DNS slowness is also required for your approach inorder for the host name c being resolved to the original IP of C)- now the rest of your scenario happens: Update U is applied, C' getsslow, C recovers and A fails.Of course also this approach requires some DNS craziness but if I didnot make a mistake in my thoughts it should still be possible.

PS: Wouldn't your scenario not also invalidate the solution of the hbaseguys using amazons elastic ips to solve the same problem (seehttps://issues.apache.org/jira/browse/HBASE-2327)?PS2: If the approach I had in mind is not valid, do you guys alreadyhave a plan for when 3.5.0 would be released or could you guys besupported in some way so that zookeeper-107 makes it sooner into a release?


Am 16.03.2012 04:43, schrieb ext Alexander Shraer:

Actually its still not clear to me how you would enforce the 2x+1. In Zookeeper 
we can guarantee liveness (progress) only when x+1 are connected and up, 
however safety (correctness) is always guaranteed, even if 2 out of 3 servers 
are temporarily down. Your design needs the 2x+1 for safety, which I think is 
problematic unless you can accurately detect failures (synchrony) and failures 
are permanent.

Alex


On Mar 15, 2012, at 3:54 PM, Alexander Shraer<[email protected]>  wrote:

I think the concern is that the old VM can recover and try to
reconnect. Theoretically you could even go back and forth between new
and old VM. For example, suppose that you have servers
A, B and C in the cluster, A is the leader. C is slow and "replaced"
with C', then update U is acked by A and C', then A fails. In this
situation you cannot have additional failures. But with the
automatic replacement thing it can (theoretically) happen that C'
becomes a little slow, C connects to B and is chosen as leader, and
the committed update U is lost forever. This is perhaps unlikely but
possible...

Alex

On Thu, Mar 15, 2012 at 1:35 PM,<[email protected]>  wrote:

I agree with your points about any kind of VMs having a hard to predict runtime 
behaviour and that participants of the zookeeper ensemble running on a VM could 
fail to send keep-alives for an uncertain amount of time. But I don't yet 
understand how that would break the approach I was mentioning: Just trying to 
re-resolve the InetAddress after an IO exception should in that case still lead 
to the same original IP address (and eventually to that node rejoining the 
ensemble).
Only if that host name (the old node was using) would be re-assigned to another 
instance this step of re-resolving would point to a new IP (and hence cause the 
old server to be replaced).

Did I understand your objection correctly?

________________________________________
Von: ext Ted Dunning [[email protected]]
Gesendet: Donnerstag, 15. März 2012 19:50
Bis: [email protected]
Cc: [email protected]
Betreff: Re: Zookeeper on short lived VMs and ZOOKEEPER-107

Alexander's comment still applies.

VM's can function or go away completely, but they can also malfunction
in more subtle ways such that they just go VEEEERRRRY slowly.  You
have to account for that failure mode.  These failures can even be
transient.

This would probably break your approach.

On 3/15/12, [email protected]<[email protected]>  wrote:

Oh sorry there is a slight misunderstanding. With VM I did not mean the java
vm but the Linux vm that contains the zookeeper node. We get notified if
that goes away and is repurposed.

BR
  Christian

Gesendet von meinem Nokia Lumia 800
________________________________
Von: ext Alexander Shraer
Gesendet: 15.03.2012 16:33
An: [email protected]; Ziech Christian (Nokia-LC/Berlin)
Betreff: Re: Zookeeper on short lived VMs and ZOOKEEPER-107

yes, by replacing x at a time from 2x+1 you have quorum intersection.

i have one more question - zookeeper itself doesn't assume perfect
failure detection, which your scheme requires. what if the VM didn't
actually fail but just slow and then tries to reconnect ?

On Thu, Mar 15, 2012 at 2:50 AM, Christian Ziech
<[email protected]>  wrote:

I don't think that we could be running into a split brain problem in our
use
case.
Let me try to describe the scenario we are worried about (assuming an
ensemble of 5 nodes A,B,C,D,E):
- The ensemble is up and running and in sync
- Node A with the host name "zookeeperA.whatever-domain.priv" goes down
because the VM has gone away
- That removal of the VM is detected and a new VM is spawned with the same
host name "zookeeperA.whatever-domain.priv" - let's call that node A'
- Node A' zookeeper wants to join the cluster - right now this gets
rejected
by the others since A' has a different IP address than A (and the old one
is
"cached" in the InetSocketAddress of the QuorumPeer instance)

We could ensure that at any given time there is only at most one node with
host name "zookeeperA.whatever-domain.priv" known by the ensemble and that
once one node is replaced, it would not come back. Also we could make sure
that our ensemble is big enough to compensate for a replacement of more
than
x nodes at a time (setting it to x*2 + 1 nodes).

So if I did not misestimate our problem it should be (due to the
restrictions) simpler than the problem to be solved by zookeeper-107. My
intention is basically by solving this smaller discrete problem to not
need
to wait for that zookeeper-107 makes it into a release (the assumption is
that a smaller fix has a possibly a chance to make it into the 3.4.x
branch
even).

Am 15.03.2012 07:46, schrieb ext Alexander Shraer:

Hi Christian,

ZK-107 would indeed allow you to add/remove servers and change their
addresses.

We could ensure that we always have a more or less fixed quorum of
zookeeper servers with a fixed set of host names.

You should probably also ensure that a majority of the old ensemble
intersects with a majority of the new one.
Otherwise you have to run a reconfiguration protocol similarly to ZK-107.
For example, if you have 3 servers A B and C, and now you're adding D and
E
that replace B and C, how would this work ?  it is probable that D and E
don't have the latest state (as you mention) and A is down or doesn't
have
the latest state too (a minority might not have the latest state). Also,
how
do you prevent split brain in this case ? meaning B and C thinking that
they
are still operational ? perhaps I'm missing something but I suspect that
the
change you propose won't be enough...

Best Regards,
Alex


On Wed, Mar 14, 2012 at 10:01 AM, Christian Ziech
<[email protected]<mailto:[email protected]>>  wrote:

   Just a small addition: In my opinion the patch could really boil
   down to add a

     quorumServer.electionAddr = new
     InetSocketAddress(electionAddr.getHostName(),
   electionAddr.getPort());

   in the catch(IOException e) clause of the connectOne() method of
   the QuorumCnxManager. In addition on should perhaps make the
   electionAddr field in the QuorumPeer.QuorumServer class volatile
   to prevent races.

   I haven't checked this change yet fully for implications but doing
   a quick test on some machines at least showed it would solve our
   use case. What do the more expert users / maintainers think - is
   it even worthwhile to go that route?

   Am 14.03.2012 17:04, schrieb ext Christian Ziech:

       LEt me describe our upcoming use case in a few words: We are
       planning to use zookeeper in a cloud were typically nodes come
       and go unpredictably. We could ensure that we always have a
       more or less fixed quorum of zookeeper servers with a fixed
       set of host names. However the IPs associated with the host
       names would change every time a new server comes up. I browsed
       the code a little and it seems right now that the only problem
       is that the zookeeper server is remembering the resolved
       InetSocketAddress in its QuorumPeer hash map.

       I saw that possibly ZOOKEEPER-107 would also solve that
       problem but possibly in a more generic way than actually
       needed (perhaps here I underestimate the impact of joining as
       a server with an empty data directory to replace a server that
       previously had one).

       Given that - from looking at ZOOKEEPER-107 - it seems that it
       will still take some time for the proposed fix to make it into
       a release, would it make sense to invest time into a smaller
       work fix just for this "replacing a dropped server without
       rolling restarts" use case? Would there be a chance that a fix
       for this makes it into the 3.4.x branch?

       Are there perhaps other ways to get this use case supported
       without the need for doing rolling restarts whenever we need
       to replace one of the zookeeper servers?

Re: Zookeeper on short lived VMs and ZOOKEEPER-107

Reply via email to