Benjamin, thanks a lot, for your response and the great product you guys designed and implemented.
> these assumptions are manifest in ... the session timeouts of the clients. Does this mean that session_expired event may be triggered all by zk-client-library itself ?(by something like a built-in client-local timer, without notification from zk server? ) (I am digging into the source code, but in case of misunderstanding of the code, I need your confirmation please) > HBase region servers went into gc for many minutes and then woke up still > thinking they are the leader Could this happen if I just follow(correctly and without a client-local timer or external fencing resources) the recipe for distributed clock? On Tue, Jan 15, 2013 at 1:27 PM, Benjamin Reed <[email protected]> wrote: > sorry to jump in the middle, but i thought i'd point out a couple of things. > > at the heart of ZK is Zab, which is an atomic broadcast protocol (it > actually has stronger guarantees than just atomic broadcast: it also > guarantees primary order). updates go through this protocol which > gives us sequential consistency for writes. > > failure detection uses timeouts, as most failure detectors do, so we > have some assumptions on bounds of message delays and drifts of > clocks. in the end, these assumptions are manifest in the sync and > initial timeouts of the server and the session timeouts of the > clients. > > as long as the assumptions are true, things will stay consistent, if > the assumptions fail, such as when HBase region servers went into gc > for many minutes and then woke up still thinking they are the leader, > bad things can happen. the fix may be to use more conservative > assumptions or to use a fencing scheme with external resources. > > if the assumptions are violated by the zookeeper cluster, it will > manifest as a liveness problem rather than a safety issue. (in theory > at least, we do have bugs occasionally :) > > ben > > On Mon, Jan 14, 2013 at 7:45 PM, Hulunbier <[email protected]> wrote: >> Hi Jordan, >> >>> Why would client 1s connection be unstable but client 2s not? In any normal >>> usage the ZK clients are going to be on the same network. Or, are you >>> thinking cross-data-center usage? In my opinion, ZooKeeper is not suited to >>> cross data center usage. >> >> er... the word "unstable" I used is misleading; A full functional(or >> stable?) tcp connection is supposed to be encountered with some >> network congestion, and should / can handle this situation well, but >> might be with some delay of delivering the segments; High volume of >> traffic in LAN may lead to the above situation, and it is not rare, I >> think. >> >> Even if there was no such congestion, there is always a time lag, >> between zk sends session-timeout message and client receives the >> message; >> Without any assumption, we can not ensure that , the client could be >> ware of that it no longer has the lock - before other clients got the >> node_not_exist notification and successful executed getChildren and >> thought it(one of the others) having the lock. >> >> I think in practice, we could (or have to) accept this assumption : >> "the server’s clock advance no faster than a known constant factor >> faster than the client’s". >> >> But the assumption itself is not enough for the correctness of lock >> protocol; because the client can only passively waiting for the >> session_time_out message, so the client may need a timer to explicitly >> check time elapsed. >> >> But the recipe claims clearly that: "at any snapshot in time no two >> clients think they hold the same lock", and "There is no polling or >> timeouts." >> >> >>> In any event, as others have pointed out, Zookeeper is _not_ a >>> transactional system. >> >>> It is an eventually consistent system that will give you a reasonable >>> degree of distributed coordination semantics. >> >> I should admit that I do not know whether ZK is eventually consistent >> , transactional or not. (BTW, there is a recipe for 2pc, and some guys >> claim that *Zab* is Sequential Consistent); >> >> Does these properties of ZK implies there is assumptions of clock drift? >> >>>There are edge cases as you describe but they are in the level of noise. >> >> You might be right, but for me, edge cases is what I am worrying about >> (please do not get me wrong, I mean, different applications have >> different requirements / constraints). >> >>> >>> -Jordan >>> >>> On Jan 14, 2013, at 5:52 PM, Hulunbier <[email protected]> wrote: >>> >>>> Hi Vitalii, >>>> >>>> Thanks a lot, got your idea. >>>> >>>> Suppose we are measuring the time of events outsides the system(zk & >>>> clients) . >>>> >>>> And we have no client side time tracking routine. >>>> >>>> And t_i < t_k if i < k >>>> >>>> t_0 : >>>> >>>> client1 has created lock/node1, client2 has created lock/node2; >>>> client1 thinks itself holding the lock; client2 does not, and watching >>>> lock/node1. >>>> >>>> t_1 : >>>> >>>> ZK thinks client1's session is timeout(let's say, client1 is actually >>>> failed to send heart-beat message on time, due to a long pause of jvm >>>> gc). >>>> >>>> ZK deletes lock/node1, >>>> sends timeout message to client1, >>>> sends "node_not_exist" message to client2 (or send this message before >>>> the deletion, but it does not matter in our case) >>>> >>>> but for some reason, link between zk and client1 becomes very unstable, >>>> high packet loss, large amount of packet retransmission, >>>> which leads to a significant packet transmission delay(between client1 >>>> and zk only), but the tcp connection is NOT broken. >>>> >>>> t_2: >>>> >>>> client2 got the "node_not_exist" event, and issues the getChildren Cmd >>>> >>>> t_3: >>>> >>>> client2 found the only node lock/node2, and thinks itself holding the >>>> lock, and begins acting like a lock owner. >>>> >>>> (at the same time, client1 is also thinking itself holding the lock) >>>> >>>> t_4: >>>> >>>> session_timeout message not reach client1 yet, >>>> >>>> client1's jvm gc completed, doing something as the lock-owner. >>>> >>>> t_5: >>>> >>>> network becomes stable, finally, the session_timeout message sent from >>>> zk reached client1; >>>> >>>> client1 thinks itself no longer holding the lock, but it is too late, >>>> it has done something really bad between t_4 and t_5. >>>> >>>> -------------------------- >>>> >>>> Sorry for the grammar, I am not a native English speaker. >>>> >>>> >>>> On Mon, Jan 14, 2013 at 11:38 PM, Vitalii Tymchyshyn <[email protected]> >>>> wrote: >>>>> There are two events: disconnected and session expired. The ephemeral >>>>> nodes >>>>> are removed after the second one. The client receives both. So to >>>>> implement "at most one lock holder" scheme, client owning lock must think >>>>> it've lost lock ownership since it've received disconnected event. So, >>>>> there is period of time between disconnect and session expired when noone >>>>> should have the lock. It's "safety" time to accomodate for time shifts, >>>>> network latencies, lock ownership recheck interval (in case when client >>>>> can't stop using resource immediatelly and simply checks regulary if it >>>>> still holds the lock). >>>>> >>>>> >>>>> >>>>> 2013/1/14 Hulunbier <[email protected]> >>>>> >>>>>> Hi Vitalii, >>>>>> >>>>>>> I don't see why clock must be in sync. >>>>>> >>>>>> I don't see any reason to precisely sync the clocks either (but if we >>>>>> could ... that would be wonderful.). >>>>>> >>>>>> By *some constrains of clock drift*, I mean : >>>>>> >>>>>> "Every node has a clock, and all clocks increase at the same rate" >>>>>> or >>>>>> "the server’s clock advance no faster than a known constant factor >>>>>> faster than the client’s.". >>>>>> >>>>>> >>>>>>> Also note the difference between disconnected and session >>>>>>> expired events. This time difference is when client knows "something's >>>>>>> wrong", but another client did not get a lock yet. >>>>>> >>>>>> sorry, but I failed to get your idea well; would you please give me >>>>>> some further explanation? >>>>>> >>>>>> >>>>>> On Mon, Jan 14, 2013 at 6:37 PM, Vitalii Tymchyshyn <[email protected]> >>>>>> wrote: >>>>>>> I don't see why clock must be in sync. They are counting time periods >>>>>>> (timeouts). Also note the difference between disconnected and session >>>>>>> expired events. This time difference is when client knows "something's >>>>>>> wrong", but another client did not get a lock yet. You will have >>>>>>> problems >>>>>>> if client can't react (and release resources) between this two events. >>>>>>> >>>>>>> Best regards, Vitalii Tymchyshyn >>>>>>> >>>>>>> >>>>>>> 2013/1/13 Hulunbier <[email protected]> >>>>>>> >>>>>>>> Thanks Jordan, >>>>>>>> >>>>>>>>> Assuming the clocks are in sync between all participants… >>>>>>>> >>>>>>>> imho, perfect clock synchronization in a distributed system is very >>>>>>>> hard (if it can be). >>>>>>>> >>>>>>>>> Someone with better understanding of ZK internals can correct me, but >>>>>>>> this is my understanding. >>>>>>>> >>>>>>>> I think I might have missed some very important and subtile(or >>>>>>>> obvious?) points of the recipe / ZK protocol. >>>>>>>> >>>>>>>> I just can not believe that, there could be such type of a flaw in the >>>>>>>> lock-recipe, for so long time, without anybody has pointed it out. >>>>>>>> >>>>>>>> On Sun, Jan 13, 2013 at 9:31 AM, Jordan Zimmerman >>>>>>>> <[email protected]> wrote: >>>>>>>>> On Jan 12, 2013, at 2:30 AM, Hulunbier <[email protected]> wrote: >>>>>>>>> >>>>>>>>>> Suppose the network link betweens client1 and server is at very low >>>>>>>>>> quality (high packet loss rate?) but still fully functional. >>>>>>>>>> >>>>>>>>>> Client1 may be happily sending heart-beat-messages to server without >>>>>>>>>> notice anything; but ZK server could be unable to receive >>>>>>>>>> heart-beat-messages from client1 for a long period of time , which >>>>>>>>>> leads ZK server to timeout client1's session, and delete the >>>>>> ephemeral >>>>>>>>>> node >>>>>>>>> >>>>>>>>> I believe the heartbeats go both ways. Thus, if the client doesn't >>>>>> hear >>>>>>>> from the server it will post a Disconnected event. >>>>>>>>> >>>>>>>>>> But I still feels that, no matter how well a ZK application behaves, >>>>>>>>>> if we use ephemeral node in the lock-recipe; we can not guarantee "at >>>>>>>>>> any snapshot in time no two clients think they hold the same lock", >>>>>>>>>> which is the fundamental requirement/constraint for a lock. >>>>>>>>> >>>>>>>>> Assuming the clocks are in sync between all participants… The server >>>>>> and >>>>>>>> the client that holds the lock should determine that there is a >>>>>>>> disconnection at nearly the same time. I imagine that there is a >>>>>>>> certain >>>>>>>> amount of time (a few milliseconds) overlap here. But, the next client >>>>>>>> wouldn't get the notification immediately anyway. Further, when the >>>>>>>> next >>>>>>>> client gets the notification, it still needs to execute a getChildren() >>>>>>>> command, process the results, etc. before it can determine that it has >>>>>> the >>>>>>>> lock. That two clients would think they have the lock at the same time >>>>>> is a >>>>>>>> vanishingly small possibility. Even if it did happen it would only be >>>>>> for a >>>>>>>> few milliseconds at most. >>>>>>>>> >>>>>>>>> Someone with better understanding of ZK internals can correct me, but >>>>>>>> this is my understanding. >>>>>>>>> >>>>>>>>> -Jordan >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Best regards, >>>>>>> Vitalii Tymchyshyn >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Best regards, >>>>> Vitalii Tymchyshyn >>>
