Benjamin,

thanks a lot, for your response and the great product you guys
designed and implemented.

> these assumptions are manifest in ... the session timeouts of the clients.

Does this mean that session_expired event may be triggered all by
zk-client-library itself ?(by something like a built-in client-local
timer, without notification from zk server? )

(I am digging into the source code, but in case of misunderstanding of
the code, I need your confirmation please)

> HBase region servers went into gc for many minutes and then woke up still 
> thinking they are the leader

Could this happen if I just follow(correctly and without a
client-local timer or external fencing resources) the recipe for
distributed clock?





On Tue, Jan 15, 2013 at 1:27 PM, Benjamin Reed <[email protected]> wrote:
> sorry to jump in the middle, but i thought i'd point out a couple of things.
>
> at the heart of ZK is Zab, which is an atomic broadcast protocol (it
> actually has stronger guarantees than just atomic broadcast: it also
> guarantees primary order). updates go through this protocol which
> gives us sequential consistency for writes.
>
> failure detection uses timeouts, as most failure detectors do, so we
> have some assumptions on bounds of message delays and drifts of
> clocks. in the end, these assumptions are manifest in the sync and
> initial timeouts of the server and the session timeouts of the
> clients.
>
> as long as the assumptions are true, things will stay consistent, if
> the assumptions fail, such as when HBase region servers went into gc
> for many minutes and then woke up still thinking they are the leader,
> bad things can happen. the fix may be to use more conservative
> assumptions or to use a fencing scheme with external resources.
>
> if the assumptions are violated by the zookeeper cluster, it will
> manifest as a liveness problem rather than a safety issue. (in theory
> at least, we do have bugs occasionally :)
>
> ben
>
> On Mon, Jan 14, 2013 at 7:45 PM, Hulunbier <[email protected]> wrote:
>> Hi Jordan,
>>
>>> Why would client 1s connection be unstable but client 2s not? In any normal 
>>> usage the ZK clients are going to be on the same network. Or, are you 
>>> thinking cross-data-center usage? In my opinion, ZooKeeper is not suited to 
>>> cross data center usage.
>>
>> er... the word "unstable" I used is misleading; A full functional(or
>> stable?) tcp connection is supposed to be encountered with some
>> network congestion, and should / can handle this situation well, but
>> might be with some delay of delivering the segments; High volume of
>> traffic in LAN may lead to the above situation, and it is not rare, I
>> think.
>>
>> Even if there was no such congestion, there is always a time lag,
>> between zk sends session-timeout message and client receives the
>> message;
>> Without any assumption, we can not ensure that , the client could be
>> ware of that it no longer has the lock - before other clients got the
>> node_not_exist notification and successful executed getChildren and
>> thought it(one of the others) having the lock.
>>
>> I think in practice, we could (or have to) accept this assumption :
>> "the server’s clock advance no faster than a known constant factor
>> faster than the client’s".
>>
>> But the assumption itself is not enough for the correctness of lock
>> protocol; because the client can only passively waiting for the
>> session_time_out message, so the client may need a timer to explicitly
>> check time elapsed.
>>
>> But the recipe claims clearly that:  "at any snapshot in time no two
>> clients think they hold the same lock", and "There is no polling or
>> timeouts."
>>
>>
>>> In any event, as others have pointed out, Zookeeper is _not_ a 
>>> transactional system.
>>
>>> It is an eventually consistent system that will give you a reasonable 
>>> degree of distributed coordination semantics.
>>
>> I should admit that I do not know whether ZK is eventually consistent
>> , transactional or not. (BTW, there is a recipe for 2pc, and some guys
>> claim that *Zab* is Sequential Consistent);
>>
>> Does these properties of ZK implies there is assumptions of clock drift?
>>
>>>There are edge cases as you describe but they are in the level of noise.
>>
>> You might be right, but for me, edge cases is what I am worrying about
>> (please do not get me wrong, I mean, different applications have
>> different requirements / constraints).
>>
>>>
>>> -Jordan
>>>
>>> On Jan 14, 2013, at 5:52 PM, Hulunbier <[email protected]> wrote:
>>>
>>>> Hi Vitalii,
>>>>
>>>> Thanks a lot, got your idea.
>>>>
>>>> Suppose we are measuring the time of events outsides the system(zk & 
>>>> clients) .
>>>>
>>>> And we have no client side time tracking routine.
>>>>
>>>> And t_i < t_k if  i < k
>>>>
>>>> t_0 :
>>>>
>>>> client1 has created lock/node1, client2 has created lock/node2;
>>>> client1 thinks itself holding the lock; client2 does not, and watching
>>>> lock/node1.
>>>>
>>>> t_1 :
>>>>
>>>> ZK thinks client1's session is timeout(let's say, client1 is actually
>>>> failed to send heart-beat message on time, due to a long pause of jvm
>>>> gc).
>>>>
>>>> ZK deletes lock/node1,
>>>> sends timeout message to client1,
>>>> sends "node_not_exist" message to client2 (or send this message before
>>>> the deletion, but it does not matter in our case)
>>>>
>>>> but for some reason, link between zk and client1 becomes very unstable,
>>>> high packet loss, large amount of packet retransmission,
>>>> which leads to a significant packet transmission delay(between client1
>>>> and zk only), but the tcp connection is NOT broken.
>>>>
>>>> t_2:
>>>>
>>>> client2 got the "node_not_exist" event, and issues the getChildren Cmd
>>>>
>>>> t_3:
>>>>
>>>> client2 found the only node lock/node2, and thinks itself holding the
>>>> lock, and begins acting like a lock owner.
>>>>
>>>> (at the same time, client1 is also thinking itself holding the lock)
>>>>
>>>> t_4:
>>>>
>>>> session_timeout message not reach client1 yet,
>>>>
>>>> client1's jvm gc completed, doing something as the lock-owner.
>>>>
>>>> t_5:
>>>>
>>>> network becomes stable, finally, the session_timeout message sent from
>>>> zk reached client1;
>>>>
>>>> client1 thinks itself no longer holding the lock, but it is too late,
>>>> it has done something really bad between t_4 and t_5.
>>>>
>>>> --------------------------
>>>>
>>>> Sorry for the grammar, I am not a native English speaker.
>>>>
>>>>
>>>> On Mon, Jan 14, 2013 at 11:38 PM, Vitalii Tymchyshyn <[email protected]> 
>>>> wrote:
>>>>> There are two events: disconnected and session expired. The ephemeral 
>>>>> nodes
>>>>> are removed after the second one. The client  receives both. So to
>>>>> implement "at most one lock holder" scheme, client owning lock must think
>>>>> it've lost lock ownership since it've received disconnected event. So,
>>>>> there is period of time between disconnect and session expired when noone
>>>>> should have the lock. It's "safety" time to accomodate for time shifts,
>>>>> network latencies, lock ownership recheck interval (in case when client
>>>>> can't stop using resource immediatelly and simply checks regulary if it
>>>>> still holds the lock).
>>>>>
>>>>>
>>>>>
>>>>> 2013/1/14 Hulunbier <[email protected]>
>>>>>
>>>>>> Hi Vitalii,
>>>>>>
>>>>>>> I don't see why clock must be in sync.
>>>>>>
>>>>>> I don't see any reason to precisely sync the clocks either (but if we
>>>>>> could ... that would be wonderful.).
>>>>>>
>>>>>> By *some constrains of clock drift*, I mean :
>>>>>>
>>>>>> "Every node has a clock, and all clocks increase at the same rate"
>>>>>> or
>>>>>> "the server’s clock advance no faster than a known constant factor
>>>>>> faster than the client’s.".
>>>>>>
>>>>>>
>>>>>>> Also note the difference between disconnected and session
>>>>>>> expired events. This time difference is when client knows "something's
>>>>>>> wrong", but another client did not get a lock yet.
>>>>>>
>>>>>> sorry, but I failed to get your idea well; would you please give me
>>>>>> some further explanation?
>>>>>>
>>>>>>
>>>>>> On Mon, Jan 14, 2013 at 6:37 PM, Vitalii Tymchyshyn <[email protected]>
>>>>>> wrote:
>>>>>>> I don't see why clock must be in sync. They are counting time periods
>>>>>>> (timeouts). Also note the difference between disconnected and session
>>>>>>> expired events. This time difference is when client knows "something's
>>>>>>> wrong", but another client did not get a lock yet. You will have 
>>>>>>> problems
>>>>>>> if client can't react (and release resources) between this two events.
>>>>>>>
>>>>>>> Best regards, Vitalii Tymchyshyn
>>>>>>>
>>>>>>>
>>>>>>> 2013/1/13 Hulunbier <[email protected]>
>>>>>>>
>>>>>>>> Thanks Jordan,
>>>>>>>>
>>>>>>>>> Assuming the clocks are in sync between all participants…
>>>>>>>>
>>>>>>>> imho, perfect clock synchronization in a distributed system is very
>>>>>>>> hard (if it can be).
>>>>>>>>
>>>>>>>>> Someone with better understanding of ZK internals can correct me, but
>>>>>>>> this is my understanding.
>>>>>>>>
>>>>>>>> I think I might have missed some very important and subtile(or
>>>>>>>> obvious?) points of the recipe / ZK protocol.
>>>>>>>>
>>>>>>>> I just can not believe that, there could be such type of a flaw in the
>>>>>>>> lock-recipe,  for so long time,  without anybody has pointed it out.
>>>>>>>>
>>>>>>>> On Sun, Jan 13, 2013 at 9:31 AM, Jordan Zimmerman
>>>>>>>> <[email protected]> wrote:
>>>>>>>>> On Jan 12, 2013, at 2:30 AM, Hulunbier <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Suppose the network link betweens client1 and server is at very low
>>>>>>>>>> quality (high packet loss rate?) but still fully functional.
>>>>>>>>>>
>>>>>>>>>> Client1 may be happily sending heart-beat-messages to server without
>>>>>>>>>> notice anything; but ZK server could be unable to receive
>>>>>>>>>> heart-beat-messages from client1 for a long period of time , which
>>>>>>>>>> leads ZK server to timeout client1's session, and delete the
>>>>>> ephemeral
>>>>>>>>>> node
>>>>>>>>>
>>>>>>>>> I believe the heartbeats go both ways. Thus, if the client doesn't
>>>>>> hear
>>>>>>>> from the server it will post a Disconnected event.
>>>>>>>>>
>>>>>>>>>> But I still feels that, no matter how well a ZK application behaves,
>>>>>>>>>> if we use ephemeral node in the lock-recipe; we can not guarantee "at
>>>>>>>>>> any snapshot in time no two clients think they hold the same lock",
>>>>>>>>>> which is the fundamental requirement/constraint for a lock.
>>>>>>>>>
>>>>>>>>> Assuming the clocks are in sync between all participants… The server
>>>>>> and
>>>>>>>> the client that holds the lock should determine that there is a
>>>>>>>> disconnection at nearly the same time. I imagine that there is a 
>>>>>>>> certain
>>>>>>>> amount of time (a few milliseconds) overlap here. But, the next client
>>>>>>>> wouldn't get the notification immediately anyway. Further, when the 
>>>>>>>> next
>>>>>>>> client gets the notification, it still needs to execute a getChildren()
>>>>>>>> command, process the results, etc. before it can determine that it has
>>>>>> the
>>>>>>>> lock. That two clients would think they have the lock at the same time
>>>>>> is a
>>>>>>>> vanishingly small possibility. Even if it did happen it would only be
>>>>>> for a
>>>>>>>> few milliseconds at most.
>>>>>>>>>
>>>>>>>>> Someone with better understanding of ZK internals can correct me, but
>>>>>>>> this is my understanding.
>>>>>>>>>
>>>>>>>>> -Jordan
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Vitalii Tymchyshyn
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Vitalii Tymchyshyn
>>>

Reply via email to