Yes, Ben, would you give some more details as to why it doesn't work in a 
cluster? I think I am seeing it work ok in cluster mode as well with some basic 
tests. There are probably other major problems with this but I would appreciate 
any direction you could give as to what might go wrong here.

Thanks,
C

-----Original Message-----
From: Ted Dunning [mailto:ted.dunn...@gmail.com] 
Sent: Wednesday, September 08, 2010 4:51 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: closing session on socket close vs waiting for timeout

To get it to work in a cluster, what would be necessary?

A new message to the leader to describe connection loss?

On Wed, Sep 8, 2010 at 1:03 PM, Benjamin Reed <br...@yahoo-inc.com> wrote:

> unfortunately, that only works on the standalone server.
>
> ben
>
> On 09/08/2010 12:52 PM, Fournier, Camille F. [Tech] wrote:
>
>> This would be the ideal solution to this problem I think.
>> Poking around the (3.3) code to figure out how hard it would be to
>> implement, I figure one way to do it would be to modify the session timeout
>> to the min session timeout and touch the connection before calling close
>> when you get certain exceptions in NIOServerCnxn.doIO. I did this (removing
>> the code in touch session that returns if the tickTime is greater than the
>> expire time) and it worked (in the standalone server anyway). Interesting
>> solution, or total hack that will not work beyond most basic test case?
>>
>> C
>>
>> (forgive lack of actual code in this email)
>>
>> -----Original Message-----
>> From: Ted Dunning [mailto:ted.dunn...@gmail.com]
>> Sent: Tuesday, September 07, 2010 1:11 PM
>> To: zookeeper-user@hadoop.apache.org
>> Cc: Benjamin Reed
>> Subject: Re: closing session on socket close vs waiting for timeout
>>
>> This really is, just as Ben says a problem of false positives and false
>> negatives in detecting session
>> expiration.
>>
>> On the other hand, the current algorithm isn't really using all the
>> information available.  The current algorithm is
>> using time since last client initiated heartbeat.  The new proposal is
>> somewhat worse in that it proposes to use
>> just the boolean "has-TCP-disconnect-happened".
>>
>> Perhaps it would be better to use multiple features in order to decrease
>> both false positives and false negatives.
>>
>> For instance, I could imagine that we use the following features:
>>
>> - time since last client hearbeat or disconnect or reconnect
>>
>> - what was the last event? (a heartbeat or a disconnect or a reconnect)
>>
>> Then the expiration algorithm could use a relatively long time since last
>> heartbeat and a relatively short time since last disconnect to mark a
>> session as disconnected.
>>
>> Wouldn't this avoid expiration during GC and cluster partition and cause
>> expiration quickly after a client disconnect?
>>
>>
>> On Mon, Sep 6, 2010 at 11:26 PM, Patrick Hunt<ph...@apache.org>  wrote:
>>
>>
>>
>>> That's a good point, however with suitable documentation, warnings and
>>> such
>>> it seems like a reasonable feature to provide for those users who require
>>> it. Used in moderation it seems fine to me. Perhaps we also make it
>>> configurable at the server level for those administrators/ops who don't
>>> want
>>> to deal with it (disable the feature entirely, or only enable on
>>> particular
>>> servers, etc...).
>>>
>>> Patrick
>>>
>>> On Mon, Sep 6, 2010 at 2:10 PM, Benjamin Reed<br...@yahoo-inc.com>
>>>  wrote:
>>>
>>>
>>>
>>>> if this mechanism were used very often, we would get a huge number of
>>>> session expirations when a server fails. you are trading fast error
>>>> detection for the ability to tolerate temporary network and server
>>>>
>>>>
>>> outages.
>>>
>>>
>>>> to be honest this seems like something that in theory sounds like it
>>>> will
>>>> work in practice, but once deployed we start getting session expirations
>>>>
>>>>
>>> for
>>>
>>>
>>>> cases that we really do not want or expect.
>>>>
>>>> ben
>>>>
>>>>
>>>> On 09/01/2010 12:47 PM, Patrick Hunt wrote:
>>>>
>>>>
>>>>
>>>>> Ben, in this case the session would be tied directly to the connection,
>>>>> we'd explicitly deny session re-establishment for this session type (so
>>>>> 4 would fail). Would that address your concern, others?
>>>>>
>>>>> Patrick
>>>>>
>>>>> On 09/01/2010 10:03 AM, Benjamin Reed wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> i'm a bit skeptical that this is going to work out properly. a server
>>>>>> may receive a socket reset even though the client is still alive:
>>>>>>
>>>>>> 1) client sends a request to a server
>>>>>> 2) client is partitioned from the server
>>>>>> 3) server starts trying to send response
>>>>>> 4) client reconnects to a different server
>>>>>> 5) partition heals
>>>>>> 6) server gets a reset from client
>>>>>>
>>>>>> at step 6 i don't think you want to delete the ephemeral nodes.
>>>>>>
>>>>>> ben
>>>>>>
>>>>>> On 08/31/2010 01:41 PM, Fournier, Camille F. [Tech] wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Yes that's right. Which network issues can cause the socket to close
>>>>>>> without the initiating process closing the socket? In my limited
>>>>>>> experience in this area network issues were more prone to leave dead
>>>>>>> sockets open rather than vice versa so I don't know what to look out
>>>>>>> for.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Camille
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Dave Wright [mailto:wrig...@gmail.com]
>>>>>>> Sent: Tuesday, August 31, 2010 1:14 PM
>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>> Subject: Re: closing session on socket close vs waiting for timeout
>>>>>>>
>>>>>>> I think he's saying that if the socket closes because of a crash
>>>>>>> (i.e.
>>>>>>> not a
>>>>>>> normal zookeeper close request) then the session stays alive until
>>>>>>> the
>>>>>>> session timeout, which is of course true since ZK allows reconnection
>>>>>>> and
>>>>>>> resumption of the session in case of disconnect due to network
>>>>>>> issues.
>>>>>>>
>>>>>>> -Dave Wright
>>>>>>>
>>>>>>> On Tue, Aug 31, 2010 at 1:03 PM, Ted Dunning<ted.dunn...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> That doesn't sound right to me.
>>>>>>>>
>>>>>>>> Is there a Zookeeper expert in the house?
>>>>>>>>
>>>>>>>> On Tue, Aug 31, 2010 at 8:58 AM, Fournier, Camille F. [Tech]<
>>>>>>>> camille.fourn...@gs.com>   wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> I foolishly did not investigate the ZK code closely enough and it
>>>>>>>>> seems
>>>>>>>>> that closing the socket still waits for the session timeout to
>>>>>>>>> remove the
>>>>>>>>> session.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

Reply via email to