I'm reopening this question for the group. I have attached some sample code 
(3.3 branch) to a jira tracker that seems to do what I propose, namely, lower 
the session timeout in the case of an error causing the socket to close. 
https://issues.apache.org/jira/browse/ZOOKEEPER-922

I am very interested in any feedback about what might fail here. I have this 
running in a dev ensemble and it seems to work, but I haven't done any sort of 
extensive testing or considered the effects of this on observers, etc. Even if 
the community doesn't want the change in ZK for reasons of false positives I 
may need to use it internally and could use any insights the experts have on 
unintended side effects.

Thanks,
Camille

-----Original Message-----
From: Benjamin Reed [mailto:br...@yahoo-inc.com] 
Sent: Friday, September 10, 2010 4:11 PM
To: zookeeper-u...@hadoop.apache.org
Subject: Re: closing session on socket close vs waiting for timeout

  ah dang, i should have said "generate a close request for the session 
and push that through the system."

ben

On 09/10/2010 01:01 PM, Benjamin Reed wrote:
>    the problem is that followers don't track session timeouts. they track
> when they last heard from the sessions that are connected to them and
> they periodically propagate this information to the leader. the leader
> is the one that expires the session. your technique only works when the
> client is connected to the leader.
>
> one thing you can do is generate a close request for the socket and push
> that through the system. that will cause it to get propagated through
> the followers and processed at the leader. it would also allow you to
> get your functionality without touching the processing pipeline.
>
> the thing that worries me about this functionality in general is that
> network anomalies can cause a whole raft of sessions to get expired in
> this way. for example, you have 3 servers with load spread well; there
> is a networking glitch that cause clients to abandon a server; suddenly
> 1/3 of your clients will get expired sessions.
>
> ben
>
> On 09/10/2010 12:17 PM, Fournier, Camille F. [Tech] wrote:
>> Ben, could you explain a bit more why you think this won't work? I'm trying 
>> to decide if I should put in the work to take the POC I wrote and complete 
>> it, but I don't really want to waste my time if there's a fundamental reason 
>> it's a bad idea.
>>
>> Thanks,
>> Camille
>>
>> -----Original Message-----
>> From: Benjamin Reed [mailto:br...@yahoo-inc.com]
>> Sent: Wednesday, September 08, 2010 4:03 PM
>> To: zookeeper-u...@hadoop.apache.org
>> Subject: Re: closing session on socket close vs waiting for timeout
>>
>> unfortunately, that only works on the standalone server.
>>
>> ben
>>
>> On 09/08/2010 12:52 PM, Fournier, Camille F. [Tech] wrote:
>>> This would be the ideal solution to this problem I think.
>>> Poking around the (3.3) code to figure out how hard it would be to 
>>> implement, I figure one way to do it would be to modify the session timeout 
>>> to the min session timeout and touch the connection before calling close 
>>> when you get certain exceptions in NIOServerCnxn.doIO. I did this (removing 
>>> the code in touch session that returns if the tickTime is greater than the 
>>> expire time) and it worked (in the standalone server anyway). Interesting 
>>> solution, or total hack that will not work beyond most basic test case?
>>>
>>> C
>>>
>>> (forgive lack of actual code in this email)
>>>
>>> -----Original Message-----
>>> From: Ted Dunning [mailto:ted.dunn...@gmail.com]
>>> Sent: Tuesday, September 07, 2010 1:11 PM
>>> To: zookeeper-u...@hadoop.apache.org
>>> Cc: Benjamin Reed
>>> Subject: Re: closing session on socket close vs waiting for timeout
>>>
>>> This really is, just as Ben says a problem of false positives and false
>>> negatives in detecting session
>>> expiration.
>>>
>>> On the other hand, the current algorithm isn't really using all the
>>> information available.  The current algorithm is
>>> using time since last client initiated heartbeat.  The new proposal is
>>> somewhat worse in that it proposes to use
>>> just the boolean "has-TCP-disconnect-happened".
>>>
>>> Perhaps it would be better to use multiple features in order to decrease
>>> both false positives and false negatives.
>>>
>>> For instance, I could imagine that we use the following features:
>>>
>>> - time since last client hearbeat or disconnect or reconnect
>>>
>>> - what was the last event? (a heartbeat or a disconnect or a reconnect)
>>>
>>> Then the expiration algorithm could use a relatively long time since last
>>> heartbeat and a relatively short time since last disconnect to mark a
>>> session as disconnected.
>>>
>>> Wouldn't this avoid expiration during GC and cluster partition and cause
>>> expiration quickly after a client disconnect?
>>>
>>>
>>> On Mon, Sep 6, 2010 at 11:26 PM, Patrick Hunt<ph...@apache.org>    wrote:
>>>
>>>
>>>> That's a good point, however with suitable documentation, warnings and such
>>>> it seems like a reasonable feature to provide for those users who require
>>>> it. Used in moderation it seems fine to me. Perhaps we also make it
>>>> configurable at the server level for those administrators/ops who don't
>>>> want
>>>> to deal with it (disable the feature entirely, or only enable on particular
>>>> servers, etc...).
>>>>
>>>> Patrick
>>>>
>>>> On Mon, Sep 6, 2010 at 2:10 PM, Benjamin Reed<br...@yahoo-inc.com>    
>>>> wrote:
>>>>
>>>>
>>>>> if this mechanism were used very often, we would get a huge number of
>>>>> session expirations when a server fails. you are trading fast error
>>>>> detection for the ability to tolerate temporary network and server
>>>>>
>>>> outages.
>>>>
>>>>> to be honest this seems like something that in theory sounds like it will
>>>>> work in practice, but once deployed we start getting session expirations
>>>>>
>>>> for
>>>>
>>>>> cases that we really do not want or expect.
>>>>>
>>>>> ben
>>>>>
>>>>>
>>>>> On 09/01/2010 12:47 PM, Patrick Hunt wrote:
>>>>>
>>>>>
>>>>>> Ben, in this case the session would be tied directly to the connection,
>>>>>> we'd explicitly deny session re-establishment for this session type (so
>>>>>> 4 would fail). Would that address your concern, others?
>>>>>>
>>>>>> Patrick
>>>>>>
>>>>>> On 09/01/2010 10:03 AM, Benjamin Reed wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> i'm a bit skeptical that this is going to work out properly. a server
>>>>>>> may receive a socket reset even though the client is still alive:
>>>>>>>
>>>>>>> 1) client sends a request to a server
>>>>>>> 2) client is partitioned from the server
>>>>>>> 3) server starts trying to send response
>>>>>>> 4) client reconnects to a different server
>>>>>>> 5) partition heals
>>>>>>> 6) server gets a reset from client
>>>>>>>
>>>>>>> at step 6 i don't think you want to delete the ephemeral nodes.
>>>>>>>
>>>>>>> ben
>>>>>>>
>>>>>>> On 08/31/2010 01:41 PM, Fournier, Camille F. [Tech] wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Yes that's right. Which network issues can cause the socket to close
>>>>>>>> without the initiating process closing the socket? In my limited
>>>>>>>> experience in this area network issues were more prone to leave dead
>>>>>>>> sockets open rather than vice versa so I don't know what to look out
>>>>>>>> for.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Camille
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Dave Wright [mailto:wrig...@gmail.com]
>>>>>>>> Sent: Tuesday, August 31, 2010 1:14 PM
>>>>>>>> To: zookeeper-u...@hadoop.apache.org
>>>>>>>> Subject: Re: closing session on socket close vs waiting for timeout
>>>>>>>>
>>>>>>>> I think he's saying that if the socket closes because of a crash (i.e.
>>>>>>>> not a
>>>>>>>> normal zookeeper close request) then the session stays alive until the
>>>>>>>> session timeout, which is of course true since ZK allows reconnection
>>>>>>>> and
>>>>>>>> resumption of the session in case of disconnect due to network issues.
>>>>>>>>
>>>>>>>> -Dave Wright
>>>>>>>>
>>>>>>>> On Tue, Aug 31, 2010 at 1:03 PM, Ted Dunning<ted.dunn...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> That doesn't sound right to me.
>>>>>>>>>
>>>>>>>>> Is there a Zookeeper expert in the house?
>>>>>>>>>
>>>>>>>>> On Tue, Aug 31, 2010 at 8:58 AM, Fournier, Camille F. [Tech]<
>>>>>>>>> camille.fourn...@gs.com>     wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I foolishly did not investigate the ZK code closely enough and it
>>>>>>>>>> seems
>>>>>>>>>> that closing the socket still waits for the session timeout to
>>>>>>>>>> remove the
>>>>>>>>>> session.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>

Reply via email to