Yes, Ben, would you give some more details as to why it doesn't work in a cluster? I think I am seeing it work ok in cluster mode as well with some basic tests. There are probably other major problems with this but I would appreciate any direction you could give as to what might go wrong here.
Thanks, C -----Original Message----- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: Wednesday, September 08, 2010 4:51 PM To: zookeeper-user@hadoop.apache.org Subject: Re: closing session on socket close vs waiting for timeout To get it to work in a cluster, what would be necessary? A new message to the leader to describe connection loss? On Wed, Sep 8, 2010 at 1:03 PM, Benjamin Reed <br...@yahoo-inc.com> wrote: > unfortunately, that only works on the standalone server. > > ben > > On 09/08/2010 12:52 PM, Fournier, Camille F. [Tech] wrote: > >> This would be the ideal solution to this problem I think. >> Poking around the (3.3) code to figure out how hard it would be to >> implement, I figure one way to do it would be to modify the session timeout >> to the min session timeout and touch the connection before calling close >> when you get certain exceptions in NIOServerCnxn.doIO. I did this (removing >> the code in touch session that returns if the tickTime is greater than the >> expire time) and it worked (in the standalone server anyway). Interesting >> solution, or total hack that will not work beyond most basic test case? >> >> C >> >> (forgive lack of actual code in this email) >> >> -----Original Message----- >> From: Ted Dunning [mailto:ted.dunn...@gmail.com] >> Sent: Tuesday, September 07, 2010 1:11 PM >> To: zookeeper-user@hadoop.apache.org >> Cc: Benjamin Reed >> Subject: Re: closing session on socket close vs waiting for timeout >> >> This really is, just as Ben says a problem of false positives and false >> negatives in detecting session >> expiration. >> >> On the other hand, the current algorithm isn't really using all the >> information available. The current algorithm is >> using time since last client initiated heartbeat. The new proposal is >> somewhat worse in that it proposes to use >> just the boolean "has-TCP-disconnect-happened". >> >> Perhaps it would be better to use multiple features in order to decrease >> both false positives and false negatives. >> >> For instance, I could imagine that we use the following features: >> >> - time since last client hearbeat or disconnect or reconnect >> >> - what was the last event? (a heartbeat or a disconnect or a reconnect) >> >> Then the expiration algorithm could use a relatively long time since last >> heartbeat and a relatively short time since last disconnect to mark a >> session as disconnected. >> >> Wouldn't this avoid expiration during GC and cluster partition and cause >> expiration quickly after a client disconnect? >> >> >> On Mon, Sep 6, 2010 at 11:26 PM, Patrick Hunt<ph...@apache.org> wrote: >> >> >> >>> That's a good point, however with suitable documentation, warnings and >>> such >>> it seems like a reasonable feature to provide for those users who require >>> it. Used in moderation it seems fine to me. Perhaps we also make it >>> configurable at the server level for those administrators/ops who don't >>> want >>> to deal with it (disable the feature entirely, or only enable on >>> particular >>> servers, etc...). >>> >>> Patrick >>> >>> On Mon, Sep 6, 2010 at 2:10 PM, Benjamin Reed<br...@yahoo-inc.com> >>> wrote: >>> >>> >>> >>>> if this mechanism were used very often, we would get a huge number of >>>> session expirations when a server fails. you are trading fast error >>>> detection for the ability to tolerate temporary network and server >>>> >>>> >>> outages. >>> >>> >>>> to be honest this seems like something that in theory sounds like it >>>> will >>>> work in practice, but once deployed we start getting session expirations >>>> >>>> >>> for >>> >>> >>>> cases that we really do not want or expect. >>>> >>>> ben >>>> >>>> >>>> On 09/01/2010 12:47 PM, Patrick Hunt wrote: >>>> >>>> >>>> >>>>> Ben, in this case the session would be tied directly to the connection, >>>>> we'd explicitly deny session re-establishment for this session type (so >>>>> 4 would fail). Would that address your concern, others? >>>>> >>>>> Patrick >>>>> >>>>> On 09/01/2010 10:03 AM, Benjamin Reed wrote: >>>>> >>>>> >>>>> >>>>> >>>>>> i'm a bit skeptical that this is going to work out properly. a server >>>>>> may receive a socket reset even though the client is still alive: >>>>>> >>>>>> 1) client sends a request to a server >>>>>> 2) client is partitioned from the server >>>>>> 3) server starts trying to send response >>>>>> 4) client reconnects to a different server >>>>>> 5) partition heals >>>>>> 6) server gets a reset from client >>>>>> >>>>>> at step 6 i don't think you want to delete the ephemeral nodes. >>>>>> >>>>>> ben >>>>>> >>>>>> On 08/31/2010 01:41 PM, Fournier, Camille F. [Tech] wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> Yes that's right. Which network issues can cause the socket to close >>>>>>> without the initiating process closing the socket? In my limited >>>>>>> experience in this area network issues were more prone to leave dead >>>>>>> sockets open rather than vice versa so I don't know what to look out >>>>>>> for. >>>>>>> >>>>>>> Thanks, >>>>>>> Camille >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Dave Wright [mailto:wrig...@gmail.com] >>>>>>> Sent: Tuesday, August 31, 2010 1:14 PM >>>>>>> To: zookeeper-user@hadoop.apache.org >>>>>>> Subject: Re: closing session on socket close vs waiting for timeout >>>>>>> >>>>>>> I think he's saying that if the socket closes because of a crash >>>>>>> (i.e. >>>>>>> not a >>>>>>> normal zookeeper close request) then the session stays alive until >>>>>>> the >>>>>>> session timeout, which is of course true since ZK allows reconnection >>>>>>> and >>>>>>> resumption of the session in case of disconnect due to network >>>>>>> issues. >>>>>>> >>>>>>> -Dave Wright >>>>>>> >>>>>>> On Tue, Aug 31, 2010 at 1:03 PM, Ted Dunning<ted.dunn...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> That doesn't sound right to me. >>>>>>>> >>>>>>>> Is there a Zookeeper expert in the house? >>>>>>>> >>>>>>>> On Tue, Aug 31, 2010 at 8:58 AM, Fournier, Camille F. [Tech]< >>>>>>>> camille.fourn...@gs.com> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> I foolishly did not investigate the ZK code closely enough and it >>>>>>>>> seems >>>>>>>>> that closing the socket still waits for the session timeout to >>>>>>>>> remove the >>>>>>>>> session. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >