I'll change the output level and take a look @ the server logs next time I start seeing the error. Up to that point, I don't recall seeing any timeouts on the client side before the session expiration errors.
Either way I've modified my code to create a new client instance if there is a fatal exception during leader election, which should help recovery from the session timeout. Env is Ubuntu 8.10, JRE 1.6.0_11 x64. Three local QuorumPeerMain instances. I'll reply again when I have more info. Thanks again guys for your help. -Tom 2009/2/12 Patrick Hunt <[email protected]>: > Tom, you might try changing the log4j default log level to DEBUG for the > rootlogger and appender if you have not already done so (servers and clients > both). You'll get more information to aid debugging if it does occur again. > http://hadoop.apache.org/zookeeper/docs/r3.0.1/zookeeperAdmin.html#sc_logging > > Also, are you seeing timeouts on the client, or just session expiration on > the server? > > The stat command, detailed here, may also be of use to you: > http://hadoop.apache.org/zookeeper/docs/r3.0.1/zookeeperAdmin.html#sc_zkCommands > > Knowing more about your env, OS & java version in particular, would also help > us help you narrow things down. :-) > > Patrick > > Tom Nichols wrote: >> >> On Thu, Feb 12, 2009 at 4:11 PM, Benjamin Reed <[email protected]> wrote: >>> >>> idleness is not a problem. the client library sends heartbeats to keep the >>> session alive. the client library will also handle reconnects automatically >>> if a server dies. >> >> That's odd then that I'm seeing this problem. I have a local, 3-node >> zookeeper quorum, and I have 3 instances of the client also running on >> the same box. The session expiry doesn't seem to be in response to >> any severe load on the machine or anything like that. I'll keep an >> eye on it and see if I can't reproduce the behavior in a distributed >> environment. >> >> I've realized a relatively easy way to deal with this problem -- I can >> let my thread throw a fatal unchecked exception and then use a >> ThreadGroup implementation that catches the exception. This in turn >> spawns a new client thread and adds it back to the same threadGroup. >> >> Thanks again guys. >> -Tom >> >> >>> since session expiration really is a rare catastrophic event. (or at least >>> it should be.) it is probably easiest to deal with it by starting with a >>> fresh instance if your session expires. >>> >>> ben >>> ________________________________________ >>> From: Tom Nichols [[email protected]] >>> Sent: Thursday, February 12, 2009 11:53 AM >>> To: [email protected] >>> Subject: Re: Dealing with session expired >>> >>> I'm using a timeout of 5000ms. Now let me ask this: Suppose all of >>> my clients are waiting on some external event -- not ZooKeeper -- so >>> they are all idle and are not touching ZK nodes, nor are they calling >>> exists, getChildren, etc etc. Can that idleness cause session expiry? >>> >>> I'm running a local quorum of 3 nodes. That is, I have an Ant script >>> that kicks off 3 <java> tasks in parallel to run ConsumerPeerMain, >>> each with its own config file. >>> >>> Regarding handling of the failure, I suspect I will just have to >>> reinitialize by creating a new instance of my client(s) that >>> themselves will have a new ZK instance. I'm using Spring to wire >>> everything together, which is why it's particularly difficult to >>> simply re-create a new ZK instance and pass it to the classes using it >>> (those classes have no knowledge of each other). But I _can_ just >>> pull a freshly-created (prototype) instance from the Spring >>> application context, which is where a new ZK client will be wired in. >>> >>> The only ramification there is I have to throw the KeeperException as >>> a fatal exception rather than letting that client try to re-elect. Or >>> maybe add in some logic to say "if I can't re-elect, _then_ throw an >>> exception and consider it fatal." >>> >>> Thanks guys. >>> >>> -Tom >>> >>> >>> On Thu, Feb 12, 2009 at 2:39 PM, Patrick Hunt <[email protected]> wrote: >>>> >>>> Regardless of frequency Tom's code still has to handle this situation. >>>> >>>> I would suggest that the "two classes" Tom is referring to in his mail, the >>>> ones that use ZK client object, should either be able to "reinitialize" >>>> with >>>> a new zk session, or they themselves should be discarded and new instances >>>> created using the new session (not sure what makes more sense for his >>>> archi...) >>>> >>>> Regardless of whether we reuse the session object or create a new one I >>>> believe the code using the session needs to "reinitialize" in some way -- >>>> there's been a dramatic break from the cluster. >>>> >>>> As I mentioned, you can decrease the likelihood of expiration by increasing >>>> the timeout - but the downside is that you are less sensitive to clients >>>> dying (because their ephemeral nodes don't get deleted till close/expire >>>> and >>>> if you are doing something like leader election among your clients it will >>>> take longer for the followers to be notified). >>>> >>>> Patrick >>>> >>>> Mahadev Konar wrote: >>>>> >>>>> Hi Tom, >>>>> The session expired event means that the the server expired the client >>>>> and >>>>> that means the watches and ephemrals will go away for that node. >>>>> >>>>> How are you running your zookeeper quorum? Session expiry event should be >>>>> really rare event . If you have a quorum of servers it should rarely >>>>> happen. >>>>> >>>>> mahadev >>>>> >>>>> >>>>> On 2/12/09 11:17 AM, "Tom Nichols" <[email protected]> wrote: >>>>> >>>>>> So if a session expires, my ephemeral nodes and watches have already >>>>>> disappeared? I suppose creating a new ZK instance with the old >>>>>> session ID would not do me any good in that case. Correct? >>>>>> >>>>>> Thanks. >>>>>> -Tom >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Feb 12, 2009 at 2:12 PM, Mahadev Konar <[email protected]> >>>>>> wrote: >>>>>>> >>>>>>> Hi Tom, >>>>>>> We prefer to discard the zookeeper instance if a session expires. >>>>>>> Maintaining a one to one relationship between a client handle and a >>>>>>> session >>>>>>> makes it much simpler for users to understand the existence and >>>>>>> disappearance of ephemeral nodes and watches created by a zookeeper >>>>>>> client. >>>>>>> >>>>>>> thanks >>>>>>> mahadev >>>>>>> >>>>>>> >>>>>>> On 2/12/09 10:58 AM, "Tom Nichols" <[email protected]> wrote: >>>>>>> >>>>>>>> I've come across the situation where a ZK instance will have an >>>>>>>> expired connection and therefore all operations fail. Now AFAIK the >>>>>>>> only way to recover is to create a new ZK instance with the old >>>>>>>> session ID, correct? >>>>>>>> >>>>>>>> Now, my problem is, the ZK instance may be shared -- not between >>>>>>>> threads -- but maybe two classes in the same thread synchronize on >>>>>>>> different nodes by using different watchers. So it makes sense that >>>>>>>> one ZK client instance can handle this. Except that even if I detect >>>>>>>> the session expiration by catching the KeeperException, if I want to >>>>>>>> "resume" the session, I have to create a new ZK instance and pass it >>>>>>>> to any classes who were previously sharing the same instance. Does >>>>>>>> this make sense so far? >>>>>>>> >>>>>>>> Anyway, bottom line is, it would be nice if a ZK instance could itself >>>>>>>> recover a session rather than discarding that instance and creating a >>>>>>>> new one. >>>>>>>> >>>>>>>> Thoughts? >>>>>>>> >>>>>>>> Thanks in advance, >>>>>>>> >>>>>>>> -Tom >
