Hello everybody, I've been working on a situation with ZooKeeper leader elections and there seems like an odd rule about Session Timeouts is currently in place in the java client.
The task I'm trying to accomplish is the following: I have multiple clients running the same body of code. For consistency in the "everything works" case, I need to have only one machine actually doing work at a time. However, when that machine fails for any reason, I need another machine to pick up the work as quickly as possible. So far, so good: do a leader election in ZooKeeper and we're set. However, these processes are long-running, and themselves do not need to interact with ZooKeeper in any way. What I've been seeing is that, if ZooKeeper itself goes down, or if the leader becomes partitioned from the ZK cluster for a long period of time ( > session timeout), then the ZooKeeper client seems to infinitely attempt to reconnect, in the background, but the application code itself never receives a session expired event. >From reading the mailing list archives, I saw this thread<http://www.mail-archive.com/zookeeper-user@hadoop.apache.org/msg01274.html>which indicates that a Session Expired event does not get fired until the ZooKeeper client reconnects to the cluster. However, for my use-case this is not ideal, since once the Leader has been elected, it may run for months (if all goes well) without needing to contact zookeeper again. So, in this world, I could end up in an inconsistent state, where the process is running on two separate clients because the one leader has been ousted, but doesn't know it. This isn't terribly difficult to work around: I can create a background thread that pings ZooKeeper every N milliseconds, and if there's a connection loss for a time greater than the session timeout, fire a SessionExpired event back to the application so it can kill itself, but it made me wonder why this particular choice was made. It seems like, from looking at the log output of a client, that ClientCnxn will basically fall into an infinite loop of trying to reconnect to ZK until it succeeds, at which point (or some point soon after--perhaps the next time somebody tries to use that client instance?) a SessionExpiration will be dealt with. It seems to me, though, that all the information is there already--after the session timeout has been exceeded without connecting to ZK, we know that that instance is shot, so why wait until we've reconnected to fire the Session expiration? Why not just fire it right away and then give up trying? Is there a performance or consistency reason why that wouldn't work? Thanks for the help, Scott