I've been experimenting with the WriteLock implementation to deal with
server failure; I've found that its maybe too simplistic creating a
reconnecting ZooKeeper proxy; instead I'm just making it easy to retry
operations (or arbitrary ZK code blocks) using a helper class
(currently called ProtocolSupport but am open to suggestions for a
better class name for a base class for higher level protocol

Using the WriteLock as an example; it seems you often want the retry
logic to include a number of calls to ZooKeeper; (e.g. check if a
znode exists, if it doesn't try to create it - retrying the whole
thing when ZK exceptions like connection loss occur etc).

I'll submit the patch soon to ZOOKEEPER-78 including this...

One thing I have found is I've managed to get a
SessionExpiredException in my test case (not sure why though; I
thought ZooKeeper automatically kept sending keep alive pings?). I
just wondered what a client should do if that happens; I didn't see
any easy way to effectively disconnect and reconnect a ZooKeeper
client in this case.

I'm assuming that the SessionExpiredException is always gonna be
possible; so I've patched ZooKeeper to allow clients to handle a
SessionExpiredException and force a reconnection (to get a new

So I've created a small patch to add a reconnect() method to ZooKeeper
which just closes and recreates the cnxn object...

(I also added a toString() method for easier debugging when running
test cases with multiple clients in the same jvm).

There's maybe a less drastic way to force the re-connection of a
ZooKeeper client; but I figured trashing and recreating the cnxn
object at least is lowest risk and a simple patch :) and the code
should only be executed rarely so performance isn't such an issue.


2008/7/18 James Strachan <[EMAIL PROTECTED]>:
> <background>
> I work on the ActiveMQ project which implements the JMS API - which is
> a kinda complex thing but it involves a number of objects
> (Connections, Sessions, Producers, Consumers). In some JMS providers
> its the end users responsibility to deal with detecting a connection
> failure (from any other kind of error) and then automatically
> recreating all the dependent objects.
> We added support for auto-reconnection which greatly simplifies the
> developers life; it lets the JMS client automatically deal with any
> socket failures, reconnecting to a broker for you and re-establishing
> all of those in-flight operations (subscriptions, in progress sends
> and so forth).
> http://activemq.apache.org/how-can-i-support-auto-reconnection.html
> Having seen the value of wrapping up the auto-reconnection within a
> proxy; am thinking its also got merits on ZK
> </background>
> As we start creating protocols/recipes that implement higher order
> features like locks, leader elections and so forth we could probably
> do with some kinda auto-reconnecting facade to ZooKeeper just to
> simplify the implementation code of protocols/recipes. Its a kinda
> complex area though and I'm sure different protocols will want
> different things; but even for something so simple as a lock - I can
> see the value in an auto-reconnecting proxy.
> e.g. there's already 5 different method calls in the current WriteLock
> implementation which all really need a custom try/catch around them to
> detect loss of the connection which then should be wrapped in a
> reconnect-retry logic.
> What to do about watches is interesting; though for now the current
> behaviour seems fine (fire them all forcing a re-watch) though we
> could though in the future re-enable watches in the new server
> connection as an option.
> All I'm thinking about for now is a kinda ReconnectingZooKeeper which
> looks like a ZooKeeper object but which internally catches dead
> connections and then internally tries to reconnect to one of the ZK
> servers under the covers - retrying the current read/write operation
> until the ReconnectPolicy says to fail. e.g. some folks might wanna
> retry connecting forever; others for a certain amount of time or
> certain number of attempts etc.
> So something like...
> public class ReconnectingZooKeeper extends ZooKeeper {
>  ...
>  // for each method that reads/writes synchronously
>  public Stat exists(String path) {...
>     boolean retry = true;
>     for (int count = 0; retry; count++ ) {
>       try {
>          // really do the method call!
>          return super.exists(path);
>       } catch (ConnectionClosedException e) {
>          // lets let any watches or listeners respond to connection
> loss first before we retry
>          fireAnyWatchesAndStuff();
>          if (!shouldRetry(count)) {
>             throw e;
>       }
>   }
> }
> Any watches should fire when a connection is lost - and all writes
> should be replicated to the new server we connect to right? So I'm
> thinking, if we had a ReconnectingZooKeeper implementation, we could
> use it with the current WriteLock implementation so that the protocol
> could survive ZK server loss & reconnection while still working.
> e.g. on connection loss the leader/lock owner needs to loose the lock
> until it gets it back just in case; but other than that I think it
> should work.
> Am sure there's some gremlins somewhere in automatically reconnecting;
> though provided the watch mechanism works, clients will be able to do
> the right thing I think.
> Thoughts?
