I work on the ActiveMQ project which implements the JMS API - which is
a kinda complex thing but it involves a number of objects
(Connections, Sessions, Producers, Consumers). In some JMS providers
its the end users responsibility to deal with detecting a connection
failure (from any other kind of error) and then automatically
recreating all the dependent objects.

We added support for auto-reconnection which greatly simplifies the
developers life; it lets the JMS client automatically deal with any
socket failures, reconnecting to a broker for you and re-establishing
all of those in-flight operations (subscriptions, in progress sends
and so forth).

Having seen the value of wrapping up the auto-reconnection within a
proxy; am thinking its also got merits on ZK

As we start creating protocols/recipes that implement higher order
features like locks, leader elections and so forth we could probably
do with some kinda auto-reconnecting facade to ZooKeeper just to
simplify the implementation code of protocols/recipes. Its a kinda
complex area though and I'm sure different protocols will want
different things; but even for something so simple as a lock - I can
see the value in an auto-reconnecting proxy.

e.g. there's already 5 different method calls in the current WriteLock
implementation which all really need a custom try/catch around them to
detect loss of the connection which then should be wrapped in a
reconnect-retry logic.

What to do about watches is interesting; though for now the current
behaviour seems fine (fire them all forcing a re-watch) though we
could though in the future re-enable watches in the new server
connection as an option.

All I'm thinking about for now is a kinda ReconnectingZooKeeper which
looks like a ZooKeeper object but which internally catches dead
connections and then internally tries to reconnect to one of the ZK
servers under the covers - retrying the current read/write operation
until the ReconnectPolicy says to fail. e.g. some folks might wanna
retry connecting forever; others for a certain amount of time or
certain number of attempts etc.

So something like...

public class ReconnectingZooKeeper extends ZooKeeper {
  // for each method that reads/writes synchronously
  public Stat exists(String path) {...
     boolean retry = true;
     for (int count = 0; retry; count++ ) {
       try {

          // really do the method call!
          return super.exists(path);

       } catch (ConnectionClosedException e) {

          // lets let any watches or listeners respond to connection
loss first before we retry

          if (!shouldRetry(count)) {
             throw e;

Any watches should fire when a connection is lost - and all writes
should be replicated to the new server we connect to right? So I'm
thinking, if we had a ReconnectingZooKeeper implementation, we could
use it with the current WriteLock implementation so that the protocol
could survive ZK server loss & reconnection while still working.

e.g. on connection loss the leader/lock owner needs to loose the lock
until it gets it back just in case; but other than that I think it
should work.

Am sure there's some gremlins somewhere in automatically reconnecting;
though provided the watch mechanism works, clients will be able to do
the right thing I think.



