I've been experimenting with the WriteLock implementation to deal with server failure; I've found that its maybe too simplistic creating a reconnecting ZooKeeper proxy; instead I'm just making it easy to retry operations (or arbitrary ZK code blocks) using a helper class (currently called ProtocolSupport but am open to suggestions for a better class name for a base class for higher level protocol implementations).
Using the WriteLock as an example; it seems you often want the retry logic to include a number of calls to ZooKeeper; (e.g. check if a znode exists, if it doesn't try to create it - retrying the whole thing when ZK exceptions like connection loss occur etc). I'll submit the patch soon to ZOOKEEPER-78 including this... https://issues.apache.org/jira/browse/ZOOKEEPER-78 One thing I have found is I've managed to get a SessionExpiredException in my test case (not sure why though; I thought ZooKeeper automatically kept sending keep alive pings?). I just wondered what a client should do if that happens; I didn't see any easy way to effectively disconnect and reconnect a ZooKeeper client in this case. I'm assuming that the SessionExpiredException is always gonna be possible; so I've patched ZooKeeper to allow clients to handle a SessionExpiredException and force a reconnection (to get a new session). So I've created a small patch to add a reconnect() method to ZooKeeper which just closes and recreates the cnxn object... https://issues.apache.org/jira/browse/ZOOKEEPER-84 (I also added a toString() method for easier debugging when running test cases with multiple clients in the same jvm). There's maybe a less drastic way to force the re-connection of a ZooKeeper client; but I figured trashing and recreating the cnxn object at least is lowest risk and a simple patch :) and the code should only be executed rarely so performance isn't such an issue. Thoughts? 2008/7/18 James Strachan <[EMAIL PROTECTED]>: > <background> > I work on the ActiveMQ project which implements the JMS API - which is > a kinda complex thing but it involves a number of objects > (Connections, Sessions, Producers, Consumers). In some JMS providers > its the end users responsibility to deal with detecting a connection > failure (from any other kind of error) and then automatically > recreating all the dependent objects. > > We added support for auto-reconnection which greatly simplifies the > developers life; it lets the JMS client automatically deal with any > socket failures, reconnecting to a broker for you and re-establishing > all of those in-flight operations (subscriptions, in progress sends > and so forth). > http://activemq.apache.org/how-can-i-support-auto-reconnection.html > > Having seen the value of wrapping up the auto-reconnection within a > proxy; am thinking its also got merits on ZK > </background> > > > As we start creating protocols/recipes that implement higher order > features like locks, leader elections and so forth we could probably > do with some kinda auto-reconnecting facade to ZooKeeper just to > simplify the implementation code of protocols/recipes. Its a kinda > complex area though and I'm sure different protocols will want > different things; but even for something so simple as a lock - I can > see the value in an auto-reconnecting proxy. > > e.g. there's already 5 different method calls in the current WriteLock > implementation which all really need a custom try/catch around them to > detect loss of the connection which then should be wrapped in a > reconnect-retry logic. > > What to do about watches is interesting; though for now the current > behaviour seems fine (fire them all forcing a re-watch) though we > could though in the future re-enable watches in the new server > connection as an option. > > All I'm thinking about for now is a kinda ReconnectingZooKeeper which > looks like a ZooKeeper object but which internally catches dead > connections and then internally tries to reconnect to one of the ZK > servers under the covers - retrying the current read/write operation > until the ReconnectPolicy says to fail. e.g. some folks might wanna > retry connecting forever; others for a certain amount of time or > certain number of attempts etc. > > So something like... > > public class ReconnectingZooKeeper extends ZooKeeper { > ... > // for each method that reads/writes synchronously > public Stat exists(String path) {... > boolean retry = true; > for (int count = 0; retry; count++ ) { > try { > > // really do the method call! > return super.exists(path); > > } catch (ConnectionClosedException e) { > > // lets let any watches or listeners respond to connection > loss first before we retry > fireAnyWatchesAndStuff(); > > if (!shouldRetry(count)) { > throw e; > } > } > } > > > Any watches should fire when a connection is lost - and all writes > should be replicated to the new server we connect to right? So I'm > thinking, if we had a ReconnectingZooKeeper implementation, we could > use it with the current WriteLock implementation so that the protocol > could survive ZK server loss & reconnection while still working. > > e.g. on connection loss the leader/lock owner needs to loose the lock > until it gets it back just in case; but other than that I think it > should work. > > Am sure there's some gremlins somewhere in automatically reconnecting; > though provided the watch mechanism works, clients will be able to do > the right thing I think. > > Thoughts? > > -- > James > ------- > http://macstrac.blogspot.com/ > > Open Source Integration > http://open.iona.com > -- James ------- http://macstrac.blogspot.com/ Open Source Integration http://open.iona.com