I've been experimenting with the WriteLock implementation to deal with
server failure; I've found that its maybe too simplistic creating a
reconnecting ZooKeeper proxy; instead I'm just making it easy to retry
operations (or arbitrary ZK code blocks) using a helper class
(currently called ProtocolSupport but am open to suggestions for a
better class name for a base class for higher level protocol
implementations).
Using the WriteLock as an example; it seems you often want the retry
logic to include a number of calls to ZooKeeper; (e.g. check if a
znode exists, if it doesn't try to create it - retrying the whole
thing when ZK exceptions like connection loss occur etc).
I'll submit the patch soon to ZOOKEEPER-78 including this...
https://issues.apache.org/jira/browse/ZOOKEEPER-78
One thing I have found is I've managed to get a
SessionExpiredException in my test case (not sure why though; I
thought ZooKeeper automatically kept sending keep alive pings?). I
just wondered what a client should do if that happens; I didn't see
any easy way to effectively disconnect and reconnect a ZooKeeper
client in this case.
I'm assuming that the SessionExpiredException is always gonna be
possible; so I've patched ZooKeeper to allow clients to handle a
SessionExpiredException and force a reconnection (to get a new
session).
So I've created a small patch to add a reconnect() method to ZooKeeper
which just closes and recreates the cnxn object...
https://issues.apache.org/jira/browse/ZOOKEEPER-84
(I also added a toString() method for easier debugging when running
test cases with multiple clients in the same jvm).
There's maybe a less drastic way to force the re-connection of a
ZooKeeper client; but I figured trashing and recreating the cnxn
object at least is lowest risk and a simple patch :) and the code
should only be executed rarely so performance isn't such an issue.
Thoughts?
2008/7/18 James Strachan [EMAIL PROTECTED]:
background
I work on the ActiveMQ project which implements the JMS API - which is
a kinda complex thing but it involves a number of objects
(Connections, Sessions, Producers, Consumers). In some JMS providers
its the end users responsibility to deal with detecting a connection
failure (from any other kind of error) and then automatically
recreating all the dependent objects.
We added support for auto-reconnection which greatly simplifies the
developers life; it lets the JMS client automatically deal with any
socket failures, reconnecting to a broker for you and re-establishing
all of those in-flight operations (subscriptions, in progress sends
and so forth).
http://activemq.apache.org/how-can-i-support-auto-reconnection.html
Having seen the value of wrapping up the auto-reconnection within a
proxy; am thinking its also got merits on ZK
/background
As we start creating protocols/recipes that implement higher order
features like locks, leader elections and so forth we could probably
do with some kinda auto-reconnecting facade to ZooKeeper just to
simplify the implementation code of protocols/recipes. Its a kinda
complex area though and I'm sure different protocols will want
different things; but even for something so simple as a lock - I can
see the value in an auto-reconnecting proxy.
e.g. there's already 5 different method calls in the current WriteLock
implementation which all really need a custom try/catch around them to
detect loss of the connection which then should be wrapped in a
reconnect-retry logic.
What to do about watches is interesting; though for now the current
behaviour seems fine (fire them all forcing a re-watch) though we
could though in the future re-enable watches in the new server
connection as an option.
All I'm thinking about for now is a kinda ReconnectingZooKeeper which
looks like a ZooKeeper object but which internally catches dead
connections and then internally tries to reconnect to one of the ZK
servers under the covers - retrying the current read/write operation
until the ReconnectPolicy says to fail. e.g. some folks might wanna
retry connecting forever; others for a certain amount of time or
certain number of attempts etc.
So something like...
public class ReconnectingZooKeeper extends ZooKeeper {
...
// for each method that reads/writes synchronously
public Stat exists(String path) {...
boolean retry = true;
for (int count = 0; retry; count++ ) {
try {
// really do the method call!
return super.exists(path);
} catch (ConnectionClosedException e) {
// lets let any watches or listeners respond to connection
loss first before we retry
fireAnyWatchesAndStuff();
if (!shouldRetry(count)) {
throw e;
}
}
}
Any watches should fire when a connection is lost - and all writes
should be replicated to the new server we connect to right? So I'm
thinking, if we had a ReconnectingZooKeeper implementation, we could
use it