This is a great FAQ topic!

There are two kinds of connection problems:

1) Disconnections: this callback says that we have disconnected: 
KeeperStateDisconnected. This state is usually due to a server failure or 
transient communication error that will hopefully be followed up by a 
reconnected callback. The basic idea is that when disconnected from ZooKeeper 
the process will not have a clear idea of changes that are happening, so it 
should be conservative and assume the worst.

2) Expired session: this callback says that there was a problem, usually a 
network outage, that prevented the client from keeping its session alive so the 
session timed out. This state is not recoverable. This is game over a new 
ZooKeeper object needs to be created the state stored in ZooKeeper needs to be 
re-queried and re-setup.

Here is the best practice for handling these two states:

1) For disconnections, the server should suspend operations that relied on 
information in ZooKeeper. For example, a leader should suspend operations that 
assume it is a leader. Operations resume once the connection is reestablished.

2) For expired sessions, the server should relinquish any rights it received 
from ZooKeeper and rerun the ZooKeeper initialization operations. For example, 
a leader will need to give up leadership, create a new ZooKeeper object and 
rerun the leader election protocol. Restarting the application is a very easy 
way to do this.

Of course there are always exceptions to these practices. For example, given a 
leader that is established with ZooKeeper and behaves conservatively by 
suspending operations on disconnects, even if a process is disconnected from 
ZooKeeper it could still send requests to the leader process. (A partial 
network partition may cause one process to not be able to connect to ZooKeeper 
and still be able to connect to another process that can connect to ZooKeeper.) 
Personally, I would still write my applications to behave conservatively in 
these situations since these kind of partial partitionings are difficult to 
test.

ben




----- Original Message ----
From: Anthony Urso <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]; zookeeper-user@hadoop.apache.org
Sent: Thursday, July 3, 2008 7:17:32 PM
Subject: [Zookeeper-user] Recipes for dealing with disconnection and connection 
expiration

Anyone have examples of the right way to deal with ZooKeeper
disconnection or connection expiration?

Currently I am exiting and starting fresh, but hopefully there is a
more efficient pattern.

Cheers,
Anthony

-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Zookeeper-user mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/zookeeper-user

Reply via email to