Hi, We have added code to our application to reconnect and re-establish watches when we receive a Disconnected event. I am running tests on a heavily loaded system where the zookeeper server and clients are all impacted. On this test system we regularly experience session timeouts and appropriately react to reconnect and set up our watches. There is an uncommon case that I am having trouble puzzling out. When running one of our tests in a loop about 1% of the time we hit a case where on the client side we think the session has expired but on the server side it has been renewed. We will then fail to be able to create an ephemeral node because it already exists and does not ever get cleaned up (since the previous session is still valid). I'm trying to figure out if we are misusing the API or if we have encountered a bug. I'm happy to provide more details. One thing I am wondering is if it is inappropriate to create a new session within the event thread of another session which has received the disconnected event.
Thanks, Martin Serrano The logs are filtered a little just to keep this email a reasonable size, but everything relevant to the session in question is here. Associated client application logs (the reconnecting message is ours where we re-establish watches, etc using a new session): 2011-09-12 18:49:55,729 INFO [ZooKeeperClientCnxn--EventThread-1325fd35663000d] - Session level event, session 1325fd35663000d: WatchedEvent state:Disconnected type:None path:null 2011-09-12 18:49:55,729 INFO [ZooKeeperClientCnxn--EventThread-1325fd35663000d] - Reconnecting, current session 1325fd35663000d 2011-09-12 18:49:55,729 INFO [ZooKeeperClientCnxn--EventThread-1325fd35663000d] - Connecting to zookeeper on localhost:16015 2011-09-12 18:50:08,823 INFO [main_master_10.2.1.100_16040] - obj[76a9b9c] Reestablishing agent watches 2011-09-12 18:50:09,120 WARN [ZooKeeperClientCnxn--EventThread-1325fd35663000d] - obj[1e5b04ae] Failed to write status node multinode-monitor/master. Saw previous value null, retrying org.apache.zookeeper.KeeperException$NodeExistsException - KeeperErrorCode = NodeExists for multinode-monitor/master Associated client zookeeper logs: 2011-09-12 18:48:15,995 INFO ClientCnxn [ZooKeeperClientCnxn--SendThread-1325fd35663000d] - Session establishment complete on server localhost/127.0.0.1:16015, sessionid = 0x1325fd35663000d, negotiated timeout = 40000 2011-09-12 18:49:31,245 INFO ClientCnxn [ZooKeeperClientCnxn--SendThread-1325fd35663000d] - Client session timed out, have not heard from server in 26672ms for sessionid 0x1325fd35663000d, closing socket connection and attempting reconnect 2011-09-12 18:49:55,729 INFO ZooKeeper [ZooKeeperClientCnxn--EventThread-1325fd35663000d] - Initiating client connection, connectString=localhost:16015 sessionTimeout=600000 watcher=com.attivio.global.AieZooKeeper@4f536eec<mailto:watcher=com.attivio.global.AieZooKeeper@4f536eec> 2011-09-12 18:50:08,057 INFO ClientCnxn [ZooKeeperClientCnxn--SendThread-1325fd35663000d] - Session establishment complete on server localhost/127.0.0.1:16015, sessionid = 0x1325fd35663000d, negotiated timeout = 40000 Associated server zookeeper logs: 2011-09-12 18:48:16,010 INFO NIOServerCnxn [SyncThread:0] - Established session 0x1325fd35663000d with negotiated timeout 40000 for client /127.0.0.1:4257 2011-09-12 18:49:54,120 WARN NIOServerCnxn [NIOServerCxn.Factory:localhost/127.0.0.1:16015] - Exception causing close of session 0x1325fd35663000d due to java.io.IOException: An established connection was aborted by the software in your host machine 2011-09-12 18:49:54,120 INFO NIOServerCnxn [NIOServerCxn.Factory:localhost/127.0.0.1:16015] - Closed socket connection for client /127.0.0.1:4257 which had sessionid 0x1325fd35663000d 2011-09-12 18:50:08,057 INFO NIOServerCnxn [NIOServerCxn.Factory:localhost/127.0.0.1:16015] - Client attempting to renew session 0x1325fd35663000d at /127.0.0.1:4566 2011-09-12 18:50:08,057 INFO NIOServerCnxn [NIOServerCxn.Factory:localhost/127.0.0.1:16015] - Established session 0x1325fd35663000d with negotiated timeout 40000 for client /127.0.0.1:4566
