Thanks for this Michael, a bit of clarification/advice inline below:

Michael Bauland wrote:

This is very different from most uses of ZK.  Normally Zookeeper is used on
a private network with almost no packet loss.  Your high rate of loss may be
tickling a protocol bug that other people
just never see.

I'm not sure, but maybe my input could help, too. As I mentioned
earlier, I also run the three zookeeper servers not in a local
environment but across two sites in different countries (soon, in
production, it'll be three sites).
I'm accessing the zookeeper ensemble continuously with about 5 clients
for testing purposes. These clients are running on the same machine as
one of the three zookeeper servers. I have attached logs of two of the
zookeeper servers (one leader and one follower).
My clients have a way to recover from connectionloss by trying ten
times, then they wait for 5 seconds, close the zookeeper connection and
open a new one and try up to ten times again; then again waiting 10
secs, closing, reconnecting, trying, etc., up to 50 seconds, then they fail.

The zk client lib will handle connection loss automatically, not sure what you mean by "try ten times" to recover from conn loss.

Basically the client lib will notify your watcher when the session is disconnected from the cluster. If you have "in process" operations they will get the "connection loss" exception. However once this happens you want to wait, the ZK client lib will attempt to connect to the cluster again (one of the servers in the connect string), once it does it will notify you again via the watcher (sync connected event).

If the session reconnects to the cluster within the session timeout then all watches are restored and nothing is necessary on your part. If you exceed the timeout then the cluster will have expired your session and you would be notified of this immediately upon reconnect to the cluster.

See item 3 in the faq, there's some detail there:

So really in your code you want to:

1) if you get disconnected just wait till you get reconnected, our library handles all that for you.
2) if you get expired then you must create a new session.

I seem to remember you may have had some issues with local vs remote though... so it may not be as clearcut. But in general our client lib handles all this for you.

Sometimes all of the clients fail at about the same time. When I look at
the zookeeper logs I see that they report that the client connection
limit has been reached (which is set to 10, since I didn't set the value
at all), although this shouldn't happen since I have just 5 clients
running and none of them opens two connections at the same time, since
they're just single threaded.

These clients are all on the same ip right?

Are you sure you are closing the sessions in all cases (the old ones)? It could also be the case that you

1) create a session, it gets disco, so you close it and
2) create a new session (and so on)

the session created in 1) (even if you closed it) may still be considered alive by the server until a) it expires, b) the session close in 1) eventually makes it's way back to the server.

To me it seems that there are some connections which actually shouldn't
be there (anymore).

The easiest way to see this is using the "dump" command on the leader. It will tell you the sessions that the cluster thinks is still active. Run this command while running your test and see what it reports...

For the logs, note that the clients are running also on the server which
is at that time the follower and I replaced my IPs by letters.

Take a look at this session 0x226f49bb5d78ea8 for example:

2010-03-08 15:27:20,525 - INFO [NIOServerCxn.Factory:2181:nioserverc...@639] - Creating new session 0x226f49bb5d78ea8 2010-03-08 15:27:22,253 - INFO [CommitProcessor:0:nioserverc...@992] - Finished init of 0x226f49bb5d78ea8 valid:true 2010-03-08 15:27:22,254 - WARN [NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing close of session 0x226f49bb5d78ea8 due to Read error 2010-03-08 15:27:22,254 - INFO [NIOServerCxn.Factory:2181:nioserverc...@857] - closing session:0x226f49bb5d78ea8 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/A.B.C.E:2181 remote=/A.B.C.D:49258] 2010-03-08 15:27:28,000 - INFO [SessionTracker:sessiontrackeri...@133] - Expiring session 0x226f49bb5d78ea8 2010-03-08 15:27:28,001 - INFO [SessionTracker:zookeeperser...@326] - Expiring session 0x226f49bb5d78ea8 2010-03-08 15:27:28,001 - INFO [ProcessThread:-1:preprequestproces...@384] - Processed session termination request for id: 0x226f49bb5d78ea8

Looks like it's using a 5sec timeout, I can see that it's connecting to the server, then the connection is failing (or client closing? I don't have the log for that) after just a fraction of a second. Then the server is expiring the session 5 seconds later. This means that the server never saw a close from the client...

here's an example where the client called close:
2010-03-08 15:27:19,053 - INFO [ProcessThread:-1:preprequestproces...@384] - Processed session termination request for id: 0x226f49bb5d78ea6 2010-03-08 15:27:20,099 - INFO [CommitProcessor:0:nioserverc...@857] - closing session:0x226f49bb5d78ea6 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/A.B.C.E:2181 remote=/A.B.C.D:49249]

checkout time index 15:27:20,605 in the leader logs. Looks to me like network connectivity problems... notice that a bit later the clients are renewing the sessions successfully.

As I mentioned, try using the dump command:

and see if that sheds any light on the maxclientcnxns issue.

You could also look at both the client and server logs and correlate the sessions in your client against information in the server (ie which sessions are active at which times).

Hope this helps, regards,


Reply via email to