Our error reporting server->client has always been weak. It's a PITA to debug in production because a lot of times when the client gets bounced it's not clear from the client side why (you end up having to search the server log - for example when maxClientCount is exceeded). It would be great to fix this, esp if the server could provide insight to the client about why (an error code/message perhaps). Doing it in a b/w compatible way might be tough though...
Patrick On Thu, Aug 4, 2011 at 2:45 PM, Ted Dunning <[email protected]> wrote: > This is used normally to guarantee in-order data views. If you get > disconnected from one host in an advanced state and then connect to an out > of date slave, ZK automatically disconnects you to avoid letting you see > time go backwards. Your situation is different of course. > > > > On Thu, Aug 4, 2011 at 7:05 PM, Fournier, Camille F. < > [email protected]> wrote: > >> Right now the server just detects that the zxid is wrong, and calls close >> on the client. The client logs: >> 15:01:47,593 - INFO >> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1159] - Unable to >> read additional data from server sessionid 0x131962b00540000, likely server >> has closed socket, closing socket connection and attempting reconnect >> (branch 3.3.3) >> >> I will poke around and see if I can figure out a nicer way to indicate this >> condition. The expired state is perfectly fine for me in my use case. >> >> C >> >> >> -----Original Message----- >> From: Patrick Hunt [mailto:[email protected]] >> Sent: Thursday, August 04, 2011 1:51 PM >> To: [email protected] >> Subject: Re: devops/admin/client question: What do you do when you >> rollback? >> >> On Thu, Aug 4, 2011 at 10:29 AM, Fournier, Camille F. >> <[email protected]> wrote: >> > We had an issue here the other day where the ZK servers were running >> poorly, and in an effort to get them healthy again we ended up rolling back >> the cluster state. While this was, in retrospect, not the right solution to >> the problem we were facing, it brought up another problem. Namely, that many >> of our clients couldn't reconnect with their sessions because their zxid was >> too high (expected), but that the error they got when trying to do that >> reconnection was just a vanilla disconnected error. The result was that most >> of our clients had to be bounced. >> >> Hi Camille, there's a long standing jira on this: >> https://issues.apache.org/jira/browse/ZOOKEEPER-523 >> >> > Aside from trying hard to avoid ever rolling back the cluster state, does >> anyone have a way they deal with this situation if it occurs? Should we >> consider enhancing the error message to the client so we could track the >> fact that we were ahead of the quorum zxid and react sensibly? Alternately, >> since we were sending a sessionId along with the zxid, perhaps it would be >> nice to check to see if the sessionId exists before checking the zxid, which >> would send an expired state signal which my client code could handle >> cleanly. >> >> It seems reasonable that if the client connects to all servers in the >> ensemble (that it knows about) and sees that it's ahead of each one, >> it should consider the session expired (we could add a new state, but >> seems like just treating as expired with a good log message would be >> better from b/w compat standpoint). >> >> I can't recall, does the client have sufficient information to make >> this determination, or is the server just disconnecting? >> >> Patrick >> >
