I'm thinking of hacking it through the connectresponse session timeout (similar to the way we detect session rejected). I wrote up a prototype that worked ok this way. Might could extend this hack to other things, using that field as an encoded error msg, thoughts?
C On Aug 4, 2011 6:10 PM, "Patrick Hunt" <[email protected]> wrote: > Our error reporting server->client has always been weak. It's a PITA > to debug in production because a lot of times when the client gets > bounced it's not clear from the client side why (you end up having to > search the server log - for example when maxClientCount is exceeded). > It would be great to fix this, esp if the server could provide insight > to the client about why (an error code/message perhaps). Doing it in a > b/w compatible way might be tough though... > > Patrick > > On Thu, Aug 4, 2011 at 2:45 PM, Ted Dunning <[email protected]> wrote: >> This is used normally to guarantee in-order data views. If you get >> disconnected from one host in an advanced state and then connect to an out >> of date slave, ZK automatically disconnects you to avoid letting you see >> time go backwards. Your situation is different of course. >> >> >> >> On Thu, Aug 4, 2011 at 7:05 PM, Fournier, Camille F. < >> [email protected]> wrote: >> >>> Right now the server just detects that the zxid is wrong, and calls close >>> on the client. The client logs: >>> 15:01:47,593 - INFO >>> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1159] - Unable to >>> read additional data from server sessionid 0x131962b00540000, likely server >>> has closed socket, closing socket connection and attempting reconnect >>> (branch 3.3.3) >>> >>> I will poke around and see if I can figure out a nicer way to indicate this >>> condition. The expired state is perfectly fine for me in my use case. >>> >>> C >>> >>> >>> -----Original Message----- >>> From: Patrick Hunt [mailto:[email protected]] >>> Sent: Thursday, August 04, 2011 1:51 PM >>> To: [email protected] >>> Subject: Re: devops/admin/client question: What do you do when you >>> rollback? >>> >>> On Thu, Aug 4, 2011 at 10:29 AM, Fournier, Camille F. >>> <[email protected]> wrote: >>> > We had an issue here the other day where the ZK servers were running >>> poorly, and in an effort to get them healthy again we ended up rolling back >>> the cluster state. While this was, in retrospect, not the right solution to >>> the problem we were facing, it brought up another problem. Namely, that many >>> of our clients couldn't reconnect with their sessions because their zxid was >>> too high (expected), but that the error they got when trying to do that >>> reconnection was just a vanilla disconnected error. The result was that most >>> of our clients had to be bounced. >>> >>> Hi Camille, there's a long standing jira on this: >>> https://issues.apache.org/jira/browse/ZOOKEEPER-523 >>> >>> > Aside from trying hard to avoid ever rolling back the cluster state, does >>> anyone have a way they deal with this situation if it occurs? Should we >>> consider enhancing the error message to the client so we could track the >>> fact that we were ahead of the quorum zxid and react sensibly? Alternately, >>> since we were sending a sessionId along with the zxid, perhaps it would be >>> nice to check to see if the sessionId exists before checking the zxid, which >>> would send an expired state signal which my client code could handle >>> cleanly. >>> >>> It seems reasonable that if the client connects to all servers in the >>> ensemble (that it knows about) and sees that it's ahead of each one, >>> it should consider the session expired (we could add a new state, but >>> seems like just treating as expired with a good log message would be >>> better from b/w compat standpoint). >>> >>> I can't recall, does the client have sufficient information to make >>> this determination, or is the server just disconnecting? >>> >>> Patrick >>> >>
