Hi Cameron, Did the client get the session expired event? Sessions don't expire during quorum loss, and I'm guessing the session got revalidated when the cluster reformed a quorum.
On Thu, May 8, 2014 at 3:31 AM, Cameron McKenzie <[email protected]> wrote: > Sorry, bashed send prematurely! > > Guys, > I've noticed a weird problem with ephemeral nodes not being cleaned up if > the session they are tied to times out while ZooKeeper does not have a > quorum. The situation is basically as follows: > > 3 node cluster > -Client connects to cluster and creates an ephemeral node > -Two nodes die, so quorum is lost > -Some time passes (longer than the session timeout negotiated for the > client that created the ephemeral node) > -One (or both) of the dead nodes come back and a quorum is reformed. > -The ephemeral node tied to the session which should have timed out still > exists and never seems to get cleaned up. > -If I telnet in on port 2181 and 'dump', then I can see that ZK seems to > think that the session is still active and associated with the ephemeral > node in question. > -It seems to stay in this state for some extended period of time (20+ > minutes). Interestingly, when I happened to fire up zkCli.sh I could see > that the node was still there, but after I exited, the node seemed to > disappear shortly afterwards. So, I wonder if the session established by > zkCli.sh ending somehow triggered the cleanup of this rogue ephemeral node? > > Has anyone experience this issue before? I understand that it's a bit of an > edge case, but I'm running across it quite frequently when testing changing > the size of ZK cluster. > > I've thought of a few work arounds for the issue, but I'd like to know if > it's a known issue. > > Any help appreciated! > cheers > > > > On Thu, May 8, 2014 at 8:15 PM, Cameron McKenzie > <[email protected]>wrote: > >> Guys, >> I've noticed a weird problem with ephemeral nodes not being cleaned up if >> the session they are tied to times out while ZooKeeper does not have a >> quorum. The situation is basically as follows: >> >> 3 node cluster >> -Client connects to cluster and creates an ephemeral node >> -Two nodes die, so quorum is lost >> -Some time passes (longer than the session timeout negotiated for the >> client that created the ephemeral node) >> -One (or both) of the dead nodes come back and a quorum is reformed. >> -The ephemeral node tied to the session which should have timed out still >> exists >> >>
