RE: ZooKeeper 3.4.6

All,

        I'm trying to troubleshoot a problem and could use some guidance from 
the experts on ZK administration. I have a cluster of applications that share 
work and that create ephemeral nodes representing the work in ZK expressly so 
that, if one application fails, the ephemeral nodes should be deleted, and the 
other apps should be able to pick up the work that is now not being completed 
by the failed instance.

        Yesterday evening, one application instance suffered from some severe 
memory pressure and had to run multiple stop the world GC cycles. The pauses 
appear to have triggered a SessionExpiredException in 
org.apache.zookeeper.ClientCnxn$SendThread.run (I correlated multiple "Pause 
Full" statements in the GC logs with the ZK session timeout in the application 
logs). After the timeout, the connection was re-established in under 1,000ms, 
but the ephemeral nodes remained in ZooKeeper, leaving them as orphans. We've 
seen this behavior before and have had to delete the nodes manually using the 
zkCli.sh utility.

        In an attempt to troubleshoot this issue, I'm trying to correlate the 
ephemeral owner that is listed on a node when you run the 'get' command with 
the ID of an active session. Basically, I'm trying to understand whether ZK 
thinks there is still an active session associated with the ephemeral node in 
the hopes that that might lead to an explanation for why the ZK server didn't 
seem to recognize the session timeout sensed on the client that triggered a new 
connection and would explain why the ephemeral nodes were not deleted as they 
should have been when the connection dropped.

        I've tried the various four letter commands on the server to see if any 
of them output anything that looks like the ephemeral owner ID without any 
success. Any suggestions/guidance would be greatly appreciated. Note, right 
now, upgrading is not an option, but I'm certainly open to that if there are 
known issues with ephemeral nodes in 3.4 that are addressed in newer versions.

Regards,
Paul

Reply via email to