Re: Zookeeper session expiration

Shawn Heisey Mon, 04 Dec 2017 11:49:42 -0800

On 12/4/2017 8:22 AM, Anthony Shaya wrote:

My question is related to how session expiration works, I noticed on many of 
the client machines the times across these machines were all off (by anywhere 
from 1 minute to 20 minutes - which was resolved after discovery - haven't 
verified this completely yet). Can this directly affect session expiration 
within the zookeeper cluster?


   *   I read the following in https://wiki.apache.org/hadoop/ZooKeeper/FAQ , 
"Expirations happens when the cluster does not hear from the client within the 
specified session timeout period (i.e. no heartbeat).". So in some case it seems 
like if the times were wrong across the machines its possible one of the clients could of 
effectively sent a heart beat in the past (not sure about this tbh) and then the cluster 
expires the session?

I make these comments without any knowledge of what ZK code actuallydoes. I am a member of this list because I'm a representative of theApache Solr project, which uses the ZK client in order to maintain acluster.

IMHO, any software which makes actual decisions based on the timestampsin messages from another system is badly designed. I would hope thatthe ZK designers know this, and always make any decisions related totime using the clock in the local system only.

If ZK's designers did the right thing, then a session timeout wouldindicate that quite literally no heartbeats were received in X seconds,as measured by the local clock, and the local clock ONLY ... NOT fromtimestamp information received from another system.

Although such a lack of communication could be caused by any number ofthings, including network hardware failure, one of the most commonreasons I have seen for problems like this is extreme java garbagecollection pauses in the client software.

Situations where the heap is a little bit too small can cause a javaprogram to basically be doing garbage collection constantly, so itdoesn't have much time to do anything else, like send heartbeats to ZKservers.

Situations where the heap is HUGE and garbage collection is not welltuned can lead to pauses of a minute or longer while Java does a massivefull GC.

   *   I don't have the zookeeper node log for the above time to see what was 
going on in zookeeper when the cluster determined the session expired.

   *   Is there any additional logging I can turn on to troubleshoot zk session 
expiration issues?

Hopefully your ZK clients also have logging. Failing that, you couldturn on GC logging for the software with the ZK client (assuming it's aJava client) and find a program or website that can examine the log andgive you statistics or a graph of GC pauses.

If there is a problem in software using the client and whatever loggingis available doesn't help you figure out what's wrong, you're generallygoing to need to talk to whoever wrote that software for helptroubleshooting it.


Thanks,
Shawn

Re: Zookeeper session expiration

Reply via email to