On 25 June 2015 at 07:28, Round, Mark <[email protected]> wrote: > I have a 5-node Zookeeper 3.4.6 cluster across 3 data centres (2 > zookeepers in each “main” DC, and a 5th in a 3rd DC for quorum). I see that > the two nodes in one DC have regular “issues” where they get kicked out of > the cluster and the ZooKeeperServer process stops for a few minutes until > the node rejoins. I’d like to know a couple of things, if someone could > please point me in the direction of the relevant docs I’d greatly > appreciate it. > > 1.) Is it expected behaviour that when a node is kicked from the cluster, > it will not be allowed to re-join for a period ? From the logs below I can > see that re-establishing a valid cluster took around 15 minutes. >
I don't think so. 2.) It appears that the leader closes connections to the affected followers > after a “transaction timeout” occurs. Where would I find out what this > timeout is ? Is this the same thing as a session timout (e.g. The default > of 20 * tickTime) ? > https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java#L496 > 3.) Where can I find the definition of the different fields in the > election log messages (I.e. What are “n.round”, “n.zxid”, “n.state” and so > on) ? Not sure if there's a better source than the source: https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L687 -rgs
