Re: Tracking down possible network partition

Raúl Gutiérrez Segalés Fri, 26 Jun 2015 18:12:06 -0700

On 25 June 2015 at 07:28, Round, Mark <[email protected]> wrote:

> I have a 5-node Zookeeper 3.4.6 cluster across 3 data centres (2
> zookeepers in each “main” DC, and a 5th in a 3rd DC for quorum). I see that
> the two nodes in one DC have regular “issues” where they get kicked out of
> the cluster and the ZooKeeperServer process stops for a few minutes until
> the node rejoins. I’d like to know a couple of things, if someone could
> please point me in the direction of the relevant docs I’d greatly
> appreciate it.
>
> 1.) Is it expected behaviour that when a node is kicked from the cluster,
> it will not be allowed to re-join for a period ? From the logs below I can
> see that re-establishing a valid cluster took around 15 minutes.
>


I don't think so.

2.) It appears that the leader closes connections to the affected followers
> after a “transaction timeout” occurs. Where would I find out what this
> timeout is ? Is this the same thing as a session timout (e.g. The default
> of 20 * tickTime) ?
>

https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java#L496


> 3.) Where can I find the definition of the different fields in the
> election log messages (I.e. What are “n.round”, “n.zxid”, “n.state” and so
> on) ?


Not sure if there's a better source than the source:
https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L687



-rgs

Re: Tracking down possible network partition

Reply via email to