RE: Tracking down possible network partition

Akihiro Suda Thu, 24 Sep 2015 17:08:54 -0700

Hi,

this JIRA ticket seems related:
https://issues.apache.org/jira/browse/ZOOKEEPER-2246


The patch suggested in the ticket might be helpful for you.


Regards,
Akihiro Suda

-----Original Message-----
From: Bob Sheehan [mailto:[email protected]] 
Sent: Friday, September 25, 2015 4:32 AM
To: [email protected]
Subject: Tracking down possible network partition

We have similar issue:

3 node ZK cluster in DC1 (e.g. Las Vegas) .. quorum of 2. Each node on
Vmware ESXI host in same rack.

2 observer ZK nodes in DC2 (e.g Germany).Each node on Vmware ESXI host in
same rack.

Centos 6
ZK version Cloudera cdh 3.4.5.


  *   Looks like leader election in DC1 is taking a while ~15 minutes. At
some point TCP connection to one of three nodes is lost. Eventualy repairs.


  *   Apparently during leader election connection lost to observers for ~15
minutes... then connection repaired. But we have 15 minute window where both
observers (DC2) cannot communicate with ZK cluster (DC1). Our DC2 clients
are comuunicating to observers using apache curator library. This causes our
API to fail as it needs ZK data.

We used netstat on TCP ports and are seeing non 0 SENDQ size.


Is there any know fix/patch for this ? Suggestions welcome.

Thanks,

Bob

RE: Tracking down possible network partition

Reply via email to