I should note that we are using version 2.9.1.  I believe we rely on Curator to 
handle the Lost and Suspended cases, looks like we’d expect calls to 
leaderLatchListener.isLeader and leaderLatchListener.notLeader.  We’ve never 
seen long GCs with this app, I’ll start logging that.

Thanks,
Steve

From: Jordan Zimmerman [mailto:[email protected]]
Sent: Wednesday, August 17, 2016 11:23 AM
To: [email protected]
Subject: Re: Leader Latch question

* How do you handle CONNECTION_SUSPENDED and CONNECTION_LOST?
* Was there possibly a very long gc? See 
https://cwiki.apache.org/confluence/display/CURATOR/TN10

-Jordan

On Aug 17, 2016, at 1:07 PM, Steve Boyle 
<[email protected]<mailto:[email protected]>> wrote:

I appreciate your response.  Any thoughts on how the issue may have occurred in 
production?  Or thoughts on how to reproduce that scenario?

In the production case, there were two instances of the app – both configured 
for a list of 5 zookeepers.

Thanks,
Steve

From: Jordan Zimmerman [mailto:[email protected]]
Sent: Wednesday, August 17, 2016 11:03 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: Leader Latch question

Manual removal of the latch node isn’t supported. It would require the latch to 
add a watch on its own node and that has performance/runtime overhead. The 
recommended behavior is to watch for connection loss/suspended events and exit 
your latch when that happens.

-Jordan

On Aug 17, 2016, at 12:43 PM, Steve Boyle 
<[email protected]<mailto:[email protected]>> wrote:

I’m using the Leader Latch recipe.  I can successfully bring up two instances 
of my app and have one become ‘active’ and one become ‘standby’.  Most 
everything works as expected.  We had an issue, in production, when adding a 
zookeeper to our existing quorum, both instances of the app became ‘active’.  
Unfortunately, the log files rolled over before we could check for exceptions.  
I’ve been trying to reproduce this issue in a test environment.  In my test 
environment, I have two instances of my app configured to use a single 
zookeeper – this zookeeper is part of a 5 node quorum and is not currently the 
leader.  I can trigger both instances of the app to become ‘active’ if I use 
zkCli and manually delete the latch path from the single zookeeper to which my 
apps are connected.  When I manually delete the latch path, I can see via debug 
logging that the instance that was previously ‘standby’ gets a notification 
from zookeeper “Got WatchedEvent state:SyncConnected type:NodeDeleted”.  
However, the instance that had already been active gets no notification at all. 
 Is it expected that manually removing the latch path would only generate 
notifications to some instances of my app?

Thanks,
Steve Boyle

Reply via email to