i apologize - I was thinking of a different recipe. LeaderLatch does handle 
partitions internally. Maybe it’s a gc

> On Aug 17, 2016, at 3:14 PM, Steve Boyle <[email protected]> wrote:
> 
> I should note that we are using version 2.9.1.  I believe we rely on Curator 
> to handle the Lost and Suspended cases, looks like we’d expect calls to 
> leaderLatchListener.isLeader and leaderLatchListener.notLeader.  We’ve never 
> seen long GCs with this app, I’ll start logging that.
>  
> Thanks,
> Steve
>   <>
> From: Jordan Zimmerman [mailto:[email protected]] 
> Sent: Wednesday, August 17, 2016 11:23 AM
> To: [email protected]
> Subject: Re: Leader Latch question
>  
> * How do you handle CONNECTION_SUSPENDED and CONNECTION_LOST? 
> * Was there possibly a very long gc? See 
> https://cwiki.apache.org/confluence/display/CURATOR/TN10 
> <https://cwiki.apache.org/confluence/display/CURATOR/TN10>
>  
> -Jordan
>  
> On Aug 17, 2016, at 1:07 PM, Steve Boyle <[email protected] 
> <mailto:[email protected]>> wrote:
>  
> I appreciate your response.  Any thoughts on how the issue may have occurred 
> in production?  Or thoughts on how to reproduce that scenario?
>  
> In the production case, there were two instances of the app – both configured 
> for a list of 5 zookeepers.
>  
> Thanks,
> Steve
>  
> From: Jordan Zimmerman [mailto:[email protected] 
> <mailto:[email protected]>] 
> Sent: Wednesday, August 17, 2016 11:03 AM
> To: [email protected] <mailto:[email protected]>
> Subject: Re: Leader Latch question
>  
> Manual removal of the latch node isn’t supported. It would require the latch 
> to add a watch on its own node and that has performance/runtime overhead. The 
> recommended behavior is to watch for connection loss/suspended events and 
> exit your latch when that happens. 
>  
> -Jordan
>  
> On Aug 17, 2016, at 12:43 PM, Steve Boyle <[email protected] 
> <mailto:[email protected]>> wrote:
>  
> I’m using the Leader Latch recipe.  I can successfully bring up two instances 
> of my app and have one become ‘active’ and one become ‘standby’.  Most 
> everything works as expected.  We had an issue, in production, when adding a 
> zookeeper to our existing quorum, both instances of the app became ‘active’.  
> Unfortunately, the log files rolled over before we could check for 
> exceptions.  I’ve been trying to reproduce this issue in a test environment.  
> In my test environment, I have two instances of my app configured to use a 
> single zookeeper – this zookeeper is part of a 5 node quorum and is not 
> currently the leader.  I can trigger both instances of the app to become 
> ‘active’ if I use zkCli and manually delete the latch path from the single 
> zookeeper to which my apps are connected.  When I manually delete the latch 
> path, I can see via debug logging that the instance that was previously 
> ‘standby’ gets a notification from zookeeper “Got WatchedEvent 
> state:SyncConnected type:NodeDeleted”.  However, the instance that had 
> already been active gets no notification at all.  Is it expected that 
> manually removing the latch path would only generate notifications to some 
> instances of my app?
>  
> Thanks,
> Steve Boyle

Reply via email to