Re: Switching from State suspended, to lost, to suspended

Robert Hodges Thu, 14 Nov 2013 11:26:43 -0800

Hi, 

I have been looking at the same problem as Henrik.  Just to be clear, the 
problem is the following:  a process wants to make state updates that are only 
safe to do while it has the leader role.


If this is correctly stated, there are three cases that are interesting.  

a.) Ensuring within the process that you have the leadership role when you 
start. 
b.) Ensuring that the process does not give up leadership while such updates 
are proceeding. 
c.) Handling the case where the process loses leadership during the operation, 
leading to a late update 

I was planning on handling cases a and b using a shared lock within each 
process that can become leader.  To perform updates threads need to acquire the 
shared lock.  This is only granted if the process has leadership to begin with. 
 To give up leadership you need to acquire the lock exclusively, which means 
the leader callback must wait for the shared locks to be released before return 
to Curator. 

Case c is the hard one.  One option is to put a callback on the lock so that 
clients holding it will receive an interrupt.  However, there's still a race 
condition hiding under there as Arie points out, so this is only a partial 
solution--in fact it's really identical to checking the flags as described 
below.  

This could be largely cured if Curator had semantics such that it would not try 
to select a new leader before ensuring that the old leader had actually 
processed the interrupt and properly exited.  

What are the Curator leader selection semantics in this case?  If Curator does 
not do something like what I described it's almost trivially easy to get 
overlapping leaders. 

Cheers, Robert Hodges

p.s., If there's interest in the lock approach I would be happy to prepare a 
patch so it can be added to Curator.

On Nov 14, 2013, at 8:11 AM PST, Arie Zilberstein wrote:

> Henrik,
> 
> You should be able to transactionally test for leadership and update a state 
> a varaible in Zookeeper.
> This is something that I requested a few weeks ago in a thread named 
> "Atomically setting a node's data while having leadership", and I hope will 
> be implemented. Personally I think it is a must-have capability.
> 
> In your scenario, however, since you must update a database, there is a race 
> condition that cannot be readily resolved (without some kind of distributed 
> transactions). You can test for leadership and then update the DB, but there 
> is no guarantee that the leadership is still yours by the end of your DB 
> update call.
> 
> Thanks,
> Arie 
> 
> 
> On Wed, Nov 13, 2013 at 4:02 PM, Henrik Nordvik <[email protected]> wrote:
> I've upgraded to curator 2.3.0.
> LeaderSelector still uses thread interrupting for signaling to the thread 
> running takeLeadership() to stop, right?
> Inside my takeLeadership I do some database operations, and before commiting 
> I'm checking if I was interrupted, and roll back if I was.
> However, some code in between clears the interrupt flag (i.e. logback does 
> this), so I'm committing even though I lost/suspended the connection.
> 
> I need some other criteria to decide if I can commit or not. hasLeadership 
> only checks a local flag, which is always true inside takeLeadership().
> Do I have another flag I can check?
> 
> 
> --
> Henrik Nordvik
> 
> 
> On Tue, Nov 5, 2013 at 5:21 PM, Jordan Zimmerman <[email protected]> 
> wrote:
> This sounds like a variation of 
> https://issues.apache.org/jira/browse/CURATOR-54 - The next release of 
> Curator (later this week) provides a more robust way of canceling leadership 
> that doesn’t require thread interruption.
> 
> -Jordan
> 
> On Nov 5, 2013, at 1:47 AM, Henrik Nordvik <[email protected]> wrote:
> 
>> Hi,
>> 
>> I'm getting some strange behaviour when stopping zookeeper in one 
>> environment that I can't reproduce locally.
>> The result is that the leader selector "quits" even though it is set as 
>> auto-requeue. (I think that happens because the retry loop inside 
>> LeaderSelector checks the interrupt-flag, which is set again even when I 
>> cleared it).
>> 
>> I think it boils down to getting
>> 
>> 2013-11-04 18:22:32,501 INFO  [main-EventThread    ] 
>> c.n.c.f.state.ConnectionStateManager      - State change: LOST
>> 2013-11-04 18:22:32,501 DEBUG [ectionStateManager-0] s.f.s.a.feed.MyListener 
>>        - Interrupting thread Thread[LeaderSelector-0,5,main]
>> 2013-11-04 18:22:32,503 INFO  [main-EventThread    ] 
>> c.n.c.f.state.ConnectionStateManager      - State change: SUSPENDED
>> 2013-11-04 18:22:32,504 DEBUG [ectionStateManager-0] s.f.s.a.feed.MyListener 
>>        - Interrupting thread Thread[LeaderSelector-0,5,main]
>> 
>> ... then I handle the interrupt in the leader thread.
>> 
>> Then I get this:
>> 2013-11-04 18:22:36,465 INFO  [main-EventThread    ] 
>> c.n.c.f.state.ConnectionStateManager      - State change: LOST
>> 2013-11-04 18:22:36,465 INFO  [main-EventThread    ] 
>> c.n.c.f.state.ConnectionStateManager      - State change: SUSPENDED
>> 2013-11-04 18:22:36,465 DEBUG [ectionStateManager-0] s.f.s.a.feed.MyListener 
>>        - StateChanged: LOST 
>> 2013-11-04 18:22:36,465 DEBUG [ectionStateManager-0] s.f.s.a.feed.MyListener 
>>        - Interrupting thread Thread[LeaderSelector-0,5,main]
>> 2013-11-04 18:22:36,466 DEBUG [ectionStateManager-0] s.f.s.a.feed.MyListener 
>>        - StateChanged: SUSPENDED 
>> 2013-11-04 18:22:36,466 DEBUG [ectionStateManager-0] s.f.s.a.feed.MyListener 
>>        - Interrupting thread Thread[LeaderSelector-0,5,main]
>> 
>> 
>> Full log is here: https://gist.github.com/zerd/7316258
>> 
>> The code follows the old leader selector example pretty well:
>> 
>>     @Override
>>     public void takeLeadership(CuratorFramework curatorFramework) throws 
>> Exception {
>>         ourThread = Thread.currentThread();
>>         logger.debug(format("(%s) Got leadership", ourThread));
>>         try {
>>             waitForAndPerformWork();
>>         } catch (InterruptedException e) {
>>             logger.debug(format("(%s) Interrupted ", ourThread), e);
>>         } finally {
>>             logger.debug(format("(%s) No longer leader", ourThread));
>>         }
>>     }
>> 
>>     @Override
>>     public void stateChanged(CuratorFramework curatorFramework, 
>> ConnectionState newState) {
>>         logger.debug("StateChanged: " + newState);
>> 
>>         if ((newState == ConnectionState.LOST) || (newState == 
>> ConnectionState.SUSPENDED)) {
>>             if (ourThread != null) {
>>                 logger.debug("Interrupting thread " + ourThread);
>>                 ourThread.interrupt();
>>             } else {
>>                 logger.debug("Thread is null");
>>             }
>>         }
>>     }
>> 
>> Is it supposed to go back and forth from lost to suspended?
>> My goal is to get it to resume trying to get the leadership when zookeeper 
>> comes back. Do I have to requeue it manually when this happens?
>> Would upgrading to latest curator with CancelLeadershipException fix this?
>> 
>> Thank you very much for your time.
>> 
>> --
>> Henrik Nordvik
> 
> 
>

Re: Switching from State suspended, to lost, to suspended

Reply via email to