Re: leader election, scheduled tasks, losing leadership

Jordan Zimmerman Sat, 08 Dec 2012 21:19:17 -0800

If your ConnectionStateListener gets SUSPENDED or LOST you've lost connection 
to ZooKeeper. Therefore you cannot use that same ZooKeeper connection to manage 
a node that denotes the process is running or not. Only 1 VM at a time will be 
running the process. That process can watch for SUSPENDED/LOST and wind down 
the task.


> You can't assume that the notification is received locally before another
> leader election finishes elsewhere
Which notification? The ConnectionStateListener is an abstraction on 
ZooKeeper's watcher mechanism. It's only significant for the VM that is the 
leader. Non-leaders don't need to be concerned.

-JZ

On Dec 8, 2012, at 9:12 PM, Henry Robinson <[email protected]> wrote:

> You can't assume that the notification is received locally before another
> leader election finishes elsewhere (particularly if you are running slowly
> for some reason!), so it's not sufficient to guarantee that the process
> that is running locally has finished before someone else starts another.
> 
> It's usually best - if possible - to restructure the system so that
> processes are idempotent to work around these kinds of problem, in
> conjunction with using the kind of primitives that Curator provides.
> 
> Henry
> 
> On 8 December 2012 21:04, Jordan Zimmerman <[email protected]>wrote:
> 
>> This is why you need a ConnectionStateListener. You'll get a notice that
>> the connection has been suspended and you should assume all locks/leaders
>> are invalid.
>> 
>> -JZ
>> 
>> On Dec 8, 2012, at 9:02 PM, Henry Robinson <[email protected]> wrote:
>> 
>>> What about a network disconnection? Presumably leadership is revoked when
>>> the leader appears to have failed, which can be for more reasons than a
>> VM
>>> crash (VM running slow, network event, GC pause etc).
>>> 
>>> Henry
>>> 
>>> On 8 December 2012 21:00, Jordan Zimmerman <[email protected]
>>> wrote:
>>> 
>>>> The leader latch lock is the equivalent of task in progress. I assume
>> the
>>>> task is running in the same VM as the leader lock. The only reason the
>> VM
>>>> would lose leadership is if it crashes in which case the process would
>> die
>>>> anyway.
>>>> 
>>>> -JZ
>>>> 
>>>> On Dec 8, 2012, at 8:56 PM, Eric Pederson <[email protected]> wrote:
>>>> 
>>>>> If I recall correctly it was Henry Robinson that gave me the advice to
>>>> have
>>>>> a "task in progress" check.
>>>>> 
>>>>> 
>>>>> -- Eric
>>>>> 
>>>>> 
>>>>> 
>>>>> On Sat, Dec 8, 2012 at 11:54 PM, Eric Pederson <[email protected]>
>>>> wrote:
>>>>> 
>>>>>> I am using Curator LeaderLatch :)
>>>>>> 
>>>>>> 
>>>>>> -- Eric
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sat, Dec 8, 2012 at 11:52 PM, Jordan Zimmerman <
>>>>>> [email protected]> wrote:
>>>>>> 
>>>>>>> You might check your leader implementation. Writing a correct leader
>>>>>>> recipe is actually quite challenging due to edge cases. Have a look
>> at
>>>>>>> Curator (disclosure: I wrote it) for an example.
>>>>>>> 
>>>>>>> -JZ
>>>>>>> 
>>>>>>> On Dec 8, 2012, at 8:49 PM, Eric Pederson <[email protected]> wrote:
>>>>>>> 
>>>>>>>> Actually I had the same thought and didn't consider having to do
>> this
>>>>>>> until
>>>>>>>> I talked about my project at a Zookeeper User Group a month or so
>> ago
>>>>>>> and I
>>>>>>>> was given this advice.
>>>>>>>> 
>>>>>>>> I know that I do see leadership being lost/transferred when one of
>> the
>>>>>>> ZK
>>>>>>>> servers is restarted (not the whole ensemble).   And it seems like
>>>> I've
>>>>>>>> seen it happen even when the ensemble stays totally stable (though I
>>>> am
>>>>>>> not
>>>>>>>> 100% sure as it's been a while since I have worked on this
>> particular
>>>>>>>> application).
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -- Eric
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sat, Dec 8, 2012 at 11:25 PM, Jordan Zimmerman <
>>>>>>>> [email protected]> wrote:
>>>>>>>> 
>>>>>>>>> Why would it lose leadership? The only reason I can think of is if
>>>> the
>>>>>>> ZK
>>>>>>>>> cluster goes down. In normal use, the ZK cluster won't go down (I
>>>>>>> assume
>>>>>>>>> you're running 3 or 5 instances).
>>>>>>>>> 
>>>>>>>>> -JZ
>>>>>>>>> 
>>>>>>>>> On Dec 8, 2012, at 8:17 PM, Eric Pederson <[email protected]>
>> wrote:
>>>>>>>>> 
>>>>>>>>>> During the time the task is running a cluster member could lose
>> its
>>>>>>>>>> leadership.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Henry Robinson
>>> Software Engineer
>>> Cloudera
>>> 415-994-6679
>> 
>> 
> 
> 
> -- 
> Henry Robinson
> Software Engineer
> Cloudera
> 415-994-6679

Re: leader election, scheduled tasks, losing leadership

Reply via email to