Thanks, yes I'll see if I can get a test going to reproduce. It's definitely a 
race case and probably rarely comes up, though it would be theoretically 
possible to always lose the race and never get the leadership.

@Jordan: No I didn't encounter it happening, I noticed it while reading the 
code. Since the methods were renamed in that commit, I bet it was just and 
accident when changing a few things at once.

Jeff

> On Jun 7, 2017, at 5:42 PM, Cameron McKenzie <[email protected]> wrote:
> 
> hey Jeff,
> Your assessment looks pretty spot on to me. I guess it's never come up before 
> as it's an edge case and for most use cases I imagine that the consequences 
> are minimal (the order of leadership just changes and it would take a bit 
> longer for the leadership to occur).
> 
> If you'd like to create a PR, preferably with a unit test reproducing the 
> issue if possible, I'd be happy to review it.
> cheers
> Cam
> 
> 
> 
>> On Thu, Jun 8, 2017 at 10:15 AM, Jeff Clites <[email protected]> wrote:
>> At the bottom of the LeaderLatch.checkLeadership() method (in Curator 
>> 2.12.0), the following is called to set a watch on the waiter one spot ahead 
>> of us, as expected:
>> 
>>  String watchPath = sortedChildren.get(ourIndex - 1);
>>  ...
>>  
>> client.getData().usingWatcher(watcher).inBackground(callback).forPath(ZKPaths.makePath(latchPath,
>>  watchPath));
>> 
>> When the watch fires and has event type NodeDeleted, then 
>> LeaderLatch.getChildren() is called to determine if we are now the leader, 
>> again as expected.
>> 
>> But I don't understand the following. The code for the callback (called in 
>> response to the getData() completing) has this:
>> 
>>    if ( event.getResultCode() == KeeperException.Code.NONODE.intValue() )
>>    {
>>        // previous node is gone - reset
>>        reset();
>>    }
>> 
>> It seems that this should be a call to getChildren(), rather than to 
>> reset(), since this means that the waiter ahead of us has disappeared (the 
>> same situation we were watch-ing for, but it happened before we could even 
>> set a watch), whereas reset() would be the right reaction if our path had 
>> disappeared. It looks to me like this code "thinks" it's a watch about our 
>> node, rather than the one-ahead-of-use node. I think this would cause us to 
>> delete and re-create our node, effectively moving us to the back of the wait 
>> queue, rather than making us the leader now.
>> 
>> Is this a bug, or am I misinterpreting something?
>> 
>> Looking at the history of this code, it looks like it did in fact used to 
>> check for leadership at this point, and got changed when the code was 
>> updated to use background calls (in commit sha ff4ec29f).
>> 
>> Thanks in advance,
>> 
>> Jeff Clites
> 

Reply via email to