Thanks, yes I'll see if I can get a test going to reproduce. It's definitely a race case and probably rarely comes up, though it would be theoretically possible to always lose the race and never get the leadership.
@Jordan: No I didn't encounter it happening, I noticed it while reading the code. Since the methods were renamed in that commit, I bet it was just and accident when changing a few things at once. Jeff > On Jun 7, 2017, at 5:42 PM, Cameron McKenzie <[email protected]> wrote: > > hey Jeff, > Your assessment looks pretty spot on to me. I guess it's never come up before > as it's an edge case and for most use cases I imagine that the consequences > are minimal (the order of leadership just changes and it would take a bit > longer for the leadership to occur). > > If you'd like to create a PR, preferably with a unit test reproducing the > issue if possible, I'd be happy to review it. > cheers > Cam > > > >> On Thu, Jun 8, 2017 at 10:15 AM, Jeff Clites <[email protected]> wrote: >> At the bottom of the LeaderLatch.checkLeadership() method (in Curator >> 2.12.0), the following is called to set a watch on the waiter one spot ahead >> of us, as expected: >> >> String watchPath = sortedChildren.get(ourIndex - 1); >> ... >> >> client.getData().usingWatcher(watcher).inBackground(callback).forPath(ZKPaths.makePath(latchPath, >> watchPath)); >> >> When the watch fires and has event type NodeDeleted, then >> LeaderLatch.getChildren() is called to determine if we are now the leader, >> again as expected. >> >> But I don't understand the following. The code for the callback (called in >> response to the getData() completing) has this: >> >> if ( event.getResultCode() == KeeperException.Code.NONODE.intValue() ) >> { >> // previous node is gone - reset >> reset(); >> } >> >> It seems that this should be a call to getChildren(), rather than to >> reset(), since this means that the waiter ahead of us has disappeared (the >> same situation we were watch-ing for, but it happened before we could even >> set a watch), whereas reset() would be the right reaction if our path had >> disappeared. It looks to me like this code "thinks" it's a watch about our >> node, rather than the one-ahead-of-use node. I think this would cause us to >> delete and re-create our node, effectively moving us to the back of the wait >> queue, rather than making us the leader now. >> >> Is this a bug, or am I misinterpreting something? >> >> Looking at the history of this code, it looks like it did in fact used to >> check for leadership at this point, and got changed when the code was >> updated to use background calls (in commit sha ff4ec29f). >> >> Thanks in advance, >> >> Jeff Clites >
