Yes, I think this is a hole. As I've thought more about it I think the method you described using the lock node in the transaction is actually the best.
-JZ > On Sep 29, 2019, at 11:41 PM, Zili Chen <wander4...@gmail.com> wrote: > > Hi Jordan, > > Here is a possible edge case of coordination node way. > > When an instance becomes leader it: > Gets the version of the coordination ZNode > Sets the data for that ZNode (the contents don't matter) using the retrieved > version number > If the set succeeds you can be assured you are currently leader (otherwise > release leadership and re-contend) > Save the new version > > Actually, it is NOT atomic that an instance becomes leader and it gets the > version of the coordination znode. So an edge case is, > > 1. instance-1 becomes leader, trying to get the version of the coordination > znode. > 2. instance-2 becomes leader, update the coordination znode. > 3. instance-1 gets the newer version and re-update the coordination znode. > > Generally speaking instance-1 suffers session expire but since Curator > retries on session expire that cases above is possible. Although > instance-2 will be mislead that itself not the leader and give up leadership > so that the algorithm can proceed and instance-1 will be > asynchronously notified it is not the leader, before the notification > instance-1 possibly performs some operations already. > > Curator should ensure that instance-1 will not regard itself as the leader > with some synchronize logic. Or just use a cached leader latch path > for checking because the leader latch path when it becomes leader is > synchronized to be the exact one. To be more clear, for leader latch > path, I don't mean the volatile field, but the one cached when it becomes > leader. > > Best, > tison. > > > Zili Chen <wander4...@gmail.com <mailto:wander4...@gmail.com>> 于2019年9月22日周日 > 上午2:43写道: > >the Curator recipes delete and recreate their paths > > However, as mentioned above, we do a one-shot election(doesn't reuse the > curator recipe) so that > we check the latch path is always the path in the epoch the contender becomes > leader. You can check > out an implementation of the design here[1]. Even we want to enable > re-contending we can set a guard > > (change state -> track latch path) > > and check the state in LEADING && path existence. ( so we don't misleading > and check a wrong path ) > > Checking version and a coordinate znode sounds another valid solution. I'm > glad to see it in the future > Curator version and if there is a valid ticket I can help to dig out a bit :-) > > Best, > tison. > > [1] > https://github.com/TisonKun/flink/blob/ad51edbfccd417be1b5a1f136e81b0b77401c43a/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/ZooKeeperLeaderElectionServiceNG.java > > <https://github.com/TisonKun/flink/blob/ad51edbfccd417be1b5a1f136e81b0b77401c43a/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/ZooKeeperLeaderElectionServiceNG.java> > > Jordan Zimmerman <jor...@jordanzimmerman.com > <mailto:jor...@jordanzimmerman.com>> 于2019年9月22日周日 上午2:31写道: > The issue is that the leader path doesn't stay constant. Every time there is > a network partition, etc. the Curator recipes delete and recreate their > paths. So, I'm concerned that client code trying to keep track of the leader > path would be error prone (it's one reason that they aren't public - it's > volatile internal state). > > -Jordan > >> On Sep 21, 2019, at 1:26 PM, Zili Chen <wander4...@gmail.com >> <mailto:wander4...@gmail.com>> wrote: >> >> Hi Jordan, >> >> >I think using the leader path may not work >> >> could you share a situation where this strategy does not work? For the >> design we do leader contending >> one-shot and when perform a transaction, checking the existence of latch >> path && in state LEADING. >> >> Given the election algorithm works, state transited to LEADING when its >> latch path once became >> the smallest sequential znode. So the existence of latch path guarding that >> nobody else becoming leader. >> >> >> Jordan Zimmerman <jor...@jordanzimmerman.com >> <mailto:jor...@jordanzimmerman.com>> 于2019年9月22日周日 上午12:58写道: >> Yeah, Ted - I think this is basically the same thing. We should all try to >> poke holes in this. >> >> -JZ >> >>> On Sep 21, 2019, at 11:54 AM, Ted Dunning <ted.dunn...@gmail.com >>> <mailto:ted.dunn...@gmail.com>> wrote: >>> >>> >>> I would suggest that using an epoch number stored in ZK might be helpful. >>> Every operation that the master takes could be made conditional on the >>> epoch number using a multi-transaction. >>> >>> Unfortunately, as you say, you have to have the update of the epoch be >>> atomic with becoming leader. >>> >>> The natural way to do this is to have an update of an epoch file be part of >>> the leader election, but that probably isn't possible using Curator. The >>> way I would tend to do it would be have a persistent file that is updated >>> atomically as part of leader election. The version of that persistent file >>> could then be used as the epoch number. All updates to files that are gated >>> on the epoch number would only proceed if no other master has been elected, >>> at least if you use the sync option. >>> >>> >>> >>> >>> >>> On Fri, Sep 20, 2019 at 1:31 AM Zili Chen <wander4...@gmail.com >>> <mailto:wander4...@gmail.com>> wrote: >>> Hi ZooKeepers, >>> >>> Recently there is an ongoing refactor[1] in Flink community aimed at >>> overcoming several inconsistent state issues on ZK we have met. I come >>> here to share our design of leader election and leader operation. For >>> leader operation, it is operation that should be committed only if the >>> contender is the leader. Also CC Curator mailing list because it also >>> contains the reason why we cannot JUST use Curator. >>> >>> The rule we want to keep is >>> >>> **Writes on ZK must be committed only if the contender is the leader** >>> >>> We represent contender by an individual ZK client. At the moment we use >>> Curator for leader election so the algorithm is the same as the >>> optimized version in this page[2]. >>> >>> The problem is that this algorithm only take care of leader election but >>> is indifferent to subsequent operations. Consider the scenario below: >>> >>> 1. contender-1 becomes the leader >>> 2. contender-1 proposes a create txn-1 >>> 3. sender thread suspended for full gc >>> 4. contender-1 lost leadership and contender-2 becomes the leader >>> 5. contender-1 recovers from full gc, before it reacts to revoke >>> leadership event, txn-1 retried and sent to ZK. >>> >>> Without other guard txn will success on ZK and thus contender-1 commit >>> a write operation even if it is no longer the leader. This issue is >>> also documented in this note[3]. >>> >>> To overcome this issue instead of just saying that we're unfortunate, >>> we draft two possible solution. >>> >>> The first is document here[4]. Briefly, when the contender becomes the >>> leader, we memorize the latch path at that moment. And for >>> subsequent operations, we do in a transaction first checking the >>> existence of the latch path. Leadership is only switched if the latch >>> gone, and all operations will fail if the latch gone. >>> >>> The second is still rough. Basically it relies on session expire >>> mechanism in ZK. We will adopt the unoptimized version in the >>> recipe[2] given that in our scenario there are only few contenders >>> at the same time. Thus we create /leader node as ephemeral znode with >>> leader information and when session expired we think leadership is >>> revoked and terminate the contender. Asynchronous write operations >>> should not succeed because they will all fail on session expire. >>> >>> We cannot adopt 1 using Curator because it doesn't expose the latch >>> path(which is added recently, but not in the version we use); we >>> cannot adopt 2 using Curator because although we have to retry on >>> connection loss but we don't want to retry on session expire. Curator >>> always creates a new client on session expire and retry the operation. >>> >>> I'd like to learn from ZooKeeper community that 1. is there any >>> potential risk if we eventually adopt option 1 or option 2? 2. is >>> there any other solution we can adopt? >>> >>> Best, >>> tison. >>> >>> [1] https://issues.apache.org/jira/browse/FLINK-10333 >>> <https://issues.apache.org/jira/browse/FLINK-10333> >>> [2] https://zookeeper.apache.org/doc/current/recipes.html#sc_leaderElection >>> <https://zookeeper.apache.org/doc/current/recipes.html#sc_leaderElection> >>> [3] https://cwiki.apache.org/confluence/display/CURATOR/TN10 >>> <https://cwiki.apache.org/confluence/display/CURATOR/TN10> >>> [4] >>> https://docs.google.com/document/d/1cBY1t0k5g1xNqzyfZby3LcPu4t-wpx57G1xf-nmWrCo/edit?usp=sharing >>> >>> <https://docs.google.com/document/d/1cBY1t0k5g1xNqzyfZby3LcPu4t-wpx57G1xf-nmWrCo/edit?usp=sharing> >>> >> >