Hi Jordan, Here is a possible edge case of coordination node way.
- When an instance becomes leader it: - Gets the version of the coordination ZNode - Sets the data for that ZNode (the contents don't matter) using the retrieved version number - If the set succeeds you can be assured you are currently leader (otherwise release leadership and re-contend) - Save the new version Actually, it is NOT atomic that an instance becomes leader and it gets the version of the coordination znode. So an edge case is, 1. instance-1 becomes leader, trying to get the version of the coordination znode. 2. instance-2 becomes leader, update the coordination znode. 3. instance-1 gets the newer version and re-update the coordination znode. Generally speaking instance-1 suffers session expire but since Curator retries on session expire that cases above is possible. Although instance-2 will be mislead that itself not the leader and give up leadership so that the algorithm can proceed and instance-1 will be asynchronously notified it is not the leader, before the notification instance-1 possibly performs some operations already. Curator should ensure that instance-1 will not regard itself as the leader with some synchronize logic. Or just use a cached leader latch path for checking because the leader latch path when it becomes leader is synchronized to be the exact one. To be more clear, for leader latch path, I don't mean the volatile field, but the one cached when it becomes leader. Best, tison. Zili Chen <wander4...@gmail.com> 于2019年9月22日周日 上午2:43写道: > >the Curator recipes delete and recreate their paths > > However, as mentioned above, we do a one-shot election(doesn't reuse the > curator recipe) so that > we check the latch path is always the path in the epoch the contender > becomes leader. You can check > out an implementation of the design here[1]. Even we want to enable > re-contending we can set a guard > > (change state -> track latch path) > > and check the state in LEADING && path existence. ( so we don't misleading > and check a wrong path ) > > Checking version and a coordinate znode sounds another valid solution. I'm > glad to see it in the future > Curator version and if there is a valid ticket I can help to dig out a bit > :-) > > Best, > tison. > > [1] > https://github.com/TisonKun/flink/blob/ad51edbfccd417be1b5a1f136e81b0b77401c43a/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/ZooKeeperLeaderElectionServiceNG.java > > > Jordan Zimmerman <jor...@jordanzimmerman.com> 于2019年9月22日周日 上午2:31写道: > >> The issue is that the leader path doesn't stay constant. Every time there >> is a network partition, etc. the Curator recipes delete and recreate their >> paths. So, I'm concerned that client code trying to keep track of the >> leader path would be error prone (it's one reason that they aren't public - >> it's volatile internal state). >> >> -Jordan >> >> On Sep 21, 2019, at 1:26 PM, Zili Chen <wander4...@gmail.com> wrote: >> >> Hi Jordan, >> >> >I think using the leader path may not work >> >> could you share a situation where this strategy does not work? For the >> design we do leader contending >> one-shot and when perform a transaction, checking the existence of latch >> path && in state LEADING. >> >> Given the election algorithm works, state transited to LEADING when its >> latch path once became >> the smallest sequential znode. So the existence of latch path guarding >> that nobody else becoming leader. >> >> >> Jordan Zimmerman <jor...@jordanzimmerman.com> 于2019年9月22日周日 上午12:58写道: >> >>> Yeah, Ted - I think this is basically the same thing. We should all try >>> to poke holes in this. >>> >>> -JZ >>> >>> On Sep 21, 2019, at 11:54 AM, Ted Dunning <ted.dunn...@gmail.com> wrote: >>> >>> >>> I would suggest that using an epoch number stored in ZK might be >>> helpful. Every operation that the master takes could be made conditional on >>> the epoch number using a multi-transaction. >>> >>> Unfortunately, as you say, you have to have the update of the epoch be >>> atomic with becoming leader. >>> >>> The natural way to do this is to have an update of an epoch file be part >>> of the leader election, but that probably isn't possible using Curator. The >>> way I would tend to do it would be have a persistent file that is updated >>> atomically as part of leader election. The version of that persistent file >>> could then be used as the epoch number. All updates to files that are gated >>> on the epoch number would only proceed if no other master has been elected, >>> at least if you use the sync option. >>> >>> >>> >>> >>> >>> On Fri, Sep 20, 2019 at 1:31 AM Zili Chen <wander4...@gmail.com> wrote: >>> >>>> Hi ZooKeepers, >>>> >>>> Recently there is an ongoing refactor[1] in Flink community aimed at >>>> overcoming several inconsistent state issues on ZK we have met. I come >>>> here to share our design of leader election and leader operation. For >>>> leader operation, it is operation that should be committed only if the >>>> contender is the leader. Also CC Curator mailing list because it also >>>> contains the reason why we cannot JUST use Curator. >>>> >>>> The rule we want to keep is >>>> >>>> **Writes on ZK must be committed only if the contender is the leader** >>>> >>>> We represent contender by an individual ZK client. At the moment we use >>>> Curator for leader election so the algorithm is the same as the >>>> optimized version in this page[2]. >>>> >>>> The problem is that this algorithm only take care of leader election but >>>> is indifferent to subsequent operations. Consider the scenario below: >>>> >>>> 1. contender-1 becomes the leader >>>> 2. contender-1 proposes a create txn-1 >>>> 3. sender thread suspended for full gc >>>> 4. contender-1 lost leadership and contender-2 becomes the leader >>>> 5. contender-1 recovers from full gc, before it reacts to revoke >>>> leadership event, txn-1 retried and sent to ZK. >>>> >>>> Without other guard txn will success on ZK and thus contender-1 commit >>>> a write operation even if it is no longer the leader. This issue is >>>> also documented in this note[3]. >>>> >>>> To overcome this issue instead of just saying that we're unfortunate, >>>> we draft two possible solution. >>>> >>>> The first is document here[4]. Briefly, when the contender becomes the >>>> leader, we memorize the latch path at that moment. And for >>>> subsequent operations, we do in a transaction first checking the >>>> existence of the latch path. Leadership is only switched if the latch >>>> gone, and all operations will fail if the latch gone. >>>> >>>> The second is still rough. Basically it relies on session expire >>>> mechanism in ZK. We will adopt the unoptimized version in the >>>> recipe[2] given that in our scenario there are only few contenders >>>> at the same time. Thus we create /leader node as ephemeral znode with >>>> leader information and when session expired we think leadership is >>>> revoked and terminate the contender. Asynchronous write operations >>>> should not succeed because they will all fail on session expire. >>>> >>>> We cannot adopt 1 using Curator because it doesn't expose the latch >>>> path(which is added recently, but not in the version we use); we >>>> cannot adopt 2 using Curator because although we have to retry on >>>> connection loss but we don't want to retry on session expire. Curator >>>> always creates a new client on session expire and retry the operation. >>>> >>>> I'd like to learn from ZooKeeper community that 1. is there any >>>> potential risk if we eventually adopt option 1 or option 2? 2. is >>>> there any other solution we can adopt? >>>> >>>> Best, >>>> tison. >>>> >>>> [1] https://issues.apache.org/jira/browse/FLINK-10333 >>>> [2] >>>> https://zookeeper.apache.org/doc/current/recipes.html#sc_leaderElection >>>> [3] https://cwiki.apache.org/confluence/display/CURATOR/TN10 >>>> [4] >>>> https://docs.google.com/document/d/1cBY1t0k5g1xNqzyfZby3LcPu4t-wpx57G1xf-nmWrCo/edit?usp=sharing >>>> >>>> >>>> >>> >>