Re: Leader election and leader operation based on zookeeper

Jordan Zimmerman Tue, 01 Oct 2019 10:12:06 -0700

Yes, I think this is a hole. As I've thought more about it I think the method 
you described using the lock node in the transaction is actually the best.


-JZ

> On Sep 29, 2019, at 11:41 PM, Zili Chen <wander4...@gmail.com> wrote:
> 
> Hi Jordan,
> 
> Here is a possible edge case of coordination node way.
> 
> When an instance becomes leader it:
> Gets the version of the coordination ZNode
> Sets the data for that ZNode (the contents don't matter) using the retrieved 
> version number
> If the set succeeds you can be assured you are currently leader (otherwise 
> release leadership and re-contend)
> Save the new version
> 
> Actually, it is NOT atomic that an instance becomes leader and it gets the 
> version of the coordination znode. So an edge case is,
> 
> 1. instance-1 becomes leader, trying to get the version of the coordination 
> znode.
> 2. instance-2 becomes leader, update the coordination znode.
> 3. instance-1 gets the newer version and re-update the coordination znode.
> 
> Generally speaking instance-1 suffers session expire but since Curator 
> retries on session expire that cases above is possible. Although
> instance-2 will be mislead that itself not the leader and give up leadership 
> so that the algorithm can proceed and instance-1 will be
> asynchronously notified it is not the leader, before the notification 
> instance-1 possibly performs some operations already.
> 
> Curator should ensure that instance-1 will not regard itself as the leader 
> with some synchronize logic. Or just use a cached leader latch path
> for checking because the leader latch path when it becomes leader is 
> synchronized to be the exact one. To be more clear, for leader latch
> path, I don't mean the volatile field, but the one cached when it becomes 
> leader.
> 
> Best,
> tison.
> 
> 
> Zili Chen <wander4...@gmail.com <mailto:wander4...@gmail.com>> 于2019年9月22日周日 
> 上午2:43写道：
> >the Curator recipes delete and recreate their paths
> 
> However, as mentioned above, we do a one-shot election(doesn't reuse the 
> curator recipe) so that
> we check the latch path is always the path in the epoch the contender becomes 
> leader. You can check
> out an implementation of the design here[1]. Even we want to enable 
> re-contending we can set a guard
> 
> (change state -> track latch path)
> 
> and check the state in LEADING && path existence. ( so we don't misleading 
> and check a wrong path )
> 
> Checking version and a coordinate znode sounds another valid solution. I'm 
> glad to see it in the future
> Curator version and if there is a valid ticket I can help to dig out a bit :-)
> 
> Best,
> tison.
> 
> [1] 
> https://github.com/TisonKun/flink/blob/ad51edbfccd417be1b5a1f136e81b0b77401c43a/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/ZooKeeperLeaderElectionServiceNG.java
>  
> <https://github.com/TisonKun/flink/blob/ad51edbfccd417be1b5a1f136e81b0b77401c43a/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/ZooKeeperLeaderElectionServiceNG.java>
> 
> Jordan Zimmerman <jor...@jordanzimmerman.com 
> <mailto:jor...@jordanzimmerman.com>> 于2019年9月22日周日 上午2:31写道：
> The issue is that the leader path doesn't stay constant. Every time there is 
> a network partition, etc. the Curator recipes delete and recreate their 
> paths. So, I'm concerned that client code trying to keep track of the leader 
> path would be error prone (it's one reason that they aren't public - it's 
> volatile internal state).
> 
> -Jordan
> 
>> On Sep 21, 2019, at 1:26 PM, Zili Chen <wander4...@gmail.com 
>> <mailto:wander4...@gmail.com>> wrote:
>> 
>> Hi Jordan,
>> 
>> >I think using the leader path may not work
>> 
>> could you share a situation where this strategy does not work? For the 
>> design we do leader contending
>> one-shot and when perform a transaction, checking the existence of latch 
>> path && in state LEADING.
>> 
>> Given the election algorithm works, state transited to LEADING when its 
>> latch path once became
>> the smallest sequential znode. So the existence of latch path guarding that 
>> nobody else becoming leader.
>> 
>> 
>> Jordan Zimmerman <jor...@jordanzimmerman.com 
>> <mailto:jor...@jordanzimmerman.com>> 于2019年9月22日周日 上午12:58写道：
>> Yeah, Ted - I think this is basically the same thing. We should all try to 
>> poke holes in this.
>> 
>> -JZ
>> 
>>> On Sep 21, 2019, at 11:54 AM, Ted Dunning <ted.dunn...@gmail.com 
>>> <mailto:ted.dunn...@gmail.com>> wrote:
>>> 
>>> 
>>> I would suggest that using an epoch number stored in ZK might be helpful. 
>>> Every operation that the master takes could be made conditional on the 
>>> epoch number using a multi-transaction.
>>> 
>>> Unfortunately, as you say, you have to have the update of the epoch be 
>>> atomic with becoming leader. 
>>> 
>>> The natural way to do this is to have an update of an epoch file be part of 
>>> the leader election, but that probably isn't possible using Curator. The 
>>> way I would tend to do it would be have a persistent file that is updated 
>>> atomically as part of leader election. The version of that persistent file 
>>> could then be used as the epoch number. All updates to files that are gated 
>>> on the epoch number would only proceed if no other master has been elected, 
>>> at least if you use the sync option.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Fri, Sep 20, 2019 at 1:31 AM Zili Chen <wander4...@gmail.com 
>>> <mailto:wander4...@gmail.com>> wrote:
>>> Hi ZooKeepers,
>>> 
>>> Recently there is an ongoing refactor[1] in Flink community aimed at
>>> overcoming several inconsistent state issues on ZK we have met. I come
>>> here to share our design of leader election and leader operation. For
>>> leader operation, it is operation that should be committed only if the
>>> contender is the leader. Also CC Curator mailing list because it also
>>> contains the reason why we cannot JUST use Curator.
>>> 
>>> The rule we want to keep is
>>> 
>>> **Writes on ZK must be committed only if the contender is the leader**
>>> 
>>> We represent contender by an individual ZK client. At the moment we use
>>> Curator for leader election so the algorithm is the same as the
>>> optimized version in this page[2].
>>> 
>>> The problem is that this algorithm only take care of leader election but
>>> is indifferent to subsequent operations. Consider the scenario below:
>>> 
>>> 1. contender-1 becomes the leader
>>> 2. contender-1 proposes a create txn-1
>>> 3. sender thread suspended for full gc
>>> 4. contender-1 lost leadership and contender-2 becomes the leader
>>> 5. contender-1 recovers from full gc, before it reacts to revoke
>>> leadership event, txn-1 retried and sent to ZK.
>>> 
>>> Without other guard txn will success on ZK and thus contender-1 commit
>>> a write operation even if it is no longer the leader. This issue is
>>> also documented in this note[3].
>>> 
>>> To overcome this issue instead of just saying that we're unfortunate,
>>> we draft two possible solution.
>>> 
>>> The first is document here[4]. Briefly, when the contender becomes the
>>> leader, we memorize the latch path at that moment. And for
>>> subsequent operations, we do in a transaction first checking the
>>> existence of the latch path. Leadership is only switched if the latch
>>> gone, and all operations will fail if the latch gone.
>>> 
>>> The second is still rough. Basically it relies on session expire
>>> mechanism in ZK. We will adopt the unoptimized version in the
>>> recipe[2] given that in our scenario there are only few contenders
>>> at the same time. Thus we create /leader node as ephemeral znode with
>>> leader information and when session expired we think leadership is
>>> revoked and terminate the contender. Asynchronous write operations
>>> should not succeed because they will all fail on session expire.
>>> 
>>> We cannot adopt 1 using Curator because it doesn't expose the latch
>>> path(which is added recently, but not in the version we use); we
>>> cannot adopt 2 using Curator because although we have to retry on
>>> connection loss but we don't want to retry on session expire. Curator
>>> always creates a new client on session expire and retry the operation.
>>> 
>>> I'd like to learn from ZooKeeper community that 1. is there any
>>> potential risk if we eventually adopt option 1 or option 2? 2. is
>>> there any other solution we can adopt?
>>> 
>>> Best,
>>> tison.
>>> 
>>> [1] https://issues.apache.org/jira/browse/FLINK-10333 
>>> <https://issues.apache.org/jira/browse/FLINK-10333>
>>> [2] https://zookeeper.apache.org/doc/current/recipes.html#sc_leaderElection 
>>> <https://zookeeper.apache.org/doc/current/recipes.html#sc_leaderElection>
>>> [3] https://cwiki.apache.org/confluence/display/CURATOR/TN10 
>>> <https://cwiki.apache.org/confluence/display/CURATOR/TN10>
>>> [4] 
>>> https://docs.google.com/document/d/1cBY1t0k5g1xNqzyfZby3LcPu4t-wpx57G1xf-nmWrCo/edit?usp=sharing
>>>  
>>> <https://docs.google.com/document/d/1cBY1t0k5g1xNqzyfZby3LcPu4t-wpx57G1xf-nmWrCo/edit?usp=sharing>
>>> 
>> 
>

Re: Leader election and leader operation based on zookeeper

Reply via email to