Re: Leader election and leader operation based on zookeeper

Jordan Zimmerman Sat, 21 Sep 2019 08:49:34 -0700

Here's a first pass Curator recipe to handle the transaction part (note you add 
a check(path, version) on the coordination node, not a setData).


        https://gist.github.com/Randgalt/1a19dcd215e202936e5b92c121fc73de 
<https://gist.github.com/Randgalt/1a19dcd215e202936e5b92c121fc73de>

When you obtain leadership you call setLeader(). Then every time you need to 
coordinate you include the operation returned by transactionOp().

I need to do more edge case testing. But, assuming this works I'll open a 
ticket to add this as a new Curator recipe.

-JZ

> On Sep 21, 2019, at 10:05 AM, Jordan Zimmerman <jor...@jordanzimmerman.com> 
> wrote:
> 
> I took a quick look at "Rethink High-Availability Stores" and I think using 
> the leader path may not work. I think the best solution will be something 
> akin to combining a leader election with a common ZNode versioning scheme. 
> i.e.
> 
> Create a single ZNode to be used for coordination
> Elect a leader in the normal manner
> When an instance becomes leader it:
> Gets the version of the coordination ZNode
> Sets the data for that ZNode (the contents don't matter) using the retrieved 
> version number
> If the set succeeds you can be assured you are currently leader (otherwise 
> release leadership and re-contend)
> Save the new version
> Every time you need to perform a transaction, include a setData of the 
> coordination ZNode with the saved version number
> As long as this succeeds you should be guaranteed of consistency
> If it ever fails with an invalid version number, release leadership and 
> re-contend
> 
> I haven't completely thought this through so others should try to poke holes 
> in it. When I get a chance, I'll try to write a test to prove this.
> 
> -Jordan
> 
>> On Sep 20, 2019, at 3:30 AM, Zili Chen <wander4...@gmail.com 
>> <mailto:wander4...@gmail.com>> wrote:
>> 
>> Hi ZooKeepers,
>> 
>> Recently there is an ongoing refactor[1] in Flink community aimed at
>> overcoming several inconsistent state issues on ZK we have met. I come
>> here to share our design of leader election and leader operation. For
>> leader operation, it is operation that should be committed only if the
>> contender is the leader. Also CC Curator mailing list because it also
>> contains the reason why we cannot JUST use Curator.
>> 
>> The rule we want to keep is
>> 
>> **Writes on ZK must be committed only if the contender is the leader**
>> 
>> We represent contender by an individual ZK client. At the moment we use
>> Curator for leader election so the algorithm is the same as the
>> optimized version in this page[2].
>> 
>> The problem is that this algorithm only take care of leader election but
>> is indifferent to subsequent operations. Consider the scenario below:
>> 
>> 1. contender-1 becomes the leader
>> 2. contender-1 proposes a create txn-1
>> 3. sender thread suspended for full gc
>> 4. contender-1 lost leadership and contender-2 becomes the leader
>> 5. contender-1 recovers from full gc, before it reacts to revoke
>> leadership event, txn-1 retried and sent to ZK.
>> 
>> Without other guard txn will success on ZK and thus contender-1 commit
>> a write operation even if it is no longer the leader. This issue is
>> also documented in this note[3].
>> 
>> To overcome this issue instead of just saying that we're unfortunate,
>> we draft two possible solution.
>> 
>> The first is document here[4]. Briefly, when the contender becomes the
>> leader, we memorize the latch path at that moment. And for
>> subsequent operations, we do in a transaction first checking the
>> existence of the latch path. Leadership is only switched if the latch
>> gone, and all operations will fail if the latch gone.
>> 
>> The second is still rough. Basically it relies on session expire
>> mechanism in ZK. We will adopt the unoptimized version in the
>> recipe[2] given that in our scenario there are only few contenders
>> at the same time. Thus we create /leader node as ephemeral znode with
>> leader information and when session expired we think leadership is
>> revoked and terminate the contender. Asynchronous write operations
>> should not succeed because they will all fail on session expire.
>> 
>> We cannot adopt 1 using Curator because it doesn't expose the latch
>> path(which is added recently, but not in the version we use); we
>> cannot adopt 2 using Curator because although we have to retry on
>> connection loss but we don't want to retry on session expire. Curator
>> always creates a new client on session expire and retry the operation.
>> 
>> I'd like to learn from ZooKeeper community that 1. is there any
>> potential risk if we eventually adopt option 1 or option 2? 2. is
>> there any other solution we can adopt?
>> 
>> Best,
>> tison.
>> 
>> [1] https://issues.apache.org/jira/browse/FLINK-10333 
>> <https://issues.apache.org/jira/browse/FLINK-10333>
>> [2] https://zookeeper.apache.org/doc/current/recipes.html#sc_leaderElection 
>> <https://zookeeper.apache.org/doc/current/recipes.html#sc_leaderElection>
>> [3] https://cwiki.apache.org/confluence/display/CURATOR/TN10 
>> <https://cwiki.apache.org/confluence/display/CURATOR/TN10>
>> [4] 
>> https://docs.google.com/document/d/1cBY1t0k5g1xNqzyfZby3LcPu4t-wpx57G1xf-nmWrCo/edit?usp=sharing
>>  
>> <https://docs.google.com/document/d/1cBY1t0k5g1xNqzyfZby3LcPu4t-wpx57G1xf-nmWrCo/edit?usp=sharing>
>> 
>

Re: Leader election and leader operation based on zookeeper

Reply via email to