I don’t think the situation you describe can happen. Let’s walk through this:

— Time N — 
We have a single, correct leader and 2 nodes:
        lock-0000000240
        lock-0000000241

— Time N + D1 — 
ZooKeeper leader instance is restarted. Shortly thereafter, both Curator 
clients will exit their doWork() loops and mark their nodes for deletion. Due 
to a failed connection, though there are still the 2 nodes:
        lock-0000000240 (waiting to be deleted)
        lock-0000000241 (waiting to be deleted)

— Time N + D2 — 
The ZooKeeper quorum is repaired and the nodes start a doWork() loop again. At 
this point, there can be 2, 3 or 4 nodes depending. 
        lock-0000000240 (waiting to be deleted)
        lock-0000000241 (waiting to be deleted)
        lock-0000000242
        lock-0000000243
Neither of the instances will achieve leadership until the nodes 240/241 are 
deleted.

Of course, there may be something else that’s causing you to see 2 leaders. A 
while back I discovered that rolling config changes can do it 
(http://zookeeper-user.578899.n2.nabble.com/Rolling-config-change-considered-harmful-td7578761.html).
 Or, there’s something else going on in Curator. 

-Jordan


From: stibi [email protected]
Reply: [email protected] [email protected]
Date: May 14, 2014 at 11:39:48 AM
To: [email protected] [email protected]
Subject:  Sometimes leader election ends up in two leaders  

Hi!

I'm using Curator's Leader Election recipe (2.4.2) and found a very 
hard-to-reproduce issue which could lead to a situation where both clients 
become leader.

Let's say 2 clients are competing for leadership, client #1 is currently the 
leader and zookeeper maintains the following structure under the leaderPath:

/leaderPath
  |- _c_a8524f0b-3bd7-4df3-ae19-cef11159a7a6-lock-0000000240 (client #1)
  |- _c_b5bdc75f-d2c9-4432-9d58-1f7fe699e125-lock-0000000241 (client #2)

autoRequeue flag is set to true for both clients

Let's tigger a leader election by restarting the ZooKeeper leader.

When this happens, both clients will lose the connection to the ZooKeeper 
ensemble and will try to re-acquire the LeaderSelector's mutex. Eventually 
(after the negotiated session timeout) the ephemeral zNodes under /leaderPath 
will be deleted.

The problem occurs when ephemeral zNode deletions interleave with mutex 
acquisition.
  
Client #1 can observe that both zNodes (240 and 241) are already deleted, 
/leaderPath has no children so it acquires the mutex successfully.

On the other hand, client #2 can observe that both zNodes still exist, so it 
starts to watch zNode #240 (LockInternals.internalLockLoop():315). In a short 
period of time the watcher will be notified about the zNode's deletion, so 
client #2 reenters LockInternals.internalLockLoop().

What is really strange that getSortedChildren() call in LockInternals:284 can 
still return zNode #241
so it will succeed in acquiring the mutex (LockInternals:287)

The result is two clients, both leader, but /leaderPath contains only one zNode 
for client #1

Did you encounter similar problems before? Do you have any ideas on how to 
prevent such race conditions? I can think of a solution: The leader should 
watch its zNode under /leaderPath and interrupt leadership when the zNode gets 
deleted.

Thank you,
Tibor

Reply via email to