What guarantees that zNode 241 will be deleted prior to the (successful) attempt of client #2 to reacquire the mutex using zNode 241? Because that’s how the lock works. As long as 241 exists, no other client will consider itself as having the mutex.
reacquire the mutex using zNode 241? This is not what happens. The client will try to acquire using a _different_ znode. Are you thinking that 241 is re-used? It’s not. -JZ From: stibi [email protected] Reply: stibi [email protected] Date: May 22, 2014 at 7:26:57 AM To: Jordan Zimmerman [email protected], [email protected] [email protected] Subject: Re: Sometimes leader election ends up in two leaders Hi! Thanks for the quick response. About this step: — Time N + D2 — The ZooKeeper quorum is repaired and the nodes start a doWork() loop again. At this point, there can be 2, 3 or 4 nodes depending. lock-0000000240 (waiting to be deleted) lock-0000000241 (waiting to be deleted) lock-0000000242 lock-0000000243 Neither of the instances will achieve leadership until the nodes 240/241 are deleted. What guarantees that zNode 241 will be deleted prior to the (successful) attempt of client #2 to reacquire the mutex using zNode 241? AFAIK node deletion is a background operation and a retry policy controls how often a deletion attempt will occur (even for guaranteed deletes). Unlucky timing can lead to a situation where deletion of zNode 241 happens after the mutex acquisition. In this case the mutex is not released by the leader, but since the zNodes are deleted, the other client will also be elected as leader. Thanks, Tibor On Thu, May 15, 2014 at 3:37 AM, Jordan Zimmerman <[email protected]> wrote: I don’t think the situation you describe can happen. Let’s walk through this: — Time N — We have a single, correct leader and 2 nodes: lock-0000000240 lock-0000000241 — Time N + D1 — ZooKeeper leader instance is restarted. Shortly thereafter, both Curator clients will exit their doWork() loops and mark their nodes for deletion. Due to a failed connection, though there are still the 2 nodes: lock-0000000240 (waiting to be deleted) lock-0000000241 (waiting to be deleted) — Time N + D2 — The ZooKeeper quorum is repaired and the nodes start a doWork() loop again. At this point, there can be 2, 3 or 4 nodes depending. lock-0000000240 (waiting to be deleted) lock-0000000241 (waiting to be deleted) lock-0000000242 lock-0000000243 Neither of the instances will achieve leadership until the nodes 240/241 are deleted. Of course, there may be something else that’s causing you to see 2 leaders. A while back I discovered that rolling config changes can do it (http://zookeeper-user.578899.n2.nabble.com/Rolling-config-change-considered-harmful-td7578761.html). Or, there’s something else going on in Curator. -Jordan From: stibi [email protected] Reply: [email protected] [email protected] Date: May 14, 2014 at 11:39:48 AM To: [email protected] [email protected] Subject: Sometimes leader election ends up in two leaders Hi! I'm using Curator's Leader Election recipe (2.4.2) and found a very hard-to-reproduce issue which could lead to a situation where both clients become leader. Let's say 2 clients are competing for leadership, client #1 is currently the leader and zookeeper maintains the following structure under the leaderPath: /leaderPath |- _c_a8524f0b-3bd7-4df3-ae19-cef11159a7a6-lock-0000000240 (client #1) |- _c_b5bdc75f-d2c9-4432-9d58-1f7fe699e125-lock-0000000241 (client #2) autoRequeue flag is set to true for both clients Let's tigger a leader election by restarting the ZooKeeper leader. When this happens, both clients will lose the connection to the ZooKeeper ensemble and will try to re-acquire the LeaderSelector's mutex. Eventually (after the negotiated session timeout) the ephemeral zNodes under /leaderPath will be deleted. The problem occurs when ephemeral zNode deletions interleave with mutex acquisition. Client #1 can observe that both zNodes (240 and 241) are already deleted, /leaderPath has no children so it acquires the mutex successfully. On the other hand, client #2 can observe that both zNodes still exist, so it starts to watch zNode #240 (LockInternals.internalLockLoop():315). In a short period of time the watcher will be notified about the zNode's deletion, so client #2 reenters LockInternals.internalLockLoop(). What is really strange that getSortedChildren() call in LockInternals:284 can still return zNode #241 so it will succeed in acquiring the mutex (LockInternals:287) The result is two clients, both leader, but /leaderPath contains only one zNode for client #1 Did you encounter similar problems before? Do you have any ideas on how to prevent such race conditions? I can think of a solution: The leader should watch its zNode under /leaderPath and interrupt leadership when the zNode gets deleted. Thank you, Tibor
