I don’t think the situation you describe can happen. Let’s walk through this:
— Time N —
We have a single, correct leader and 2 nodes:
lock-0000000240
lock-0000000241
— Time N + D1 —
ZooKeeper leader instance is restarted. Shortly thereafter, both Curator
clients will exit their doWork() loops and mark their nodes for deletion. Due
to a failed connection, though there are still the 2 nodes:
lock-0000000240 (waiting to be deleted)
lock-0000000241 (waiting to be deleted)
— Time N + D2 —
The ZooKeeper quorum is repaired and the nodes start a doWork() loop again. At
this point, there can be 2, 3 or 4 nodes depending.
lock-0000000240 (waiting to be deleted)
lock-0000000241 (waiting to be deleted)
lock-0000000242
lock-0000000243
Neither of the instances will achieve leadership until the nodes 240/241 are
deleted.
Of course, there may be something else that’s causing you to see 2 leaders. A
while back I discovered that rolling config changes can do it
(http://zookeeper-user.578899.n2.nabble.com/Rolling-config-change-considered-harmful-td7578761.html).
Or, there’s something else going on in Curator.
-Jordan
From: stibi [email protected]
Reply: [email protected] [email protected]
Date: May 14, 2014 at 11:39:48 AM
To: [email protected] [email protected]
Subject: Sometimes leader election ends up in two leaders
Hi!
I'm using Curator's Leader Election recipe (2.4.2) and found a very
hard-to-reproduce issue which could lead to a situation where both clients
become leader.
Let's say 2 clients are competing for leadership, client #1 is currently the
leader and zookeeper maintains the following structure under the leaderPath:
/leaderPath
|- _c_a8524f0b-3bd7-4df3-ae19-cef11159a7a6-lock-0000000240 (client #1)
|- _c_b5bdc75f-d2c9-4432-9d58-1f7fe699e125-lock-0000000241 (client #2)
autoRequeue flag is set to true for both clients
Let's tigger a leader election by restarting the ZooKeeper leader.
When this happens, both clients will lose the connection to the ZooKeeper
ensemble and will try to re-acquire the LeaderSelector's mutex. Eventually
(after the negotiated session timeout) the ephemeral zNodes under /leaderPath
will be deleted.
The problem occurs when ephemeral zNode deletions interleave with mutex
acquisition.
Client #1 can observe that both zNodes (240 and 241) are already deleted,
/leaderPath has no children so it acquires the mutex successfully.
On the other hand, client #2 can observe that both zNodes still exist, so it
starts to watch zNode #240 (LockInternals.internalLockLoop():315). In a short
period of time the watcher will be notified about the zNode's deletion, so
client #2 reenters LockInternals.internalLockLoop().
What is really strange that getSortedChildren() call in LockInternals:284 can
still return zNode #241
so it will succeed in acquiring the mutex (LockInternals:287)
The result is two clients, both leader, but /leaderPath contains only one zNode
for client #1
Did you encounter similar problems before? Do you have any ideas on how to
prevent such race conditions? I can think of a solution: The leader should
watch its zNode under /leaderPath and interrupt leadership when the zNode gets
deleted.
Thank you,
Tibor