Re: Sometimes leader election ends up in two leaders

Jordan Zimmerman Thu, 22 May 2014 05:41:36 -0700

What guarantees that zNode 241 will be deleted prior to the (successful) 
attempt of client #2 to reacquire the mutex using zNode 241?
Because that’s how the lock works. As long as 241 exists, no other client will 
consider itself as having the mutex.

reacquire the mutex using zNode 241?
This is not what happens. The client will try to acquire using a _different_
znode. Are you thinking that 241 is re-used? It’s not.

-JZ

From: stibi [email protected]
Reply: stibi [email protected]
Date: May 22, 2014 at 7:26:57 AM
To: Jordan Zimmerman [email protected], [email protected]
[email protected]
Subject: Re: Sometimes leader election ends up in two leaders

Hi!

Thanks for the quick response.
About this step:

— Time N + D2 —
The ZooKeeper quorum is repaired and the nodes start a doWork() loop again. At
this point, there can be 2, 3 or 4 nodes depending.
lock-0000000240 (waiting to be deleted)
lock-0000000241 (waiting to be deleted)
lock-0000000242
lock-0000000243
Neither of the instances will achieve leadership until the nodes 240/241 are
deleted.

What guarantees that zNode 241 will be deleted prior to the (successful)
attempt of client #2 to reacquire the mutex using zNode 241?
AFAIK node deletion is a background operation and a retry policy controls how
often a deletion attempt will occur (even for guaranteed deletes). Unlucky
timing can lead to a situation where deletion of zNode 241 happens after the
mutex acquisition. In this case the mutex is not released by the leader, but
since the zNodes are deleted, the other client will also be elected as leader.

Thanks,
Tibor

On Thu, May 15, 2014 at 3:37 AM, Jordan Zimmerman <[email protected]>
wrote:
I don’t think the situation you describe can happen. Let’s walk through this:

— Time N —
We have a single, correct leader and 2 nodes:
lock-0000000240
lock-0000000241

— Time N + D1 —
ZooKeeper leader instance is restarted. Shortly thereafter, both Curator
clients will exit their doWork() loops and mark their nodes for deletion. Due
to a failed connection, though there are still the 2 nodes:
lock-0000000240 (waiting to be deleted)
lock-0000000241 (waiting to be deleted)

Of course, there may be something else that’s causing you to see 2 leaders. A
while back I discovered that rolling config changes can do it
(http://zookeeper-user.578899.n2.nabble.com/Rolling-config-change-considered-harmful-td7578761.html).
Or, there’s something else going on in Curator.

-Jordan

From: stibi [email protected]
Reply: [email protected] [email protected]
Date: May 14, 2014 at 11:39:48 AM
To: [email protected] [email protected]
Subject: Sometimes leader election ends up in two leaders

Hi!

I'm using Curator's Leader Election recipe (2.4.2) and found a very
hard-to-reproduce issue which could lead to a situation where both clients
become leader.

Let's say 2 clients are competing for leadership, client #1 is currently the
leader and zookeeper maintains the following structure under the leaderPath:

/leaderPath
|- _c_a8524f0b-3bd7-4df3-ae19-cef11159a7a6-lock-0000000240 (client #1)
|- _c_b5bdc75f-d2c9-4432-9d58-1f7fe699e125-lock-0000000241 (client #2)

autoRequeue flag is set to true for both clients

Let's tigger a leader election by restarting the ZooKeeper leader.

When this happens, both clients will lose the connection to the ZooKeeper
ensemble and will try to re-acquire the LeaderSelector's mutex. Eventually
(after the negotiated session timeout) the ephemeral zNodes under /leaderPath
will be deleted.

The problem occurs when ephemeral zNode deletions interleave with mutex
acquisition.
Client #1 can observe that both zNodes (240 and 241) are already deleted,
/leaderPath has no children so it acquires the mutex successfully.

On the other hand, client #2 can observe that both zNodes still exist, so it
starts to watch zNode #240 (LockInternals.internalLockLoop():315). In a short
period of time the watcher will be notified about the zNode's deletion, so
client #2 reenters LockInternals.internalLockLoop().

What is really strange that getSortedChildren() call in LockInternals:284 can
still return zNode #241
so it will succeed in acquiring the mutex (LockInternals:287)

The result is two clients, both leader, but /leaderPath contains only one zNode
for client #1

Did you encounter similar problems before? Do you have any ideas on how to
prevent such race conditions? I can think of a solution: The leader should
watch its zNode under /leaderPath and interrupt leadership when the zNode gets
deleted.

Thank you,
Tibor

Re: Sometimes leader election ends up in two leaders

Reply via email to