Hello,

I am new to using Zookeeper and I have a quick question about the locking
recipe that can be found here:

http://hadoop.apache.org/zookeeper/docs/r3.1.2/recipes.html#sc_recipes_Locks

It appears to me that there is a flaw in this algorithm related to partial
failure, and I am curious to know how to fix it.

The algorithm follows these steps:

 1. Call "create()" with a pathname like "/some/path/to/parent/child-lock-".
 2. Call "getChildren()" on the lock node without the watch flag set.
 3. If the path created in step (1) has the lowest sequence number, you are
the master (skip the next steps).
 4. Otherwise, call "exists()" with the watch flag set on the child with the
next lowest sequence number.
 5. If "exists()" returns false, go to step (2), otherwise wait for a
notification from the path, then go to step (2).

The scenario that seems to be faulty is a partial failure in step (1).
Assume that my client program follows step (1) and calls "create()". Assume
that the call succeeds on the Zookeeper server, but there is a
ConnectionLoss event right as the server sends the response (e.g., a network
partition, some dropped packets, the ZK server goes down, etc). Assume
further that the client immediately reconnects, so the session is not timed
out. At this point there is a child node that was created by my client, but
that my client does not know about (since it never received the response).
Since my client doesn't know about the child, it won't know to watch the
previous child to it, and it also won't know to delete it. That means all
clients using that lock will fail to make progress as soon as the orphaned
child is the lowest sequence number. This state will continue until my
client closes it's session (which may be a while if I have a long lived
session, as I would like to have). Correctness is maintained here, but
live-ness is not.

The only good solution I have found for this problem is to establish a new
session with Zookeeper before acquiring a lock, and to close that session
immediately upon any connection loss in step (1). If everything works, the
session could be re-used, but you'd need to guarantee that the session was
closed if there was a failure during creation of the child node. Are there
other good solutions?

I looked at the sample code that comes with the Zookeeper distribution (I'm
using 3.2.2 right now), and it uses the current session ID as part of the
child node name. Then, if there is a failure during creation, it tries to
look up the child using that session ID. This isn't really helpful in the
environment I'm using, where a single session could be shared by multiple
threads, any of which could request a lock (so I can't uniquely identify a
lock by session ID). I could use thread ID, but then I run the risk of a
thread being reused and getting the wrong lock. In any case, there is also
the risk that a second failure prevents me from looking up the lock after a
connection loss, so I'm right back to an orphaned lock child, as above. I
could, presumably, be careful enough with try/catch logic to prevent even
that case, but it makes for pretty bug-prone code. Also, as a side note,
that code appears to be sorting the child nodes by the session ID first,
then the sequence number, which could cause locks to be ordered incorrectly.

Thanks for any help you can provide!

Charles Gordon

Reply via email to