Hello, I am new to using Zookeeper and I have a quick question about the locking recipe that can be found here:
http://hadoop.apache.org/zookeeper/docs/r3.1.2/recipes.html#sc_recipes_Locks It appears to me that there is a flaw in this algorithm related to partial failure, and I am curious to know how to fix it. The algorithm follows these steps: 1. Call "create()" with a pathname like "/some/path/to/parent/child-lock-". 2. Call "getChildren()" on the lock node without the watch flag set. 3. If the path created in step (1) has the lowest sequence number, you are the master (skip the next steps). 4. Otherwise, call "exists()" with the watch flag set on the child with the next lowest sequence number. 5. If "exists()" returns false, go to step (2), otherwise wait for a notification from the path, then go to step (2). The scenario that seems to be faulty is a partial failure in step (1). Assume that my client program follows step (1) and calls "create()". Assume that the call succeeds on the Zookeeper server, but there is a ConnectionLoss event right as the server sends the response (e.g., a network partition, some dropped packets, the ZK server goes down, etc). Assume further that the client immediately reconnects, so the session is not timed out. At this point there is a child node that was created by my client, but that my client does not know about (since it never received the response). Since my client doesn't know about the child, it won't know to watch the previous child to it, and it also won't know to delete it. That means all clients using that lock will fail to make progress as soon as the orphaned child is the lowest sequence number. This state will continue until my client closes it's session (which may be a while if I have a long lived session, as I would like to have). Correctness is maintained here, but live-ness is not. The only good solution I have found for this problem is to establish a new session with Zookeeper before acquiring a lock, and to close that session immediately upon any connection loss in step (1). If everything works, the session could be re-used, but you'd need to guarantee that the session was closed if there was a failure during creation of the child node. Are there other good solutions? I looked at the sample code that comes with the Zookeeper distribution (I'm using 3.2.2 right now), and it uses the current session ID as part of the child node name. Then, if there is a failure during creation, it tries to look up the child using that session ID. This isn't really helpful in the environment I'm using, where a single session could be shared by multiple threads, any of which could request a lock (so I can't uniquely identify a lock by session ID). I could use thread ID, but then I run the risk of a thread being reused and getting the wrong lock. In any case, there is also the risk that a second failure prevents me from looking up the lock after a connection loss, so I'm right back to an orphaned lock child, as above. I could, presumably, be careful enough with try/catch logic to prevent even that case, but it makes for pretty bug-prone code. Also, as a side note, that code appears to be sorting the child nodes by the session ID first, then the sequence number, which could cause locks to be ordered incorrectly. Thanks for any help you can provide! Charles Gordon