I setup a single zookeeper instance using the binaries distributed with
Ubuntu 12.04. I downloaded the 3.3.5 source and compiled the C based locking
recipe. I built this into a program of mine and ran into a problem. So I had
some questions.
If i wanted to create 1000 locks, do i setup the locks as follows?
/lock/0
/lock/1
...
/lock/999
is this correct?
I was running an example with two clients competing for 1 lock running on the
same machine the zookeeper instance was running on. I found that
zkr_lock_lock() would often fail to acquire the lock, so i put that in a loop
with 1000 retries. That seems to make it work most of the time, but other times
there would still be a failure at zoo_lock.c:301
// cannot watch my predecessor i am giving up
// we need to be able to watch the predecessor
// since if we do not become a leader the others
// will keep waiting
[301] if (ret != ZOK) {
free_String_vector(vector);
I put a printf to see what ret was and it was ZNONODE. Now looking at the
code above this spot, get_children is called and then it sorts the results and
later calls zoo_wexists. It seems reasonable that the state could change
between these two calls? I added a statement that if the result was ZNONODE, it
does a goto back to above where get_children is called so it runs the algorithm
again.
That changes seems to make the code work all the time now, but I'm not sure
if that change is correct. I've included the diff below. So is it expected that
zkr_lock_lock will fail periodically since it only tries to acquire the lock 4
times?
thanks for any help,
kevin
--- zoo_lock.c.orig 2012-06-15 00:37:53.880508812 -0500
+++ zoo_lock.c 2012-06-15 00:41:41.304518262 -0500
@@ -273,6 +273,7 @@ static int zkr_lock_operation(zkr_lock_m
mutex->id = getName(retbuf);
}
+tryagain:
if (mutex->id != NULL) {
ret = ZCONNECTIONLOSS;
ret = retry_getchildren(zh, path, vector, ts, retry);
@@ -299,7 +300,9 @@ static int zkr_lock_operation(zkr_lock_m
// will keep waiting
if (ret != ZOK) {
free_String_vector(vector);
+ if (ret == ZNONODE) goto tryagain;
LOG_WARN(("unable to watch my predecessor"));
+ printf("zret = %d\n", ret);
ret = zkr_lock_unlock(mutex);
while (ret == 0) {
//we have to give up our leadership