Hi!

I my name is Łukasz Osipiuk. I am working for one of major Polish
Internet companies.
In one of our projects we are intensively using Zookeeper as
distributed locking system. We implemented slightly modified locking
algorithm
from zookeeper docs page.
(http://hadoop.apache.org/zookeeper/docs/current/recipes.html#sc_recipes_Locks)

Unfortunately we experience some problems with deadlocks. As I
examined the problem it appears that either we misuse zookeeper in
some way
or it is buggy. Our app is written in C++ and we are using
zookeeper_mt C library.

Tests below are done using server version 3.1.1 and client library
version 3.2.0, but on production we have both client and server in
3.1.1. and experience same problems.

I attach the code snippet i wrote to isolate our problems. As I run it
and while it is running randomly kill zookeeper nodes I (from time to
time) get one of following behaviors:

1. the zoo_create() call returns error but still node is created in zookeeper.
    If such problem happens in locking protocol we get a hanging lock
without owner which will never disapear. Closing client zookeeper
session is
    needed to remove such hanging ephemeral node.

2. application thread just hangs. From what i observed in gdb it is
waiting for synchronous operation completion (function
wait_sync_completion)

Is there a way to avoid this problems? Are we doing something wrong or
should we create a bug report?
Is anyone of you using zookeeper as distributed locking service with
more success?

Help is really appreciate.

PS. to compile code snippet use:
g++ credel.cc -o credel -pedantic -lzookeeper_mt

-- 
Łukasz Osipiuk
mailto:luk...@osipiuk.net
#include <string>
#include <zookeeper/zookeeper.h>
#include <stdlib.h>
#include <assert.h>


int main() {
  static std::string HOSTS="zookeeper-cluster-1.atm:2181,zookeeper-cluster-2.atm:2181,zookeeper-cluster-3.atm:2181";
  static std::string NODE="/credel";
  int retcode;
  struct Stat stat;
  char name[128];
  // create zhandle
  zhandle_t* zhandle = zookeeper_init(HOSTS.c_str(), NULL, 5000, 0, NULL, 0);

  // just wait for connection for sake of simplicity  
  sleep(3);

  for(;;) {

    // cleanup
    retcode = zoo_delete(zhandle, NODE.c_str(), -1);
    printf("initial delete; retcode=%d\n", retcode);
    if (retcode != ZOK && retcode != ZNONODE) {
      continue;
    }

    retcode = zoo_create(zhandle, NODE.c_str(), "", 0, &ZOO_OPEN_ACL_UNSAFE, 0, name, 128);
    if (retcode != ZOK && retcode != ZNODEEXISTS) {
      printf("node creation returned error; retcode=%d\n", retcode);
      
      // check if node exists
      for(;;) 
      {
        retcode = zoo_exists(zhandle, NODE.c_str(), 0, &stat);
        
        if (retcode == ZOK) {
          printf("ERROR create returned error but node exits\n");
          exit(1);
        }
        if (retcode != ZNONODE) {
          printf("ERROR while checking if node exits, retrying; retcode=%d\n", retcode);
          continue;
        }
        assert(retcode == ZNONODE); // ok
      }
    }
  }
}

Reply via email to