Hi! I my name is Łukasz Osipiuk. I am working for one of major Polish Internet companies. In one of our projects we are intensively using Zookeeper as distributed locking system. We implemented slightly modified locking algorithm from zookeeper docs page. (http://hadoop.apache.org/zookeeper/docs/current/recipes.html#sc_recipes_Locks)
Unfortunately we experience some problems with deadlocks. As I examined the problem it appears that either we misuse zookeeper in some way or it is buggy. Our app is written in C++ and we are using zookeeper_mt C library. Tests below are done using server version 3.1.1 and client library version 3.2.0, but on production we have both client and server in 3.1.1. and experience same problems. I attach the code snippet i wrote to isolate our problems. As I run it and while it is running randomly kill zookeeper nodes I (from time to time) get one of following behaviors: 1. the zoo_create() call returns error but still node is created in zookeeper. If such problem happens in locking protocol we get a hanging lock without owner which will never disapear. Closing client zookeeper session is needed to remove such hanging ephemeral node. 2. application thread just hangs. From what i observed in gdb it is waiting for synchronous operation completion (function wait_sync_completion) Is there a way to avoid this problems? Are we doing something wrong or should we create a bug report? Is anyone of you using zookeeper as distributed locking service with more success? Help is really appreciate. PS. to compile code snippet use: g++ credel.cc -o credel -pedantic -lzookeeper_mt -- Łukasz Osipiuk mailto:luk...@osipiuk.net
#include <string> #include <zookeeper/zookeeper.h> #include <stdlib.h> #include <assert.h> int main() { static std::string HOSTS="zookeeper-cluster-1.atm:2181,zookeeper-cluster-2.atm:2181,zookeeper-cluster-3.atm:2181"; static std::string NODE="/credel"; int retcode; struct Stat stat; char name[128]; // create zhandle zhandle_t* zhandle = zookeeper_init(HOSTS.c_str(), NULL, 5000, 0, NULL, 0); // just wait for connection for sake of simplicity sleep(3); for(;;) { // cleanup retcode = zoo_delete(zhandle, NODE.c_str(), -1); printf("initial delete; retcode=%d\n", retcode); if (retcode != ZOK && retcode != ZNONODE) { continue; } retcode = zoo_create(zhandle, NODE.c_str(), "", 0, &ZOO_OPEN_ACL_UNSAFE, 0, name, 128); if (retcode != ZOK && retcode != ZNODEEXISTS) { printf("node creation returned error; retcode=%d\n", retcode); // check if node exists for(;;) { retcode = zoo_exists(zhandle, NODE.c_str(), 0, &stat); if (retcode == ZOK) { printf("ERROR create returned error but node exits\n"); exit(1); } if (retcode != ZNONODE) { printf("ERROR while checking if node exits, retrying; retcode=%d\n", retcode); continue; } assert(retcode == ZNONODE); // ok } } } }