Re: c client - problem with failover
Hi Lukasz, Comments in line: On 8/28/09 8:24 AM, "Łukasz Osipiuk" wrote: > Hi! > > I my name is Łukasz Osipiuk. I am working for one of major Polish > Internet companies. > In one of our projects we are intensively using Zookeeper as > distributed locking system. We implemented slightly modified locking > algorithm > from zookeeper docs page. > (http://hadoop.apache.org/zookeeper/docs/current/recipes.html#sc_recipes_Locks> ) > > Unfortunately we experience some problems with deadlocks. As I > examined the problem it appears that either we misuse zookeeper in > some way > or it is buggy. Our app is written in C++ and we are using > zookeeper_mt C library. > > Tests below are done using server version 3.1.1 and client library > version 3.2.0, but on production we have both client and server in > 3.1.1. and experience same problems. > > I attach the code snippet i wrote to isolate our problems. As I run it > and while it is running randomly kill zookeeper nodes I (from time to > time) get one of following behaviors: > > 1. the zoo_create() call returns error but still node is created in zookeeper. > If such problem happens in locking protocol we get a hanging lock > without owner which will never disapear. Closing client zookeeper > session is > needed to remove such hanging ephemeral node. This could happen. If you get a CONNECTIONLOSS error on a create then the create might or might not have happened. Please take a look at CONNECTIONLOSS handling on our wiki http://wiki.apache.org/hadoop/ZooKeeper (I cant get to the direct link since wiki is down) Also, take a look at http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperProgrammers.html For handling CONNECTIONLOSS. We have an open jira where in we want to avoid throwing CONNECTIONLOSS error but that will be fixed probably in 3.3. http://issues.apache.org/jira/browse/ZOOKEEPER-22 > > 2. application thread just hangs. From what i observed in gdb it is > waiting for synchronous operation completion (function > wait_sync_completion) > Are you accessing the zoookeper handle via 2 different threads? Thoguht the handle is thread safe but you should make sure that you do not call zoo api's after you have called zoo_close() on the handle. We have seen this kind of hanging problem wherein one thread was closing the handle and the other thread was calling something like zoo_exists(). > Is there a way to avoid this problems? Are we doing something wrong or > should we create a bug report? > Is anyone of you using zookeeper as distributed locking service with > more success? > > Help is really appreciate. > > PS. to compile code snippet use: > g++ credel.cc -o credel -pedantic -lzookeeper_mt Hope this helps. Thanks mahadev
c client - problem with failover
Hi! I my name is Łukasz Osipiuk. I am working for one of major Polish Internet companies. In one of our projects we are intensively using Zookeeper as distributed locking system. We implemented slightly modified locking algorithm from zookeeper docs page. (http://hadoop.apache.org/zookeeper/docs/current/recipes.html#sc_recipes_Locks) Unfortunately we experience some problems with deadlocks. As I examined the problem it appears that either we misuse zookeeper in some way or it is buggy. Our app is written in C++ and we are using zookeeper_mt C library. Tests below are done using server version 3.1.1 and client library version 3.2.0, but on production we have both client and server in 3.1.1. and experience same problems. I attach the code snippet i wrote to isolate our problems. As I run it and while it is running randomly kill zookeeper nodes I (from time to time) get one of following behaviors: 1. the zoo_create() call returns error but still node is created in zookeeper. If such problem happens in locking protocol we get a hanging lock without owner which will never disapear. Closing client zookeeper session is needed to remove such hanging ephemeral node. 2. application thread just hangs. From what i observed in gdb it is waiting for synchronous operation completion (function wait_sync_completion) Is there a way to avoid this problems? Are we doing something wrong or should we create a bug report? Is anyone of you using zookeeper as distributed locking service with more success? Help is really appreciate. PS. to compile code snippet use: g++ credel.cc -o credel -pedantic -lzookeeper_mt -- Łukasz Osipiuk mailto:luk...@osipiuk.net #include #include #include #include int main() { static std::string HOSTS="zookeeper-cluster-1.atm:2181,zookeeper-cluster-2.atm:2181,zookeeper-cluster-3.atm:2181"; static std::string NODE="/credel"; int retcode; struct Stat stat; char name[128]; // create zhandle zhandle_t* zhandle = zookeeper_init(HOSTS.c_str(), NULL, 5000, 0, NULL, 0); // just wait for connection for sake of simplicity sleep(3); for(;;) { // cleanup retcode = zoo_delete(zhandle, NODE.c_str(), -1); printf("initial delete; retcode=%d\n", retcode); if (retcode != ZOK && retcode != ZNONODE) { continue; } retcode = zoo_create(zhandle, NODE.c_str(), "", 0, &ZOO_OPEN_ACL_UNSAFE, 0, name, 128); if (retcode != ZOK && retcode != ZNODEEXISTS) { printf("node creation returned error; retcode=%d\n", retcode); // check if node exists for(;;) { retcode = zoo_exists(zhandle, NODE.c_str(), 0, &stat); if (retcode == ZOK) { printf("ERROR create returned error but node exits\n"); exit(1); } if (retcode != ZNONODE) { printf("ERROR while checking if node exits, retrying; retcode=%d\n", retcode); continue; } assert(retcode == ZNONODE); // ok } } } }