Hi Lukasz, Comments in line:
On 8/28/09 8:24 AM, "Łukasz Osipiuk" <luk...@osipiuk.net> wrote: > Hi! > > I my name is Łukasz Osipiuk. I am working for one of major Polish > Internet companies. > In one of our projects we are intensively using Zookeeper as > distributed locking system. We implemented slightly modified locking > algorithm > from zookeeper docs page. > (http://hadoop.apache.org/zookeeper/docs/current/recipes.html#sc_recipes_Locks> ) > > Unfortunately we experience some problems with deadlocks. As I > examined the problem it appears that either we misuse zookeeper in > some way > or it is buggy. Our app is written in C++ and we are using > zookeeper_mt C library. > > Tests below are done using server version 3.1.1 and client library > version 3.2.0, but on production we have both client and server in > 3.1.1. and experience same problems. > > I attach the code snippet i wrote to isolate our problems. As I run it > and while it is running randomly kill zookeeper nodes I (from time to > time) get one of following behaviors: > > 1. the zoo_create() call returns error but still node is created in zookeeper. > If such problem happens in locking protocol we get a hanging lock > without owner which will never disapear. Closing client zookeeper > session is > needed to remove such hanging ephemeral node. This could happen. If you get a CONNECTIONLOSS error on a create then the create might or might not have happened. Please take a look at CONNECTIONLOSS handling on our wiki http://wiki.apache.org/hadoop/ZooKeeper (I cant get to the direct link since wiki is down) Also, take a look at http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperProgrammers.html For handling CONNECTIONLOSS. We have an open jira where in we want to avoid throwing CONNECTIONLOSS error but that will be fixed probably in 3.3. http://issues.apache.org/jira/browse/ZOOKEEPER-22 > > 2. application thread just hangs. From what i observed in gdb it is > waiting for synchronous operation completion (function > wait_sync_completion) > Are you accessing the zoookeper handle via 2 different threads? Thoguht the handle is thread safe but you should make sure that you do not call zoo api's after you have called zoo_close() on the handle. We have seen this kind of hanging problem wherein one thread was closing the handle and the other thread was calling something like zoo_exists(). > Is there a way to avoid this problems? Are we doing something wrong or > should we create a bug report? > Is anyone of you using zookeeper as distributed locking service with > more success? > > Help is really appreciate. > > PS. to compile code snippet use: > g++ credel.cc -o credel -pedantic -lzookeeper_mt Hope this helps. Thanks mahadev