Comments in line:
On 8/28/09 8:24 AM, "Łukasz Osipiuk" <luk...@osipiuk.net> wrote:
> I my name is Łukasz Osipiuk. I am working for one of major Polish
> Internet companies.
> In one of our projects we are intensively using Zookeeper as
> distributed locking system. We implemented slightly modified locking
> from zookeeper docs page.
> Unfortunately we experience some problems with deadlocks. As I
> examined the problem it appears that either we misuse zookeeper in
> some way
> or it is buggy. Our app is written in C++ and we are using
> zookeeper_mt C library.
> Tests below are done using server version 3.1.1 and client library
> version 3.2.0, but on production we have both client and server in
> 3.1.1. and experience same problems.
> I attach the code snippet i wrote to isolate our problems. As I run it
> and while it is running randomly kill zookeeper nodes I (from time to
> time) get one of following behaviors:
> 1. the zoo_create() call returns error but still node is created in zookeeper.
> If such problem happens in locking protocol we get a hanging lock
> without owner which will never disapear. Closing client zookeeper
> session is
> needed to remove such hanging ephemeral node.
This could happen. If you get a CONNECTIONLOSS error on a create then the
create might or might not have happened. Please take a look at
CONNECTIONLOSS handling on our wiki
http://wiki.apache.org/hadoop/ZooKeeper (I cant get to the direct link since
wiki is down)
Also, take a look at
For handling CONNECTIONLOSS. We have an open jira where in we want to avoid
throwing CONNECTIONLOSS error but that will be fixed probably in 3.3.
> 2. application thread just hangs. From what i observed in gdb it is
> waiting for synchronous operation completion (function
Are you accessing the zoookeper handle via 2 different threads? Thoguht the
handle is thread safe but you should make sure that you do not call zoo
api's after you have called zoo_close() on the handle. We have seen this
kind of hanging problem wherein one thread was closing the handle and the
other thread was calling something like zoo_exists().
> Is there a way to avoid this problems? Are we doing something wrong or
> should we create a bug report?
> Is anyone of you using zookeeper as distributed locking service with
> more success?
> Help is really appreciate.
> PS. to compile code snippet use:
> g++ credel.cc -o credel -pedantic -lzookeeper_mt
Hope this helps.