Re: c client - problem with failover

Mahadev Konar Fri, 28 Aug 2009 13:24:54 -0700

Hi Lukasz,
  Comments in line:


On 8/28/09 8:24 AM, "Łukasz Osipiuk" <luk...@osipiuk.net> wrote:

> Hi!
> 
> I my name is Łukasz Osipiuk. I am working for one of major Polish
> Internet companies.
> In one of our projects we are intensively using Zookeeper as
> distributed locking system. We implemented slightly modified locking
> algorithm
> from zookeeper docs page.
> 
(http://hadoop.apache.org/zookeeper/docs/current/recipes.html#sc_recipes_Locks>
)
> 
> Unfortunately we experience some problems with deadlocks. As I
> examined the problem it appears that either we misuse zookeeper in
> some way
> or it is buggy. Our app is written in C++ and we are using
> zookeeper_mt C library.
> 
> Tests below are done using server version 3.1.1 and client library
> version 3.2.0, but on production we have both client and server in
> 3.1.1. and experience same problems.
> 
> I attach the code snippet i wrote to isolate our problems. As I run it
> and while it is running randomly kill zookeeper nodes I (from time to
> time) get one of following behaviors:
> 
> 1. the zoo_create() call returns error but still node is created in zookeeper.
>     If such problem happens in locking protocol we get a hanging lock
> without owner which will never disapear. Closing client zookeeper
> session is
>     needed to remove such hanging ephemeral node.
This could happen. If you get a CONNECTIONLOSS error on a create then the
create might or might not have happened. Please take a look at
CONNECTIONLOSS handling on our wiki
http://wiki.apache.org/hadoop/ZooKeeper (I cant get to the direct link since
wiki is down)

Also, take a look at

http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperProgrammers.html

For handling CONNECTIONLOSS. We have an open jira where in we want to avoid
throwing CONNECTIONLOSS error but that will be fixed probably in 3.3.
http://issues.apache.org/jira/browse/ZOOKEEPER-22
> 
> 2. application thread just hangs. From what i observed in gdb it is
> waiting for synchronous operation completion (function
> wait_sync_completion)
> 
Are you accessing the zoookeper handle via 2 different threads? Thoguht the
handle is thread safe but you should make sure that you do not call zoo
api's after you have called zoo_close() on the handle. We have seen this
kind of hanging problem wherein one thread was closing the handle and the
other thread was calling something like zoo_exists().


> Is there a way to avoid this problems? Are we doing something wrong or
> should we create a bug report?
> Is anyone of you using zookeeper as distributed locking service with
> more success?
> 
> Help is really appreciate.
> 
> PS. to compile code snippet use:
> g++ credel.cc -o credel -pedantic -lzookeeper_mt

Hope this helps.

Thanks
mahadev

Re: c client - problem with failover

Reply via email to