Re: c client - problem with failover

2009-08-28 Thread Mahadev Konar
Hi Lukasz,
  Comments in line:


On 8/28/09 8:24 AM, "Łukasz Osipiuk"  wrote:

> Hi!
> 
> I my name is Łukasz Osipiuk. I am working for one of major Polish
> Internet companies.
> In one of our projects we are intensively using Zookeeper as
> distributed locking system. We implemented slightly modified locking
> algorithm
> from zookeeper docs page.
> 
(http://hadoop.apache.org/zookeeper/docs/current/recipes.html#sc_recipes_Locks>
)
> 
> Unfortunately we experience some problems with deadlocks. As I
> examined the problem it appears that either we misuse zookeeper in
> some way
> or it is buggy. Our app is written in C++ and we are using
> zookeeper_mt C library.
> 
> Tests below are done using server version 3.1.1 and client library
> version 3.2.0, but on production we have both client and server in
> 3.1.1. and experience same problems.
> 
> I attach the code snippet i wrote to isolate our problems. As I run it
> and while it is running randomly kill zookeeper nodes I (from time to
> time) get one of following behaviors:
> 
> 1. the zoo_create() call returns error but still node is created in zookeeper.
> If such problem happens in locking protocol we get a hanging lock
> without owner which will never disapear. Closing client zookeeper
> session is
> needed to remove such hanging ephemeral node.
This could happen. If you get a CONNECTIONLOSS error on a create then the
create might or might not have happened. Please take a look at
CONNECTIONLOSS handling on our wiki
http://wiki.apache.org/hadoop/ZooKeeper (I cant get to the direct link since
wiki is down)

Also, take a look at

http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperProgrammers.html

For handling CONNECTIONLOSS. We have an open jira where in we want to avoid
throwing CONNECTIONLOSS error but that will be fixed probably in 3.3.
http://issues.apache.org/jira/browse/ZOOKEEPER-22
> 
> 2. application thread just hangs. From what i observed in gdb it is
> waiting for synchronous operation completion (function
> wait_sync_completion)
> 
Are you accessing the zoookeper handle via 2 different threads? Thoguht the
handle is thread safe but you should make sure that you do not call zoo
api's after you have called zoo_close() on the handle. We have seen this
kind of hanging problem wherein one thread was closing the handle and the
other thread was calling something like zoo_exists().


> Is there a way to avoid this problems? Are we doing something wrong or
> should we create a bug report?
> Is anyone of you using zookeeper as distributed locking service with
> more success?
> 
> Help is really appreciate.
> 
> PS. to compile code snippet use:
> g++ credel.cc -o credel -pedantic -lzookeeper_mt

Hope this helps.

Thanks
mahadev



c client - problem with failover

2009-08-28 Thread Łukasz Osipiuk
Hi!

I my name is Łukasz Osipiuk. I am working for one of major Polish
Internet companies.
In one of our projects we are intensively using Zookeeper as
distributed locking system. We implemented slightly modified locking
algorithm
from zookeeper docs page.
(http://hadoop.apache.org/zookeeper/docs/current/recipes.html#sc_recipes_Locks)

Unfortunately we experience some problems with deadlocks. As I
examined the problem it appears that either we misuse zookeeper in
some way
or it is buggy. Our app is written in C++ and we are using
zookeeper_mt C library.

Tests below are done using server version 3.1.1 and client library
version 3.2.0, but on production we have both client and server in
3.1.1. and experience same problems.

I attach the code snippet i wrote to isolate our problems. As I run it
and while it is running randomly kill zookeeper nodes I (from time to
time) get one of following behaviors:

1. the zoo_create() call returns error but still node is created in zookeeper.
If such problem happens in locking protocol we get a hanging lock
without owner which will never disapear. Closing client zookeeper
session is
needed to remove such hanging ephemeral node.

2. application thread just hangs. From what i observed in gdb it is
waiting for synchronous operation completion (function
wait_sync_completion)

Is there a way to avoid this problems? Are we doing something wrong or
should we create a bug report?
Is anyone of you using zookeeper as distributed locking service with
more success?

Help is really appreciate.

PS. to compile code snippet use:
g++ credel.cc -o credel -pedantic -lzookeeper_mt

-- 
Łukasz Osipiuk
mailto:luk...@osipiuk.net
#include 
#include 
#include 
#include 


int main() {
  static std::string HOSTS="zookeeper-cluster-1.atm:2181,zookeeper-cluster-2.atm:2181,zookeeper-cluster-3.atm:2181";
  static std::string NODE="/credel";
  int retcode;
  struct Stat stat;
  char name[128];
  // create zhandle
  zhandle_t* zhandle = zookeeper_init(HOSTS.c_str(), NULL, 5000, 0, NULL, 0);

  // just wait for connection for sake of simplicity  
  sleep(3);

  for(;;) {

// cleanup
retcode = zoo_delete(zhandle, NODE.c_str(), -1);
printf("initial delete; retcode=%d\n", retcode);
if (retcode != ZOK && retcode != ZNONODE) {
  continue;
}

retcode = zoo_create(zhandle, NODE.c_str(), "", 0, &ZOO_OPEN_ACL_UNSAFE, 0, name, 128);
if (retcode != ZOK && retcode != ZNODEEXISTS) {
  printf("node creation returned error; retcode=%d\n", retcode);
  
  // check if node exists
  for(;;) 
  {
retcode = zoo_exists(zhandle, NODE.c_str(), 0, &stat);

if (retcode == ZOK) {
  printf("ERROR create returned error but node exits\n");
  exit(1);
}
if (retcode != ZNONODE) {
  printf("ERROR while checking if node exits, retrying; retcode=%d\n", retcode);
  continue;
}
assert(retcode == ZNONODE); // ok
  }
}
  }
}