hi again mesos users and devs, In the prior post i left with description of hanging program with mesos zookeeper c++ api and wondered about enhancement to not wait indefinitely when underlying zookeeper responses dont occur. At that time i thought perhaps the underlying zookeeper and/or its C binding might not be responding up to the mesos api callers. So, while the question is still outstanding, I now see that potentially the hanging issue is with the mesos implementation over zookeeper c binding. In particular i've now tried a similar scenario just with zookeeper c binding api. That is, do zk aget/complete from within a watcher for events for the CHANGED event from a prior aset/complete. i dont see any blocking indefinitely and both the aget and aset completions are invoked and finish.
Unless i'm not reproducing this properly, what i determine is a bad behavior from the mesos c++ api. Somehow the mesos c++ zookeeper api implementation is getting itself into pthread condition waits with nothing to notify and break the waits. this seems to occur with get calls from a Watcher on CHANGED events. craig -------- Original Message -------- From: [email protected] Apparently from: [email protected] To: [email protected] Subject: mesos c++ zookeeper blocks indefinately -- any plans to enhance? Date: Wed, 4 Mar 2015 10:05:54 -0500 > hi mesos users and devs, > We've observed that that the mesos 0.22.0-rc1 c++ zookeeper code appears to > allow indefinite waits on responses. > This leads to application hangs blocked inside mesos zookeeper calls. > This can happen with a properly running zookeeper presumably able to make all > responses. > > Heres how we hung it for eg. > We issue a mesos zk set via > > int ZooKeeper::set ( const std::string & path, > const std::string & data, > int version > ) > > then inside a Watcher we process on CHANGED event to issue a mesos zk get on > the same path via > > int ZooKeeper::get ( const std::string & path, > bool watch, > std::string * result, > Stat * stat > ) > > we end up with two threads in the process both in pthread_cond_waits > #0 0x000000334e20b43c in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x00007f6664ee1cf5 in Gate::arrive (this=0x7f6140, old=0) > at ../../../3rdparty/libprocess/src/gate.hpp:82 > #2 0x00007f6664ecef6e in process::ProcessManager::wait (this=0x7f02e0, > pid=...) > at ../../../3rdparty/libprocess/src/process.cpp:2476 > #3 0x00007f6664ed2ce9 in process::wait (pid=..., duration=...) > at ../../../3rdparty/libprocess/src/process.cpp:2958 > #4 0x00007f6664e90558 in process::Latch::await (this=0x7f6ba0, duration=...) > at ../../../3rdparty/libprocess/src/latch.cpp:49 > #5 0x00007f66649452cc in process::Future<int>::await (this=0x7fffa0fd9040, > duration=...) > at ../../3rdparty/libprocess/include/process/future.hpp:1156 > #6 0x00007f666493a04d in process::Future<int>::get (this=0x7fffa0fd9040) > at ../../3rdparty/libprocess/include/process/future.hpp:1167 > #7 0x00007f6664ab1aac in ZooKeeper::set (this=0x803ce0, path="/craig/mo", > data= > ... > > and > #0 0x000000334e20b43c in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x00007f6664ee1cf5 in Gate::arrive (this=0x7f66380013f0, old=0) > at ../../../3rdparty/libprocess/src/gate.hpp:82 > #2 0x00007f6664ecef6e in process::ProcessManager::wait (this=0x7f02e0, > pid=...) > at ../../../3rdparty/libprocess/src/process.cpp:2476 > #3 0x00007f6664ed2ce9 in process::wait (pid=..., duration=...) > at ../../../3rdparty/libprocess/src/process.cpp:2958 > #4 0x00007f6664e90558 in process::Latch::await (this=0x7f6638000d00, > duration=...) > at ../../../3rdparty/libprocess/src/latch.cpp:49 > #5 0x00007f66649452cc in process::Future<int>::await (this=0x7f66595fb6f0, > duration=...) > at ../../3rdparty/libprocess/include/process/future.hpp:1156 > #6 0x00007f666493a04d in process::Future<int>::get (this=0x7f66595fb6f0) > at ../../3rdparty/libprocess/include/process/future.hpp:1167 > #7 0x00007f6664ab18d3 in ZooKeeper::get (this=0x803ce0, path="/craig/mo", > watch=false, > .... > > So, really we are asking whether the mesos zk c++ api will be enhanced to not > block indefinitely when results are beyond a time bound. > > cheers > craig

