I think this is the ticket: https://issues.apache.org/jira/browse/MESOS-2451 and it ended up not being a Mesos bug.
-- Jiang Yan Xu <[email protected]> @xujyan <http://twitter.com/xujyan> On Mon, Mar 16, 2015 at 11:33 AM, Niklas Nielsen <[email protected]> wrote: > Hi Craig, > > I am sorry you guys have been running into trouble with Zookeeper. > Have you file a JIRA ticket where we can track the issues you are seeing? > That is how we track and schedule (human) resources for bug fixing :) > > Thanks! > Niklas > > On 4 March 2015 at 13:18, <[email protected]> wrote: > >> hi again mesos users and devs, >> In the prior post i left with description of hanging program with mesos >> zookeeper c++ api and wondered about enhancement to not wait indefinitely >> when underlying zookeeper responses dont occur. >> At that time i thought perhaps the underlying zookeeper and/or its C >> binding might not be responding up to the mesos api callers. >> So, while the question is still outstanding, I now see that potentially >> the hanging issue is with the mesos implementation over zookeeper c binding. >> In particular i've now tried a similar scenario just with zookeeper c >> binding api. >> That is, do zk aget/complete from within a watcher for events for the >> CHANGED event from a prior aset/complete. >> i dont see any blocking indefinitely and both the aget and aset >> completions are invoked and finish. >> >> Unless i'm not reproducing this properly, what i determine is a bad >> behavior from the mesos c++ api. >> Somehow the mesos c++ zookeeper api implementation is getting itself into >> pthread condition waits with nothing to notify and break the waits. >> this seems to occur with get calls from a Watcher on CHANGED events. >> >> craig >> >> >> >> >> -------- Original Message -------- >> From: [email protected] >> Apparently from: [email protected] >> To: [email protected] >> Subject: mesos c++ zookeeper blocks indefinately -- any plans to enhance? >> Date: Wed, 4 Mar 2015 10:05:54 -0500 >> >> > hi mesos users and devs, >> > We've observed that that the mesos 0.22.0-rc1 c++ zookeeper code >> appears to allow indefinite waits on responses. >> > This leads to application hangs blocked inside mesos zookeeper calls. >> > This can happen with a properly running zookeeper presumably able to >> make all responses. >> > >> > Heres how we hung it for eg. >> > We issue a mesos zk set via >> > >> > int ZooKeeper::set ( const std::string & path, >> > const std::string & data, >> > int version >> > ) >> > >> > then inside a Watcher we process on CHANGED event to issue a mesos zk >> get on the same path via >> > >> > int ZooKeeper::get ( const std::string & path, >> > bool watch, >> > std::string * result, >> > Stat * stat >> > ) >> > >> > we end up with two threads in the process both in pthread_cond_waits >> > #0 0x000000334e20b43c in pthread_cond_wait@@GLIBC_2.3.2 () from >> /lib64/libpthread.so.0 >> > #1 0x00007f6664ee1cf5 in Gate::arrive (this=0x7f6140, old=0) >> > at ../../../3rdparty/libprocess/src/gate.hpp:82 >> > #2 0x00007f6664ecef6e in process::ProcessManager::wait (this=0x7f02e0, >> pid=...) >> > at ../../../3rdparty/libprocess/src/process.cpp:2476 >> > #3 0x00007f6664ed2ce9 in process::wait (pid=..., duration=...) >> > at ../../../3rdparty/libprocess/src/process.cpp:2958 >> > #4 0x00007f6664e90558 in process::Latch::await (this=0x7f6ba0, >> duration=...) >> > at ../../../3rdparty/libprocess/src/latch.cpp:49 >> > #5 0x00007f66649452cc in process::Future<int>::await >> (this=0x7fffa0fd9040, duration=...) >> > at ../../3rdparty/libprocess/include/process/future.hpp:1156 >> > #6 0x00007f666493a04d in process::Future<int>::get >> (this=0x7fffa0fd9040) >> > at ../../3rdparty/libprocess/include/process/future.hpp:1167 >> > #7 0x00007f6664ab1aac in ZooKeeper::set (this=0x803ce0, >> path="/craig/mo", data= >> > ... >> > >> > and >> > #0 0x000000334e20b43c in pthread_cond_wait@@GLIBC_2.3.2 () from >> /lib64/libpthread.so.0 >> > #1 0x00007f6664ee1cf5 in Gate::arrive (this=0x7f66380013f0, old=0) >> > at ../../../3rdparty/libprocess/src/gate.hpp:82 >> > #2 0x00007f6664ecef6e in process::ProcessManager::wait (this=0x7f02e0, >> pid=...) >> > at ../../../3rdparty/libprocess/src/process.cpp:2476 >> > #3 0x00007f6664ed2ce9 in process::wait (pid=..., duration=...) >> > at ../../../3rdparty/libprocess/src/process.cpp:2958 >> > #4 0x00007f6664e90558 in process::Latch::await (this=0x7f6638000d00, >> duration=...) >> > at ../../../3rdparty/libprocess/src/latch.cpp:49 >> > #5 0x00007f66649452cc in process::Future<int>::await >> (this=0x7f66595fb6f0, duration=...) >> > at ../../3rdparty/libprocess/include/process/future.hpp:1156 >> > #6 0x00007f666493a04d in process::Future<int>::get >> (this=0x7f66595fb6f0) >> > at ../../3rdparty/libprocess/include/process/future.hpp:1167 >> > #7 0x00007f6664ab18d3 in ZooKeeper::get (this=0x803ce0, >> path="/craig/mo", watch=false, >> > .... >> > >> > So, really we are asking whether the mesos zk c++ api will be enhanced >> to not block indefinitely when results are beyond a time bound. >> > >> > cheers >> > craig >> > >

