hi again mesos users and devs,
In the prior post i left with description of hanging program with mesos 
zookeeper c++ api and wondered about enhancement to not wait indefinitely when 
underlying zookeeper responses dont occur.
At that time i thought perhaps the underlying zookeeper and/or its C binding 
might not be responding up to the mesos api callers.
So, while the question is still outstanding, I now see that potentially the 
hanging issue is with the mesos implementation over zookeeper c binding.
In particular i've now tried a similar scenario just with zookeeper c binding 
api.
That is, do zk aget/complete from within a watcher for events for the CHANGED 
event from a prior aset/complete.
i dont see any blocking indefinitely and both the aget and aset completions are 
invoked and finish.

Unless i'm not reproducing this properly, what i determine is a bad behavior 
from the mesos c++ api.
Somehow the mesos c++ zookeeper api implementation is getting itself into 
pthread condition waits with nothing to notify and break the waits.
this seems to occur with get calls from a Watcher on CHANGED events.

craig




-------- Original Message --------
From: [email protected]
Apparently from: [email protected]
To: [email protected]
Subject: mesos c++ zookeeper blocks indefinately -- any plans to enhance?
Date: Wed, 4 Mar 2015 10:05:54 -0500

> hi mesos users and devs,
> We've observed that that the mesos 0.22.0-rc1 c++ zookeeper code appears to 
> allow indefinite waits on responses.
> This leads to application hangs blocked inside mesos zookeeper calls.
> This can happen with a properly running zookeeper presumably able to make all 
> responses.
> 
> Heres how we hung it for eg.
> We issue a mesos zk set via
> 
> int ZooKeeper::set    (       const std::string &     path,
> const std::string &   data,
> int   version 
> )     
> 
> then inside a Watcher we process on CHANGED event to issue a mesos zk get on 
> the same path via
> 
> int ZooKeeper::get    (       const std::string &     path,
> bool  watch,
> std::string *         result,
> Stat *        stat 
> )     
> 
> we end up with two threads in the process both in pthread_cond_waits
> #0  0x000000334e20b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x00007f6664ee1cf5 in Gate::arrive (this=0x7f6140, old=0)
>     at ../../../3rdparty/libprocess/src/gate.hpp:82
> #2  0x00007f6664ecef6e in process::ProcessManager::wait (this=0x7f02e0, 
> pid=...)
>     at ../../../3rdparty/libprocess/src/process.cpp:2476
> #3  0x00007f6664ed2ce9 in process::wait (pid=..., duration=...)
>     at ../../../3rdparty/libprocess/src/process.cpp:2958
> #4  0x00007f6664e90558 in process::Latch::await (this=0x7f6ba0, duration=...)
>     at ../../../3rdparty/libprocess/src/latch.cpp:49
> #5  0x00007f66649452cc in process::Future<int>::await (this=0x7fffa0fd9040, 
> duration=...)
>     at ../../3rdparty/libprocess/include/process/future.hpp:1156
> #6  0x00007f666493a04d in process::Future<int>::get (this=0x7fffa0fd9040)
>     at ../../3rdparty/libprocess/include/process/future.hpp:1167
> #7  0x00007f6664ab1aac in ZooKeeper::set (this=0x803ce0, path="/craig/mo", 
> data=
> ...
> 
> and
> #0  0x000000334e20b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x00007f6664ee1cf5 in Gate::arrive (this=0x7f66380013f0, old=0)
>     at ../../../3rdparty/libprocess/src/gate.hpp:82
> #2  0x00007f6664ecef6e in process::ProcessManager::wait (this=0x7f02e0, 
> pid=...)
>     at ../../../3rdparty/libprocess/src/process.cpp:2476
> #3  0x00007f6664ed2ce9 in process::wait (pid=..., duration=...)
>     at ../../../3rdparty/libprocess/src/process.cpp:2958
> #4  0x00007f6664e90558 in process::Latch::await (this=0x7f6638000d00, 
> duration=...)
>     at ../../../3rdparty/libprocess/src/latch.cpp:49
> #5  0x00007f66649452cc in process::Future<int>::await (this=0x7f66595fb6f0, 
> duration=...)
>     at ../../3rdparty/libprocess/include/process/future.hpp:1156
> #6  0x00007f666493a04d in process::Future<int>::get (this=0x7f66595fb6f0)
>     at ../../3rdparty/libprocess/include/process/future.hpp:1167
> #7  0x00007f6664ab18d3 in ZooKeeper::get (this=0x803ce0, path="/craig/mo", 
> watch=false,
> ....
> 
> So, really we are asking whether the mesos zk c++ api will be enhanced to not 
> block indefinitely when results are beyond a time bound.
> 
> cheers
> craig

Reply via email to