Hi, Sorry for spamming the list :( I will rate limit myself. I reviewed the docs for ZMQ_ROUTER_MANDATORY and it's clear now that the router socket may block if the message can be routed but HWM is reached and ZMQ_DONTWAIT is not specified. This is the exact code path my application blocks in.
The problem is that HWM is not reached in my case. zmq::router_t::xsend() checks HWM with zmq::pipe_t::check_write(), which returns false, but not because HWM is reached, but beacuse pipe state is zmq::pipe_t::waiting_for_delimiter. Summary: I don't think it's reasonable for zmq::router_t::xsend() to return -1 EAGAIN, when the corresponding pipe is being terminated. It's obvious that the message can't be sent in the future, there's no point in retrying. (For the time being, as a workaround I specify ZMQ_DONTWAIT on the send, and I consider the worker dead with either EHOTUNREACH or EAGAIN.) What's your opinion on this? Regards, Gyorgy On Thu, Feb 16, 2017 at 10:44 PM, Gyorgy Szekely <hodito...@gmail.com> wrote: > Hi, > I dug a bit deeper, here are my findings: > - removing the on/off switching for the ZMQ_ROUTER_MANDATORY flag, and > enabling it before the router socket bind: makes no difference > - removing the monitor trigger and heartbeating the workers periodically > (2.5 sec) drastically reduces the occurrence rate, the program hangs after > 3-4 hours, instead of seconds. (in the background a worker > connects/disconnects with 4 second period time) > > From this I suspect the issue appears in a small timeframe which is close > to the monitor event, but otherwise hard to hit. > > With GDB is see the following: > - in zmq::socket_base_t::send() the call to xsend() returns EAGAIN. This > should not happen since the ZMQ_DONTWAIT is not specified. > - ZMQ_DONTWAIT is not specified, so the function won't return -1, but > block (see trace in prev mail). > > - inside zmq::router_t::xsend() the pipe is found in the outpipes map, but > the check_write() on it returns false > - the if(mandatory) check in this block (router.cpp:218) returns with -1, > EAGAIN > - a similar block 10 lines below returns with -1, EHOSTUNREACH > > Should both if(mandatory) checks return EHOSTUNREACH? There's also a > comment in the header for bool mandatory, that it will report EAGAIN, but > this contradicts with the documentation. > > Can you help to clarify? > > > Regards, > Gyorgy > > > It > > On Thu, Feb 16, 2017 at 12:22 PM, Gyorgy Szekely <hodito...@gmail.com> > wrote: > >> Hi, >> Continuing my journey on detecting dead workers I reduced the design to >> the minimal, and eliminated the messy file descriptors. >> I only have: >> - a router socket, with some number of peers >> - a monitor socket attached to the router socket >> >> When the monitor detects a disconnect on the router socket: >> - do setsockopt(ZMQ_ROUTER_MANDATORY, 1); >> - send heartbeat message to every known peer >> - if EHOSTUNREACH returned: remove the peer >> - do setsockopt(ZMQ_ROUTER_MANDATORY, 0); >> >> What happens: _my app regularly hangs_ in zmq_msg_send(). Roughly 20% of >> the invocations. The call never returns, I have to kill the application. >> >> What am I doing wrong??? According to the RFC's router sockets should >> never block. >> I attached a full stacktrace with info locals and args for each relevant >> frame (sorry for the machine readable format). >> >> Env: >> libzmq 4.2.1 stable, debug build >> Ubuntu 16.04 64bit (the same happens with ubuntu packaged lib) >> >> Regards, >> Gyorgy >> >> >
_______________________________________________ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev