On Fri, 2017-02-17 at 10:53 +0100, Gyorgy Szekely wrote: > Hi, > Sorry for spamming the list :( I will rate limit myself. > > I reviewed the docs for ZMQ_ROUTER_MANDATORY and it's clear now that > the > router socket may block if the message can be routed but HWM is > reached and > ZMQ_DONTWAIT is not specified. This is the exact code path my > application > blocks in. > > The problem is that HWM is not reached in my case. > zmq::router_t::xsend() > checks HWM with zmq::pipe_t::check_write(), which returns false, but > not > because HWM is reached, but beacuse pipe state is > zmq::pipe_t::waiting_for_delimiter. > > Summary: > I don't think it's reasonable for zmq::router_t::xsend() to return -1 > EAGAIN, when the corresponding pipe is being terminated. It's obvious > that > the message can't be sent in the future, there's no point in > retrying. > > (For the time being, as a workaround I specify ZMQ_DONTWAIT on the > send, > and I consider the worker dead with either EHOTUNREACH or EAGAIN.) > > What's your opinion on this? > > > Regards, > Gyorgy
Is the pipe terminated when the underlying socket is disconnected? I can't remember and I'd have to double check, but if that's the case then it could come back, so EAGAIN would be appropriate, right? Also the check_write just returns true/false, and given it's in the hot path I'd be wary of overloading it to cater for a single corner case. > On Thu, Feb 16, 2017 at 10:44 PM, Gyorgy Szekely <hodito...@gmail.com > > > wrote: > > > Hi, > > I dug a bit deeper, here are my findings: > > - removing the on/off switching for the ZMQ_ROUTER_MANDATORY flag, > > and > > enabling it before the router socket bind: makes no difference > > - removing the monitor trigger and heartbeating the workers > > periodically > > (2.5 sec) drastically reduces the occurrence rate, the program > > hangs after > > 3-4 hours, instead of seconds. (in the background a worker > > connects/disconnects with 4 second period time) > > > > From this I suspect the issue appears in a small timeframe which is > > close > > to the monitor event, but otherwise hard to hit. > > > > With GDB is see the following: > > - in zmq::socket_base_t::send() the call to xsend() returns EAGAIN. > > This > > should not happen since the ZMQ_DONTWAIT is not specified. > > - ZMQ_DONTWAIT is not specified, so the function won't return -1, > > but > > block (see trace in prev mail). > > > > - inside zmq::router_t::xsend() the pipe is found in the outpipes > > map, but > > the check_write() on it returns false > > - the if(mandatory) check in this block (router.cpp:218) returns > > with -1, > > EAGAIN > > - a similar block 10 lines below returns with -1, EHOSTUNREACH > > > > Should both if(mandatory) checks return EHOSTUNREACH? There's also > > a > > comment in the header for bool mandatory, that it will report > > EAGAIN, but > > this contradicts with the documentation. > > > > Can you help to clarify? > > > > > > Regards, > > Gyorgy > > > > > > It > > > > On Thu, Feb 16, 2017 at 12:22 PM, Gyorgy Szekely <hoditohod@gmail.c > > om> > > wrote: > > > > > Hi, > > > Continuing my journey on detecting dead workers I reduced the > > > design to > > > the minimal, and eliminated the messy file descriptors. > > > I only have: > > > - a router socket, with some number of peers > > > - a monitor socket attached to the router socket > > > > > > When the monitor detects a disconnect on the router socket: > > > - do setsockopt(ZMQ_ROUTER_MANDATORY, 1); > > > - send heartbeat message to every known peer > > > - if EHOSTUNREACH returned: remove the peer > > > - do setsockopt(ZMQ_ROUTER_MANDATORY, 0); > > > > > > What happens: _my app regularly hangs_ in zmq_msg_send(). Roughly > > > 20% of > > > the invocations. The call never returns, I have to kill the > > > application. > > > > > > What am I doing wrong??? According to the RFC's router sockets > > > should > > > never block. > > > I attached a full stacktrace with info locals and args for each > > > relevant > > > frame (sorry for the machine readable format). > > > > > > Env: > > > libzmq 4.2.1 stable, debug build > > > Ubuntu 16.04 64bit (the same happens with ubuntu packaged lib) > > > > > > Regards, > > > Gyorgy > > > > > > > > _______________________________________________ > zeromq-dev mailing list > zeromq-dev@lists.zeromq.org > https://lists.zeromq.org/mailman/listinfo/zeromq-dev
signature.asc
Description: This is a digitally signed message part
_______________________________________________ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev