Hi,
Sorry for spamming the list :( I will rate limit myself.

I reviewed the docs for ZMQ_ROUTER_MANDATORY and it's clear now that the
router socket may block if the message can be routed but HWM is reached and
ZMQ_DONTWAIT is not specified. This is the exact code path my application
blocks in.

The problem is that HWM is not reached in my case. zmq::router_t::xsend()
checks HWM with zmq::pipe_t::check_write(), which returns false, but not
because HWM is reached, but beacuse pipe state is
zmq::pipe_t::waiting_for_delimiter.

Summary:
I don't think it's reasonable for zmq::router_t::xsend() to return -1
EAGAIN, when the corresponding pipe is being terminated. It's obvious that
the message can't be sent in the future, there's no point in retrying.

(For the time being, as a workaround I specify ZMQ_DONTWAIT on the send,
and I consider the worker dead with either EHOTUNREACH or EAGAIN.)

What's your opinion on this?


Regards,
  Gyorgy

On Thu, Feb 16, 2017 at 10:44 PM, Gyorgy Szekely <hodito...@gmail.com>
wrote:

> Hi,
> I dug a bit deeper, here are my findings:
> - removing the on/off switching for the ZMQ_ROUTER_MANDATORY flag, and
> enabling it before the router socket bind: makes no difference
> - removing the monitor trigger and heartbeating the workers periodically
> (2.5 sec) drastically reduces the occurrence rate, the program hangs after
> 3-4 hours, instead of seconds. (in the background a worker
> connects/disconnects with 4 second period time)
>
> From this I suspect the issue appears in a small timeframe which is close
> to the monitor event, but otherwise hard to hit.
>
> With GDB is see the following:
> - in zmq::socket_base_t::send() the call to xsend() returns EAGAIN. This
> should not happen since the ZMQ_DONTWAIT is not specified.
> - ZMQ_DONTWAIT is not specified, so the function won't return -1, but
> block (see trace in prev mail).
>
> - inside zmq::router_t::xsend() the pipe is found in the outpipes map, but
> the check_write() on it returns false
> - the if(mandatory) check in this block (router.cpp:218) returns with -1,
> EAGAIN
> - a similar block 10 lines below returns with -1, EHOSTUNREACH
>
> Should both if(mandatory) checks return EHOSTUNREACH? There's also a
> comment in the header for bool mandatory, that it will report EAGAIN, but
> this contradicts with the documentation.
>
> Can you help to clarify?
>
>
> Regards,
>   Gyorgy
>
>
> It
>
> On Thu, Feb 16, 2017 at 12:22 PM, Gyorgy Szekely <hodito...@gmail.com>
> wrote:
>
>> Hi,
>> Continuing my journey on detecting dead workers I reduced the design to
>> the minimal, and eliminated the messy file descriptors.
>> I only have:
>> - a router socket, with some number of peers
>> - a monitor socket attached to the router socket
>>
>> When the monitor detects a disconnect on the router socket:
>> - do setsockopt(ZMQ_ROUTER_MANDATORY, 1);
>> - send heartbeat message to every known peer
>> - if EHOSTUNREACH returned: remove the peer
>> - do setsockopt(ZMQ_ROUTER_MANDATORY, 0);
>>
>> What happens: _my app regularly hangs_ in zmq_msg_send(). Roughly 20% of
>> the invocations. The call never returns, I have to kill the application.
>>
>> What am I doing wrong??? According to the RFC's router sockets should
>> never block.
>> I attached a full stacktrace with info locals and args for each relevant
>> frame (sorry for the machine readable format).
>>
>> Env:
>> libzmq 4.2.1 stable, debug build
>> Ubuntu 16.04 64bit (the same happens with ubuntu packaged lib)
>>
>> Regards,
>>   Gyorgy
>>
>>
>
_______________________________________________
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

Reply via email to