[zeromq-dev] Router socket reconnection failure
Hi all, I'm experimenting with a router-router setup and I'm getting a strange issue when peers reconnect. Basically, I have three nodes, which I'll call D, P1 and P2. The idea is that D has a known TCP endpoint and socket identity. P1 and P2 connect to D, register their TCP endpoint and identify and then discover each other through D (the directory). At this point, one of them connects to the other and they become peers. Through heartbeating, they can successfully detect connections and disconnections of the other peer. Because the topology is dynamic and volatile, peers explicitly disconnect when they detect that one of their peers is unresponsive. So far, my prototype implementations of programs for D and P* are working as intended. The issue I'm having is with this sequence: - P1 and P2 discover each other through D; - P1 connects to P2 and P2 waits for a connection from P1 (direction is determined by lexicographical ordering of identities, which both peers have prior to connecting); - Peers exchange heartbeats for a while; - I forcibly crash P2; - P1 eventually detects that P2 is unresponsive and explicitly disconnects; - after this happens, I restart P2; - P1 and P2 discover each other through D again; - P1 tries to connect to P2 and P2 expects a connection from P1; - both peers send heartbeats, but neither peer receives the other's messages and it appears the connection is never established. Also note that after this has happened, context termination hangs despite closing the (only) socket and setting the linger to 1 second. If I crash P1 instead of P2, the reconnection is successful. Also, if after the error sequence above I crash P1, peers reconnect successfully. As far as I can tell, the problem seems to be that a sequence of zmq_connect(), zmq_disconnect() and zmq_connect() on the same router socket and with the same endpoint corrupts the router socket. Has anyone encountered this issue before? I'm using ZMQ 4.1.0 via the PyZMQ bindings. I may be able to work out a minimalist repro if necessary. Thanks, André ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org http://lists.zeromq.org/mailman/listinfo/zeromq-dev
Re: [zeromq-dev] Router socket reconnection failure
Hi Andre, On Tue, Dec 16, 2014, at 07:14 AM, Andre Caron wrote: The issue I'm having is with this sequence: - P1 and P2 discover each other through D; - P1 connects to P2 and P2 waits for a connection from P1 (direction is determined by lexicographical ordering of identities, which both peers have prior to connecting); - Peers exchange heartbeats for a while; - I forcibly crash P2; - P1 eventually detects that P2 is unresponsive and explicitly disconnects; - after this happens, I restart P2; - P1 and P2 discover each other through D again; - P1 tries to connect to P2 and P2 expects a connection from P1; - both peers send heartbeats, but neither peer receives the other's messages and it appears the connection is never established. Also note that after this has happened, context termination hangs despite closing the (only) socket and setting the linger to 1 second. If I crash P1 instead of P2, the reconnection is successful. Also, if after the error sequence above I crash P1, peers reconnect successfully. This is a known issue, and I reported it earlier this year: http://lists.zeromq.org/pipermail/zeromq-dev/2014-February/025202.html I believe the problem is that once a connector queue learns the ID of a remote address, this binding sticks for life. The reason that you can restart P1 and things work is because connectors maintain queues even if there are no connections, but binders don't. Unfortunately I haven't had time yet to look at a fix. Justin ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org http://lists.zeromq.org/mailman/listinfo/zeromq-dev
Re: [zeromq-dev] Router socket reconnection failure
Hi Justin, Thanks for the info :-) Just read that thread, but the case seems slightly different: all my nodes use a persistent identity, which I set immediately after creating the socket and thus before any bind or connect operation. However, I just tried having P2 restart with a new identity and I get the same problem. I'm really confused by the answers from Laurent near the end of the thread. It seems to me like the whole point of the identity socket option is to send the string to the peer so that it can resume a session across multiple TCP connections and/or process executions. It also seems to me like if it doesn't work in this scenario, then the identity's only purpose would be for debugging purposes. In addition, nothing I've seen so far explains the fact that this scenario causes zmq_term() to hang forever despite closing all sockets and setting a non-zero linger value, which is clearly a bug. I tried playing around with my code a bit more. Using ZMQ 4.0.5, I get the error. If I switch to ZMQ 4.1.0, the peers reconnect, but I zmq_term() still hangs as soon as P1 reconnects to P2. I don't know what was fixed between those two releases, but something almost fixed the problem! If it's of any help, setting the ZMQ_ROUTER_HANDOVER option to 1 doesn't prevent zmq_term() from hanging. This option doesn't exist in 4.0 releases, so I can't try it out there. André On Tue, Dec 16, 2014 at 5:17 PM, Justin Karneges jus...@affinix.com wrote: Hi Andre, On Tue, Dec 16, 2014, at 07:14 AM, Andre Caron wrote: The issue I'm having is with this sequence: - P1 and P2 discover each other through D; - P1 connects to P2 and P2 waits for a connection from P1 (direction is determined by lexicographical ordering of identities, which both peers have prior to connecting); - Peers exchange heartbeats for a while; - I forcibly crash P2; - P1 eventually detects that P2 is unresponsive and explicitly disconnects; - after this happens, I restart P2; - P1 and P2 discover each other through D again; - P1 tries to connect to P2 and P2 expects a connection from P1; - both peers send heartbeats, but neither peer receives the other's messages and it appears the connection is never established. Also note that after this has happened, context termination hangs despite closing the (only) socket and setting the linger to 1 second. If I crash P1 instead of P2, the reconnection is successful. Also, if after the error sequence above I crash P1, peers reconnect successfully. This is a known issue, and I reported it earlier this year: http://lists.zeromq.org/pipermail/zeromq-dev/2014-February/025202.html I believe the problem is that once a connector queue learns the ID of a remote address, this binding sticks for life. The reason that you can restart P1 and things work is because connectors maintain queues even if there are no connections, but binders don't. Unfortunately I haven't had time yet to look at a fix. Justin ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org http://lists.zeromq.org/mailman/listinfo/zeromq-dev ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org http://lists.zeromq.org/mailman/listinfo/zeromq-dev