[zeromq-dev] Router socket reconnection failure

2014-12-16 Thread Andre Caron
Hi all,

I'm experimenting with a router-router setup and I'm getting a strange issue 
when peers reconnect.

Basically, I have three nodes, which I'll call D, P1 and P2.  The idea is that 
D has a known TCP endpoint and socket identity.  P1 and P2 connect to D, 
register their TCP endpoint and identify and then discover each other through D 
(the directory).  At this point, one of them connects to the other and they 
become peers.  Through heartbeating, they can successfully detect connections 
and disconnections of the other peer.  Because the topology is dynamic and 
volatile, peers explicitly disconnect when they detect that one of their peers 
is unresponsive.

So far, my prototype implementations of programs for D and P* are working as 
intended.

The issue I'm having is with this sequence:
- P1 and P2 discover each other through D;
- P1 connects to P2 and P2 waits for a connection from P1 (direction is 
determined by lexicographical ordering of identities, which both peers have 
prior to connecting);
- Peers exchange heartbeats for a while;
- I forcibly crash P2;
- P1 eventually detects that P2 is unresponsive and explicitly disconnects;
- after this happens, I restart P2;
- P1 and P2 discover each other through D again;
- P1 tries to connect to P2 and P2 expects a connection from P1;
- both peers send heartbeats, but neither peer receives the other's messages 
and it appears the connection is never established.

Also note that after this has happened, context termination hangs despite 
closing the (only) socket and setting the linger to 1 second.

If I crash P1 instead of P2, the reconnection is successful.  Also, if after 
the error sequence above I crash P1, peers reconnect successfully.

As far as I can tell, the problem seems to be that a sequence of zmq_connect(), 
zmq_disconnect() and zmq_connect() on the same router socket and with the same 
endpoint corrupts the router socket.

Has anyone encountered this issue before?  I'm using ZMQ 4.1.0 via the PyZMQ 
bindings.

I may be able to work out a minimalist repro if necessary.

Thanks,

André
___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
http://lists.zeromq.org/mailman/listinfo/zeromq-dev


Re: [zeromq-dev] Router socket reconnection failure

2014-12-16 Thread Justin Karneges
Hi Andre,

On Tue, Dec 16, 2014, at 07:14 AM, Andre Caron wrote:
 The issue I'm having is with this sequence:
 - P1 and P2 discover each other through D;
 - P1 connects to P2 and P2 waits for a connection from P1 (direction is
 determined by lexicographical ordering of identities, which both peers
 have prior to connecting);
 - Peers exchange heartbeats for a while;
 - I forcibly crash P2;
 - P1 eventually detects that P2 is unresponsive and explicitly
 disconnects;
 - after this happens, I restart P2;
 - P1 and P2 discover each other through D again;
 - P1 tries to connect to P2 and P2 expects a connection from P1;
 - both peers send heartbeats, but neither peer receives the other's
 messages and it appears the connection is never established.
 
 Also note that after this has happened, context termination hangs despite
 closing the (only) socket and setting the linger to 1 second.
 
 If I crash P1 instead of P2, the reconnection is successful.  Also, if
 after the error sequence above I crash P1, peers reconnect successfully.

This is a known issue, and I reported it earlier this year:
http://lists.zeromq.org/pipermail/zeromq-dev/2014-February/025202.html

I believe the problem is that once a connector queue learns the ID of a
remote address, this binding sticks for life. The reason that you can
restart P1 and things work is because connectors maintain queues even if
there are no connections, but binders don't.

Unfortunately I haven't had time yet to look at a fix.

Justin
___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
http://lists.zeromq.org/mailman/listinfo/zeromq-dev


Re: [zeromq-dev] Router socket reconnection failure

2014-12-16 Thread André Caron
Hi Justin,

Thanks for the info :-)

Just read that thread, but the case seems slightly different: all my nodes
use a persistent identity, which I set immediately after creating the
socket and thus before any bind or connect operation.  However, I just
tried having P2 restart with a new identity and I get the same problem.

I'm really confused by the answers from Laurent near the end of the
thread.  It seems to me like the whole point of the identity socket option
is to send the string to the peer so that it can resume a session across
multiple TCP connections and/or process executions.  It also seems to me
like if it doesn't work in this scenario, then the identity's only purpose
would be for debugging purposes.  In addition, nothing I've seen so far
explains the fact that this scenario causes zmq_term() to hang forever
despite closing all sockets and setting a non-zero linger value, which is
clearly a bug.

I tried playing around with my code a bit more.  Using ZMQ 4.0.5, I get the
error. If I switch to ZMQ 4.1.0, the peers reconnect, but I zmq_term()
still hangs as soon as P1 reconnects to P2.  I don't know what was fixed
between those two releases, but something almost fixed the problem!

If it's of any help, setting the ZMQ_ROUTER_HANDOVER option to 1 doesn't
prevent zmq_term() from hanging.  This option doesn't exist in 4.0
releases, so I can't try it out there.

André

On Tue, Dec 16, 2014 at 5:17 PM, Justin Karneges jus...@affinix.com wrote:

 Hi Andre,

 On Tue, Dec 16, 2014, at 07:14 AM, Andre Caron wrote:
  The issue I'm having is with this sequence:
  - P1 and P2 discover each other through D;
  - P1 connects to P2 and P2 waits for a connection from P1 (direction is
  determined by lexicographical ordering of identities, which both peers
  have prior to connecting);
  - Peers exchange heartbeats for a while;
  - I forcibly crash P2;
  - P1 eventually detects that P2 is unresponsive and explicitly
  disconnects;
  - after this happens, I restart P2;
  - P1 and P2 discover each other through D again;
  - P1 tries to connect to P2 and P2 expects a connection from P1;
  - both peers send heartbeats, but neither peer receives the other's
  messages and it appears the connection is never established.
 
  Also note that after this has happened, context termination hangs despite
  closing the (only) socket and setting the linger to 1 second.
 
  If I crash P1 instead of P2, the reconnection is successful.  Also, if
  after the error sequence above I crash P1, peers reconnect successfully.

 This is a known issue, and I reported it earlier this year:
 http://lists.zeromq.org/pipermail/zeromq-dev/2014-February/025202.html

 I believe the problem is that once a connector queue learns the ID of a
 remote address, this binding sticks for life. The reason that you can
 restart P1 and things work is because connectors maintain queues even if
 there are no connections, but binders don't.

 Unfortunately I haven't had time yet to look at a fix.

 Justin
 ___
 zeromq-dev mailing list
 zeromq-dev@lists.zeromq.org
 http://lists.zeromq.org/mailman/listinfo/zeromq-dev

___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
http://lists.zeromq.org/mailman/listinfo/zeromq-dev