Thanks Bill I pulled the latest libzmq and the issue still occurs.
I have tracked it down to the protocol_error handling. In the case of a ZMQ_SUB connecting to a ZMQ_REQ a protocol_error happens (expected) and the session is terminated. The termination does not remove that connection endpoint from the socket. This means subsequent calls to socket->connect on the same endpoint (after the correct service has resumed) are no ops because SUB can only have one connection to a single endpoint. The change below fixes my issue but I'm not sure if it's correct for other protocol errors. I haven't worked on the sessions/pipes before. I noticed in gdb the second session has a _pipe but is not fully created. https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L487 case i_engine::protocol_error: // if (_pending) { if (_pending || handshaked_) { // <<< if handshaked we should also terminate pipes. if (_pipe) _pipe->terminate (false); if (_zap_pipe) _zap_pipe->terminate (false); } else { terminate (); } I am happy to create a pull request to discuss if I am on the right track? I have test code to recreate. #include "testutil.hpp" #include "testutil_unity.hpp" #include <iostream> #include <stdlib.h> SETUP_TEARDOWN_TESTCONTEXT char end[] = "tcp://127.0.0.1:55667"; void test_pubreq () { // SUB up and connect to 55557 void *sub = test_context_socket (ZMQ_SUB); TEST_ASSERT_SUCCESS_ERRNO (zmq_setsockopt (sub, ZMQ_SUBSCRIBE, "", 0)); TEST_ASSERT_SUCCESS_ERRNO (zmq_connect (sub, end)); // REQ is up incorrectly on 55667 void *req = test_context_socket (ZMQ_REQ); TEST_ASSERT_SUCCESS_ERRNO (zmq_bind (req, end)); msleep(1000); TEST_ASSERT_SUCCESS_ERRNO (zmq_unbind (req, end)); // REQ is down // At this point the SUB socket has a protocol_error on 55667 (so no reconnect) but the socket thinks it still connected to 55667 msleep(1000); // PUB correctly comes up on 55667 void *pub = test_context_socket (ZMQ_PUB); TEST_ASSERT_SUCCESS_ERRNO (zmq_bind (pub, end)); // NOTE: If we force a disconnect here it works. // TEST_ASSERT_SUCCESS_ERRNO (zmq_disconnect (sub, end)); // Connect again fails TEST_ASSERT_SUCCESS_ERRNO (zmq_connect (sub, end)); msleep(100); send_string_expect_success (pub, "Hello", 0); msleep(100); recv_string_expect_success (sub, "Hello", 0); msleep(100); test_context_socket_close (pub); test_context_socket_close (req); test_context_socket_close (sub); } int main (void) { setup_test_environment (); UNITY_BEGIN (); RUN_TEST (test_pubreq); return UNITY_END (); } On Thu, May 20, 2021 at 4:56 PM Bill Torpey <wallstp...@gmail.com> wrote: > Sorry — meant to get back to you sooner, but it’s been a crazy week. > > You don’t say what version you’re running, but there have been some > changes in that area not that long ago — check these out and see if they > help: > > https://github.com/zeromq/libzmq/pull/3831 > > https://github.com/zeromq/libzmq/pull/3960 > > https://github.com/zeromq/libzmq/pull/4053 > > Good luck. > > Bill > > > On May 20, 2021, at 10:26 AM, James Harvey <jamesdillonhar...@gmail.com> > wrote: > > Hi, > > I will try and simplify my previous long email. > > If a stream gets into a protocol error state (e.g tcp SUB connect to REQ) > > Should the information (connection is terminated) be passed somehow back > to the parent socket so if connect() is called again it attempts to connect > rather than a no-op. > > OR > > Should we add a protocol error event to socket monitor so the calling > process can handle it by calling disconnect/connect > > Just want some clarification so I work on the correct code. > > Thanks > > James > > On Thu, May 13, 2021 at 4:48 PM James Harvey <jamesdillonhar...@gmail.com> > wrote: > >> Hi, >> >> I have a rare/random bug that causes my ZMQ_SUB socket to fail for a >> certain endpoint with no way to track/notify. Yes it's because a SUB >> connects to a REQ socket but once you start to use zeromq for lots of >> transient systems in a large company this kind of thing will happen >> occasionally. >> >> The process happens like this: >> >> - ZMQ_PUB binds on 1.2.3.4:44444 (ephemeral) >> - ZMQ_SUB connects to 1.2.3.4:44444 (data flows) >> - ZMQ_PUB goes down >> - Unrelated process (ZMQ_REQ) comes up and grabs the same 1.2.3.4:44444 >> as its ephemeral >> - ZMQ_SUB has not yet been told to disconnect so it reconnects to the >> ZMQ_REQ >> - protocol error happens and the connection is terminated in the >> session/engine >> - Now a good ZMQ_PUB comes up and binds on 1.2.3.4:44444 >> - ZMQ_SUB gets new instruction to connect() >> - connect() just returns noop. >> - The socket_base thinks it still has a valid endpoint and SUB only >> connects once to each endpoint. >> - At this point there are no errors and no data flowing. >> >> My question is, should the protocol_error in the session propagate up to >> remove the endpoint from the socket? >> >> If yes I can look at adding that, if no do you have any suggestions? >> >> Thanks for your time >> >> James >> >> Some links to the code: >> >> If socket is SUB and the endpoint is present dont connect. >> https://github.com/zeromq/libzmq/blob/master/src/socket_base.cpp#L901 >> >> terminate with no reconnect on protocol_error >> https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L486 >> > _______________________________________________ > zeromq-dev mailing list > zeromq-dev@lists.zeromq.org > https://lists.zeromq.org/mailman/listinfo/zeromq-dev > > > _______________________________________________ > zeromq-dev mailing list > zeromq-dev@lists.zeromq.org > https://lists.zeromq.org/mailman/listinfo/zeromq-dev >
_______________________________________________ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev