Re: [zeromq-dev] zeromq protocol_error handling

2021-05-22 Thread Bill Torpey
Hi James:

> In general zeromq  is a steep learning curve and trying to work out if the 
> behaviour you think is bad is really an issue or expected is hard.

You’re not kidding — I’ve been through the same thing.  It’s only recently that 
I’ve felt comfortable making even minor changes, and I’ve had some help along 
the way.

> 
>  The maintainers of zmq clearly have a far superior knowledge so it's easy to 
> just let them do all the work. This feels wrong so I want to help.

In my experience, the maintainers (esp. Doron, Luca and Simon) have been great, 
but unlike some other OSS projects, ZeroMQ is a side gig for them, so bear that 
in mind.  

Regards,

Bill


> 
> 
> 
> 
> On Fri, 21 May 2021, 21:16 Bill Torpey,  > wrote:
> Hey James:
> 
> Going back over your original scenario:
> 
>>  - ZMQ_PUB binds on 1.2.3.4:4  (ephemeral)
> 
>>  - ZMQ_SUB connects to 1.2.3.4:4  (data flows)
> 
>>  - ZMQ_PUB goes down
> 
> At this point the SUB should get a disconnect.  It will then start trying to 
> reconnect, which it will do “forever” without any other  action.  (The 
> default for ZMQ_RECONNECT_IVL is 100 millis).
> 
> This PR (https://github.com/zeromq/libzmq/pull/3831 
> ) explicitly checks for the 
> scenario where a previously-connected socket gets ECONNREFUSED when 
> attempting to reconnect.  If that condition is detected, the reconnect is 
> aborted AND the endpoint address is “forgotten” so subsequent attempts to 
> connect (not re-connect) to that endpoint are not silently ignored. 
> 
> Note that you have to ask for this behavior, as it’s not the default, by 
> calling something like "zmq_setsockopt(socket, ZMQ_RECONNECT_STOP, 
> ZMQ_RECONNECT_STOP_CONN_REFUSED ..”.
> 
> (FWIW, I initially suggested that silently ignoring duplicate connection 
> attempts is a bad idea, and would prefer that the connect return an error 
> (like EAGAIN), but there was push-back on that as it’s a change in behavior.  
> I still think that’s a better approach).
> 
> 
>>  - Unrelated process (ZMQ_REQ) comes up and grabs the same 1.2.3.4:4 
>>  as its ephemeral 
> 
> 
> It seems unlikely that another process could grab the same ephemeral port 
> without an intervening ECONNREFUSED (no code listening at port). 
> 
> You really need to implement the socket monitoring code (as I’ve already 
> suggested).  Make sure to use zmqBridgeMamaTransportImpl_monitorEvent_v2 as 
> that will give you both endpoint addresses.
> 
> If that’s too much trouble, you may be able to use 
> zmtpdump(https://github.com/zeromq/zmtpdump 
> ) or wireshark to see what is really 
> going on.
> 
> Last but not least, you are likely better off creating an issue on GitHub for 
> this.
> 
> Regards,
> 
> Bill
> 
> 
>> On May 21, 2021, at 2:38 PM, James Harvey > > wrote:
>> 
>> Hi Bill,
>> 
>> I will check/reply to rest of points later ( im in the pub ) but that is the 
>> point. The protocol_error stops everything so no more reconnect from the pub 
>> socket. Its effectively a zombie as it's terminated but still the endpoint 
>> is registered on the socket.
>> 
>> Cheers
>> 
>> James
>> 
>> 
>> On Fri, 21 May 2021, 18:43 Bill Torpey, > > wrote:
>> Hi James:
>> 
>> A couple of questions:
>> 
>> - Is the SUB socket attempting to reconnect?  (Default is yes).
>> 
>> - Are you activating any of the socket options added by recent changes?  
>> IIRC none of the new options (e.g., ZMQ_RECONNECT_STOP_CONN_REFUSED)  have 
>> any effect by default — they need to be activated explicitly.
>> 
>> - Are you tracing socket events?  If not, you should give that a try — it 
>> will tell you what is going on “under the covers”. You can find an example 
>> at 
>> https://github.com/nyfix/OZ/blob/4627b0364be80de4451bf1a80a26c00d0ba9310f/src/transport.c#L1549
>>  
>> 
>> 
>> I’ll try to take a look when I have some time, but not sure when that will 
>> be …
>> 
>> Regards,
>> 
>> Bill
>> 
>>> On May 21, 2021, at 10:04 AM, James Harvey >> > wrote:
>>> 
>>> Thanks Bill 
>>> 
>>> I pulled the latest libzmq and the issue still occurs.
>>> 
>>> I have tracked it down to the protocol_error handling.  In the case of a 
>>> ZMQ_SUB connecting to a ZMQ_REQ a protocol_error happens (expected) and the 
>>> session is terminated.
>>> 
>>> The termination does not remove that connection endpoint from the socket. 
>>> This means subsequent calls to socket->connect on the same endpoint (after 
>>> the correct service has resumed) are no ops because SUB can only have one 
>>> connection to a single endpoint.
>>> 
>>> 
>>> The change below fixes my issue but I'm not sure if it's correct for other 
>>> 

Re: [zeromq-dev] zeromq protocol_error handling

2021-05-22 Thread James Harvey
Hi,

I moved this to an issue as suggested

https://github.com/zeromq/libzmq/issues/4196

Thanks.

James

On Fri, May 21, 2021 at 9:16 PM Bill Torpey  wrote:

> Hey James:
>
> Going back over your original scenario:
>
>  - ZMQ_PUB binds on 1.2.3.4:4 (ephemeral)
>
>  - ZMQ_SUB connects to 1.2.3.4:4 (data flows)
>
>  - ZMQ_PUB goes down
>
>
> At this point the SUB should get a disconnect.  It will then start trying
> to reconnect, which it will do “forever” without any other  action.  (The
> default for ZMQ_RECONNECT_IVL is 100 millis).
>
> This PR (https://github.com/zeromq/libzmq/pull/3831) explicitly checks
> for the scenario where a previously-connected socket gets ECONNREFUSED when
> attempting to reconnect.  If that condition is detected, the reconnect is
> aborted AND the endpoint address is “forgotten” so subsequent attempts to
> connect (not re-connect) to that endpoint are not silently ignored.
>
> Note that you have to ask for this behavior, as it’s not the default, by
> calling something like "zmq_setsockopt(socket, ZMQ_RECONNECT_STOP,
> ZMQ_RECONNECT_STOP_CONN_REFUSED ..”.
>
> (FWIW, I initially suggested that silently ignoring duplicate connection
> attempts is a bad idea, and would prefer that the connect return an error
> (like EAGAIN), but there was push-back on that as it’s a change in
> behavior.  I still think that’s a better approach).
>
>
>  - Unrelated process (ZMQ_REQ) comes up and grabs the same 1.2.3.4:4 as
> its ephemeral
>
>
> It seems unlikely that another process could grab the same ephemeral port
> without an intervening ECONNREFUSED (no code listening at port).
>
> You really need to implement the socket monitoring code (as I’ve already
> suggested).  Make sure to use zmqBridgeMamaTransportImpl_monitorEvent_v2 as
> that will give you both endpoint addresses.
>
> If that’s too much trouble, you may be able to use zmtpdump(
> https://github.com/zeromq/zmtpdump) or wireshark to see what is really
> going on.
>
> Last but not least, you are likely better off creating an issue on GitHub
> for this.
>
> Regards,
>
> Bill
>
>
> On May 21, 2021, at 2:38 PM, James Harvey 
> wrote:
>
> Hi Bill,
>
> I will check/reply to rest of points later ( im in the pub ) but that is
> the point. The protocol_error stops everything so no more reconnect from
> the pub socket. Its effectively a zombie as it's terminated but still the
> endpoint is registered on the socket.
>
> Cheers
>
> James
>
>
> On Fri, 21 May 2021, 18:43 Bill Torpey,  wrote:
>
>> Hi James:
>>
>> A couple of questions:
>>
>> - Is the SUB socket attempting to reconnect?  (Default is yes).
>>
>> - Are you activating any of the socket options added by recent changes?
>> IIRC none of the new options (e.g., ZMQ_RECONNECT_STOP_CONN_REFUSED)  have
>> any effect by default — they need to be activated explicitly.
>>
>> - Are you tracing socket events?  If not, you should give that a try — it
>> will tell you what is going on “under the covers”. You can find an example
>> at
>> https://github.com/nyfix/OZ/blob/4627b0364be80de4451bf1a80a26c00d0ba9310f/src/transport.c#L1549
>>
>> I’ll try to take a look when I have some time, but not sure when that
>> will be …
>>
>> Regards,
>>
>> Bill
>>
>> On May 21, 2021, at 10:04 AM, James Harvey 
>> wrote:
>>
>> Thanks Bill
>>
>> I pulled the latest libzmq and the issue still occurs.
>>
>> I have tracked it down to the protocol_error handling.  In the case of a
>> ZMQ_SUB connecting to a ZMQ_REQ a protocol_error happens (expected) and the
>> session is terminated.
>>
>> The termination does not remove that connection endpoint from the socket.
>> This means subsequent calls to socket->connect on the same endpoint (after
>> the correct service has resumed) are no ops because SUB can only have one
>> connection to a single endpoint.
>>
>>
>> The change below fixes my issue but I'm not sure if it's correct for
>> other protocol errors.  I haven't worked on the sessions/pipes before.I
>> noticed in gdb the second session has a _pipe but is not fully created.
>>
>> https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L487
>>
>> case i_engine::protocol_error:
>> //if (_pending) {
>> if (_pending || handshaked_) {  // <<<  if handshaked we
>> should also terminate pipes.
>> if (_pipe)
>> _pipe->terminate (false);
>> if (_zap_pipe)
>> _zap_pipe->terminate (false);
>> } else {
>> terminate ();
>> }
>>
>> I am happy to create a pull request to discuss if I am on the right track?
>>
>> I have test code to recreate.
>>
>> #include "testutil.hpp"
>> #include "testutil_unity.hpp"
>> #include 
>> #include 
>> SETUP_TEARDOWN_TESTCONTEXT
>> char end[] = "tcp://127.0.0.1:55667";
>>
>> void test_pubreq ()
>> {
>>
>> // SUB up and connect to 7
>> void *sub = test_context_socket (ZMQ_SUB);
>> TEST_ASSERT_SUCCESS_ERRNO (zmq_setsockopt (sub, 

Re: [zeromq-dev] zeromq protocol_error handling

2021-05-21 Thread James Harvey
Thanks Bill for the advice, I will implement the monitoring to gather more
data. I think I have sufficient information to create an issue now.

In general zeromq  is a steep learning curve and trying to work out if the
behaviour you think is bad is really an issue or expected is hard.

 The maintainers of zmq clearly have a far superior knowledge so it's easy
to just let them do all the work. This feels wrong so I want to help.




On Fri, 21 May 2021, 21:16 Bill Torpey,  wrote:

> Hey James:
>
> Going back over your original scenario:
>
>  - ZMQ_PUB binds on 1.2.3.4:4 (ephemeral)
>
>  - ZMQ_SUB connects to 1.2.3.4:4 (data flows)
>
>  - ZMQ_PUB goes down
>
>
> At this point the SUB should get a disconnect.  It will then start trying
> to reconnect, which it will do “forever” without any other  action.  (The
> default for ZMQ_RECONNECT_IVL is 100 millis).
>
> This PR (https://github.com/zeromq/libzmq/pull/3831) explicitly checks
> for the scenario where a previously-connected socket gets ECONNREFUSED when
> attempting to reconnect.  If that condition is detected, the reconnect is
> aborted AND the endpoint address is “forgotten” so subsequent attempts to
> connect (not re-connect) to that endpoint are not silently ignored.
>
> Note that you have to ask for this behavior, as it’s not the default, by
> calling something like "zmq_setsockopt(socket, ZMQ_RECONNECT_STOP,
> ZMQ_RECONNECT_STOP_CONN_REFUSED ..”.
>
> (FWIW, I initially suggested that silently ignoring duplicate connection
> attempts is a bad idea, and would prefer that the connect return an error
> (like EAGAIN), but there was push-back on that as it’s a change in
> behavior.  I still think that’s a better approach).
>
>
>  - Unrelated process (ZMQ_REQ) comes up and grabs the same 1.2.3.4:4 as
> its ephemeral
>
>
> It seems unlikely that another process could grab the same ephemeral port
> without an intervening ECONNREFUSED (no code listening at port).
>
> You really need to implement the socket monitoring code (as I’ve already
> suggested).  Make sure to use zmqBridgeMamaTransportImpl_monitorEvent_v2 as
> that will give you both endpoint addresses.
>
> If that’s too much trouble, you may be able to use zmtpdump(
> https://github.com/zeromq/zmtpdump) or wireshark to see what is really
> going on.
>
> Last but not least, you are likely better off creating an issue on GitHub
> for this.
>
> Regards,
>
> Bill
>
>
> On May 21, 2021, at 2:38 PM, James Harvey 
> wrote:
>
> Hi Bill,
>
> I will check/reply to rest of points later ( im in the pub ) but that is
> the point. The protocol_error stops everything so no more reconnect from
> the pub socket. Its effectively a zombie as it's terminated but still the
> endpoint is registered on the socket.
>
> Cheers
>
> James
>
>
> On Fri, 21 May 2021, 18:43 Bill Torpey,  wrote:
>
>> Hi James:
>>
>> A couple of questions:
>>
>> - Is the SUB socket attempting to reconnect?  (Default is yes).
>>
>> - Are you activating any of the socket options added by recent changes?
>> IIRC none of the new options (e.g., ZMQ_RECONNECT_STOP_CONN_REFUSED)  have
>> any effect by default — they need to be activated explicitly.
>>
>> - Are you tracing socket events?  If not, you should give that a try — it
>> will tell you what is going on “under the covers”. You can find an example
>> at
>> https://github.com/nyfix/OZ/blob/4627b0364be80de4451bf1a80a26c00d0ba9310f/src/transport.c#L1549
>>
>> I’ll try to take a look when I have some time, but not sure when that
>> will be …
>>
>> Regards,
>>
>> Bill
>>
>> On May 21, 2021, at 10:04 AM, James Harvey 
>> wrote:
>>
>> Thanks Bill
>>
>> I pulled the latest libzmq and the issue still occurs.
>>
>> I have tracked it down to the protocol_error handling.  In the case of a
>> ZMQ_SUB connecting to a ZMQ_REQ a protocol_error happens (expected) and the
>> session is terminated.
>>
>> The termination does not remove that connection endpoint from the socket.
>> This means subsequent calls to socket->connect on the same endpoint (after
>> the correct service has resumed) are no ops because SUB can only have one
>> connection to a single endpoint.
>>
>>
>> The change below fixes my issue but I'm not sure if it's correct for
>> other protocol errors.  I haven't worked on the sessions/pipes before.I
>> noticed in gdb the second session has a _pipe but is not fully created.
>>
>> https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L487
>>
>> case i_engine::protocol_error:
>> //if (_pending) {
>> if (_pending || handshaked_) {  // <<<  if handshaked we
>> should also terminate pipes.
>> if (_pipe)
>> _pipe->terminate (false);
>> if (_zap_pipe)
>> _zap_pipe->terminate (false);
>> } else {
>> terminate ();
>> }
>>
>> I am happy to create a pull request to discuss if I am on the right track?
>>
>> I have test code to recreate.
>>
>> 

Re: [zeromq-dev] zeromq protocol_error handling

2021-05-21 Thread Bill Torpey
Hey James:

Going back over your original scenario:

>  - ZMQ_PUB binds on 1.2.3.4:4  (ephemeral)

>  - ZMQ_SUB connects to 1.2.3.4:4  (data flows)

>  - ZMQ_PUB goes down

At this point the SUB should get a disconnect.  It will then start trying to 
reconnect, which it will do “forever” without any other  action.  (The default 
for ZMQ_RECONNECT_IVL is 100 millis).

This PR (https://github.com/zeromq/libzmq/pull/3831) explicitly checks for the 
scenario where a previously-connected socket gets ECONNREFUSED when attempting 
to reconnect.  If that condition is detected, the reconnect is aborted AND the 
endpoint address is “forgotten” so subsequent attempts to connect (not 
re-connect) to that endpoint are not silently ignored. 

Note that you have to ask for this behavior, as it’s not the default, by 
calling something like "zmq_setsockopt(socket, ZMQ_RECONNECT_STOP, 
ZMQ_RECONNECT_STOP_CONN_REFUSED ..”.

(FWIW, I initially suggested that silently ignoring duplicate connection 
attempts is a bad idea, and would prefer that the connect return an error (like 
EAGAIN), but there was push-back on that as it’s a change in behavior.  I still 
think that’s a better approach).


>  - Unrelated process (ZMQ_REQ) comes up and grabs the same 1.2.3.4:4 
>  as its ephemeral 


It seems unlikely that another process could grab the same ephemeral port 
without an intervening ECONNREFUSED (no code listening at port). 

You really need to implement the socket monitoring code (as I’ve already 
suggested).  Make sure to use zmqBridgeMamaTransportImpl_monitorEvent_v2 as 
that will give you both endpoint addresses.

If that’s too much trouble, you may be able to use 
zmtpdump(https://github.com/zeromq/zmtpdump) or wireshark to see what is really 
going on.

Last but not least, you are likely better off creating an issue on GitHub for 
this.

Regards,

Bill


> On May 21, 2021, at 2:38 PM, James Harvey  wrote:
> 
> Hi Bill,
> 
> I will check/reply to rest of points later ( im in the pub ) but that is the 
> point. The protocol_error stops everything so no more reconnect from the pub 
> socket. Its effectively a zombie as it's terminated but still the endpoint is 
> registered on the socket.
> 
> Cheers
> 
> James
> 
> 
> On Fri, 21 May 2021, 18:43 Bill Torpey,  > wrote:
> Hi James:
> 
> A couple of questions:
> 
> - Is the SUB socket attempting to reconnect?  (Default is yes).
> 
> - Are you activating any of the socket options added by recent changes?  IIRC 
> none of the new options (e.g., ZMQ_RECONNECT_STOP_CONN_REFUSED)  have any 
> effect by default — they need to be activated explicitly.
> 
> - Are you tracing socket events?  If not, you should give that a try — it 
> will tell you what is going on “under the covers”. You can find an example at 
> https://github.com/nyfix/OZ/blob/4627b0364be80de4451bf1a80a26c00d0ba9310f/src/transport.c#L1549
>  
> 
> 
> I’ll try to take a look when I have some time, but not sure when that will be 
> …
> 
> Regards,
> 
> Bill
> 
>> On May 21, 2021, at 10:04 AM, James Harvey > > wrote:
>> 
>> Thanks Bill 
>> 
>> I pulled the latest libzmq and the issue still occurs.
>> 
>> I have tracked it down to the protocol_error handling.  In the case of a 
>> ZMQ_SUB connecting to a ZMQ_REQ a protocol_error happens (expected) and the 
>> session is terminated.
>> 
>> The termination does not remove that connection endpoint from the socket. 
>> This means subsequent calls to socket->connect on the same endpoint (after 
>> the correct service has resumed) are no ops because SUB can only have one 
>> connection to a single endpoint.
>> 
>> 
>> The change below fixes my issue but I'm not sure if it's correct for other 
>> protocol errors.  I haven't worked on the sessions/pipes before.I 
>> noticed in gdb the second session has a _pipe but is not fully created.
>> 
>> https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L487 
>>   
>> 
>> case i_engine::protocol_error:
>> //if (_pending) {
>> if (_pending || handshaked_) {  // <<<  if handshaked we should 
>> also terminate pipes.
>> if (_pipe)
>> _pipe->terminate (false);
>> if (_zap_pipe)
>> _zap_pipe->terminate (false);
>> } else {
>> terminate ();
>> }
>> 
>> I am happy to create a pull request to discuss if I am on the right track?
>> 
>> I have test code to recreate.
>> 
>> #include "testutil.hpp"
>> #include "testutil_unity.hpp"
>> #include 
>> #include 
>> SETUP_TEARDOWN_TESTCONTEXT
>> char end[] = "tcp://127.0.0.1:55667 ";
>> 
>> void test_pubreq ()
>> {

Re: [zeromq-dev] zeromq protocol_error handling

2021-05-21 Thread James Harvey
Hi Bill,

I will check/reply to rest of points later ( im in the pub ) but that is
the point. The protocol_error stops everything so no more reconnect from
the pub socket. Its effectively a zombie as it's terminated but still the
endpoint is registered on the socket.

Cheers

James


On Fri, 21 May 2021, 18:43 Bill Torpey,  wrote:

> Hi James:
>
> A couple of questions:
>
> - Is the SUB socket attempting to reconnect?  (Default is yes).
>
> - Are you activating any of the socket options added by recent changes?
> IIRC none of the new options (e.g., ZMQ_RECONNECT_STOP_CONN_REFUSED)  have
> any effect by default — they need to be activated explicitly.
>
> - Are you tracing socket events?  If not, you should give that a try — it
> will tell you what is going on “under the covers”. You can find an example
> at
> https://github.com/nyfix/OZ/blob/4627b0364be80de4451bf1a80a26c00d0ba9310f/src/transport.c#L1549
>
> I’ll try to take a look when I have some time, but not sure when that will
> be …
>
> Regards,
>
> Bill
>
> On May 21, 2021, at 10:04 AM, James Harvey 
> wrote:
>
> Thanks Bill
>
> I pulled the latest libzmq and the issue still occurs.
>
> I have tracked it down to the protocol_error handling.  In the case of a
> ZMQ_SUB connecting to a ZMQ_REQ a protocol_error happens (expected) and the
> session is terminated.
>
> The termination does not remove that connection endpoint from the socket.
> This means subsequent calls to socket->connect on the same endpoint (after
> the correct service has resumed) are no ops because SUB can only have one
> connection to a single endpoint.
>
>
> The change below fixes my issue but I'm not sure if it's correct for other
> protocol errors.  I haven't worked on the sessions/pipes before.I
> noticed in gdb the second session has a _pipe but is not fully created.
>
> https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L487
>
> case i_engine::protocol_error:
> //if (_pending) {
> if (_pending || handshaked_) {  // <<<  if handshaked we
> should also terminate pipes.
> if (_pipe)
> _pipe->terminate (false);
> if (_zap_pipe)
> _zap_pipe->terminate (false);
> } else {
> terminate ();
> }
>
> I am happy to create a pull request to discuss if I am on the right track?
>
> I have test code to recreate.
>
> #include "testutil.hpp"
> #include "testutil_unity.hpp"
> #include 
> #include 
> SETUP_TEARDOWN_TESTCONTEXT
> char end[] = "tcp://127.0.0.1:55667";
>
> void test_pubreq ()
> {
>
> // SUB up and connect to 7
> void *sub = test_context_socket (ZMQ_SUB);
> TEST_ASSERT_SUCCESS_ERRNO (zmq_setsockopt (sub, ZMQ_SUBSCRIBE, "", 0));
> TEST_ASSERT_SUCCESS_ERRNO (zmq_connect (sub, end));
>
> // REQ is up incorrectly on 55667
> void *req = test_context_socket (ZMQ_REQ);
> TEST_ASSERT_SUCCESS_ERRNO (zmq_bind (req, end));
> msleep(1000);
> TEST_ASSERT_SUCCESS_ERRNO (zmq_unbind (req, end));
> // REQ is down
> // At this point the SUB socket has a protocol_error on 55667 (so no
> reconnect) but the socket thinks it still connected to 55667
>
> msleep(1000);
>
> // PUB correctly comes up on 55667
> void *pub = test_context_socket (ZMQ_PUB);
> TEST_ASSERT_SUCCESS_ERRNO (zmq_bind (pub, end));
>
> // NOTE: If we force a disconnect here it works.
> //TEST_ASSERT_SUCCESS_ERRNO (zmq_disconnect (sub, end));
>
> // Connect again fails
> TEST_ASSERT_SUCCESS_ERRNO (zmq_connect (sub, end));
>
> msleep(100);
>
> send_string_expect_success (pub, "Hello", 0);
>
> msleep(100);
>
> recv_string_expect_success (sub, "Hello", 0);
>
> msleep(100);
>
> test_context_socket_close (pub);
> test_context_socket_close (req);
> test_context_socket_close (sub);
>
> }
>
> int main (void)
> {
> setup_test_environment ();
>
> UNITY_BEGIN ();
> RUN_TEST (test_pubreq);
> return UNITY_END ();
> }
>
> On Thu, May 20, 2021 at 4:56 PM Bill Torpey  wrote:
>
>> Sorry — meant to get back to you sooner, but it’s been a crazy week.
>>
>> You don’t say what version you’re running, but there have been some
>> changes in that area not that long ago — check these out and see if they
>> help:
>>
>> https://github.com/zeromq/libzmq/pull/3831
>>
>> https://github.com/zeromq/libzmq/pull/3960
>>
>> https://github.com/zeromq/libzmq/pull/4053
>>
>> Good luck.
>>
>> Bill
>>
>>
>> On May 20, 2021, at 10:26 AM, James Harvey 
>> wrote:
>>
>> Hi,
>>
>> I will try and simplify my previous long email.
>>
>> If a stream gets into a protocol error state  (e.g tcp SUB connect to
>> REQ)
>>
>> Should the information (connection is terminated) be passed somehow back
>> to the parent socket so if connect() is called again it attempts to connect
>> rather than a no-op.
>>
>> OR
>>
>> Should we add a protocol error event to socket monitor so the calling
>> process can handle it  by calling 

Re: [zeromq-dev] zeromq protocol_error handling

2021-05-21 Thread Bill Torpey
Hi James:

A couple of questions:

- Is the SUB socket attempting to reconnect?  (Default is yes).

- Are you activating any of the socket options added by recent changes?  IIRC 
none of the new options (e.g., ZMQ_RECONNECT_STOP_CONN_REFUSED)  have any 
effect by default — they need to be activated explicitly.

- Are you tracing socket events?  If not, you should give that a try — it will 
tell you what is going on “under the covers”. You can find an example at 
https://github.com/nyfix/OZ/blob/4627b0364be80de4451bf1a80a26c00d0ba9310f/src/transport.c#L1549

I’ll try to take a look when I have some time, but not sure when that will be …

Regards,

Bill

> On May 21, 2021, at 10:04 AM, James Harvey  
> wrote:
> 
> Thanks Bill 
> 
> I pulled the latest libzmq and the issue still occurs.
> 
> I have tracked it down to the protocol_error handling.  In the case of a 
> ZMQ_SUB connecting to a ZMQ_REQ a protocol_error happens (expected) and the 
> session is terminated.
> 
> The termination does not remove that connection endpoint from the socket. 
> This means subsequent calls to socket->connect on the same endpoint (after 
> the correct service has resumed) are no ops because SUB can only have one 
> connection to a single endpoint.
> 
> 
> The change below fixes my issue but I'm not sure if it's correct for other 
> protocol errors.  I haven't worked on the sessions/pipes before.I noticed 
> in gdb the second session has a _pipe but is not fully created.
> 
> https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L487 
>   
> 
> case i_engine::protocol_error:
> //if (_pending) {
> if (_pending || handshaked_) {  // <<<  if handshaked we should 
> also terminate pipes.
> if (_pipe)
> _pipe->terminate (false);
> if (_zap_pipe)
> _zap_pipe->terminate (false);
> } else {
> terminate ();
> }
> 
> I am happy to create a pull request to discuss if I am on the right track?
> 
> I have test code to recreate.
> 
> #include "testutil.hpp"
> #include "testutil_unity.hpp"
> #include 
> #include 
> SETUP_TEARDOWN_TESTCONTEXT
> char end[] = "tcp://127.0.0.1:55667 ";
> 
> void test_pubreq ()
> {
>
> // SUB up and connect to 7
> void *sub = test_context_socket (ZMQ_SUB);
> TEST_ASSERT_SUCCESS_ERRNO (zmq_setsockopt (sub, ZMQ_SUBSCRIBE, "", 0));
> TEST_ASSERT_SUCCESS_ERRNO (zmq_connect (sub, end));
> 
> // REQ is up incorrectly on 55667 
> void *req = test_context_socket (ZMQ_REQ);
> TEST_ASSERT_SUCCESS_ERRNO (zmq_bind (req, end));
> msleep(1000);
> TEST_ASSERT_SUCCESS_ERRNO (zmq_unbind (req, end));
> // REQ is down
> // At this point the SUB socket has a protocol_error on 55667 (so no 
> reconnect) but the socket thinks it still connected to 55667
> 
> msleep(1000);
> 
> // PUB correctly comes up on 55667
> void *pub = test_context_socket (ZMQ_PUB);
> TEST_ASSERT_SUCCESS_ERRNO (zmq_bind (pub, end));
> 
> // NOTE: If we force a disconnect here it works.
> //TEST_ASSERT_SUCCESS_ERRNO (zmq_disconnect (sub, end));
> 
> // Connect again fails
> TEST_ASSERT_SUCCESS_ERRNO (zmq_connect (sub, end));
> 
> msleep(100);
> 
> send_string_expect_success (pub, "Hello", 0);
> 
> msleep(100);
> 
> recv_string_expect_success (sub, "Hello", 0);
> 
> msleep(100);
> 
> test_context_socket_close (pub);
> test_context_socket_close (req);
> test_context_socket_close (sub);
> 
> }
> 
> int main (void)
> {
> setup_test_environment ();
> 
> UNITY_BEGIN ();
> RUN_TEST (test_pubreq);
> return UNITY_END (); 
> }
> 
> On Thu, May 20, 2021 at 4:56 PM Bill Torpey  > wrote:
> Sorry — meant to get back to you sooner, but it’s been a crazy week.
> 
> You don’t say what version you’re running, but there have been some changes 
> in that area not that long ago — check these out and see if they help:
> 
> https://github.com/zeromq/libzmq/pull/3831 
> 
> 
> https://github.com/zeromq/libzmq/pull/3960 
> 
> 
> https://github.com/zeromq/libzmq/pull/4053 
> 
> 
> Good luck.
> 
> Bill
> 
> 
>> On May 20, 2021, at 10:26 AM, James Harvey > > wrote:
>> 
>> Hi,
>> 
>> I will try and simplify my previous long email.
>> 
>> If a stream gets into a protocol error state  (e.g tcp SUB connect to REQ) 
>> 
>> Should the information (connection is terminated) be passed somehow back to 
>> the parent socket so if connect() is called again it attempts to connect 
>> rather than a no-op.
>> 
>> OR
>> 
>> Should we add a protocol error event to socket monitor so the calling 
>> process can handle it  by calling disconnect/connect
>> 

Re: [zeromq-dev] zeromq protocol_error handling

2021-05-21 Thread James Harvey
Thanks Bill

I pulled the latest libzmq and the issue still occurs.

I have tracked it down to the protocol_error handling.  In the case of a
ZMQ_SUB connecting to a ZMQ_REQ a protocol_error happens (expected) and the
session is terminated.

The termination does not remove that connection endpoint from the socket.
This means subsequent calls to socket->connect on the same endpoint (after
the correct service has resumed) are no ops because SUB can only have one
connection to a single endpoint.


The change below fixes my issue but I'm not sure if it's correct for other
protocol errors.  I haven't worked on the sessions/pipes before.I
noticed in gdb the second session has a _pipe but is not fully created.

https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L487

case i_engine::protocol_error:
//if (_pending) {
if (_pending || handshaked_) {  // <<<  if handshaked we should
also terminate pipes.
if (_pipe)
_pipe->terminate (false);
if (_zap_pipe)
_zap_pipe->terminate (false);
} else {
terminate ();
}

I am happy to create a pull request to discuss if I am on the right track?

I have test code to recreate.

#include "testutil.hpp"
#include "testutil_unity.hpp"
#include 
#include 
SETUP_TEARDOWN_TESTCONTEXT
char end[] = "tcp://127.0.0.1:55667";

void test_pubreq ()
{

// SUB up and connect to 7
void *sub = test_context_socket (ZMQ_SUB);
TEST_ASSERT_SUCCESS_ERRNO (zmq_setsockopt (sub, ZMQ_SUBSCRIBE, "", 0));
TEST_ASSERT_SUCCESS_ERRNO (zmq_connect (sub, end));

// REQ is up incorrectly on 55667
void *req = test_context_socket (ZMQ_REQ);
TEST_ASSERT_SUCCESS_ERRNO (zmq_bind (req, end));
msleep(1000);
TEST_ASSERT_SUCCESS_ERRNO (zmq_unbind (req, end));
// REQ is down
// At this point the SUB socket has a protocol_error on 55667 (so no
reconnect) but the socket thinks it still connected to 55667

msleep(1000);

// PUB correctly comes up on 55667
void *pub = test_context_socket (ZMQ_PUB);
TEST_ASSERT_SUCCESS_ERRNO (zmq_bind (pub, end));

// NOTE: If we force a disconnect here it works.
//TEST_ASSERT_SUCCESS_ERRNO (zmq_disconnect (sub, end));

// Connect again fails
TEST_ASSERT_SUCCESS_ERRNO (zmq_connect (sub, end));

msleep(100);

send_string_expect_success (pub, "Hello", 0);

msleep(100);

recv_string_expect_success (sub, "Hello", 0);

msleep(100);

test_context_socket_close (pub);
test_context_socket_close (req);
test_context_socket_close (sub);

}

int main (void)
{
setup_test_environment ();

UNITY_BEGIN ();
RUN_TEST (test_pubreq);
return UNITY_END ();
}

On Thu, May 20, 2021 at 4:56 PM Bill Torpey  wrote:

> Sorry — meant to get back to you sooner, but it’s been a crazy week.
>
> You don’t say what version you’re running, but there have been some
> changes in that area not that long ago — check these out and see if they
> help:
>
> https://github.com/zeromq/libzmq/pull/3831
>
> https://github.com/zeromq/libzmq/pull/3960
>
> https://github.com/zeromq/libzmq/pull/4053
>
> Good luck.
>
> Bill
>
>
> On May 20, 2021, at 10:26 AM, James Harvey 
> wrote:
>
> Hi,
>
> I will try and simplify my previous long email.
>
> If a stream gets into a protocol error state  (e.g tcp SUB connect to REQ)
>
> Should the information (connection is terminated) be passed somehow back
> to the parent socket so if connect() is called again it attempts to connect
> rather than a no-op.
>
> OR
>
> Should we add a protocol error event to socket monitor so the calling
> process can handle it  by calling disconnect/connect
>
> Just want some clarification so I work on the correct code.
>
> Thanks
>
> James
>
> On Thu, May 13, 2021 at 4:48 PM James Harvey 
> wrote:
>
>> Hi,
>>
>> I have a rare/random bug that causes my ZMQ_SUB socket to fail for a
>> certain endpoint with no way to track/notify.  Yes it's because a SUB
>> connects to a REQ socket but once you start to use zeromq for lots of
>> transient systems in a large company this kind of thing will happen
>> occasionally.
>>
>> The process happens like this:
>>
>>   - ZMQ_PUB binds on 1.2.3.4:4 (ephemeral)
>>   - ZMQ_SUB connects to 1.2.3.4:4 (data flows)
>>   - ZMQ_PUB goes down
>>   - Unrelated process (ZMQ_REQ) comes up and grabs the same 1.2.3.4:4
>> as its ephemeral
>>   - ZMQ_SUB has not yet been told to disconnect so it reconnects to the
>> ZMQ_REQ
>>   - protocol error happens and the connection is terminated in the
>> session/engine
>>   - Now a good ZMQ_PUB comes up and binds on 1.2.3.4:4
>>   - ZMQ_SUB gets new instruction to connect()
>>   - connect() just returns noop.
>> - The socket_base thinks it still has a valid endpoint and SUB only
>> connects once to each endpoint.
>>   - At this point there are no errors and no data flowing.
>>
>> My question is, should the protocol_error in the session 

Re: [zeromq-dev] zeromq protocol_error handling

2021-05-20 Thread Bill Torpey
Sorry — meant to get back to you sooner, but it’s been a crazy week.

You don’t say what version you’re running, but there have been some changes in 
that area not that long ago — check these out and see if they help:

https://github.com/zeromq/libzmq/pull/3831

https://github.com/zeromq/libzmq/pull/3960

https://github.com/zeromq/libzmq/pull/4053

Good luck.

Bill


> On May 20, 2021, at 10:26 AM, James Harvey  
> wrote:
> 
> Hi,
> 
> I will try and simplify my previous long email.
> 
> If a stream gets into a protocol error state  (e.g tcp SUB connect to REQ) 
> 
> Should the information (connection is terminated) be passed somehow back to 
> the parent socket so if connect() is called again it attempts to connect 
> rather than a no-op.
> 
> OR
> 
> Should we add a protocol error event to socket monitor so the calling process 
> can handle it  by calling disconnect/connect
> 
> Just want some clarification so I work on the correct code.
> 
> Thanks
> 
> James
> 
> On Thu, May 13, 2021 at 4:48 PM James Harvey  > wrote:
> Hi,
> 
> I have a rare/random bug that causes my ZMQ_SUB socket to fail for a certain 
> endpoint with no way to track/notify.  Yes it's because a SUB connects to a 
> REQ socket but once you start to use zeromq for lots of transient systems in 
> a large company this kind of thing will happen occasionally.
> 
> The process happens like this:
> 
>   - ZMQ_PUB binds on 1.2.3.4:4  (ephemeral)
>   - ZMQ_SUB connects to 1.2.3.4:4  (data flows)
>   - ZMQ_PUB goes down
>   - Unrelated process (ZMQ_REQ) comes up and grabs the same 1.2.3.4:4 
>  as its ephemeral
>   - ZMQ_SUB has not yet been told to disconnect so it reconnects to the 
> ZMQ_REQ
>   - protocol error happens and the connection is terminated in the 
> session/engine
>   - Now a good ZMQ_PUB comes up and binds on 1.2.3.4:4 
> 
>   - ZMQ_SUB gets new instruction to connect()
>   - connect() just returns noop.
> - The socket_base thinks it still has a valid endpoint and SUB only 
> connects once to each endpoint.
>   - At this point there are no errors and no data flowing.
> 
> My question is, should the protocol_error in the session propagate up to 
> remove the endpoint from the socket?
> 
> If yes I can look at adding that, if no do you have any suggestions?
> 
> Thanks for your time
> 
> James
> 
> Some links to the code:
> 
> If socket is SUB and the endpoint is present dont connect.
> https://github.com/zeromq/libzmq/blob/master/src/socket_base.cpp#L901 
> 
> 
> terminate with no reconnect on protocol_error 
> https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L486 
> 
> ___
> zeromq-dev mailing list
> zeromq-dev@lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev

___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev


Re: [zeromq-dev] zeromq protocol_error handling

2021-05-20 Thread James Harvey
Hi,

I will try and simplify my previous long email.

If a stream gets into a protocol error state  (e.g tcp SUB connect to REQ)

Should the information (connection is terminated) be passed somehow back to
the parent socket so if connect() is called again it attempts to connect
rather than a no-op.

OR

Should we add a protocol error event to socket monitor so the calling
process can handle it  by calling disconnect/connect

Just want some clarification so I work on the correct code.

Thanks

James

On Thu, May 13, 2021 at 4:48 PM James Harvey 
wrote:

> Hi,
>
> I have a rare/random bug that causes my ZMQ_SUB socket to fail for a
> certain endpoint with no way to track/notify.  Yes it's because a SUB
> connects to a REQ socket but once you start to use zeromq for lots of
> transient systems in a large company this kind of thing will happen
> occasionally.
>
> The process happens like this:
>
>   - ZMQ_PUB binds on 1.2.3.4:4 (ephemeral)
>   - ZMQ_SUB connects to 1.2.3.4:4 (data flows)
>   - ZMQ_PUB goes down
>   - Unrelated process (ZMQ_REQ) comes up and grabs the same 1.2.3.4:4
> as its ephemeral
>   - ZMQ_SUB has not yet been told to disconnect so it reconnects to the
> ZMQ_REQ
>   - protocol error happens and the connection is terminated in the
> session/engine
>   - Now a good ZMQ_PUB comes up and binds on 1.2.3.4:4
>   - ZMQ_SUB gets new instruction to connect()
>   - connect() just returns noop.
> - The socket_base thinks it still has a valid endpoint and SUB only
> connects once to each endpoint.
>   - At this point there are no errors and no data flowing.
>
> My question is, should the protocol_error in the session propagate up to
> remove the endpoint from the socket?
>
> If yes I can look at adding that, if no do you have any suggestions?
>
> Thanks for your time
>
> James
>
> Some links to the code:
>
> If socket is SUB and the endpoint is present dont connect.
> https://github.com/zeromq/libzmq/blob/master/src/socket_base.cpp#L901
>
> terminate with no reconnect on protocol_error
> https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L486
>
___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev