Hi Emmanuel, This is great news. Version 3.3 will be able to talk to 3.1 peers.
-Pieter On Tue, Mar 26, 2013 at 2:02 PM, Emmanuel TAUREL <[email protected]> wrote: > Hi Pieter, > > I have done more investigation on this problem. > It seems that the problem always happens after a subscriber shutdown. On the > publisher, I see packet containing "" arriving from the killed subscriber > and the next multi-part message sent by ZMQ is wrong. Not always but very > often. > From this remark, I was able to re-produce the problem in a small > development environment. > Upgrading the development environment to ZMQ 3.2.2 solves the problem which > is a good news. > > Another reason to move our production system to ZMQ 3.2.2. > > Thank's for your answer. > > Emmanuel > > > On 25/03/2013 10:30, Pieter Hintjens wrote: >> >> Hi Emmanuel, >> >> There were a number of issues with PUB sockets and multipart, e.g >> commit eb9bc1 from two years ago. But these changes are all in 3.1. >> >> So most likely this issue is still present in the current master. It >> may be related to the HWM, so you could try raising this on the >> production PUB side, and using a lower HWM when you try to reproduce >> the problem. >> >> -Pieter >> >> >> On Mon, Mar 25, 2013 at 9:32 AM, Emmanuel TAUREL <[email protected]> wrote: >>> >>> Hello all, >>> >>> Due to the wire incompatibility between ZMQ 3.1 and 3.2, we are still >>> using 3.1 in our production environment. >>> We are using the PUB/SUB pattern. The subscriber is a GUI given to our >>> users. There are several instances of this GUI running. There are around >>> 150 publishers running on different hosts. Every publisher process >>> publishes a heartbeat message every 9 seconds. ZMQ propagates (PUB/SUB >>> pattern) these heartbeat messages to >>> every running subscribers (GUI). These heartbeat messages are multi-part >>> messages with 3 parts. Our problem is that "from time to time", the first >>> part of the multi-part message is not sent by ZMQ to some registered >>> subscriber! >>> >>> I have recorded all the network packets sent by some of our publishers >>> using the strace command (- e trace=network -o <file>). >>> Here are some lines extracted from this strace generated file where the >>> problem is clear. >>> >>> All heartbeat messages sent by the publisher to its socket 34 around >>> 07:41. I remind you that they are sent every 9 seconds. >>> >>> 4848 07:41:00.873685 send(34, >>> >>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", >>> 89, 0) = 89 >>> 4848 07:41:09.875210 send(34, >>> >>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", >>> 89, 0) = 89 >>> 4848 07:41:18.873082 send(34, >>> "\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 26, 0) = 26 >>> 4848 07:41:27.875234 send(34, >>> >>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", >>> 89, 0) = 89 >>> 4848 07:41:36.874466 send(34, >>> >>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", >>> 89, 0) = 89 >>> 4848 07:41:45.873737 send(34, >>> >>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", >>> 89, 0) = 89 >>> 4848 07:41:54.873934 send(34, >>> >>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", >>> 89, 0) = 89 >>> >>> All heartbeat messages sent by the same publisher to its socket 60 >>> around 07:41 >>> 4848 07:41:00.872882 send(60, >>> >>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", >>> 89, 0) = 89 >>> 4848 07:41:09.873237 send(60, >>> >>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", >>> 89, 0) = 89 >>> 4848 07:41:18.873612 send(60, >>> >>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", >>> 89, 0) = 89 >>> 4848 07:41:27.874226 send(60, >>> >>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", >>> 89, 0 <unfinished ...> >>> 4848 07:41:36.873452 send(60, >>> >>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", >>> 89, 0 <unfinished ...> >>> 4848 07:41:45.872701 send(60, >>> >>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", >>> 89, 0 <unfinished ...> >>> 4848 07:41:54.872940 send(60, >>> >>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", >>> 89, 0 <unfinished ...> >>> >>> As you can see, for socket 34, the packet sent at 07:41:18 is not >>> correct. The first message part (the string finishing with .heartbeat) >>> is not transmitted. >>> But for socket 60, at the same date (07:41:18), it is correctly sent. To >>> me, it seems that when ZMQ sends its messages to all the connected >>> subscribers, it >>> from time to time forget to send the first part of the message to one >>> subscriber. >>> This happens rarely. On this specific publisher, it happens only once >>> during a 64 hours recording session! >>> >>> Are you aware of this kind of problem? >>> Is there a chance that this problem will disappear when we will upgrade >>> our production system to 3.2? >>> >>> Due to the complexity of the system set-up (150 publishers running on >>> 150 different hosts) and to the rare ocuurence of the problem, we are >>> not able (yet?) to provide a simple test case which reproduces the >>> problem. >>> >>> Thank's for your answers and help >>> >>> Emmanuel >>> >>> _______________________________________________ >>> zeromq-dev mailing list >>> [email protected] >>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev > > > _______________________________________________ zeromq-dev mailing list [email protected] http://lists.zeromq.org/mailman/listinfo/zeromq-dev
