Hi Pieter, I have done more investigation on this problem. It seems that the problem always happens after a subscriber shutdown. On the publisher, I see packet containing "" arriving from the killed subscriber and the next multi-part message sent by ZMQ is wrong. Not always but very often. From this remark, I was able to re-produce the problem in a small development environment. Upgrading the development environment to ZMQ 3.2.2 solves the problem which is a good news.
Another reason to move our production system to ZMQ 3.2.2. Thank's for your answer. Emmanuel On 25/03/2013 10:30, Pieter Hintjens wrote: > Hi Emmanuel, > > There were a number of issues with PUB sockets and multipart, e.g > commit eb9bc1 from two years ago. But these changes are all in 3.1. > > So most likely this issue is still present in the current master. It > may be related to the HWM, so you could try raising this on the > production PUB side, and using a lower HWM when you try to reproduce > the problem. > > -Pieter > > > On Mon, Mar 25, 2013 at 9:32 AM, Emmanuel TAUREL <[email protected]> wrote: >> Hello all, >> >> Due to the wire incompatibility between ZMQ 3.1 and 3.2, we are still >> using 3.1 in our production environment. >> We are using the PUB/SUB pattern. The subscriber is a GUI given to our >> users. There are several instances of this GUI running. There are around >> 150 publishers running on different hosts. Every publisher process >> publishes a heartbeat message every 9 seconds. ZMQ propagates (PUB/SUB >> pattern) these heartbeat messages to >> every running subscribers (GUI). These heartbeat messages are multi-part >> messages with 3 parts. Our problem is that "from time to time", the first >> part of the multi-part message is not sent by ZMQ to some registered >> subscriber! >> >> I have recorded all the network packets sent by some of our publishers >> using the strace command (- e trace=network -o <file>). >> Here are some lines extracted from this strace generated file where the >> problem is clear. >> >> All heartbeat messages sent by the publisher to its socket 34 around >> 07:41. I remind you that they are sent every 9 seconds. >> >> 4848 07:41:00.873685 send(34, >> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", >> 89, 0) = 89 >> 4848 07:41:09.875210 send(34, >> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", >> 89, 0) = 89 >> 4848 07:41:18.873082 send(34, >> "\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 26, 0) = 26 >> 4848 07:41:27.875234 send(34, >> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", >> 89, 0) = 89 >> 4848 07:41:36.874466 send(34, >> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", >> 89, 0) = 89 >> 4848 07:41:45.873737 send(34, >> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", >> 89, 0) = 89 >> 4848 07:41:54.873934 send(34, >> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", >> 89, 0) = 89 >> >> All heartbeat messages sent by the same publisher to its socket 60 >> around 07:41 >> 4848 07:41:00.872882 send(60, >> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", >> 89, 0) = 89 >> 4848 07:41:09.873237 send(60, >> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", >> 89, 0) = 89 >> 4848 07:41:18.873612 send(60, >> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", >> 89, 0) = 89 >> 4848 07:41:27.874226 send(60, >> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", >> 89, 0 <unfinished ...> >> 4848 07:41:36.873452 send(60, >> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", >> 89, 0 <unfinished ...> >> 4848 07:41:45.872701 send(60, >> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", >> 89, 0 <unfinished ...> >> 4848 07:41:54.872940 send(60, >> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", >> 89, 0 <unfinished ...> >> >> As you can see, for socket 34, the packet sent at 07:41:18 is not >> correct. The first message part (the string finishing with .heartbeat) >> is not transmitted. >> But for socket 60, at the same date (07:41:18), it is correctly sent. To >> me, it seems that when ZMQ sends its messages to all the connected >> subscribers, it >> from time to time forget to send the first part of the message to one >> subscriber. >> This happens rarely. On this specific publisher, it happens only once >> during a 64 hours recording session! >> >> Are you aware of this kind of problem? >> Is there a chance that this problem will disappear when we will upgrade >> our production system to 3.2? >> >> Due to the complexity of the system set-up (150 publishers running on >> 150 different hosts) and to the rare ocuurence of the problem, we are >> not able (yet?) to provide a simple test case which reproduces the problem. >> >> Thank's for your answers and help >> >> Emmanuel >> >> _______________________________________________ >> zeromq-dev mailing list >> [email protected] >> http://lists.zeromq.org/mailman/listinfo/zeromq-dev _______________________________________________ zeromq-dev mailing list [email protected] http://lists.zeromq.org/mailman/listinfo/zeromq-dev
