Hello all, Due to the wire incompatibility between ZMQ 3.1 and 3.2, we are still using 3.1 in our production environment. We are using the PUB/SUB pattern. The subscriber is a GUI given to our users. There are several instances of this GUI running. There are around 150 publishers running on different hosts. Every publisher process publishes a heartbeat message every 9 seconds. ZMQ propagates (PUB/SUB pattern) these heartbeat messages to every running subscribers (GUI). These heartbeat messages are multi-part messages with 3 parts. Our problem is that "from time to time", the first part of the multi-part message is not sent by ZMQ to some registered subscriber!
I have recorded all the network packets sent by some of our publishers using the strace command (- e trace=network -o <file>). Here are some lines extracted from this strace generated file where the problem is clear. All heartbeat messages sent by the publisher to its socket 34 around 07:41. I remind you that they are sent every 9 seconds. 4848 07:41:00.873685 send(34, ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 89, 0) = 89 4848 07:41:09.875210 send(34, ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 89, 0) = 89 4848 07:41:18.873082 send(34, "\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 26, 0) = 26 4848 07:41:27.875234 send(34, ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 89, 0) = 89 4848 07:41:36.874466 send(34, ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 89, 0) = 89 4848 07:41:45.873737 send(34, ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 89, 0) = 89 4848 07:41:54.873934 send(34, ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 89, 0) = 89 All heartbeat messages sent by the same publisher to its socket 60 around 07:41 4848 07:41:00.872882 send(60, ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 89, 0) = 89 4848 07:41:09.873237 send(60, ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 89, 0) = 89 4848 07:41:18.873612 send(60, ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 89, 0) = 89 4848 07:41:27.874226 send(60, ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 89, 0 <unfinished ...> 4848 07:41:36.873452 send(60, ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 89, 0 <unfinished ...> 4848 07:41:45.872701 send(60, ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 89, 0 <unfinished ...> 4848 07:41:54.872940 send(60, ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 89, 0 <unfinished ...> As you can see, for socket 34, the packet sent at 07:41:18 is not correct. The first message part (the string finishing with .heartbeat) is not transmitted. But for socket 60, at the same date (07:41:18), it is correctly sent. To me, it seems that when ZMQ sends its messages to all the connected subscribers, it from time to time forget to send the first part of the message to one subscriber. This happens rarely. On this specific publisher, it happens only once during a 64 hours recording session! Are you aware of this kind of problem? Is there a chance that this problem will disappear when we will upgrade our production system to 3.2? Due to the complexity of the system set-up (150 publishers running on 150 different hosts) and to the rare ocuurence of the problem, we are not able (yet?) to provide a simple test case which reproduces the problem. Thank's for your answers and help Emmanuel _______________________________________________ zeromq-dev mailing list [email protected] http://lists.zeromq.org/mailman/listinfo/zeromq-dev
