Hi all, A few more questions after inspecting ZMQ source code: - I see that in June 2019 the following PR was merged: https://github.com/zeromq/libzmq/pull/3555 This one exposes ZMQ_OUT_BATCH_SIZE. At first look it may seem exactly what I was looking for, but the thing is that the default value is already quite high (8192)... in my use case probably it would be enough to coalesce together a max of 5 or 6 messages to reach the MTU size. - The thread that is publishing on my PUB zmq socket probably takes between 100-500usec to generate a new message. That means that to generate 5 messages in worst case it might take 2.5msec. I would be OK to pay this latency in order to improve throughput... .is there any way to achieve that? What happens if I disable the code in ZMQ that sets TCP_NODELAY and replace it with TCP_CORK ? Do you think I could get some kind of breakage of my PUB/SUB connections?
and one consideration: - I discovered why my tcpdump capture contains larger-than-MTU packets (even though they are <1%): the reason is that capturing traffic on the same server sending/receiving the traffic is not a good idea: https://blog.packet-foo.com/2014/05/the-drawbacks-of-local-packet-captures/ https://packetbomb.com/how-can-the-packet-size-be-greater-than-the-mtu/ I will try to acquire tcpdumps from the SPAN port of a managed switch. I don't think the results will change much though Thanks for any hint, Francesco Il giorno sab 27 mar 2021 alle ore 10:22 Francesco < francesco.monto...@gmail.com> ha scritto: > Hi Jim, > You're right and I have in plan to change the MTU to 9000 for sure. > However even now, with the MTU being 1500, I see most packets are very far > from the limit. > Attached is a screenshot of the capture: > > [image: tcp_capture.png] > > By looking at the timestamps I see that the packets of size 583B and 376B > are spaced just 100us roughly and between the packet of 376B and 366B are > spaced 400us. > In this case I'd be more than welcome to pay some extra latency and merge > all these 3 packets together. > > After some more digging I found this code in ZMQ: > > // Disable Nagle's algorithm. We are doing data batching on 0MQ level, > // so using Nagle wouldn't improve throughput in anyway, but it would > // hurt latency. > int nodelay = 1; > const int rc = > setsockopt (s_, IPPROTO_TCP, TCP_NODELAY, > reinterpret_cast<char *> (&nodelay), sizeof (int)); > assert_success_or_recoverable (s_, rc); > if (rc != 0) > return rc; > > Now my next question is: where is this " data batching on 0MQ level" > happening? Can I tune it somehow? Can I restore Nagle algorithm ? > I saw also from here > https://man7.org/linux/man-pages/man7/tcp.7.html > that there's the possibility to set TCP_CORK as option on the socket to > try to optimize throughput ... any way to do that through ZMQ? > > Thanks!! > > Francesco > > > > > Il giorno sab 27 mar 2021 alle ore 05:01 Jim Melton <jim@melton.space> ha > scritto: > >> Small TCP packets will never achieve maximum throughput. This is >> independent of ZMQ. Each TCP packet requires a synchronous round-trip. >> >> For a 20 Gbps network, you need a larger MTU to achieve close to >> theoretical bandwidth, and each packet needs to be close to MTU. Jumbo MTU >> is typically 9000 bytes. The TCP ACK packets will kill your throughput, >> though. >> -- >> Jim Melton >> (303) 829-0447 >> http://blogs.melton.space/pharisee/ >> jim@melton.space >> >> >> >> >> On Mar 26, 2021, at 4:17 PM, Francesco <francesco.monto...@gmail.com> >> wrote: >> >> Hi all, >> >> I'm using ZMQ in a product that moves a lot of data using TCP as >> transport and PUB-SUB as communication pattern. "A lot" here means around >> 1Gbps. The software is actually a mono-directional chain of small >> components each linked to the previous with a SUB socket (to receive data) >> and a PUB socket (to send data to next stage). >> I'm debugging an issue with one of these components receiving 1.1Gbps >> from its SUB socket and sending out 1.1Gbps on its PUB socket (no wonder >> the two numbers match since the component does not aggregation whatsoever). >> >> The "problem" is that we are currently using 16 ZMQ background threads to >> move a total of 2.2Gbps for that software component (note the physical >> links can carry up to 20Gbps so we're far from saturation of the link). >> IIRC the "golden rule" for sizing number of ZMQ background threads is 1Gbps >> = 1 thread. >> As you can see we're very far from this golden rule, and that's what I'm >> trying to debug. >> >> The ZMQ background threads have a CPU usage ranging from 98% to 80%. >> Using "strace" I see that most of the time for these threads is spent in >> the "sendto" syscall. >> So I started digging on the quality of the TX side of the TCP connection, >> recording a short trace of the traffic outgoing from the software component. >> >> Analyzing the traffic with wireshark it turns out that the TCP packets >> for the PUB connection are pretty small: >> * 50% of them are 66B long; these are the TCP ACK packets (incoming) >> * 21% of them are in the range 160B-320B >> * 18% in the range 320B-640B >> * 5% in range 640B-1280B >> * just 3% reach the MTU equal to 1500B >> * [there are a <1% fraction that also exceed the MTU=1500B of the link, >> which I'm not sure how is possible] >> >> My belief is that having a fewer number of packets, all close to the MTU >> of the link should greatly improve the performances. Would you agree with >> that? >> Is there any configuration I can apply on the PUB socket to force the >> Linux TCP stack to generate fewer but larger TCP segments on the wire? >> >> Thanks for any hint, >> >> Francesco >> >> >> _______________________________________________ >> zeromq-dev mailing list >> zeromq-dev@lists.zeromq.org >> https://lists.zeromq.org/mailman/listinfo/zeromq-dev >> >> >> _______________________________________________ >> zeromq-dev mailing list >> zeromq-dev@lists.zeromq.org >> https://lists.zeromq.org/mailman/listinfo/zeromq-dev >> >
_______________________________________________ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev