Re: [zeromq-dev] Performance results on 100Gbps direct link
On Thu, 2019-10-24 at 16:55 -0400, Brett Viren via zeromq-dev wrote: > Hi again, > > Doron Somech < > somdo...@gmail.com > > writes: > > > You need to create multiple connections to enjoy the multiple io > > threads. > > > > So in the remote/local_thr connect to the same endpoint 100 times > > and create 10 io > > threads. > > > > You don't need to create multiple sockets, just call connect > > multiple times with same > > address. > > I keep working on evaluating ZeroMQ against this 100 Gbps network > when I > get a spare moment. You can see some initial results in the attached > PNG. As-is, things look pretty decent but there are two effects I > see > which I don't fully understand and I think are important to achieving > something closer to saturation. > > 1) As you can see in the PNG, as the message size increases beyond 10 > kB > the 10 I/O threads become less and less active. This activity seems > correlated with throughput. But, why the die-off as we go to higher > message size and why the resurgence at ~1 MB? Might there be some > additional tricks to lift the throughput? Are such large messages > simply not reasonable? > > 2) I've instrumented my tester to include a sequence count in each > message and it uncovers that this multi-thread/multi-connect trick > may > lead to messages arriving to the receiver out-of-order. Given PUSH > is > round-robin and PULL is fair queued, I naively didn't expect > this. But > seeing it, I have two guesses. 1) I don't actually know what > "fair-queued" really means :) and 2) if a mute state is getting hit > then > maybe all bets are off. I do wonder if adding "credit based" > transfers > might solve this ordering. Eg, if N credits (or fewer) are used > given N > connects, might the round-robin/fair-queue ordering stay in lock > step? > > > Any ideas are welcome. Thanks! > > -Brett. The zero-copy receiver uses by default an 8KB buffer. If the message received is larger, a new buffer is allocated per each message. That's probably why you are seeing a drop just around that size. A socket option ZMQ_IN_BATCH_SIZE has been recently added (TO BE USED WITH CARE) to change that default - maybe try experimenting with that and see if this assumption holds true. -- Kind regards, Luca Boccassi signature.asc Description: This is a digitally signed message part ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev
Re: [zeromq-dev] Performance results on 100Gbps direct link
To explain point two, you can't easily impose an order of messages across multiple connections, even to the same peer. It's sort of a fundamental limit: the only reason a single ZMQ connection can provide in-order delivery is because it leans on TCP to correct the multiple delivery, out-of-order delivery, random bit-flips, lost segments, and other chaos that goes on at the IP level. If all your application needs to worry about is getting many messages to the other end as fast as possible then by all means open multiple connections, and similarly if you have urgent messages that need to be processed asynchronously and dodge head-of-line blocking in the high volume channel, but if you need to make sure your messages are processed in some global order you're better off using a single connection and not needing to reinvent half of TCP all over again. You also can't use credit based flow control here. PUSH and PULL are unidirectional, you can't send credit from the PULL socket to the PUSH. Replace PUSH with DEALER and PULL with ROUTER and you can, but even then credit based flow control is about limiting the amount of messages/bytes/other unit of work currently in flight in the system. If your goal is solely to saturate a link then it's actually the opposite of what you want. Now. My experience is sort of at the extreme other end of the spectrum from your use case but if I were inclined to optimize for Maximum Fast, at this point two lines of investigation occur to me. First, what is happening at the network level once throughput hits that plateau? I'd take a pcap and then open it in wireshark after the test is over; look at the times payload-carrying packets go out and when the corresponding ACK packets come back. If they bunch up in any way, some tuning of TCP options at the socket option or possibly kernel sysctl level may be called for. Secondly, where is that CPU time actually being spent? Intuitively I expect this would take more effort to bear fruit which is why I'd save it for after poking at the network, but I'd make a flamegraph[1] (you can substitute your favorite profiler here) and look for hotspots I might be able to optimize. Past the threshold where cpu decreases things get harder, since that looks like time waiting on locks or hardware starts to dominate. There's ways to interrogate waits into and past the kernel, but I've never had to do it so I can't tell you how painful it might be. Good luck, friend. [1]: https://github.com/brendangregg/FlameGraph On Thu, Oct 24, 2019, 9:38 PM Brett Viren via zeromq-dev < zeromq-dev@lists.zeromq.org> wrote: > Hi again, > > Doron Somech writes: > > > You need to create multiple connections to enjoy the multiple io threads. > > > > So in the remote/local_thr connect to the same endpoint 100 times and > create 10 io > > threads. > > > > You don't need to create multiple sockets, just call connect multiple > times with same > > address. > > I keep working on evaluating ZeroMQ against this 100 Gbps network when I > get a spare moment. You can see some initial results in the attached > PNG. As-is, things look pretty decent but there are two effects I see > which I don't fully understand and I think are important to achieving > something closer to saturation. > > 1) As you can see in the PNG, as the message size increases beyond 10 kB > the 10 I/O threads become less and less active. This activity seems > correlated with throughput. But, why the die-off as we go to higher > message size and why the resurgence at ~1 MB? Might there be some > additional tricks to lift the throughput? Are such large messages > simply not reasonable? > > 2) I've instrumented my tester to include a sequence count in each > message and it uncovers that this multi-thread/multi-connect trick may > lead to messages arriving to the receiver out-of-order. Given PUSH is > round-robin and PULL is fair queued, I naively didn't expect this. But > seeing it, I have two guesses. 1) I don't actually know what > "fair-queued" really means :) and 2) if a mute state is getting hit then > maybe all bets are off. I do wonder if adding "credit based" transfers > might solve this ordering. Eg, if N credits (or fewer) are used given N > connects, might the round-robin/fair-queue ordering stay in lock step? > > > Any ideas are welcome. Thanks! > > -Brett. > > ___ > zeromq-dev mailing list > zeromq-dev@lists.zeromq.org > https://lists.zeromq.org/mailman/listinfo/zeromq-dev > ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev
Re: [zeromq-dev] Performance results on 100Gbps direct link
Doron Somech writes: > You need to create multiple connections to enjoy the multiple io threads. > > So in the remote/local_thr connect to the same endpoint 100 times and > create 10 io threads. Fantastic! 10 threads in both remote_thr and local_thr, 10 connect() calls in remote_thr. [bviren@dune dev]$ ./libzmq/perf/.libs/local_thr tcp://10.0.1.117:5200 131072 100 0 10 using 10 I/O threads message size: 131072 [B] message count: 100 mean throughput: 63112 [msg/s] mean throughput: 66178.185 [Mb/s] 10 threads in both remote_thr and local_thr, 100 connect() calls in remote_thr. [bviren@dune dev]$ ./libzmq/perf/.libs/local_thr tcp://10.0.1.117:5200 131072 100 0 10 using 10 I/O threads message size: 131072 [B] message count: 100 mean throughput: 91633 [msg/s] mean throughput: 96084.376 [Mb/s] In the second case, the 10 threads in local_thr are using between 50-100% CPU. remote_thr is much less active with about half threads around 50% and half around 10%. Now to see what this does across the spectrum of message sizes! Thanks! -Brett. signature.asc Description: PGP signature ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev
Re: [zeromq-dev] Performance results on 100Gbps direct link
Make a pull request :) On Wed, Oct 2, 2019, 20:27 Francesco wrote: > > > Il mer 2 ott 2019, 19:05 Doron Somech ha scritto: > >> >> You don't need to create multiple sockets, just call connect multiple >> times with same address. >> > Wow, really?? > I wish I had known that, I already changed quite a bit of code to use > multiple zmq sockets to make better use of background zmq threads!! > > I will try connecting multiple times... At this point I suggest modifying > the benchmark utility to just do this trick and update the performance > graphs in the wiki with new results! > > Francesco > > > On Wed, Oct 2, 2019, 19:45 Brett Viren via zeromq-dev < >> zeromq-dev@lists.zeromq.org> wrote: >> >>> Hi Francesco, >>> >>> I confirm your benchmark using two systems with the same 100 Gbps >>> Mellanox NICs but with an intervening Juniper QFX5200 switch (100 Gbps >>> ports). >>> >>> To reach ~25 Gbps with the largest message sizes required "jumbo frame" >>> MTU. The default mtu=1500 allows only ~20 Gbps. I also tried two more >>> doubling of zmsg size in the benchmark and these produce no significant >>> increase in throughput. OTOH, pinning the receiver (local_thr) to a CPU >>> gets it up to 33 Gbps. >>> >>> I note that iperf3 can achieve almost 40 Gbps (20 Gbps w MTU=1500). >>> Multiple simultaneous iperf3 tests can, in aggregate, use 90-100 Gbps. >>> >>> In both the ZMQ and singular iperf3 tests, it seems that CPU is the >>> bottleneck. For ZeroMQ the receiver's I/O thread is pegged at 100%. >>> With iperf3 it's that of the client/sender. The other ends in both >>> cases are at about 50%. >>> >>> The zguide suggests to use one I/O thread per GByte/s (faq says "Gbps") >>> so I tried the naive thing and hacked the ZMQ remote_thr.cpp and >>> local_thr.cpp so each use ten I/O threads. While I see all ten threads >>> in "top -H", still only one thread uses any CPU and it remains pegged at >>> 100% on the receiver (local_thr) and about 50% on the sender >>> (remote_thr). I think now that I misinterpreted this advice and it's >>> really relevant to the case of handling a very large number of >>> connections. >>> >>> >>> Any suggestions on how to let ZeroMQ get higher throughput at 100 Gbps? >>> If so, I'll give them a try. >>> >>> >>> Cheers, >>> -Brett. >>> >>> Francesco writes: >>> >>> > Hi all, >>> > >>> > I placed here: >>> > http://zeromq.org/results:100gbe-tests-v432 >>> > the results I collected using 2 Mellanox ConnectX-5 linked by 100Gbps >>> > fiber cable. >>> > >>> > The results are not too much different from those at 10gpbs >>> > (http://zeromq.org/results:10gbe-tests-v432 )... the difference in TCP >>> > throughput is that >>> > - even using 100kB-long messages we still cannot saturate the link >>> > - latency is very much improved for messages > 10kB long >>> > >>> > Hopefully we will be able to improve performances in the future to >>> > improve these benchmarks... >>> > >>> > Francesco >>> > ___ >>> > zeromq-dev mailing list >>> > zeromq-dev@lists.zeromq.org >>> > https://lists.zeromq.org/mailman/listinfo/zeromq-dev >>> ___ >>> zeromq-dev mailing list >>> zeromq-dev@lists.zeromq.org >>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev >>> >> ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev
Re: [zeromq-dev] Performance results on 100Gbps direct link
Il mer 2 ott 2019, 19:05 Doron Somech ha scritto: > > You don't need to create multiple sockets, just call connect multiple > times with same address. > Wow, really?? I wish I had known that, I already changed quite a bit of code to use multiple zmq sockets to make better use of background zmq threads!! I will try connecting multiple times... At this point I suggest modifying the benchmark utility to just do this trick and update the performance graphs in the wiki with new results! Francesco On Wed, Oct 2, 2019, 19:45 Brett Viren via zeromq-dev < > zeromq-dev@lists.zeromq.org> wrote: > >> Hi Francesco, >> >> I confirm your benchmark using two systems with the same 100 Gbps >> Mellanox NICs but with an intervening Juniper QFX5200 switch (100 Gbps >> ports). >> >> To reach ~25 Gbps with the largest message sizes required "jumbo frame" >> MTU. The default mtu=1500 allows only ~20 Gbps. I also tried two more >> doubling of zmsg size in the benchmark and these produce no significant >> increase in throughput. OTOH, pinning the receiver (local_thr) to a CPU >> gets it up to 33 Gbps. >> >> I note that iperf3 can achieve almost 40 Gbps (20 Gbps w MTU=1500). >> Multiple simultaneous iperf3 tests can, in aggregate, use 90-100 Gbps. >> >> In both the ZMQ and singular iperf3 tests, it seems that CPU is the >> bottleneck. For ZeroMQ the receiver's I/O thread is pegged at 100%. >> With iperf3 it's that of the client/sender. The other ends in both >> cases are at about 50%. >> >> The zguide suggests to use one I/O thread per GByte/s (faq says "Gbps") >> so I tried the naive thing and hacked the ZMQ remote_thr.cpp and >> local_thr.cpp so each use ten I/O threads. While I see all ten threads >> in "top -H", still only one thread uses any CPU and it remains pegged at >> 100% on the receiver (local_thr) and about 50% on the sender >> (remote_thr). I think now that I misinterpreted this advice and it's >> really relevant to the case of handling a very large number of >> connections. >> >> >> Any suggestions on how to let ZeroMQ get higher throughput at 100 Gbps? >> If so, I'll give them a try. >> >> >> Cheers, >> -Brett. >> >> Francesco writes: >> >> > Hi all, >> > >> > I placed here: >> > http://zeromq.org/results:100gbe-tests-v432 >> > the results I collected using 2 Mellanox ConnectX-5 linked by 100Gbps >> > fiber cable. >> > >> > The results are not too much different from those at 10gpbs >> > (http://zeromq.org/results:10gbe-tests-v432 )... the difference in TCP >> > throughput is that >> > - even using 100kB-long messages we still cannot saturate the link >> > - latency is very much improved for messages > 10kB long >> > >> > Hopefully we will be able to improve performances in the future to >> > improve these benchmarks... >> > >> > Francesco >> > ___ >> > zeromq-dev mailing list >> > zeromq-dev@lists.zeromq.org >> > https://lists.zeromq.org/mailman/listinfo/zeromq-dev >> ___ >> zeromq-dev mailing list >> zeromq-dev@lists.zeromq.org >> https://lists.zeromq.org/mailman/listinfo/zeromq-dev >> > ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev
Re: [zeromq-dev] Performance results on 100Gbps direct link
You need to create multiple connections to enjoy the multiple io threads. So in the remote/local_thr connect to the same endpoint 100 times and create 10 io threads. You don't need to create multiple sockets, just call connect multiple times with same address. On Wed, Oct 2, 2019, 19:45 Brett Viren via zeromq-dev < zeromq-dev@lists.zeromq.org> wrote: > Hi Francesco, > > I confirm your benchmark using two systems with the same 100 Gbps > Mellanox NICs but with an intervening Juniper QFX5200 switch (100 Gbps > ports). > > To reach ~25 Gbps with the largest message sizes required "jumbo frame" > MTU. The default mtu=1500 allows only ~20 Gbps. I also tried two more > doubling of zmsg size in the benchmark and these produce no significant > increase in throughput. OTOH, pinning the receiver (local_thr) to a CPU > gets it up to 33 Gbps. > > I note that iperf3 can achieve almost 40 Gbps (20 Gbps w MTU=1500). > Multiple simultaneous iperf3 tests can, in aggregate, use 90-100 Gbps. > > In both the ZMQ and singular iperf3 tests, it seems that CPU is the > bottleneck. For ZeroMQ the receiver's I/O thread is pegged at 100%. > With iperf3 it's that of the client/sender. The other ends in both > cases are at about 50%. > > The zguide suggests to use one I/O thread per GByte/s (faq says "Gbps") > so I tried the naive thing and hacked the ZMQ remote_thr.cpp and > local_thr.cpp so each use ten I/O threads. While I see all ten threads > in "top -H", still only one thread uses any CPU and it remains pegged at > 100% on the receiver (local_thr) and about 50% on the sender > (remote_thr). I think now that I misinterpreted this advice and it's > really relevant to the case of handling a very large number of > connections. > > > Any suggestions on how to let ZeroMQ get higher throughput at 100 Gbps? > If so, I'll give them a try. > > > Cheers, > -Brett. > > Francesco writes: > > > Hi all, > > > > I placed here: > > http://zeromq.org/results:100gbe-tests-v432 > > the results I collected using 2 Mellanox ConnectX-5 linked by 100Gbps > > fiber cable. > > > > The results are not too much different from those at 10gpbs > > (http://zeromq.org/results:10gbe-tests-v432 )... the difference in TCP > > throughput is that > > - even using 100kB-long messages we still cannot saturate the link > > - latency is very much improved for messages > 10kB long > > > > Hopefully we will be able to improve performances in the future to > > improve these benchmarks... > > > > Francesco > > ___ > > zeromq-dev mailing list > > zeromq-dev@lists.zeromq.org > > https://lists.zeromq.org/mailman/listinfo/zeromq-dev > ___ > zeromq-dev mailing list > zeromq-dev@lists.zeromq.org > https://lists.zeromq.org/mailman/listinfo/zeromq-dev > ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev
Re: [zeromq-dev] Performance results on 100Gbps direct link
Hi Francesco, I confirm your benchmark using two systems with the same 100 Gbps Mellanox NICs but with an intervening Juniper QFX5200 switch (100 Gbps ports). To reach ~25 Gbps with the largest message sizes required "jumbo frame" MTU. The default mtu=1500 allows only ~20 Gbps. I also tried two more doubling of zmsg size in the benchmark and these produce no significant increase in throughput. OTOH, pinning the receiver (local_thr) to a CPU gets it up to 33 Gbps. I note that iperf3 can achieve almost 40 Gbps (20 Gbps w MTU=1500). Multiple simultaneous iperf3 tests can, in aggregate, use 90-100 Gbps. In both the ZMQ and singular iperf3 tests, it seems that CPU is the bottleneck. For ZeroMQ the receiver's I/O thread is pegged at 100%. With iperf3 it's that of the client/sender. The other ends in both cases are at about 50%. The zguide suggests to use one I/O thread per GByte/s (faq says "Gbps") so I tried the naive thing and hacked the ZMQ remote_thr.cpp and local_thr.cpp so each use ten I/O threads. While I see all ten threads in "top -H", still only one thread uses any CPU and it remains pegged at 100% on the receiver (local_thr) and about 50% on the sender (remote_thr). I think now that I misinterpreted this advice and it's really relevant to the case of handling a very large number of connections. Any suggestions on how to let ZeroMQ get higher throughput at 100 Gbps? If so, I'll give them a try. Cheers, -Brett. Francesco writes: > Hi all, > > I placed here: > http://zeromq.org/results:100gbe-tests-v432 > the results I collected using 2 Mellanox ConnectX-5 linked by 100Gbps > fiber cable. > > The results are not too much different from those at 10gpbs > (http://zeromq.org/results:10gbe-tests-v432 )... the difference in TCP > throughput is that > - even using 100kB-long messages we still cannot saturate the link > - latency is very much improved for messages > 10kB long > > Hopefully we will be able to improve performances in the future to > improve these benchmarks... > > Francesco > ___ > zeromq-dev mailing list > zeromq-dev@lists.zeromq.org > https://lists.zeromq.org/mailman/listinfo/zeromq-dev signature.asc Description: PGP signature ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev
Re: [zeromq-dev] Performance results on 100Gbps direct link
Hi Benjamin, Il giorno dom 11 ago 2019 alle ore 13:05 Benjamin Henrion ha scritto: > Do you have a test which can show first that you can saturate the 100gbps > link? > > Like an iperf or a simple wget test? I don't have a graph just like the one I created with ZMQ benchmark utilities but using a custom DPDK app creating packets of size 256-512B I've been able to saturate the 100Gbps link using the 16 CPU cores of the first CPU, with hyperthreading off (the server has 2 CPUs/NUMA nodes, each with 16 physical cores that become 32 with HT for a total of 64). This is not that far from the official Mellanox reports using DPDK: https://fast.dpdk.org/doc/perf/DPDK_19_05_Mellanox_NIC_performance_report.pdf (see chapter 5) where they declare they can the line rate of a 100Gbps link using 12 cores (well, they're using Xeon Platinum CPUs while my server had Gold CPUs and moreover the DPDK application they used, l3fwd, is very simple indeed). And yes: the line rate for 64B packets at 100Gbps is an astonishingly 148 million packets per seconds. Of course I don't think we can ever reach that performances with using regular Linux kernel TCP stack (btw that DPDK example sends frames over IPv4 but I don't know if over IPv4 it uses UDP, TCP or something else). However currently ZMQ does about 1Mpps @ 64B which is pretty far from that 148Mpps... I wonder if that could be improved :) As you point out it would be interesting to run a test using iperf or wget that indeed use the Linux kernel TCP stack as well... when I will have access to that setup I can try. Francesco ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev
Re: [zeromq-dev] Performance results on 100Gbps direct link
On 9 Aug 2019 23:36, "Francesco" wrote: Hi all, I placed here: http://zeromq.org/results:100gbe-tests-v432 the results I collected using 2 Mellanox ConnectX-5 linked by 100Gbps fiber cable. The results are not too much different from those at 10gpbs (http://zeromq.org/results:10gbe-tests-v432)... the difference in TCP throughput is that - even using 100kB-long messages we still cannot saturate the link - latency is very much improved for messages > 10kB long Hopefully we will be able to improve performances in the future to improve these benchmarks... Do you have a test which can show first that you can saturate the 100gbps link? Like an iperf or a simple wget test? Best, ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev
Re: [zeromq-dev] Performance results on 100Gbps direct link
Great job, thank you! On Fri, 2019-08-09 at 23:34 +0200, Francesco wrote: > Hi all, > > I placed here: > http://zeromq.org/results:100gbe-tests-v432 > the results I collected using 2 Mellanox ConnectX-5 linked by 100Gbps > fiber cable. > > The results are not too much different from those at 10gpbs > (http://zeromq.org/results:10gbe-tests-v432)... the difference in TCP > throughput is that > - even using 100kB-long messages we still cannot saturate the link > - latency is very much improved for messages > 10kB long > > Hopefully we will be able to improve performances in the future to > improve these benchmarks... > > Francesco > ___ > zeromq-dev mailing list > zeromq-dev@lists.zeromq.org > https://lists.zeromq.org/mailman/listinfo/zeromq-dev -- Kind regards, Luca Boccassi signature.asc Description: This is a digitally signed message part ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev
[zeromq-dev] Performance results on 100Gbps direct link
Hi all, I placed here: http://zeromq.org/results:100gbe-tests-v432 the results I collected using 2 Mellanox ConnectX-5 linked by 100Gbps fiber cable. The results are not too much different from those at 10gpbs (http://zeromq.org/results:10gbe-tests-v432)... the difference in TCP throughput is that - even using 100kB-long messages we still cannot saturate the link - latency is very much improved for messages > 10kB long Hopefully we will be able to improve performances in the future to improve these benchmarks... Francesco ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev