Hi Florin, Thanks once again for looking at this issue. Please see inline:
On Fri, Sep 11, 2020 at 2:06 PM Florin Coras <fcoras.li...@gmail.com> wrote: > Hi Vijay, > > Inline. > > On Sep 11, 2020, at 1:08 PM, Vijay Sampath <vsamp...@gmail.com> wrote: > > Hi Florin, > > Thanks for the response. Please see inline: > > On Fri, Sep 11, 2020 at 10:42 AM Florin Coras <fcoras.li...@gmail.com> > wrote: > >> Hi Vijay, >> >> Cool experiment. More inline. >> >> > On Sep 11, 2020, at 9:42 AM, Vijay Sampath <vsamp...@gmail.com> wrote: >> > >> > Hi, >> > >> > I am using iperf3 as a client on an Ubuntu 18.04 Linux machine >> connected to another server running VPP using 100G NICs. Both servers are >> Intel Xeon with 24 cores. >> >> May I ask the frequency for those cores? Also what type of nic are you >> using? >> > > 2700 MHz. > > > Probably this somewhat limits throughput per single connection compared to > my testbed where the Intel cpu boosts to 4GHz. > Please see below, I noticed an anomaly. > The nic is a Pensando DSC100. > > > Okay, not sure what to expect there. Since this mostly stresses the rx > side, what’s the number of rx descriptors? Typically I test with 256, with > more connections higher throughput you might need more. > This is the default - comments seem to suggest that is 1024. I don't see any rx queue empty errors on the nic, which probably means there are sufficient buffers. > > I am trying to push 100G traffic from the iperf Linux TCP client by >> starting 10 parallel iperf connections on different port numbers and >> pinning them to different cores on the sender side. On the VPP receiver >> side I have 10 worker threads and 10 rx-queues in dpdk, and running iperf3 >> using VCL library as follows >> > >> > taskset 0x00400 sh -c >> "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so >> VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9000" & >> > taskset 0x00800 sh -c >> "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so >> VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9001" & >> > taskset 0x01000 sh -c "LD_PRELOAD=/usr/lib/x86_64 >> > ... >> > >> > MTU is set to 9216 everywhere, and TCP MSS set to 8200 on client: >> > >> > taskset 0x0001 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9000 >> > taskset 0x0002 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9001 >> > ... >> >> Could you try first with only 1 iperf server/client pair, just to see >> where performance is with that? >> > > These are the numbers I get > rx-fifo-size 65536: ~8G > rx-fifo-size 524288: 22G > rx-fifo-size 4000000: 25G > > > Okay, so 4MB is probably the sweet spot. Btw, could you check the vector > rate (and the errors) in this case also? > I noticed that adding "enable-tcp-udp-checksum" back seems to improve performance. Not sure if this is an issue with the dpdk driver for the nic. Anyway in the "show hardware" flags I see now that tcp and udp checksum offloads are enabled: root@server:~# vppctl show hardware Name Idx Link Hardware eth0 1 up dsc1 Link speed: 100 Gbps Ethernet address 00:ae:cd:03:79:51 ### UNKNOWN ### carrier up full duplex mtu 9000 flags: admin-up pmd maybe-multiseg rx-ip4-cksum Devargs: rx: queues 4 (max 16), desc 1024 (min 16 max 32768 align 1) tx: queues 5 (max 16), desc 1024 (min 16 max 32768 align 1) pci: device 1dd8:1002 subsystem 1dd8:400a address 0000:15:00.00 numa 0 max rx packet len: 9208 promiscuous: unicast off all-multicast on vlan offload: strip off filter off qinq off rx offload avail: vlan-strip ipv4-cksum udp-cksum tcp-cksum vlan-filter jumbo-frame scatter rx offload active: ipv4-cksum udp-cksum tcp-cksum jumbo-frame scatter tx offload avail: vlan-insert ipv4-cksum udp-cksum tcp-cksum tcp-tso outer-ipv4-cksum multi-segs mbuf-fast-free outer-udp-cksum tx offload active: multi-segs rss avail: ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp ipv6 rss active: ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp ipv6 tx burst function: ionic_xmit_pkts rx burst function: ionic_recv_pkts With this I get better performance per iperf3 connection - about 30.5G. Show run output attached (1connection.txt) > rx-fifo-size 8000000: 25G > > >> >> > >> > I see that I am not able to push beyond 50-60G. I tried different sizes >> for the vcl rx-fifo-size - 64K, 256K and 1M. With 1M fifo size, I see that >> tcp latency as reported on the client increases, but not a significant >> improvement in bandwidth. Are there any suggestions to achieve 100G >> bandwidth? I am using a vpp build from master. >> >> Depends a lot on how many connections you’re running in parallel. With >> only one connection, buffer occupancy might go up, so 1-2MB might be better. >> >> > > With the current run I increased this to 8000000. > >> >> Could you also check how busy vpp is with “clear run” wait at least 1 >> second and then “show run”. That will give you per node/worker vector >> rates. If they go above 100 vectors/dispatch the workers are busy so you >> could increase their number and implicitly the number of sessions. Note >> however that RSS is not perfect so you can get more connections on one >> worker. >> > > I am attaching the output of this to the email (10 iperf connections, 4 > worker threads) > > > It’s clearly saturated. Could also do a “clear error”/“show error” and > “clear tcp stats”/“show tcp stats”? > > Because this is purely a server/receiver scenario for vpp, and because > tcp4-established seems to need a lot of clocks, make sure that iperf runs > on the same numa vpp’s workers and the nic run on. To see the nic’s numa, > “show hardware”. > > For instance, in my testbed at ~37.5Gbps and 1 connection, > tcp4-established needs around 7e2 clocks. In your case it goes as high as > 1.2e4, so it doesn’t look it’s only frequency related. > I now repeated this test with all cores and nic on numa 0. Cores 1-4 are used by VPP and 5-11 by iperf. I get about 63G. I am attaching the vpp statistics for this case (7connection.txt). Looks like in this case nothing is hashing to core 4. > Pasting below the output of vpp and vcl conf files: >> > >> > cpu { >> > main-core 0 >> > workers 10 >> >> You can pin vpp’s workers to cores with: corelist-workers c1,c3-cN to >> avoid overlap with iperf. You might want to start with 1 worker and work >> your way up from there. In my testing, 1 worker should be enough to >> saturate a 40Gbps nic with 1 iperf connection. Maybe you need a couple more >> to reach 100, but I wouldn’t expect more. >> > > I changed this to 4 cores and pinned them as you suggested. > > > See above wrt how vpp’s workers, iperf and the nic should all be on the > same numa. Make sure iperf and vpp’s workers don’t overlap. > Done. > > > >> >> > } >> > >> > buffers { >> > buffers-per-numa 65536 >> >> Unless you need the buffers for something else, 16k might be enough. >> >> > default data-size 9216 >> >> Hm, no idea about the impact of this on performance. Session layer can >> build chained buffers so you can also try with this removed to see if it >> changes anything. >> > > For now, I kept this setting. > > > If possible, try with 1460 mtu and 2kB buffers, to see if that changes > anything. > Sure I will try this. I am hitting some issues with the link not coming up when I reduce the buffer data-size. It could be a driver issue. > >> > } >> > >> > dpdk { >> > dev 0000:15:00.0 { >> > name eth0 >> > num-rx-queues 10 >> >> Keep this in sync with the number of workers >> >> > } >> > enable-tcp-udp-checksum >> >> This enables sw checksum. For better performance, you’ll have to remove >> it. It will be needed however if you want to turn tso on. >> > > ok. removed. > > >> >> > } >> > >> > session { >> > evt_qs_memfd_seg >> > } >> > socksvr { socket-name /tmp/vpp-api.sock} >> > >> > tcp { >> > mtu 9216 >> > max-rx-fifo 262144 >> >> This is only used to compute the window scale factor. Given that your >> fifos might be larger, I would remove it. By default the value is 32MB and >> gives a wnd_scale of 10 (should be okay). >> > > When I was testing with Linux TCP stack on both sides, I was restricting > the receive window per socket using net.ipv4.tcp_rmem to get better latency > numbers. I want to mimic that with VPP. What is the right way to restrict > the rcv_wnd on VPP? > > > The rcv_wnd is controlled by the rx fifo size. This value will limit the > wnd_scale and the actual fifo size, if larger than 256kB, won’t be > correctly advertised. So it would be better to remove this and only control > it from rx fifo. > Sure, so I assume rx-fifo-size in vcl.conf is a per socket fifo size? Thanks, Vijay
root@server:~# vppctl show run Thread 0 vpp_main (lcore 0) Time 3.3, 10 sec internal node vector rate 0.00 loops/sec 1123145.49 vector rates in 0.0000e0, out 0.0000e0, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call cnat-scanner-process any wait 0 0 4 2.03e3 0.00 dpdk-process any wait 0 0 1 1.21e4 0.00 fib-walk any wait 0 0 2 2.98e3 0.00 ikev2-manager-process any wait 0 0 4 2.10e3 0.00 ip6-mld-process any wait 0 0 4 1.07e3 0.00 ip6-ra-process any wait 0 0 4 8.94e2 0.00 session-queue-main polling 576464 0 0 1.06e2 0.00 session-queue-process any wait 0 0 3 1.37e3 0.00 unix-cli-local:3 active 1 0 2 3.29e8 0.00 unix-cli-new-session any wait 0 0 3 1.79e3 0.00 unix-epoll-input polling 576464 0 0 1.21e4 0.00 wg-timer-manager any wait 0 0 326 2.62e2 0.00 --------------- Thread 1 vpp_wk_0 (lcore 1) Time 3.3, 10 sec internal node vector rate 1.00 loops/sec 6238348.06 vector rates in 6.1281e-1, out 0.0000e0, drop 6.1281e-1, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call dpdk-input polling 21020380 2 0 1.09e9 0.00 drop active 2 2 0 6.07e2 1.00 error-drop active 2 2 0 6.19e2 1.00 ethernet-input active 2 2 0 6.29e2 1.00 llc-input active 2 2 0 2.70e2 1.00 session-queue polling 21020380 0 0 1.51e2 0.00 unix-epoll-input polling 20508 0 0 4.48e2 0.00 --------------- Thread 2 vpp_wk_1 (lcore 2) Time 3.3, 10 sec internal node vector rate 0.00 loops/sec 6569148.91 vector rates in 0.0000e0, out 0.0000e0, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call dpdk-input polling 22052253 0 0 1.01e2 0.00 session-queue polling 22052253 0 0 1.49e2 0.00 unix-epoll-input polling 21514 0 0 4.39e2 0.00 --------------- Thread 3 vpp_wk_2 (lcore 3) Time 3.3, 10 sec internal node vector rate 96.69 loops/sec 2954.17 vector rates in 4.6875e5, out 2.9038e3, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call dpdk-input polling 9477 1520339 0 1.71e2 160.42 dsc1-output active 9477 9477 0 6.90e2 1.00 dsc1-tx active 9477 9477 0 1.05e3 1.00 ethernet-input active 9477 1520339 0 1.59e1 160.42 ip4-input-no-checksum active 9477 1520339 0 2.02e1 160.42 ip4-local active 9477 1520339 0 1.82e3 160.42 ip4-lookup active 9481 1529816 0 2.52e1 161.36 ip4-rewrite active 9477 9477 0 4.15e2 1.00 session-queue polling 9477 9477 0 1.55e3 1.00 tcp4-established active 9477 1520339 0 2.56e3 160.42 tcp4-input active 9477 1520339 0 7.27e1 160.42 tcp4-output active 9477 9477 0 8.20e2 1.00 unix-epoll-input polling 9 0 0 1.45e3 0.00 --------------- Thread 4 vpp_wk_3 (lcore 4) Time 3.3, 10 sec internal node vector rate 0.00 loops/sec 6490532.87 vector rates in 0.0000e0, out 0.0000e0, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call dpdk-input polling 21727288 0 0 1.01e2 0.00 session-queue polling 21727288 0 0 1.49e2 0.00 unix-epoll-input polling 21198 0 0 4.37e2 0.00 root@server:~# vppctl show error Count Node Reason 1 snap-input unknown oui/snap protocol 2 llc-input unknown llc ssap/dsap 8822 session-queue Packets transmitted 1415395 tcp4-established Packets pushed into rx fifo 8822 tcp4-output Packets sent root@server:~# vppctl clear tcp stats root@server:~# vppctl show tcp stats Thread 0: Thread 1: Thread 2: Thread 3: Thread 4:
root@server:~# vppctl show run Thread 0 vpp_main (lcore 0) Time 5.2, 10 sec internal node vector rate 0.00 loops/sec 1253686.91 vector rates in 0.0000e0, out 0.0000e0, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call cnat-scanner-process any wait 0 0 5 3.20e3 0.00 dpdk-process any wait 0 0 2 2.12e4 0.00 fib-walk any wait 0 0 3 5.00e3 0.00 ikev2-manager-process any wait 0 0 5 3.19e3 0.00 ip4-full-reassembly-expire-wal any wait 0 0 1 4.05e3 0.00 ip4-sv-reassembly-expire-walk any wait 0 0 1 3.80e3 0.00 ip6-full-reassembly-expire-wal any wait 0 0 1 3.38e3 0.00 ip6-mld-process any wait 0 0 5 1.72e3 0.00 ip6-ra-process any wait 0 0 5 1.47e3 0.00 ip6-sv-reassembly-expire-walk any wait 0 0 1 5.27e3 0.00 session-queue-main polling 922847 0 0 1.07e2 0.00 session-queue-process any wait 0 0 5 2.19e3 0.00 statseg-collector-process time wait 0 0 1 3.80e4 0.00 unix-cli-local:11 active 1 0 2 2.57e9 0.00 unix-cli-new-session any wait 0 0 3 2.38e3 0.00 unix-epoll-input polling 922847 0 0 1.21e4 0.00 wg-timer-manager any wait 0 0 522 3.50e2 0.00 --------------- Thread 1 vpp_wk_0 (lcore 1) Time 5.2, 10 sec internal node vector rate 140.52 loops/sec 1243.57 vector rates in 3.2308e5, out 2.5045e3, drop 7.6472e-1, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call dpdk-input polling 6550 1676800 0 3.49e2 256.00 drop active 4 4 0 2.32e3 1.00 dsc1-output active 6550 13100 0 6.32e2 2.00 dsc1-tx active 6550 13100 0 9.30e2 2.00 error-drop active 4 4 0 2.41e3 1.00 ethernet-input active 6550 1676800 0 2.02e1 256.00 ip4-input-no-checksum active 6550 1676796 0 2.06e1 255.99 ip4-local active 6550 1676796 0 3.04e3 255.99 ip4-lookup active 13100 1689896 0 2.62e1 128.99 ip4-rewrite active 6550 13100 0 2.74e2 2.00 llc-input active 4 4 0 1.99e3 1.00 session-queue polling 6550 13100 0 1.58e3 2.00 snap-input active 1 1 0 6.12e3 1.00 tcp4-established active 6550 1676796 0 3.26e3 255.99 tcp4-input active 6550 1676796 0 8.89e1 255.99 tcp4-output active 6550 13100 0 1.19e3 2.00 unix-epoll-input polling 7 0 0 2.55e3 0.00 --------------- Thread 2 vpp_wk_1 (lcore 2) Time 5.2, 10 sec internal node vector rate 140.96 loops/sec 1246.79 vector rates in 3.2448e5, out 3.6639e3, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call dpdk-input polling 6555 1678080 0 3.10e2 256.00 dsc1-output active 6555 19165 0 5.81e2 2.92 dsc1-tx active 6555 19165 0 7.15e2 2.92 ethernet-input active 6555 1678080 0 1.86e1 256.00 ip4-input-no-checksum active 6555 1678080 0 1.97e1 256.00 ip4-local active 6555 1678080 0 3.16e3 256.00 ip4-lookup active 13110 1697245 0 2.60e1 129.46 ip4-rewrite active 6555 19165 0 1.93e2 2.92 session-queue polling 6555 19165 0 1.29e3 2.92 tcp4-established active 6555 1678080 0 3.17e3 256.00 tcp4-input active 6555 1678080 0 9.17e1 256.00 tcp4-output active 6555 19165 0 7.29e2 2.92 unix-epoll-input polling 6 0 0 3.11e3 0.00 --------------- Thread 3 vpp_wk_2 (lcore 3) Time 5.2, 10 sec internal node vector rate 140.55 loops/sec 1218.51 vector rates in 3.2258e5, out 2.5006e3, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call dpdk-input polling 6540 1674240 0 3.49e2 256.00 dsc1-output active 6540 13080 0 6.63e2 2.00 dsc1-tx active 6540 13080 0 1.03e3 2.00 ethernet-input active 6540 1674240 0 1.97e1 256.00 ip4-input-no-checksum active 6540 1674240 0 2.04e1 256.00 ip4-local active 6540 1674240 0 3.12e3 256.00 ip4-lookup active 13080 1687320 0 2.65e1 129.00 ip4-rewrite active 6540 13080 0 2.66e2 2.00 session-queue polling 6540 13080 0 1.61e3 2.00 tcp4-established active 6540 1674240 0 3.19e3 256.00 tcp4-input active 6540 1674240 0 9.08e1 256.00 tcp4-output active 6540 13080 0 9.91e2 2.00 unix-epoll-input polling 6 0 0 2.21e3 0.00 --------------- Thread 4 vpp_wk_3 (lcore 4) Time 5.2, 10 sec internal node vector rate 0.00 loops/sec 6384609.01 vector rates in 0.0000e0, out 0.0000e0, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call dpdk-input polling 34700727 0 0 1.01e2 0.00 session-queue polling 34700727 0 0 1.50e2 0.00 unix-epoll-input polling 33854 0 0 5.62e2 0.00 root@server:~# vppctl clear error root@server:~# vppctl show error Count Node Reason 16046 session-queue Packets transmitted 2053885 tcp4-established Packets pushed into rx fifo 16046 tcp4-output Packets sent 3 llc-input unknown llc ssap/dsap 23632 session-queue Packets transmitted 2049856 tcp4-established Packets pushed into rx fifo 6944 tcp4-established OOO packets pushed into rx fifo 23632 tcp4-output Packets sent 15912 session-queue Packets transmitted 2036736 tcp4-established Packets pushed into rx fifo 15912 tcp4-output Packets sent root@server:~# vppctl clear tcp stats root@server:~# vppctl show tcp stats Thread 0: Thread 1: Thread 2: Thread 3: Thread 4:
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#17383): https://lists.fd.io/g/vpp-dev/message/17383 Mute This Topic: https://lists.fd.io/mt/76783803/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-