Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.
On Fri, Aug 23, 2019 at 9:59 AM Ilya Maximets wrote: > > On 23.08.2019 19:08, William Tu wrote: > > On Wed, Aug 21, 2019 at 2:31 AM Eelco Chaudron wrote: > >> > >> > >> > > William, Eelco, which HW NIC you're using? Which kernel driver? > > I’m using the below on the latest bpf-next driver: > > 01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit > SFI/SFP+ Network Connection (rev 01) > 01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit > SFI/SFP+ Network Connection (rev 01) > >>> > >>> Thanks for information. > >>> I found one suspicious place inside the ixgbe driver that could break > >>> the completion queue ring and prepared a patch: > >>> > >>> https://protect2.fireeye.com/url?k=ac2418ed930ec67f.ac2593a2-94283087c2dd9833&u=https://patchwork.ozlabs.org/patch/1150244/ > >>> > >>> It'll be good if you can test it. > >> > >> Hi Ilya, I was doping some testing of my own, and also concluded it was > >> in the drivers' completion ring. I noticed after sending 512 packets the > >> drivers TX counters kept increasing, which looks related to your fix. > >> > >> Will try it out, and sent results to the upstream mailing list… > >> > >> Thanks, > >> > >> Eelco > > > > Hi, > > > > I'm comparing the performance of netdev-afxdp.c on current master and > > the DPDK's AF_XDP implementation in OVS dpdk-latest branch. > > I'm using ixgbe and doing physical port to physical port forwarding, sending > > 64 byte packets, with OpenFlow rule: > > ovs-ofctl add-flow br0 "in_port=eth2, actions=output:eth3" > > > > In short > > A. OVS's netdev-afxdp: 6.1Mpps > > B. OVS-DPDK AF_XDP pmd: 8Mpps > > So I start to think about how to optimize lib/netdev-afxdp.c. Any comments > > are > > welcomed! Below is the analysis: > > One major difference is that DPDK implementation supports XDP_USE_NEED_WAKEUP > and it will be in use if you're building kernel from latest bpf-next tree. > This allowes to significantly decrease number of syscalls. > According to below perf stats, OVS implementation unlike dpdk one wastes ~11% > of time inside the kernel and this could be fixed by need_wakeup feature. Cool, thank you. I will look at how to use XDP_USE_NEED_WAKEUP > > BTW, there are a lot of pmd threads in case A, but only one in case B. > Was the test setup really equal? Yes, they should be equal. I accidentally in case A add a pmd-cpu-mask=0xf0 so it uses more cpus, but I always enable only one queue and pmd-stats-show shows other pmd is doing nothing. Will fix it next time. Regards, William ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.
On 23.08.2019 19:08, William Tu wrote: > On Wed, Aug 21, 2019 at 2:31 AM Eelco Chaudron wrote: >> >> >> > William, Eelco, which HW NIC you're using? Which kernel driver? I’m using the below on the latest bpf-next driver: 01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) 01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) >>> >>> Thanks for information. >>> I found one suspicious place inside the ixgbe driver that could break >>> the completion queue ring and prepared a patch: >>> >>> https://protect2.fireeye.com/url?k=ac2418ed930ec67f.ac2593a2-94283087c2dd9833&u=https://patchwork.ozlabs.org/patch/1150244/ >>> >>> It'll be good if you can test it. >> >> Hi Ilya, I was doping some testing of my own, and also concluded it was >> in the drivers' completion ring. I noticed after sending 512 packets the >> drivers TX counters kept increasing, which looks related to your fix. >> >> Will try it out, and sent results to the upstream mailing list… >> >> Thanks, >> >> Eelco > > Hi, > > I'm comparing the performance of netdev-afxdp.c on current master and > the DPDK's AF_XDP implementation in OVS dpdk-latest branch. > I'm using ixgbe and doing physical port to physical port forwarding, sending > 64 byte packets, with OpenFlow rule: > ovs-ofctl add-flow br0 "in_port=eth2, actions=output:eth3" > > In short > A. OVS's netdev-afxdp: 6.1Mpps > B. OVS-DPDK AF_XDP pmd: 8Mpps > So I start to think about how to optimize lib/netdev-afxdp.c. Any comments are > welcomed! Below is the analysis: One major difference is that DPDK implementation supports XDP_USE_NEED_WAKEUP and it will be in use if you're building kernel from latest bpf-next tree. This allowes to significantly decrease number of syscalls. According to below perf stats, OVS implementation unlike dpdk one wastes ~11% of time inside the kernel and this could be fixed by need_wakeup feature. BTW, there are a lot of pmd threads in case A, but only one in case B. Was the test setup really equal? Best regards, Ilya Maximets. > > A. OVS netdev-afxdp Physical to physical 6.1Mpps > # pstree -p 702 > ovs-vswitchd(702)-+-{ct_clean1) S 1 7(706) > |-{handler4}(712) > |-{ipf_clean2}(707) > |-{pmd6}(790) > |-{pmd7}(791) > |-{pmd8}(792) > |-{pmd9}(793) > |-{revalidator5}(713) > `-{urcu3}(708) > > # ovs-appctl dpif-netdev/pmd-stats-show > pmd thread numa_id 0 core_id 6: > packets received: 92290351 > packet recirculations: 0 > avg. datapath passes per packet: 1.00 > emc hits: 92290319 > smc hits: 0 > megaflow hits: 31 > avg. subtable lookups per megaflow hit: 1.00 > miss with success upcall: 1 > miss with failed upcall: 0 > avg. packets per output batch: 31.88 > idle cycles: 20835727677 (34.86%) --> pretty high!? > processing cycles: 38932097052 (65.14%) > avg cycles per packet: 647.61 (59767824729/92290351) > avg processing cycles per packet: 421.84 (38932097052/92290351) > > # ./perf record -t 790 sleep 10 > 13.80% pmd6 ovs-vswitchd[.] miniflow_extract > 13.58% pmd6 ovs-vswitchd[.] __netdev_afxdp_batch_send >9.64% pmd6 ovs-vswitchd[.] dp_netdev_input__ >9.07% pmd6 ovs-vswitchd[.] dp_packet_init__ >8.91% pmd6 ovs-vswitchd[.] netdev_afxdp_rxq_recv >7.40% pmd6 ovs-vswitchd[.] miniflow_hash_5tuple >5.32% pmd6 libc-2.23.so[.] __memcpy_avx_unaligned >4.60% pmd6 [kernel.vmlinux][k] do_syscall_64 >3.72% pmd6 ovs-vswitchd[.] dp_packet_use_afxdp--> > maybe optimize? >2.74% pmd6 libpthread-2.23.so [.] __pthread_enable_asynccancel >2.43% pmd6 [kernel.vmlinux][k] fput_many >2.18% pmd6 libc-2.23.so[.] __memcmp_sse4_1 >2.06% pmd6 [kernel.vmlinux][k] entry_SYSCALL_64 >1.79% pmd6 [kernel.vmlinux][k] syscall_return_via_sysret >1.71% pmd6 ovs-vswitchd[.] dp_execute_cb >1.03% pmd6 ovs-vswitchd[.] non_atomic_ullong_add >0.86% pmd6 ovs-vswitchd[.]dp_netdev_pmd_flush_output_on_port > > B. OVS-DPDK afxdp using dpdk-latest 8Mpps > ovs-vswitchd(19462)-+-{ct_clean3}(19470) > |-{dpdk_watchdog1}(19468) > |-{eal-intr-thread}(19466) > |-{handler16}(19501) > |-{handler17}(19505) > |-{handler18}(19506) > |-{handler19}(19507) > |-{handler20}(19508) > |-{handler22}(19502) > |-{handler24}(19504) > |-{handler26}(19503) > |-{ipf_clean4}(19471) > |-{pmd27}(19536) > |-{revalidator21}(19509) >
Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.
On Wed, Aug 21, 2019 at 2:31 AM Eelco Chaudron wrote: > > > > >>> William, Eelco, which HW NIC you're using? Which kernel driver? > >> > >> I’m using the below on the latest bpf-next driver: > >> > >> 01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit > >> SFI/SFP+ Network Connection (rev 01) > >> 01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit > >> SFI/SFP+ Network Connection (rev 01) > > > > Thanks for information. > > I found one suspicious place inside the ixgbe driver that could break > > the completion queue ring and prepared a patch: > > https://patchwork.ozlabs.org/patch/1150244/ > > > > It'll be good if you can test it. > > Hi Ilya, I was doping some testing of my own, and also concluded it was > in the drivers' completion ring. I noticed after sending 512 packets the > drivers TX counters kept increasing, which looks related to your fix. > > Will try it out, and sent results to the upstream mailing list… > > Thanks, > > Eelco Hi, I'm comparing the performance of netdev-afxdp.c on current master and the DPDK's AF_XDP implementation in OVS dpdk-latest branch. I'm using ixgbe and doing physical port to physical port forwarding, sending 64 byte packets, with OpenFlow rule: ovs-ofctl add-flow br0 "in_port=eth2, actions=output:eth3" In short A. OVS's netdev-afxdp: 6.1Mpps B. OVS-DPDK AF_XDP pmd: 8Mpps So I start to think about how to optimize lib/netdev-afxdp.c. Any comments are welcomed! Below is the analysis: A. OVS netdev-afxdp Physical to physical 6.1Mpps # pstree -p 702 ovs-vswitchd(702)-+-{ct_clean1) S 1 7(706) |-{handler4}(712) |-{ipf_clean2}(707) |-{pmd6}(790) |-{pmd7}(791) |-{pmd8}(792) |-{pmd9}(793) |-{revalidator5}(713) `-{urcu3}(708) # ovs-appctl dpif-netdev/pmd-stats-show pmd thread numa_id 0 core_id 6: packets received: 92290351 packet recirculations: 0 avg. datapath passes per packet: 1.00 emc hits: 92290319 smc hits: 0 megaflow hits: 31 avg. subtable lookups per megaflow hit: 1.00 miss with success upcall: 1 miss with failed upcall: 0 avg. packets per output batch: 31.88 idle cycles: 20835727677 (34.86%) --> pretty high!? processing cycles: 38932097052 (65.14%) avg cycles per packet: 647.61 (59767824729/92290351) avg processing cycles per packet: 421.84 (38932097052/92290351) # ./perf record -t 790 sleep 10 13.80% pmd6 ovs-vswitchd[.] miniflow_extract 13.58% pmd6 ovs-vswitchd[.] __netdev_afxdp_batch_send 9.64% pmd6 ovs-vswitchd[.] dp_netdev_input__ 9.07% pmd6 ovs-vswitchd[.] dp_packet_init__ 8.91% pmd6 ovs-vswitchd[.] netdev_afxdp_rxq_recv 7.40% pmd6 ovs-vswitchd[.] miniflow_hash_5tuple 5.32% pmd6 libc-2.23.so[.] __memcpy_avx_unaligned 4.60% pmd6 [kernel.vmlinux][k] do_syscall_64 3.72% pmd6 ovs-vswitchd[.] dp_packet_use_afxdp--> maybe optimize? 2.74% pmd6 libpthread-2.23.so [.] __pthread_enable_asynccancel 2.43% pmd6 [kernel.vmlinux][k] fput_many 2.18% pmd6 libc-2.23.so[.] __memcmp_sse4_1 2.06% pmd6 [kernel.vmlinux][k] entry_SYSCALL_64 1.79% pmd6 [kernel.vmlinux][k] syscall_return_via_sysret 1.71% pmd6 ovs-vswitchd[.] dp_execute_cb 1.03% pmd6 ovs-vswitchd[.] non_atomic_ullong_add 0.86% pmd6 ovs-vswitchd[.]dp_netdev_pmd_flush_output_on_port B. OVS-DPDK afxdp using dpdk-latest 8Mpps ovs-vswitchd(19462)-+-{ct_clean3}(19470) |-{dpdk_watchdog1}(19468) |-{eal-intr-thread}(19466) |-{handler16}(19501) |-{handler17}(19505) |-{handler18}(19506) |-{handler19}(19507) |-{handler20}(19508) |-{handler22}(19502) |-{handler24}(19504) |-{handler26}(19503) |-{ipf_clean4}(19471) |-{pmd27}(19536) |-{revalidator21}(19509) |-{revalidator23}(19511) |-{revalidator25}(19510) |-{rte_mp_handle}(19467) `-{urcu2}(19469) # ovs-appctl dpif-netdev/pmd-stats-show pmd thread numa_id 0 core_id 11: packets received: 1813689117 packet recirculations: 0 avg. datapath passes per packet: 1.00 emc hits: 1813689053 smc hits: 0 megaflow hits: 63 avg. subtable lookups per megaflow hit: 1.00 miss with success upcall: 1 miss with failed upcall: 0 avg. packets per output batch: 31.85 idle cycles: 13848892341 (2.50%) processing cycles: 541064826249 (97.50%) avg cycles per packet: 305.96 (554913718590/1813689117) avg processing cycles per packet: 298.32 (541064826249/1813689117) # ./perf record -t 19536 sleep 10 24.84% pmd27 ovs-vsw
Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.
On 20 Aug 2019, at 17:20, Ilya Maximets wrote: On 20.08.2019 14:19, Eelco Chaudron wrote: On 20 Aug 2019, at 12:10, Ilya Maximets wrote: On 14.08.2019 19:16, William Tu wrote: On Wed, Aug 14, 2019 at 7:58 AM William Tu wrote: On Wed, Aug 14, 2019 at 5:09 AM Eelco Chaudron wrote: On 8 Aug 2019, at 17:38, Ilya Maximets wrote: I see a rather high number of afxdp_cq_skip, which should to my knowledge never happen? I tried to investigate this previously, but didn't find anything suspicious. So, for my knowledge, this should never happen too. However, I only looked at the code without actually running, because I had no HW available for testing. While investigation and stress-testing virtual ports I found few issues with missing locking inside the kernel, so there is no trust for kernel part of XDP implementation from my side. I'm suspecting that there are some other bugs in kernel/libbpf that only could be reproduced with driver mode. This never happens for virtual ports with SKB mode, so I never saw this coverage counter being non-zero. Did some quick debugging, as something else has come up that needs my attention :) But once I’m in a faulty state and sent a single packet, causing afxdp_complete_tx() to be called, it tells me 2048 descriptors are ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that there might be some ring management bug. Maybe consumer and receiver are equal meaning 0 buffers, but it returns max? I did not look at the kernel code, so this is just a wild guess :) (gdb) p tx_done $3 = 2048 (gdb) p umem->cq $4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask = 2047, size = 2048, producer = 0x7f08486b8000, consumer = 0x7f08486b8040, ring = 0x7f08486b8080} Thanks for debugging! xsk_ring_cons__peek() just returns the difference between cached_prod and cached_cons, but these values are too different: 3830466864 - 3578066899 = 252399965 Since this value > requested, it returns requested number (2048). So, the ring is broken. At least broken its 'cached' part. It'll be good to look at *consumer and *producer values to verify the state of the actual ring. I’ll try to find some more time next week to debug further. William I noticed your email in xdp-newbies where you mention this problem of getting the wrong pointers. Did you ever follow up, or did further trouble shooting on the above? Yes, I posted here https://www.spinics.net/lists/xdp-newbies/msg00956.html "Question/Bug about AF_XDP idx_cq from xsk_ring_cons__peek?" At that time I was thinking about reproducing the problem using the xdpsock sample code from kernel. But turned out that my reproduction code is not correct, so not able to show the case we hit here in OVS. Then I put more similar code logic from OVS to xdpsock, but the problem does not show up. As a result, I worked around it by marking addr as "*addr == UINT64_MAX". I will debug again this week once I get my testbed back. Just to refresh my memory. The original issue is that when calling: tx_done = xsk_ring_cons__peek(&umem->cq, CONS_NUM_DESCS, &idx_cq); xsk_ring_cons__release(&umem->cq, tx_done); I expect there are 'tx_done' elems on the CQ to re-cycle back to memory pool. However, when I inspect these elems, I found some elems that 'already' been reported complete last time I call xsk_ring_cons__peek. In other word, some elems show up at CQ twice. And this cause overflow of the mempool. Thus, mark the elems on CQ as UINT64_MAX to indicate that we already seen this elem. William, Eelco, which HW NIC you're using? Which kernel driver? I’m using the below on the latest bpf-next driver: 01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) 01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) Thanks for information. I found one suspicious place inside the ixgbe driver that could break the completion queue ring and prepared a patch: https://patchwork.ozlabs.org/patch/1150244/ It'll be good if you can test it. Hi Ilya, I was doping some testing of my own, and also concluded it was in the drivers' completion ring. I noticed after sending 512 packets the drivers TX counters kept increasing, which looks related to your fix. Will try it out, and sent results to the upstream mailing list… Thanks, Eelco ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.
On 20.08.2019 14:19, Eelco Chaudron wrote: > > > On 20 Aug 2019, at 12:10, Ilya Maximets wrote: > >> On 14.08.2019 19:16, William Tu wrote: >>> On Wed, Aug 14, 2019 at 7:58 AM William Tu wrote: On Wed, Aug 14, 2019 at 5:09 AM Eelco Chaudron wrote: > > > > On 8 Aug 2019, at 17:38, Ilya Maximets wrote: > > > > I see a rather high number of afxdp_cq_skip, which should to my > knowledge never happen? I tried to investigate this previously, but didn't find anything suspicious. So, for my knowledge, this should never happen too. However, I only looked at the code without actually running, because I had no HW available for testing. While investigation and stress-testing virtual ports I found few issues with missing locking inside the kernel, so there is no trust for kernel part of XDP implementation from my side. I'm suspecting that there are some other bugs in kernel/libbpf that only could be reproduced with driver mode. This never happens for virtual ports with SKB mode, so I never saw this coverage counter being non-zero. >>> >>> Did some quick debugging, as something else has come up that needs my >>> attention :) >>> >>> But once I’m in a faulty state and sent a single packet, causing >>> afxdp_complete_tx() to be called, it tells me 2048 descriptors are >>> ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that >>> there might be some ring management bug. Maybe consumer and receiver >>> are equal meaning 0 buffers, but it returns max? I did not look at >>> the kernel code, so this is just a wild guess :) >>> >>> (gdb) p tx_done >>> $3 = 2048 >>> >>> (gdb) p umem->cq >>> $4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask = >>> 2047, size = 2048, producer = 0x7f08486b8000, consumer = >>> 0x7f08486b8040, ring = 0x7f08486b8080} >> >> Thanks for debugging! >> >> xsk_ring_cons__peek() just returns the difference between cached_prod >> and cached_cons, but these values are too different: >> >> 3830466864 - 3578066899 = 252399965 >> >> Since this value > requested, it returns requested number (2048). >> >> So, the ring is broken. At least broken its 'cached' part. It'll be >> good >> to look at *consumer and *producer values to verify the state of the >> actual ring. >> > > I’ll try to find some more time next week to debug further. > > William I noticed your email in xdp-newbies where you mention this > problem of getting the wrong pointers. Did you ever follow up, or did > further trouble shooting on the above? Yes, I posted here https://www.spinics.net/lists/xdp-newbies/msg00956.html "Question/Bug about AF_XDP idx_cq from xsk_ring_cons__peek?" At that time I was thinking about reproducing the problem using the xdpsock sample code from kernel. But turned out that my reproduction code is not correct, so not able to show the case we hit here in OVS. Then I put more similar code logic from OVS to xdpsock, but the problem does not show up. As a result, I worked around it by marking addr as "*addr == UINT64_MAX". I will debug again this week once I get my testbed back. >>> Just to refresh my memory. The original issue is that >>> when calling: >>> tx_done = xsk_ring_cons__peek(&umem->cq, CONS_NUM_DESCS, &idx_cq); >>> xsk_ring_cons__release(&umem->cq, tx_done); >>> >>> I expect there are 'tx_done' elems on the CQ to re-cycle back to memory >>> pool. >>> However, when I inspect these elems, I found some elems that 'already' been >>> reported complete last time I call xsk_ring_cons__peek. In other word, some >>> elems show up at CQ twice. And this cause overflow of the mempool. >>> >>> Thus, mark the elems on CQ as UINT64_MAX to indicate that we already >>> seen this elem. >> >> William, Eelco, which HW NIC you're using? Which kernel driver? > > I’m using the below on the latest bpf-next driver: > > 01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ > Network Connection (rev 01) > 01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ > Network Connection (rev 01) Thanks for information. I found one suspicious place inside the ixgbe driver that could break the completion queue ring and prepared a patch: https://patchwork.ozlabs.org/patch/1150244/ It'll be good if you can test it. Best regards, Ilya Maximets. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.
On 20 Aug 2019, at 12:10, Ilya Maximets wrote: On 14.08.2019 19:16, William Tu wrote: On Wed, Aug 14, 2019 at 7:58 AM William Tu wrote: On Wed, Aug 14, 2019 at 5:09 AM Eelco Chaudron wrote: On 8 Aug 2019, at 17:38, Ilya Maximets wrote: I see a rather high number of afxdp_cq_skip, which should to my knowledge never happen? I tried to investigate this previously, but didn't find anything suspicious. So, for my knowledge, this should never happen too. However, I only looked at the code without actually running, because I had no HW available for testing. While investigation and stress-testing virtual ports I found few issues with missing locking inside the kernel, so there is no trust for kernel part of XDP implementation from my side. I'm suspecting that there are some other bugs in kernel/libbpf that only could be reproduced with driver mode. This never happens for virtual ports with SKB mode, so I never saw this coverage counter being non-zero. Did some quick debugging, as something else has come up that needs my attention :) But once I’m in a faulty state and sent a single packet, causing afxdp_complete_tx() to be called, it tells me 2048 descriptors are ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that there might be some ring management bug. Maybe consumer and receiver are equal meaning 0 buffers, but it returns max? I did not look at the kernel code, so this is just a wild guess :) (gdb) p tx_done $3 = 2048 (gdb) p umem->cq $4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask = 2047, size = 2048, producer = 0x7f08486b8000, consumer = 0x7f08486b8040, ring = 0x7f08486b8080} Thanks for debugging! xsk_ring_cons__peek() just returns the difference between cached_prod and cached_cons, but these values are too different: 3830466864 - 3578066899 = 252399965 Since this value > requested, it returns requested number (2048). So, the ring is broken. At least broken its 'cached' part. It'll be good to look at *consumer and *producer values to verify the state of the actual ring. I’ll try to find some more time next week to debug further. William I noticed your email in xdp-newbies where you mention this problem of getting the wrong pointers. Did you ever follow up, or did further trouble shooting on the above? Yes, I posted here https://www.spinics.net/lists/xdp-newbies/msg00956.html "Question/Bug about AF_XDP idx_cq from xsk_ring_cons__peek?" At that time I was thinking about reproducing the problem using the xdpsock sample code from kernel. But turned out that my reproduction code is not correct, so not able to show the case we hit here in OVS. Then I put more similar code logic from OVS to xdpsock, but the problem does not show up. As a result, I worked around it by marking addr as "*addr == UINT64_MAX". I will debug again this week once I get my testbed back. Just to refresh my memory. The original issue is that when calling: tx_done = xsk_ring_cons__peek(&umem->cq, CONS_NUM_DESCS, &idx_cq); xsk_ring_cons__release(&umem->cq, tx_done); I expect there are 'tx_done' elems on the CQ to re-cycle back to memory pool. However, when I inspect these elems, I found some elems that 'already' been reported complete last time I call xsk_ring_cons__peek. In other word, some elems show up at CQ twice. And this cause overflow of the mempool. Thus, mark the elems on CQ as UINT64_MAX to indicate that we already seen this elem. William, Eelco, which HW NIC you're using? Which kernel driver? I’m using the below on the latest bpf-next driver: 01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) 01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) //Eelco ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.
On 14.08.2019 19:16, William Tu wrote: > On Wed, Aug 14, 2019 at 7:58 AM William Tu wrote: >> >> On Wed, Aug 14, 2019 at 5:09 AM Eelco Chaudron wrote: >>> >>> >>> >>> On 8 Aug 2019, at 17:38, Ilya Maximets wrote: >>> >>> >>> >>> I see a rather high number of afxdp_cq_skip, which should to my >>> knowledge never happen? >> >> I tried to investigate this previously, but didn't find anything >> suspicious. >> So, for my knowledge, this should never happen too. >> However, I only looked at the code without actually running, because >> I had no >> HW available for testing. >> >> While investigation and stress-testing virtual ports I found few >> issues with >> missing locking inside the kernel, so there is no trust for kernel >> part of XDP >> implementation from my side. I'm suspecting that there are some >> other bugs in >> kernel/libbpf that only could be reproduced with driver mode. >> >> This never happens for virtual ports with SKB mode, so I never saw >> this coverage >> counter being non-zero. > > Did some quick debugging, as something else has come up that needs my > attention :) > > But once I’m in a faulty state and sent a single packet, causing > afxdp_complete_tx() to be called, it tells me 2048 descriptors are > ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that > there might be some ring management bug. Maybe consumer and receiver > are equal meaning 0 buffers, but it returns max? I did not look at > the kernel code, so this is just a wild guess :) > > (gdb) p tx_done > $3 = 2048 > > (gdb) p umem->cq > $4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask = > 2047, size = 2048, producer = 0x7f08486b8000, consumer = > 0x7f08486b8040, ring = 0x7f08486b8080} Thanks for debugging! xsk_ring_cons__peek() just returns the difference between cached_prod and cached_cons, but these values are too different: 3830466864 - 3578066899 = 252399965 Since this value > requested, it returns requested number (2048). So, the ring is broken. At least broken its 'cached' part. It'll be good to look at *consumer and *producer values to verify the state of the actual ring. >>> >>> I’ll try to find some more time next week to debug further. >>> >>> William I noticed your email in xdp-newbies where you mention this >>> problem of getting the wrong pointers. Did you ever follow up, or did >>> further trouble shooting on the above? >> >> Yes, I posted here >> https://www.spinics.net/lists/xdp-newbies/msg00956.html >> "Question/Bug about AF_XDP idx_cq from xsk_ring_cons__peek?" >> >> At that time I was thinking about reproducing the problem using the >> xdpsock sample code from kernel. But turned out that my reproduction >> code is not correct, so not able to show the case we hit here in OVS. >> >> Then I put more similar code logic from OVS to xdpsock, but the problem >> does not show up. As a result, I worked around it by marking addr as >> "*addr == UINT64_MAX". >> >> I will debug again this week once I get my testbed back. >> > Just to refresh my memory. The original issue is that > when calling: > tx_done = xsk_ring_cons__peek(&umem->cq, CONS_NUM_DESCS, &idx_cq); > xsk_ring_cons__release(&umem->cq, tx_done); > > I expect there are 'tx_done' elems on the CQ to re-cycle back to memory pool. > However, when I inspect these elems, I found some elems that 'already' been > reported complete last time I call xsk_ring_cons__peek. In other word, some > elems show up at CQ twice. And this cause overflow of the mempool. > > Thus, mark the elems on CQ as UINT64_MAX to indicate that we already > seen this elem. William, Eelco, which HW NIC you're using? Which kernel driver? Best regards, Ilya Maximets. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.
On Wed, Aug 14, 2019 at 7:58 AM William Tu wrote: > > On Wed, Aug 14, 2019 at 5:09 AM Eelco Chaudron wrote: > > > > > > > > On 8 Aug 2019, at 17:38, Ilya Maximets wrote: > > > > > > > > I see a rather high number of afxdp_cq_skip, which should to my > > knowledge never happen? > > >>> > > >>> I tried to investigate this previously, but didn't find anything > > >>> suspicious. > > >>> So, for my knowledge, this should never happen too. > > >>> However, I only looked at the code without actually running, because > > >>> I had no > > >>> HW available for testing. > > >>> > > >>> While investigation and stress-testing virtual ports I found few > > >>> issues with > > >>> missing locking inside the kernel, so there is no trust for kernel > > >>> part of XDP > > >>> implementation from my side. I'm suspecting that there are some > > >>> other bugs in > > >>> kernel/libbpf that only could be reproduced with driver mode. > > >>> > > >>> This never happens for virtual ports with SKB mode, so I never saw > > >>> this coverage > > >>> counter being non-zero. > > >> > > >> Did some quick debugging, as something else has come up that needs my > > >> attention :) > > >> > > >> But once I’m in a faulty state and sent a single packet, causing > > >> afxdp_complete_tx() to be called, it tells me 2048 descriptors are > > >> ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that > > >> there might be some ring management bug. Maybe consumer and receiver > > >> are equal meaning 0 buffers, but it returns max? I did not look at > > >> the kernel code, so this is just a wild guess :) > > >> > > >> (gdb) p tx_done > > >> $3 = 2048 > > >> > > >> (gdb) p umem->cq > > >> $4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask = > > >> 2047, size = 2048, producer = 0x7f08486b8000, consumer = > > >> 0x7f08486b8040, ring = 0x7f08486b8080} > > > > > > Thanks for debugging! > > > > > > xsk_ring_cons__peek() just returns the difference between cached_prod > > > and cached_cons, but these values are too different: > > > > > > 3830466864 - 3578066899 = 252399965 > > > > > > Since this value > requested, it returns requested number (2048). > > > > > > So, the ring is broken. At least broken its 'cached' part. It'll be > > > good > > > to look at *consumer and *producer values to verify the state of the > > > actual ring. > > > > > > > I’ll try to find some more time next week to debug further. > > > > William I noticed your email in xdp-newbies where you mention this > > problem of getting the wrong pointers. Did you ever follow up, or did > > further trouble shooting on the above? > > Yes, I posted here > https://www.spinics.net/lists/xdp-newbies/msg00956.html > "Question/Bug about AF_XDP idx_cq from xsk_ring_cons__peek?" > > At that time I was thinking about reproducing the problem using the > xdpsock sample code from kernel. But turned out that my reproduction > code is not correct, so not able to show the case we hit here in OVS. > > Then I put more similar code logic from OVS to xdpsock, but the problem > does not show up. As a result, I worked around it by marking addr as > "*addr == UINT64_MAX". > > I will debug again this week once I get my testbed back. > Just to refresh my memory. The original issue is that when calling: tx_done = xsk_ring_cons__peek(&umem->cq, CONS_NUM_DESCS, &idx_cq); xsk_ring_cons__release(&umem->cq, tx_done); I expect there are 'tx_done' elems on the CQ to re-cycle back to memory pool. However, when I inspect these elems, I found some elems that 'already' been reported complete last time I call xsk_ring_cons__peek. In other word, some elems show up at CQ twice. And this cause overflow of the mempool. Thus, mark the elems on CQ as UINT64_MAX to indicate that we already seen this elem. William ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.
On Wed, Aug 14, 2019 at 5:09 AM Eelco Chaudron wrote: > > > > On 8 Aug 2019, at 17:38, Ilya Maximets wrote: > > > > I see a rather high number of afxdp_cq_skip, which should to my > knowledge never happen? > >>> > >>> I tried to investigate this previously, but didn't find anything > >>> suspicious. > >>> So, for my knowledge, this should never happen too. > >>> However, I only looked at the code without actually running, because > >>> I had no > >>> HW available for testing. > >>> > >>> While investigation and stress-testing virtual ports I found few > >>> issues with > >>> missing locking inside the kernel, so there is no trust for kernel > >>> part of XDP > >>> implementation from my side. I'm suspecting that there are some > >>> other bugs in > >>> kernel/libbpf that only could be reproduced with driver mode. > >>> > >>> This never happens for virtual ports with SKB mode, so I never saw > >>> this coverage > >>> counter being non-zero. > >> > >> Did some quick debugging, as something else has come up that needs my > >> attention :) > >> > >> But once I’m in a faulty state and sent a single packet, causing > >> afxdp_complete_tx() to be called, it tells me 2048 descriptors are > >> ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that > >> there might be some ring management bug. Maybe consumer and receiver > >> are equal meaning 0 buffers, but it returns max? I did not look at > >> the kernel code, so this is just a wild guess :) > >> > >> (gdb) p tx_done > >> $3 = 2048 > >> > >> (gdb) p umem->cq > >> $4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask = > >> 2047, size = 2048, producer = 0x7f08486b8000, consumer = > >> 0x7f08486b8040, ring = 0x7f08486b8080} > > > > Thanks for debugging! > > > > xsk_ring_cons__peek() just returns the difference between cached_prod > > and cached_cons, but these values are too different: > > > > 3830466864 - 3578066899 = 252399965 > > > > Since this value > requested, it returns requested number (2048). > > > > So, the ring is broken. At least broken its 'cached' part. It'll be > > good > > to look at *consumer and *producer values to verify the state of the > > actual ring. > > > > I’ll try to find some more time next week to debug further. > > William I noticed your email in xdp-newbies where you mention this > problem of getting the wrong pointers. Did you ever follow up, or did > further trouble shooting on the above? Yes, I posted here https://www.spinics.net/lists/xdp-newbies/msg00956.html "Question/Bug about AF_XDP idx_cq from xsk_ring_cons__peek?" At that time I was thinking about reproducing the problem using the xdpsock sample code from kernel. But turned out that my reproduction code is not correct, so not able to show the case we hit here in OVS. Then I put more similar code logic from OVS to xdpsock, but the problem does not show up. As a result, I worked around it by marking addr as "*addr == UINT64_MAX". I will debug again this week once I get my testbed back. William ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.
On 8 Aug 2019, at 17:38, Ilya Maximets wrote: I see a rather high number of afxdp_cq_skip, which should to my knowledge never happen? I tried to investigate this previously, but didn't find anything suspicious. So, for my knowledge, this should never happen too. However, I only looked at the code without actually running, because I had no HW available for testing. While investigation and stress-testing virtual ports I found few issues with missing locking inside the kernel, so there is no trust for kernel part of XDP implementation from my side. I'm suspecting that there are some other bugs in kernel/libbpf that only could be reproduced with driver mode. This never happens for virtual ports with SKB mode, so I never saw this coverage counter being non-zero. Did some quick debugging, as something else has come up that needs my attention :) But once I’m in a faulty state and sent a single packet, causing afxdp_complete_tx() to be called, it tells me 2048 descriptors are ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that there might be some ring management bug. Maybe consumer and receiver are equal meaning 0 buffers, but it returns max? I did not look at the kernel code, so this is just a wild guess :) (gdb) p tx_done $3 = 2048 (gdb) p umem->cq $4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask = 2047, size = 2048, producer = 0x7f08486b8000, consumer = 0x7f08486b8040, ring = 0x7f08486b8080} Thanks for debugging! xsk_ring_cons__peek() just returns the difference between cached_prod and cached_cons, but these values are too different: 3830466864 - 3578066899 = 252399965 Since this value > requested, it returns requested number (2048). So, the ring is broken. At least broken its 'cached' part. It'll be good to look at *consumer and *producer values to verify the state of the actual ring. I’ll try to find some more time next week to debug further. William I noticed your email in xdp-newbies where you mention this problem of getting the wrong pointers. Did you ever follow up, or did further trouble shooting on the above? $ ovs-appctl coverage/show | grep xdp afxdp_cq_empty 0.0/sec 339.600/sec 5.6606/sec total: 20378 afxdp_tx_full 0.0/sec 29.967/sec 0.4994/sec total: 1798 afxdp_cq_skip 0.0/sec 61884770.167/sec 1174238.3644/sec total: 4227258112 You mentioned you saw this high number in your v15 change notes, did you do any research on why? Cheers, Eelco ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.
On 08.08.2019 17:53, Eelco Chaudron wrote: > > > On 8 Aug 2019, at 14:09, Ilya Maximets wrote: > >> On 08.08.2019 14:42, Eelco Chaudron wrote: >>> >>> >>> On 19 Jul 2019, at 16:54, Ilya Maximets wrote: >>> On 18.07.2019 23:11, William Tu wrote: > The patch introduces experimental AF_XDP support for OVS netdev. > AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket > type built upon the eBPF and XDP technology. It is aims to have > comparable > performance to DPDK but cooperate better with existing kernel's networking > stack. An AF_XDP socket receives and sends packets from an eBPF/XDP > program > attached to the netdev, by-passing a couple of Linux kernel's subsystems > As a result, AF_XDP socket shows much better performance than AF_PACKET > For more details about AF_XDP, please see linux kernel's > Documentation/networking/af_xdp.rst. Note that by default, this feature is > not compiled in. > > Signed-off-by: William Tu Thanks, William, Eelco and Ben! I fixed couple of things and applied to master! >>> >>> Good to see this got merged into master while on PTO. However, when I got >>> back I decided to test it once more… >>> >>> When testing PVP I got a couple of packets trough, and then it would stall. >>> I thought it might be my kernel, so updated to yesterdays latest, no luck… >>> >>> I did see a bunch of “eno1: send failed due to exhausted memory pool.” >>> messages in the log. Putting back patch v14, made my problems go away… >>> >>> After some debugging, I noticed the problem was with the “continue” case in >>> the afxdp_complete_tx() function. >>> Applying the following patch made it work again: >>> >>> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c >>> index b7cc0d988..9b335ddf0 100644 >>> --- a/lib/netdev-afxdp.c >>> +++ b/lib/netdev-afxdp.c >>> @@ -823,16 +823,21 @@ afxdp_complete_tx(struct xsk_socket_info *xsk_info) >>> >>> if (tx_to_free == BATCH_SIZE || j == tx_done - 1) { >>> umem_elem_push_n(&umem->mpool, tx_to_free, elems_push); >>> xsk_info->outstanding_tx -= tx_to_free; >>> tx_to_free = 0; >>> } >>> } >>> >>> + if (tx_to_free) { >>> + umem_elem_push_n(&umem->mpool, tx_to_free, elems_push); >>> + xsk_info->outstanding_tx -= tx_to_free; >>> + } >>> + >>> if (tx_done > 0) { >>> xsk_ring_cons__release(&umem->cq, tx_done); >>> } else { >>> COVERAGE_INC(afxdp_cq_empty); >>> } >>> } >> >> Good catch! Will you submit a patch? >> BTW, to reduce the code duplication I'd suggest to remove the 'continue' >> like this: >> >> if (*addr != UINT64_MAX) { >> Do work; >> } else { >> COVERAGE_INC(afxdp_cq_skip); >> } > > Done, patch has been sent out… > >>> >>> >>> Which made me wonder why we do mark elements as being used? To my knowledge >>> (and looking at some of the code and examples), after the >>> xsk_ring_cons__release() function a xsk_ring_cons__peek() should not >>> receive any duplicate slots. >>> >>> I see a rather high number of afxdp_cq_skip, which should to my knowledge >>> never happen? >> >> I tried to investigate this previously, but didn't find anything suspicious. >> So, for my knowledge, this should never happen too. >> However, I only looked at the code without actually running, because I had no >> HW available for testing. >> >> While investigation and stress-testing virtual ports I found few issues with >> missing locking inside the kernel, so there is no trust for kernel part of >> XDP >> implementation from my side. I'm suspecting that there are some other bugs in >> kernel/libbpf that only could be reproduced with driver mode. >> >> This never happens for virtual ports with SKB mode, so I never saw this >> coverage >> counter being non-zero. > > Did some quick debugging, as something else has come up that needs my > attention :) > > But once I’m in a faulty state and sent a single packet, causing > afxdp_complete_tx() to be called, it tells me 2048 descriptors are ready, > which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that there might be > some ring management bug. Maybe consumer and receiver are equal meaning 0 > buffers, but it returns max? I did not look at the kernel code, so this is > just a wild guess :) > > (gdb) p tx_done > $3 = 2048 > > (gdb) p umem->cq > $4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask = 2047, size = > 2048, producer = 0x7f08486b8000, consumer = 0x7f08486b8040, ring = > 0x7f08486b8080} Thanks for debugging! xsk_ring_cons__peek() just returns the difference between cached_prod and cached_cons, but these values are too different: 3830466864 - 3578066899 = 252399965 Since this value > requested, it returns requested number (2048). So, the ring is broken. At least broken its 'cached' part. It'll be good to look at *consumer and *producer values
Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.
On 8 Aug 2019, at 14:09, Ilya Maximets wrote: On 08.08.2019 14:42, Eelco Chaudron wrote: On 19 Jul 2019, at 16:54, Ilya Maximets wrote: On 18.07.2019 23:11, William Tu wrote: The patch introduces experimental AF_XDP support for OVS netdev. AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket type built upon the eBPF and XDP technology. It is aims to have comparable performance to DPDK but cooperate better with existing kernel's networking stack. An AF_XDP socket receives and sends packets from an eBPF/XDP program attached to the netdev, by-passing a couple of Linux kernel's subsystems As a result, AF_XDP socket shows much better performance than AF_PACKET For more details about AF_XDP, please see linux kernel's Documentation/networking/af_xdp.rst. Note that by default, this feature is not compiled in. Signed-off-by: William Tu Thanks, William, Eelco and Ben! I fixed couple of things and applied to master! Good to see this got merged into master while on PTO. However, when I got back I decided to test it once more… When testing PVP I got a couple of packets trough, and then it would stall. I thought it might be my kernel, so updated to yesterdays latest, no luck… I did see a bunch of “eno1: send failed due to exhausted memory pool.” messages in the log. Putting back patch v14, made my problems go away… After some debugging, I noticed the problem was with the “continue” case in the afxdp_complete_tx() function. Applying the following patch made it work again: diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c index b7cc0d988..9b335ddf0 100644 --- a/lib/netdev-afxdp.c +++ b/lib/netdev-afxdp.c @@ -823,16 +823,21 @@ afxdp_complete_tx(struct xsk_socket_info *xsk_info) if (tx_to_free == BATCH_SIZE || j == tx_done - 1) { umem_elem_push_n(&umem->mpool, tx_to_free, elems_push); xsk_info->outstanding_tx -= tx_to_free; tx_to_free = 0; } } + if (tx_to_free) { + umem_elem_push_n(&umem->mpool, tx_to_free, elems_push); + xsk_info->outstanding_tx -= tx_to_free; + } + if (tx_done > 0) { xsk_ring_cons__release(&umem->cq, tx_done); } else { COVERAGE_INC(afxdp_cq_empty); } } Good catch! Will you submit a patch? BTW, to reduce the code duplication I'd suggest to remove the 'continue' like this: if (*addr != UINT64_MAX) { Do work; } else { COVERAGE_INC(afxdp_cq_skip); } Done, patch has been sent out… Which made me wonder why we do mark elements as being used? To my knowledge (and looking at some of the code and examples), after the xsk_ring_cons__release() function a xsk_ring_cons__peek() should not receive any duplicate slots. I see a rather high number of afxdp_cq_skip, which should to my knowledge never happen? I tried to investigate this previously, but didn't find anything suspicious. So, for my knowledge, this should never happen too. However, I only looked at the code without actually running, because I had no HW available for testing. While investigation and stress-testing virtual ports I found few issues with missing locking inside the kernel, so there is no trust for kernel part of XDP implementation from my side. I'm suspecting that there are some other bugs in kernel/libbpf that only could be reproduced with driver mode. This never happens for virtual ports with SKB mode, so I never saw this coverage counter being non-zero. Did some quick debugging, as something else has come up that needs my attention :) But once I’m in a faulty state and sent a single packet, causing afxdp_complete_tx() to be called, it tells me 2048 descriptors are ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that there might be some ring management bug. Maybe consumer and receiver are equal meaning 0 buffers, but it returns max? I did not look at the kernel code, so this is just a wild guess :) (gdb) p tx_done $3 = 2048 (gdb) p umem->cq $4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask = 2047, size = 2048, producer = 0x7f08486b8000, consumer = 0x7f08486b8040, ring = 0x7f08486b8080} $ ovs-appctl coverage/show | grep xdp afxdp_cq_empty 0.0/sec 339.600/sec 5.6606/sec total: 20378 afxdp_tx_full 0.0/sec 29.967/sec 0.4994/sec total: 1798 afxdp_cq_skip 0.0/sec 61884770.167/sec 1174238.3644/sec total: 4227258112 You mentioned you saw this high number in your v15 change notes, did you do any research on why? Cheers, Eelco ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.
On 08.08.2019 14:42, Eelco Chaudron wrote: > > > On 19 Jul 2019, at 16:54, Ilya Maximets wrote: > >> On 18.07.2019 23:11, William Tu wrote: >>> The patch introduces experimental AF_XDP support for OVS netdev. >>> AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket >>> type built upon the eBPF and XDP technology. It is aims to have comparable >>> performance to DPDK but cooperate better with existing kernel's networking >>> stack. An AF_XDP socket receives and sends packets from an eBPF/XDP program >>> attached to the netdev, by-passing a couple of Linux kernel's subsystems >>> As a result, AF_XDP socket shows much better performance than AF_PACKET >>> For more details about AF_XDP, please see linux kernel's >>> Documentation/networking/af_xdp.rst. Note that by default, this feature is >>> not compiled in. >>> >>> Signed-off-by: William Tu >> >> >> Thanks, William, Eelco and Ben! >> >> I fixed couple of things and applied to master! > > Good to see this got merged into master while on PTO. However, when I got > back I decided to test it once more… > > When testing PVP I got a couple of packets trough, and then it would stall. I > thought it might be my kernel, so updated to yesterdays latest, no luck… > > I did see a bunch of “eno1: send failed due to exhausted memory pool.” > messages in the log. Putting back patch v14, made my problems go away… > > After some debugging, I noticed the problem was with the “continue” case in > the afxdp_complete_tx() function. > Applying the following patch made it work again: > > diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c > index b7cc0d988..9b335ddf0 100644 > --- a/lib/netdev-afxdp.c > +++ b/lib/netdev-afxdp.c > @@ -823,16 +823,21 @@ afxdp_complete_tx(struct xsk_socket_info *xsk_info) > > if (tx_to_free == BATCH_SIZE || j == tx_done - 1) { > umem_elem_push_n(&umem->mpool, tx_to_free, elems_push); > xsk_info->outstanding_tx -= tx_to_free; > tx_to_free = 0; > } > } > > + if (tx_to_free) { > + umem_elem_push_n(&umem->mpool, tx_to_free, elems_push); > + xsk_info->outstanding_tx -= tx_to_free; > + } > + > if (tx_done > 0) { > xsk_ring_cons__release(&umem->cq, tx_done); > } else { > COVERAGE_INC(afxdp_cq_empty); > } > } Good catch! Will you submit a patch? BTW, to reduce the code duplication I'd suggest to remove the 'continue' like this: if (*addr != UINT64_MAX) { Do work; } else { COVERAGE_INC(afxdp_cq_skip); } > > > Which made me wonder why we do mark elements as being used? To my knowledge > (and looking at some of the code and examples), after the > xsk_ring_cons__release() function a xsk_ring_cons__peek() should not receive > any duplicate slots. > > I see a rather high number of afxdp_cq_skip, which should to my knowledge > never happen? I tried to investigate this previously, but didn't find anything suspicious. So, for my knowledge, this should never happen too. However, I only looked at the code without actually running, because I had no HW available for testing. While investigation and stress-testing virtual ports I found few issues with missing locking inside the kernel, so there is no trust for kernel part of XDP implementation from my side. I'm suspecting that there are some other bugs in kernel/libbpf that only could be reproduced with driver mode. This never happens for virtual ports with SKB mode, so I never saw this coverage counter being non-zero. > > $ ovs-appctl coverage/show | grep xdp > afxdp_cq_empty 0.0/sec 339.600/sec 5.6606/sec total: > 20378 > afxdp_tx_full 0.0/sec 29.967/sec 0.4994/sec total: > 1798 > afxdp_cq_skip 0.0/sec 61884770.167/sec 1174238.3644/sec > total: 4227258112 > > > You mentioned you saw this high number in your v15 change notes, did you do > any research on why? > > Cheers, > > Eelco ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.
On 19 Jul 2019, at 16:54, Ilya Maximets wrote: On 18.07.2019 23:11, William Tu wrote: The patch introduces experimental AF_XDP support for OVS netdev. AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket type built upon the eBPF and XDP technology. It is aims to have comparable performance to DPDK but cooperate better with existing kernel's networking stack. An AF_XDP socket receives and sends packets from an eBPF/XDP program attached to the netdev, by-passing a couple of Linux kernel's subsystems As a result, AF_XDP socket shows much better performance than AF_PACKET For more details about AF_XDP, please see linux kernel's Documentation/networking/af_xdp.rst. Note that by default, this feature is not compiled in. Signed-off-by: William Tu Thanks, William, Eelco and Ben! I fixed couple of things and applied to master! Good to see this got merged into master while on PTO. However, when I got back I decided to test it once more… When testing PVP I got a couple of packets trough, and then it would stall. I thought it might be my kernel, so updated to yesterdays latest, no luck… I did see a bunch of “eno1: send failed due to exhausted memory pool.” messages in the log. Putting back patch v14, made my problems go away… After some debugging, I noticed the problem was with the “continue” case in the afxdp_complete_tx() function. Applying the following patch made it work again: diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c index b7cc0d988..9b335ddf0 100644 --- a/lib/netdev-afxdp.c +++ b/lib/netdev-afxdp.c @@ -823,16 +823,21 @@ afxdp_complete_tx(struct xsk_socket_info *xsk_info) if (tx_to_free == BATCH_SIZE || j == tx_done - 1) { umem_elem_push_n(&umem->mpool, tx_to_free, elems_push); xsk_info->outstanding_tx -= tx_to_free; tx_to_free = 0; } } +if (tx_to_free) { +umem_elem_push_n(&umem->mpool, tx_to_free, elems_push); +xsk_info->outstanding_tx -= tx_to_free; +} + if (tx_done > 0) { xsk_ring_cons__release(&umem->cq, tx_done); } else { COVERAGE_INC(afxdp_cq_empty); } } Which made me wonder why we do mark elements as being used? To my knowledge (and looking at some of the code and examples), after the xsk_ring_cons__release() function a xsk_ring_cons__peek() should not receive any duplicate slots. I see a rather high number of afxdp_cq_skip, which should to my knowledge never happen? $ ovs-appctl coverage/show | grep xdp afxdp_cq_empty 0.0/sec 339.600/sec5.6606/sec total: 20378 afxdp_tx_full 0.0/sec29.967/sec0.4994/sec total: 1798 afxdp_cq_skip 0.0/sec 61884770.167/sec 1174238.3644/sec total: 4227258112 You mentioned you saw this high number in your v15 change notes, did you do any research on why? Cheers, Eelco ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.
On Fri, Jul 19, 2019 at 05:54:54PM +0300, Ilya Maximets wrote: > On 18.07.2019 23:11, William Tu wrote: > > The patch introduces experimental AF_XDP support for OVS netdev. > > AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket > > type built upon the eBPF and XDP technology. It is aims to have comparable > > performance to DPDK but cooperate better with existing kernel's networking > > stack. An AF_XDP socket receives and sends packets from an eBPF/XDP program > > attached to the netdev, by-passing a couple of Linux kernel's subsystems > > As a result, AF_XDP socket shows much better performance than AF_PACKET > > For more details about AF_XDP, please see linux kernel's > > Documentation/networking/af_xdp.rst. Note that by default, this feature is > > not compiled in. > > > > Signed-off-by: William Tu > > > Thanks, William, Eelco and Ben! > > I fixed couple of things and applied to master! > > List of changes: > * Dropped config.h from headers. > * Removed double increment of 'util_xalloc' coverage counter in > xmalloc_size_align(). > * Fixed style of a couple of comments. > * Renamed underscored functions from netdev-afxdp-pool.c to be more OVS-style. > Ex.: __umem_elem_pop_n --> umem_elem_pop_n__ > > > Best regards, Ilya Maximets. Thanks for your final review and fixes. William ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.
On 18.07.2019 23:11, William Tu wrote: > The patch introduces experimental AF_XDP support for OVS netdev. > AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket > type built upon the eBPF and XDP technology. It is aims to have comparable > performance to DPDK but cooperate better with existing kernel's networking > stack. An AF_XDP socket receives and sends packets from an eBPF/XDP program > attached to the netdev, by-passing a couple of Linux kernel's subsystems > As a result, AF_XDP socket shows much better performance than AF_PACKET > For more details about AF_XDP, please see linux kernel's > Documentation/networking/af_xdp.rst. Note that by default, this feature is > not compiled in. > > Signed-off-by: William Tu Thanks, William, Eelco and Ben! I fixed couple of things and applied to master! List of changes: * Dropped config.h from headers. * Removed double increment of 'util_xalloc' coverage counter in xmalloc_size_align(). * Fixed style of a couple of comments. * Renamed underscored functions from netdev-afxdp-pool.c to be more OVS-style. Ex.: __umem_elem_pop_n --> umem_elem_pop_n__ Best regards, Ilya Maximets. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.
The patch introduces experimental AF_XDP support for OVS netdev. AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket type built upon the eBPF and XDP technology. It is aims to have comparable performance to DPDK but cooperate better with existing kernel's networking stack. An AF_XDP socket receives and sends packets from an eBPF/XDP program attached to the netdev, by-passing a couple of Linux kernel's subsystems As a result, AF_XDP socket shows much better performance than AF_PACKET For more details about AF_XDP, please see linux kernel's Documentation/networking/af_xdp.rst. Note that by default, this feature is not compiled in. Signed-off-by: William Tu --- v15: * address review feedback from Ilya https://patchwork.ozlabs.org/patch/1125476/ * skip TCP related test cases * reclaim all CONS_NUM_DESC at complete tx * add retries to kick_tx * increase memory pool size * remove redundant xdp flag and bind flag * remove unused rx_dropped var * make tx_dropped counter atomic * refactor dp_packet_init_afxdp using dp_packet_init__ * rebase to ovs master, test with latest bpf-next kernel commit b14a260e33ddb4 Ilya's kernel patches are required commit 455302d1c9ae ("xdp: fix hang while unregistering device bound to xdp socket") commit 162c820ed896 ("xdp: hold device for umem regardless of zero-copy mode") Possible issues: * still lots of afxdp_cq_skip (ovs-appctl coverage/show) afxdp_cq_skip 44325273.6/sec 34362312.683/sec 572705.2114/sec total: 2106010377 * TODO: 'make check-afxdp' still not all pass IP fragmentation expiry test not fix yet, need to implement deferral memory free, s.t like dpdk_mp_sweep. Currently hit some missing umem descs when reclaiming. NSH test case still failed (not due to afxdp) v16: * address feedbacks from Ilya * add deferral memory free * add afxdp testsuites files to gitignore v17: * address feedbacks from Ilya and Ben https://patchwork.ozlabs.org/patch/1131547/ * ovs_spin_lock: add pthread_spin_lock checks, fix typo * update NEWS * add pthread_spin_lock check at OVS_CHECK_LINUX_AF_XDP * fix bug in xmalloc_size_align * rename xdpsock to netdev-afxdp-pool * remove struct umem_elem, use void * * fix style and comments * fix afxdp.rst * rebase to OVS master, tested on kernel 5.2.0-rc6 Note: I still leave the last_tsc in pmd_perf_stats the same as v16 v18: * address feedbacks from Ilya and Ben https://patchwork.ozlabs.org/patch/1133416/ * update document about tcp and reconfiguration * fix leak in tx spin locks * refactor __umem_pool alloc and assert * refactor macro and defines used in netdev-afxdp[-pool] * refactor the xpool->array to remove using type casting * add empty netdev_afxdp_rxq_destruct to avoid closing afxdp socket --- Documentation/automake.mk |1 + Documentation/index.rst |1 + Documentation/intro/install/afxdp.rst | 432 ++ Documentation/intro/install/index.rst |1 + NEWS |1 + acinclude.m4 | 35 ++ configure.ac |1 + lib/automake.mk | 10 + lib/dp-packet.c | 23 + lib/dp-packet.h | 18 +- lib/dpif-netdev-perf.h| 24 + lib/netdev-afxdp-pool.c | 167 ++ lib/netdev-afxdp-pool.h | 58 ++ lib/netdev-afxdp.c| 1041 + lib/netdev-afxdp.h| 73 +++ lib/netdev-linux-private.h| 132 + lib/netdev-linux.c| 126 ++-- lib/netdev-provider.h |3 + lib/netdev.c | 11 + lib/util.c| 92 ++- lib/util.h|5 + tests/.gitignore |3 + tests/automake.mk | 16 + tests/system-afxdp-macros.at | 39 ++ tests/system-afxdp-testsuite.at | 26 + tests/system-traffic.at |2 + vswitchd/vswitch.xml | 15 + 27 files changed, 2247 insertions(+), 109 deletions(-) create mode 100644 Documentation/intro/install/afxdp.rst create mode 100644 lib/netdev-afxdp-pool.c create mode 100644 lib/netdev-afxdp-pool.h create mode 100644 lib/netdev-afxdp.c create mode 100644 lib/netdev-afxdp.h create mode 100644 lib/netdev-linux-private.h create mode 100644 tests/system-afxdp-macros.at create mode 100644 tests/system-afxdp-testsuite.at diff --git a/Documentation/automake.mk b/Documentation/automake.mk index 8472921746ba..2a3214a3cc7f 100644 --- a/Documentation/automake.mk +++ b/Documentation/automake.mk @@ -10,6 +10,7 @@ DOC_SOURCE = \ Documentation/intro/why-ovs.rst \ Documentation/intro/install/index.rst \ Documentation/intro/install/bash-completion.rst \ + Documentation/intro