Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.

2019-08-23 Thread William Tu
On Fri, Aug 23, 2019 at 9:59 AM Ilya Maximets  wrote:
>
> On 23.08.2019 19:08, William Tu wrote:
> > On Wed, Aug 21, 2019 at 2:31 AM Eelco Chaudron  wrote:
> >>
> >>
> >>
> > William, Eelco, which HW NIC you're using? Which kernel driver?
> 
>  I’m using the below on the latest bpf-next driver:
> 
>  01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
>  SFI/SFP+ Network Connection (rev 01)
>  01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
>  SFI/SFP+ Network Connection (rev 01)
> >>>
> >>> Thanks for information.
> >>> I found one suspicious place inside the ixgbe driver that could break
> >>> the completion queue ring and prepared a patch:
> >>> 
> >>> https://protect2.fireeye.com/url?k=ac2418ed930ec67f.ac2593a2-94283087c2dd9833=https://patchwork.ozlabs.org/patch/1150244/
> >>>
> >>> It'll be good if you can test it.
> >>
> >> Hi Ilya, I was doping some testing of my own, and also concluded it was
> >> in the drivers' completion ring. I noticed after sending 512 packets the
> >> drivers TX counters kept increasing, which looks related to your fix.
> >>
> >> Will try it out, and sent results to the upstream mailing list…
> >>
> >> Thanks,
> >>
> >> Eelco
> >
> > Hi,
> >
> > I'm comparing the performance of netdev-afxdp.c on current master and
> > the DPDK's AF_XDP implementation in OVS dpdk-latest branch.
> > I'm using ixgbe and doing physical port to physical port forwarding, sending
> > 64 byte packets, with OpenFlow rule:
> >   ovs-ofctl  add-flow br0  "in_port=eth2, actions=output:eth3"
> >
> > In short
> > A. OVS's netdev-afxdp: 6.1Mpps
> > B. OVS-DPDK  AF_XDP pmd: 8Mpps
> > So I start to think about how to optimize lib/netdev-afxdp.c. Any comments 
> > are
> > welcomed! Below is the analysis:
>
> One major difference is that DPDK implementation supports XDP_USE_NEED_WAKEUP
> and it will be in use if you're building kernel from latest bpf-next tree.
> This allowes to significantly decrease number of syscalls.
> According to below perf stats, OVS implementation unlike dpdk one wastes ~11%
> of time inside the kernel and this could be fixed by need_wakeup feature.

Cool, thank you.
I will look at how to use XDP_USE_NEED_WAKEUP

>
> BTW, there are a lot of pmd threads in case A, but only one in case B.
> Was the test setup really equal?

Yes, they should be equal.
I accidentally in case A add a pmd-cpu-mask=0xf0
so it uses more cpus, but I always enable only one queue and pmd-stats-show
shows other pmd is doing nothing. Will fix it next time.

Regards,
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.

2019-08-23 Thread Ilya Maximets
On 23.08.2019 19:08, William Tu wrote:
> On Wed, Aug 21, 2019 at 2:31 AM Eelco Chaudron  wrote:
>>
>>
>>
> William, Eelco, which HW NIC you're using? Which kernel driver?

 I’m using the below on the latest bpf-next driver:

 01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
 SFI/SFP+ Network Connection (rev 01)
 01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
 SFI/SFP+ Network Connection (rev 01)
>>>
>>> Thanks for information.
>>> I found one suspicious place inside the ixgbe driver that could break
>>> the completion queue ring and prepared a patch:
>>> 
>>> https://protect2.fireeye.com/url?k=ac2418ed930ec67f.ac2593a2-94283087c2dd9833=https://patchwork.ozlabs.org/patch/1150244/
>>>
>>> It'll be good if you can test it.
>>
>> Hi Ilya, I was doping some testing of my own, and also concluded it was
>> in the drivers' completion ring. I noticed after sending 512 packets the
>> drivers TX counters kept increasing, which looks related to your fix.
>>
>> Will try it out, and sent results to the upstream mailing list…
>>
>> Thanks,
>>
>> Eelco
> 
> Hi,
> 
> I'm comparing the performance of netdev-afxdp.c on current master and
> the DPDK's AF_XDP implementation in OVS dpdk-latest branch.
> I'm using ixgbe and doing physical port to physical port forwarding, sending
> 64 byte packets, with OpenFlow rule:
>   ovs-ofctl  add-flow br0  "in_port=eth2, actions=output:eth3"
> 
> In short
> A. OVS's netdev-afxdp: 6.1Mpps
> B. OVS-DPDK  AF_XDP pmd: 8Mpps
> So I start to think about how to optimize lib/netdev-afxdp.c. Any comments are
> welcomed! Below is the analysis:

One major difference is that DPDK implementation supports XDP_USE_NEED_WAKEUP
and it will be in use if you're building kernel from latest bpf-next tree.
This allowes to significantly decrease number of syscalls.
According to below perf stats, OVS implementation unlike dpdk one wastes ~11%
of time inside the kernel and this could be fixed by need_wakeup feature.

BTW, there are a lot of pmd threads in case A, but only one in case B.
Was the test setup really equal?

Best regards, Ilya Maximets.

> 
> A. OVS netdev-afxdp Physical to physical 6.1Mpps
> # pstree -p 702
> ovs-vswitchd(702)-+-{ct_clean1) S 1 7(706)
>   |-{handler4}(712)
>   |-{ipf_clean2}(707)
>   |-{pmd6}(790)
>   |-{pmd7}(791)
>   |-{pmd8}(792)
>   |-{pmd9}(793)
> |-{revalidator5}(713)
>   `-{urcu3}(708)
> 
> # ovs-appctl  dpif-netdev/pmd-stats-show
> pmd thread numa_id 0 core_id 6:
>   packets received: 92290351
>   packet recirculations: 0
>   avg. datapath passes per packet: 1.00
>   emc hits: 92290319
>   smc hits: 0
>   megaflow hits: 31
>   avg. subtable lookups per megaflow hit: 1.00
>   miss with success upcall: 1
>   miss with failed upcall: 0
>   avg. packets per output batch: 31.88
>   idle cycles: 20835727677 (34.86%)   --> pretty high!?
>   processing cycles: 38932097052 (65.14%)
>   avg cycles per packet: 647.61 (59767824729/92290351)
>   avg processing cycles per packet: 421.84 (38932097052/92290351)
> 
> # ./perf record -t 790 sleep 10
>   13.80%  pmd6 ovs-vswitchd[.] miniflow_extract
>   13.58%  pmd6 ovs-vswitchd[.] __netdev_afxdp_batch_send
>9.64%  pmd6   ovs-vswitchd[.] dp_netdev_input__
>9.07%  pmd6   ovs-vswitchd[.] dp_packet_init__
>8.91%  pmd6   ovs-vswitchd[.] netdev_afxdp_rxq_recv
>7.40%  pmd6   ovs-vswitchd[.] miniflow_hash_5tuple
>5.32%  pmd6   libc-2.23.so[.] __memcpy_avx_unaligned
>4.60%  pmd6   [kernel.vmlinux][k] do_syscall_64
>3.72%  pmd6   ovs-vswitchd[.] dp_packet_use_afxdp-->
> maybe optimize?
>2.74%  pmd6   libpthread-2.23.so  [.] __pthread_enable_asynccancel
>2.43%  pmd6   [kernel.vmlinux][k] fput_many
>2.18%  pmd6   libc-2.23.so[.] __memcmp_sse4_1
>2.06%  pmd6   [kernel.vmlinux][k] entry_SYSCALL_64
>1.79%  pmd6   [kernel.vmlinux][k] syscall_return_via_sysret
>1.71%  pmd6   ovs-vswitchd[.] dp_execute_cb
>1.03%  pmd6   ovs-vswitchd[.] non_atomic_ullong_add
>0.86%  pmd6   ovs-vswitchd[.]dp_netdev_pmd_flush_output_on_port
> 
> B. OVS-DPDK afxdp using dpdk-latest 8Mpps
> ovs-vswitchd(19462)-+-{ct_clean3}(19470)
> |-{dpdk_watchdog1}(19468)
> |-{eal-intr-thread}(19466)
> |-{handler16}(19501)
> |-{handler17}(19505)
> |-{handler18}(19506)
> |-{handler19}(19507)
> |-{handler20}(19508)
> |-{handler22}(19502)
> |-{handler24}(19504)
> |-{handler26}(19503)
> |-{ipf_clean4}(19471)
> |-{pmd27}(19536)
> |-{revalidator21}(19509)
>  

Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.

2019-08-23 Thread William Tu
On Wed, Aug 21, 2019 at 2:31 AM Eelco Chaudron  wrote:
>
>
>
> >>> William, Eelco, which HW NIC you're using? Which kernel driver?
> >>
> >> I’m using the below on the latest bpf-next driver:
> >>
> >> 01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
> >> SFI/SFP+ Network Connection (rev 01)
> >> 01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
> >> SFI/SFP+ Network Connection (rev 01)
> >
> > Thanks for information.
> > I found one suspicious place inside the ixgbe driver that could break
> > the completion queue ring and prepared a patch:
> > https://patchwork.ozlabs.org/patch/1150244/
> >
> > It'll be good if you can test it.
>
> Hi Ilya, I was doping some testing of my own, and also concluded it was
> in the drivers' completion ring. I noticed after sending 512 packets the
> drivers TX counters kept increasing, which looks related to your fix.
>
> Will try it out, and sent results to the upstream mailing list…
>
> Thanks,
>
> Eelco

Hi,

I'm comparing the performance of netdev-afxdp.c on current master and
the DPDK's AF_XDP implementation in OVS dpdk-latest branch.
I'm using ixgbe and doing physical port to physical port forwarding, sending
64 byte packets, with OpenFlow rule:
  ovs-ofctl  add-flow br0  "in_port=eth2, actions=output:eth3"

In short
A. OVS's netdev-afxdp: 6.1Mpps
B. OVS-DPDK  AF_XDP pmd: 8Mpps
So I start to think about how to optimize lib/netdev-afxdp.c. Any comments are
welcomed! Below is the analysis:

A. OVS netdev-afxdp Physical to physical 6.1Mpps
# pstree -p 702
ovs-vswitchd(702)-+-{ct_clean1) S 1 7(706)
  |-{handler4}(712)
  |-{ipf_clean2}(707)
  |-{pmd6}(790)
  |-{pmd7}(791)
  |-{pmd8}(792)
  |-{pmd9}(793)
|-{revalidator5}(713)
  `-{urcu3}(708)

# ovs-appctl  dpif-netdev/pmd-stats-show
pmd thread numa_id 0 core_id 6:
  packets received: 92290351
  packet recirculations: 0
  avg. datapath passes per packet: 1.00
  emc hits: 92290319
  smc hits: 0
  megaflow hits: 31
  avg. subtable lookups per megaflow hit: 1.00
  miss with success upcall: 1
  miss with failed upcall: 0
  avg. packets per output batch: 31.88
  idle cycles: 20835727677 (34.86%)   --> pretty high!?
  processing cycles: 38932097052 (65.14%)
  avg cycles per packet: 647.61 (59767824729/92290351)
  avg processing cycles per packet: 421.84 (38932097052/92290351)

# ./perf record -t 790 sleep 10
  13.80%  pmd6 ovs-vswitchd[.] miniflow_extract
  13.58%  pmd6 ovs-vswitchd[.] __netdev_afxdp_batch_send
   9.64%  pmd6   ovs-vswitchd[.] dp_netdev_input__
   9.07%  pmd6   ovs-vswitchd[.] dp_packet_init__
   8.91%  pmd6   ovs-vswitchd[.] netdev_afxdp_rxq_recv
   7.40%  pmd6   ovs-vswitchd[.] miniflow_hash_5tuple
   5.32%  pmd6   libc-2.23.so[.] __memcpy_avx_unaligned
   4.60%  pmd6   [kernel.vmlinux][k] do_syscall_64
   3.72%  pmd6   ovs-vswitchd[.] dp_packet_use_afxdp-->
maybe optimize?
   2.74%  pmd6   libpthread-2.23.so  [.] __pthread_enable_asynccancel
   2.43%  pmd6   [kernel.vmlinux][k] fput_many
   2.18%  pmd6   libc-2.23.so[.] __memcmp_sse4_1
   2.06%  pmd6   [kernel.vmlinux][k] entry_SYSCALL_64
   1.79%  pmd6   [kernel.vmlinux][k] syscall_return_via_sysret
   1.71%  pmd6   ovs-vswitchd[.] dp_execute_cb
   1.03%  pmd6   ovs-vswitchd[.] non_atomic_ullong_add
   0.86%  pmd6   ovs-vswitchd[.]dp_netdev_pmd_flush_output_on_port

B. OVS-DPDK afxdp using dpdk-latest 8Mpps
ovs-vswitchd(19462)-+-{ct_clean3}(19470)
|-{dpdk_watchdog1}(19468)
|-{eal-intr-thread}(19466)
|-{handler16}(19501)
|-{handler17}(19505)
|-{handler18}(19506)
|-{handler19}(19507)
|-{handler20}(19508)
|-{handler22}(19502)
|-{handler24}(19504)
|-{handler26}(19503)
|-{ipf_clean4}(19471)
|-{pmd27}(19536)
|-{revalidator21}(19509)
|-{revalidator23}(19511)
|-{revalidator25}(19510)
|-{rte_mp_handle}(19467)
`-{urcu2}(19469)

# ovs-appctl  dpif-netdev/pmd-stats-show
pmd thread numa_id 0 core_id 11:
  packets received: 1813689117
  packet recirculations: 0
  avg. datapath passes per packet: 1.00
  emc hits: 1813689053
  smc hits: 0
  megaflow hits: 63
  avg. subtable lookups per megaflow hit: 1.00
  miss with success upcall: 1
  miss with failed upcall: 0
  avg. packets per output batch: 31.85
  idle cycles: 13848892341 (2.50%)
  processing cycles: 541064826249 (97.50%)
  avg cycles per packet: 305.96 (554913718590/1813689117)
  avg processing cycles per packet: 298.32 (541064826249/1813689117)

#  ./perf record -t 19536 sleep 10
  24.84%  pmd27 

Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.

2019-08-21 Thread Eelco Chaudron



On 20 Aug 2019, at 17:20, Ilya Maximets wrote:


On 20.08.2019 14:19, Eelco Chaudron wrote:



On 20 Aug 2019, at 12:10, Ilya Maximets wrote:


On 14.08.2019 19:16, William Tu wrote:
On Wed, Aug 14, 2019 at 7:58 AM William Tu  
wrote:


On Wed, Aug 14, 2019 at 5:09 AM Eelco Chaudron 
 wrote:




On 8 Aug 2019, at 17:38, Ilya Maximets wrote:



I see a rather high number of afxdp_cq_skip, which should to 
my

knowledge never happen?


I tried to investigate this previously, but didn't find 
anything

suspicious.
So, for my knowledge, this should never happen too.
However, I only looked at the code without actually running, 
because

I had no
HW available for testing.

While investigation and stress-testing virtual ports I found 
few

issues with
missing locking inside the kernel, so there is no trust for 
kernel

part of XDP
implementation from my side. I'm suspecting that there are 
some

other bugs in
kernel/libbpf that only could be reproduced with driver mode.

This never happens for virtual ports with SKB mode, so I never 
saw

this coverage
counter being non-zero.


Did some quick debugging, as something else has come up that 
needs my

attention :)

But once I’m in a faulty state and sent a single packet, 
causing
afxdp_complete_tx() to be called, it tells me 2048 descriptors 
are
ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess 
that
there might be some ring management bug. Maybe consumer and 
receiver
are equal meaning 0 buffers, but it returns max? I did not look 
at

the kernel code, so this is just a wild guess :)

(gdb) p tx_done
$3 = 2048

(gdb) p umem->cq
$4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask 
=

2047, size = 2048, producer = 0x7f08486b8000, consumer =
0x7f08486b8040, ring = 0x7f08486b8080}


Thanks for debugging!

xsk_ring_cons__peek() just returns the difference between 
cached_prod

and cached_cons, but these values are too different:

3830466864 - 3578066899 = 252399965

Since this value > requested, it returns requested number 
(2048).


So, the ring is broken. At least broken its 'cached' part. It'll 
be

good
to look at *consumer and *producer values to verify the state of 
the

actual ring.



I’ll try to find some more time next week to debug further.

William I noticed your email in xdp-newbies where you mention 
this
problem of getting the wrong pointers. Did you ever follow up, or 
did

further trouble shooting on the above?


Yes, I posted here
https://www.spinics.net/lists/xdp-newbies/msg00956.html
"Question/Bug about AF_XDP idx_cq from xsk_ring_cons__peek?"

At that time I was thinking about reproducing the problem using 
the
xdpsock sample code from kernel. But turned out that my 
reproduction
code is not correct, so not able to show the case we hit here in 
OVS.


Then I put more similar code logic from OVS to xdpsock, but the 
problem
does not show up. As a result, I worked around it by marking addr 
as

"*addr == UINT64_MAX".

I will debug again this week once I get my testbed back.


Just to refresh my memory. The original issue is that
when calling:
tx_done = xsk_ring_cons__peek(>cq, CONS_NUM_DESCS, _cq);
xsk_ring_cons__release(>cq, tx_done);

I expect there are 'tx_done' elems on the CQ to re-cycle back to 
memory pool.
However, when I inspect these elems, I found some elems that 
'already' been
reported complete last time I call xsk_ring_cons__peek. In other 
word, some

elems show up at CQ twice. And this cause overflow of the mempool.

Thus, mark the elems on CQ as UINT64_MAX to indicate that we 
already

seen this elem.


William, Eelco, which HW NIC you're using? Which kernel driver?


I’m using the below on the latest bpf-next driver:

01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit 
SFI/SFP+ Network Connection (rev 01)
01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit 
SFI/SFP+ Network Connection (rev 01)


Thanks for information.
I found one suspicious place inside the ixgbe driver that could break
the completion queue ring and prepared a patch:
https://patchwork.ozlabs.org/patch/1150244/

It'll be good if you can test it.


Hi Ilya, I was doping some testing of my own, and also concluded it was 
in the drivers' completion ring. I noticed after sending 512 packets the 
drivers TX counters kept increasing, which looks related to your fix.


Will try it out, and sent results to the upstream mailing list…

Thanks,

Eelco
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.

2019-08-20 Thread Ilya Maximets
On 20.08.2019 14:19, Eelco Chaudron wrote:
> 
> 
> On 20 Aug 2019, at 12:10, Ilya Maximets wrote:
> 
>> On 14.08.2019 19:16, William Tu wrote:
>>> On Wed, Aug 14, 2019 at 7:58 AM William Tu  wrote:

 On Wed, Aug 14, 2019 at 5:09 AM Eelco Chaudron  wrote:
>
>
>
> On 8 Aug 2019, at 17:38, Ilya Maximets wrote:
>
> 
>
> I see a rather high number of afxdp_cq_skip, which should to my
> knowledge never happen?

 I tried to investigate this previously, but didn't find anything
 suspicious.
 So, for my knowledge, this should never happen too.
 However, I only looked at the code without actually running, because
 I had no
 HW available for testing.

 While investigation and stress-testing virtual ports I found few
 issues with
 missing locking inside the kernel, so there is no trust for kernel
 part of XDP
 implementation from my side. I'm suspecting that there are some
 other bugs in
 kernel/libbpf that only could be reproduced with driver mode.

 This never happens for virtual ports with SKB mode, so I never saw
 this coverage
 counter being non-zero.
>>>
>>> Did some quick debugging, as something else has come up that needs my
>>> attention :)
>>>
>>> But once I’m in a faulty state and sent a single packet, causing
>>> afxdp_complete_tx() to be called, it tells me 2048 descriptors are
>>> ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that
>>> there might be some ring management bug. Maybe consumer and receiver
>>> are equal meaning 0 buffers, but it returns max? I did not look at
>>> the kernel code, so this is just a wild guess :)
>>>
>>> (gdb) p tx_done
>>> $3 = 2048
>>>
>>> (gdb) p umem->cq
>>> $4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask =
>>> 2047, size = 2048, producer = 0x7f08486b8000, consumer =
>>> 0x7f08486b8040, ring = 0x7f08486b8080}
>>
>> Thanks for debugging!
>>
>> xsk_ring_cons__peek() just returns the difference between cached_prod
>> and cached_cons, but these values are too different:
>>
>> 3830466864 - 3578066899 = 252399965
>>
>> Since this value > requested, it returns requested number (2048).
>>
>> So, the ring is broken. At least broken its 'cached' part. It'll be
>> good
>> to look at *consumer and *producer values to verify the state of the
>> actual ring.
>>
>
> I’ll try to find some more time next week to debug further.
>
> William I noticed your email in xdp-newbies where you mention this
> problem of getting the wrong pointers. Did you ever follow up, or did
> further trouble shooting on the above?

 Yes, I posted here
 https://www.spinics.net/lists/xdp-newbies/msg00956.html
 "Question/Bug about AF_XDP idx_cq from xsk_ring_cons__peek?"

 At that time I was thinking about reproducing the problem using the
 xdpsock sample code from kernel. But turned out that my reproduction
 code is not correct, so not able to show the case we hit here in OVS.

 Then I put more similar code logic from OVS to xdpsock, but the problem
 does not show up. As a result, I worked around it by marking addr as
 "*addr == UINT64_MAX".

 I will debug again this week once I get my testbed back.

>>> Just to refresh my memory. The original issue is that
>>> when calling:
>>> tx_done = xsk_ring_cons__peek(>cq, CONS_NUM_DESCS, _cq);
>>> xsk_ring_cons__release(>cq, tx_done);
>>>
>>> I expect there are 'tx_done' elems on the CQ to re-cycle back to memory 
>>> pool.
>>> However, when I inspect these elems, I found some elems that 'already' been
>>> reported complete last time I call xsk_ring_cons__peek. In other word, some
>>> elems show up at CQ twice. And this cause overflow of the mempool.
>>>
>>> Thus, mark the elems on CQ as UINT64_MAX to indicate that we already
>>> seen this elem.
>>
>> William, Eelco, which HW NIC you're using? Which kernel driver?
> 
> I’m using the below on the latest bpf-next driver:
> 
> 01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ 
> Network Connection (rev 01)
> 01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ 
> Network Connection (rev 01)

Thanks for information.
I found one suspicious place inside the ixgbe driver that could break
the completion queue ring and prepared a patch:
https://patchwork.ozlabs.org/patch/1150244/

It'll be good if you can test it.

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.

2019-08-20 Thread Eelco Chaudron



On 20 Aug 2019, at 12:10, Ilya Maximets wrote:


On 14.08.2019 19:16, William Tu wrote:
On Wed, Aug 14, 2019 at 7:58 AM William Tu  
wrote:


On Wed, Aug 14, 2019 at 5:09 AM Eelco Chaudron  
wrote:




On 8 Aug 2019, at 17:38, Ilya Maximets wrote:




I see a rather high number of afxdp_cq_skip, which should to my
knowledge never happen?


I tried to investigate this previously, but didn't find anything
suspicious.
So, for my knowledge, this should never happen too.
However, I only looked at the code without actually running, 
because

I had no
HW available for testing.

While investigation and stress-testing virtual ports I found few
issues with
missing locking inside the kernel, so there is no trust for 
kernel

part of XDP
implementation from my side. I'm suspecting that there are some
other bugs in
kernel/libbpf that only could be reproduced with driver mode.

This never happens for virtual ports with SKB mode, so I never 
saw

this coverage
counter being non-zero.


Did some quick debugging, as something else has come up that 
needs my

attention :)

But once I’m in a faulty state and sent a single packet, 
causing
afxdp_complete_tx() to be called, it tells me 2048 descriptors 
are

ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that
there might be some ring management bug. Maybe consumer and 
receiver
are equal meaning 0 buffers, but it returns max? I did not look 
at

the kernel code, so this is just a wild guess :)

(gdb) p tx_done
$3 = 2048

(gdb) p umem->cq
$4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask =
2047, size = 2048, producer = 0x7f08486b8000, consumer =
0x7f08486b8040, ring = 0x7f08486b8080}


Thanks for debugging!

xsk_ring_cons__peek() just returns the difference between 
cached_prod

and cached_cons, but these values are too different:

3830466864 - 3578066899 = 252399965

Since this value > requested, it returns requested number (2048).

So, the ring is broken. At least broken its 'cached' part. It'll 
be

good
to look at *consumer and *producer values to verify the state of 
the

actual ring.



I’ll try to find some more time next week to debug further.

William I noticed your email in xdp-newbies where you mention this
problem of getting the wrong pointers. Did you ever follow up, or 
did

further trouble shooting on the above?


Yes, I posted here
https://www.spinics.net/lists/xdp-newbies/msg00956.html
"Question/Bug about AF_XDP idx_cq from xsk_ring_cons__peek?"

At that time I was thinking about reproducing the problem using the
xdpsock sample code from kernel. But turned out that my reproduction
code is not correct, so not able to show the case we hit here in 
OVS.


Then I put more similar code logic from OVS to xdpsock, but the 
problem

does not show up. As a result, I worked around it by marking addr as
"*addr == UINT64_MAX".

I will debug again this week once I get my testbed back.


Just to refresh my memory. The original issue is that
when calling:
tx_done = xsk_ring_cons__peek(>cq, CONS_NUM_DESCS, _cq);
xsk_ring_cons__release(>cq, tx_done);

I expect there are 'tx_done' elems on the CQ to re-cycle back to 
memory pool.
However, when I inspect these elems, I found some elems that 
'already' been
reported complete last time I call xsk_ring_cons__peek. In other 
word, some

elems show up at CQ twice. And this cause overflow of the mempool.

Thus, mark the elems on CQ as UINT64_MAX to indicate that we already
seen this elem.


William, Eelco, which HW NIC you're using? Which kernel driver?


I’m using the below on the latest bpf-next driver:

01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit 
SFI/SFP+ Network Connection (rev 01)
01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit 
SFI/SFP+ Network Connection (rev 01)


//Eelco

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.

2019-08-20 Thread Ilya Maximets
On 14.08.2019 19:16, William Tu wrote:
> On Wed, Aug 14, 2019 at 7:58 AM William Tu  wrote:
>>
>> On Wed, Aug 14, 2019 at 5:09 AM Eelco Chaudron  wrote:
>>>
>>>
>>>
>>> On 8 Aug 2019, at 17:38, Ilya Maximets wrote:
>>>
>>> 
>>>
>>> I see a rather high number of afxdp_cq_skip, which should to my
>>> knowledge never happen?
>>
>> I tried to investigate this previously, but didn't find anything
>> suspicious.
>> So, for my knowledge, this should never happen too.
>> However, I only looked at the code without actually running, because
>> I had no
>> HW available for testing.
>>
>> While investigation and stress-testing virtual ports I found few
>> issues with
>> missing locking inside the kernel, so there is no trust for kernel
>> part of XDP
>> implementation from my side. I'm suspecting that there are some
>> other bugs in
>> kernel/libbpf that only could be reproduced with driver mode.
>>
>> This never happens for virtual ports with SKB mode, so I never saw
>> this coverage
>> counter being non-zero.
>
> Did some quick debugging, as something else has come up that needs my
> attention :)
>
> But once I’m in a faulty state and sent a single packet, causing
> afxdp_complete_tx() to be called, it tells me 2048 descriptors are
> ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that
> there might be some ring management bug. Maybe consumer and receiver
> are equal meaning 0 buffers, but it returns max? I did not look at
> the kernel code, so this is just a wild guess :)
>
> (gdb) p tx_done
> $3 = 2048
>
> (gdb) p umem->cq
> $4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask =
> 2047, size = 2048, producer = 0x7f08486b8000, consumer =
> 0x7f08486b8040, ring = 0x7f08486b8080}

 Thanks for debugging!

 xsk_ring_cons__peek() just returns the difference between cached_prod
 and cached_cons, but these values are too different:

 3830466864 - 3578066899 = 252399965

 Since this value > requested, it returns requested number (2048).

 So, the ring is broken. At least broken its 'cached' part. It'll be
 good
 to look at *consumer and *producer values to verify the state of the
 actual ring.

>>>
>>> I’ll try to find some more time next week to debug further.
>>>
>>> William I noticed your email in xdp-newbies where you mention this
>>> problem of getting the wrong pointers. Did you ever follow up, or did
>>> further trouble shooting on the above?
>>
>> Yes, I posted here
>> https://www.spinics.net/lists/xdp-newbies/msg00956.html
>> "Question/Bug about AF_XDP idx_cq from xsk_ring_cons__peek?"
>>
>> At that time I was thinking about reproducing the problem using the
>> xdpsock sample code from kernel. But turned out that my reproduction
>> code is not correct, so not able to show the case we hit here in OVS.
>>
>> Then I put more similar code logic from OVS to xdpsock, but the problem
>> does not show up. As a result, I worked around it by marking addr as
>> "*addr == UINT64_MAX".
>>
>> I will debug again this week once I get my testbed back.
>>
> Just to refresh my memory. The original issue is that
> when calling:
> tx_done = xsk_ring_cons__peek(>cq, CONS_NUM_DESCS, _cq);
> xsk_ring_cons__release(>cq, tx_done);
> 
> I expect there are 'tx_done' elems on the CQ to re-cycle back to memory pool.
> However, when I inspect these elems, I found some elems that 'already' been
> reported complete last time I call xsk_ring_cons__peek. In other word, some
> elems show up at CQ twice. And this cause overflow of the mempool.
> 
> Thus, mark the elems on CQ as UINT64_MAX to indicate that we already
> seen this elem.

William, Eelco, which HW NIC you're using? Which kernel driver?

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.

2019-08-14 Thread William Tu
On Wed, Aug 14, 2019 at 7:58 AM William Tu  wrote:
>
> On Wed, Aug 14, 2019 at 5:09 AM Eelco Chaudron  wrote:
> >
> >
> >
> > On 8 Aug 2019, at 17:38, Ilya Maximets wrote:
> >
> > 
> >
> >  I see a rather high number of afxdp_cq_skip, which should to my
> >  knowledge never happen?
> > >>>
> > >>> I tried to investigate this previously, but didn't find anything
> > >>> suspicious.
> > >>> So, for my knowledge, this should never happen too.
> > >>> However, I only looked at the code without actually running, because
> > >>> I had no
> > >>> HW available for testing.
> > >>>
> > >>> While investigation and stress-testing virtual ports I found few
> > >>> issues with
> > >>> missing locking inside the kernel, so there is no trust for kernel
> > >>> part of XDP
> > >>> implementation from my side. I'm suspecting that there are some
> > >>> other bugs in
> > >>> kernel/libbpf that only could be reproduced with driver mode.
> > >>>
> > >>> This never happens for virtual ports with SKB mode, so I never saw
> > >>> this coverage
> > >>> counter being non-zero.
> > >>
> > >> Did some quick debugging, as something else has come up that needs my
> > >> attention :)
> > >>
> > >> But once I’m in a faulty state and sent a single packet, causing
> > >> afxdp_complete_tx() to be called, it tells me 2048 descriptors are
> > >> ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that
> > >> there might be some ring management bug. Maybe consumer and receiver
> > >> are equal meaning 0 buffers, but it returns max? I did not look at
> > >> the kernel code, so this is just a wild guess :)
> > >>
> > >> (gdb) p tx_done
> > >> $3 = 2048
> > >>
> > >> (gdb) p umem->cq
> > >> $4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask =
> > >> 2047, size = 2048, producer = 0x7f08486b8000, consumer =
> > >> 0x7f08486b8040, ring = 0x7f08486b8080}
> > >
> > > Thanks for debugging!
> > >
> > > xsk_ring_cons__peek() just returns the difference between cached_prod
> > > and cached_cons, but these values are too different:
> > >
> > > 3830466864 - 3578066899 = 252399965
> > >
> > > Since this value > requested, it returns requested number (2048).
> > >
> > > So, the ring is broken. At least broken its 'cached' part. It'll be
> > > good
> > > to look at *consumer and *producer values to verify the state of the
> > > actual ring.
> > >
> >
> > I’ll try to find some more time next week to debug further.
> >
> > William I noticed your email in xdp-newbies where you mention this
> > problem of getting the wrong pointers. Did you ever follow up, or did
> > further trouble shooting on the above?
>
> Yes, I posted here
> https://www.spinics.net/lists/xdp-newbies/msg00956.html
> "Question/Bug about AF_XDP idx_cq from xsk_ring_cons__peek?"
>
> At that time I was thinking about reproducing the problem using the
> xdpsock sample code from kernel. But turned out that my reproduction
> code is not correct, so not able to show the case we hit here in OVS.
>
> Then I put more similar code logic from OVS to xdpsock, but the problem
> does not show up. As a result, I worked around it by marking addr as
> "*addr == UINT64_MAX".
>
> I will debug again this week once I get my testbed back.
>
Just to refresh my memory. The original issue is that
when calling:
tx_done = xsk_ring_cons__peek(>cq, CONS_NUM_DESCS, _cq);
xsk_ring_cons__release(>cq, tx_done);

I expect there are 'tx_done' elems on the CQ to re-cycle back to memory pool.
However, when I inspect these elems, I found some elems that 'already' been
reported complete last time I call xsk_ring_cons__peek. In other word, some
elems show up at CQ twice. And this cause overflow of the mempool.

Thus, mark the elems on CQ as UINT64_MAX to indicate that we already
seen this elem.

William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.

2019-08-14 Thread William Tu
On Wed, Aug 14, 2019 at 5:09 AM Eelco Chaudron  wrote:
>
>
>
> On 8 Aug 2019, at 17:38, Ilya Maximets wrote:
>
> 
>
>  I see a rather high number of afxdp_cq_skip, which should to my
>  knowledge never happen?
> >>>
> >>> I tried to investigate this previously, but didn't find anything
> >>> suspicious.
> >>> So, for my knowledge, this should never happen too.
> >>> However, I only looked at the code without actually running, because
> >>> I had no
> >>> HW available for testing.
> >>>
> >>> While investigation and stress-testing virtual ports I found few
> >>> issues with
> >>> missing locking inside the kernel, so there is no trust for kernel
> >>> part of XDP
> >>> implementation from my side. I'm suspecting that there are some
> >>> other bugs in
> >>> kernel/libbpf that only could be reproduced with driver mode.
> >>>
> >>> This never happens for virtual ports with SKB mode, so I never saw
> >>> this coverage
> >>> counter being non-zero.
> >>
> >> Did some quick debugging, as something else has come up that needs my
> >> attention :)
> >>
> >> But once I’m in a faulty state and sent a single packet, causing
> >> afxdp_complete_tx() to be called, it tells me 2048 descriptors are
> >> ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that
> >> there might be some ring management bug. Maybe consumer and receiver
> >> are equal meaning 0 buffers, but it returns max? I did not look at
> >> the kernel code, so this is just a wild guess :)
> >>
> >> (gdb) p tx_done
> >> $3 = 2048
> >>
> >> (gdb) p umem->cq
> >> $4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask =
> >> 2047, size = 2048, producer = 0x7f08486b8000, consumer =
> >> 0x7f08486b8040, ring = 0x7f08486b8080}
> >
> > Thanks for debugging!
> >
> > xsk_ring_cons__peek() just returns the difference between cached_prod
> > and cached_cons, but these values are too different:
> >
> > 3830466864 - 3578066899 = 252399965
> >
> > Since this value > requested, it returns requested number (2048).
> >
> > So, the ring is broken. At least broken its 'cached' part. It'll be
> > good
> > to look at *consumer and *producer values to verify the state of the
> > actual ring.
> >
>
> I’ll try to find some more time next week to debug further.
>
> William I noticed your email in xdp-newbies where you mention this
> problem of getting the wrong pointers. Did you ever follow up, or did
> further trouble shooting on the above?

Yes, I posted here
https://www.spinics.net/lists/xdp-newbies/msg00956.html
"Question/Bug about AF_XDP idx_cq from xsk_ring_cons__peek?"

At that time I was thinking about reproducing the problem using the
xdpsock sample code from kernel. But turned out that my reproduction
code is not correct, so not able to show the case we hit here in OVS.

Then I put more similar code logic from OVS to xdpsock, but the problem
does not show up. As a result, I worked around it by marking addr as
"*addr == UINT64_MAX".

I will debug again this week once I get my testbed back.

William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.

2019-08-14 Thread Eelco Chaudron



On 8 Aug 2019, at 17:38, Ilya Maximets wrote:



I see a rather high number of afxdp_cq_skip, which should to my 
knowledge never happen?


I tried to investigate this previously, but didn't find anything 
suspicious.

So, for my knowledge, this should never happen too.
However, I only looked at the code without actually running, because 
I had no

HW available for testing.

While investigation and stress-testing virtual ports I found few 
issues with
missing locking inside the kernel, so there is no trust for kernel 
part of XDP
implementation from my side. I'm suspecting that there are some 
other bugs in

kernel/libbpf that only could be reproduced with driver mode.

This never happens for virtual ports with SKB mode, so I never saw 
this coverage

counter being non-zero.


Did some quick debugging, as something else has come up that needs my 
attention :)


But once I’m in a faulty state and sent a single packet, causing 
afxdp_complete_tx() to be called, it tells me 2048 descriptors are 
ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that 
there might be some ring management bug. Maybe consumer and receiver 
are equal meaning 0 buffers, but it returns max? I did not look at 
the kernel code, so this is just a wild guess :)


(gdb) p tx_done
$3 = 2048

(gdb) p umem->cq
$4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask = 
2047, size = 2048, producer = 0x7f08486b8000, consumer = 
0x7f08486b8040, ring = 0x7f08486b8080}


Thanks for debugging!

xsk_ring_cons__peek() just returns the difference between cached_prod
and cached_cons, but these values are too different:

3830466864 - 3578066899 = 252399965

Since this value > requested, it returns requested number (2048).

So, the ring is broken. At least broken its 'cached' part. It'll be 
good

to look at *consumer and *producer values to verify the state of the
actual ring.



I’ll try to find some more time next week to debug further.

William I noticed your email in xdp-newbies where you mention this 
problem of getting the wrong pointers. Did you ever follow up, or did 
further trouble shooting on the above?






$ ovs-appctl coverage/show  | grep xdp
afxdp_cq_empty 0.0/sec   
339.600/sec    5.6606/sec   total: 20378
afxdp_tx_full  0.0/sec    
29.967/sec    0.4994/sec   total: 1798
afxdp_cq_skip  0.0/sec 61884770.167/sec  
1174238.3644/sec   total: 4227258112



You mentioned you saw this high number in your v15 change notes, 
did you do any research on why?


Cheers,

Eelco




___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.

2019-08-08 Thread Ilya Maximets
On 08.08.2019 17:53, Eelco Chaudron wrote:
> 
> 
> On 8 Aug 2019, at 14:09, Ilya Maximets wrote:
> 
>> On 08.08.2019 14:42, Eelco Chaudron wrote:
>>>
>>>
>>> On 19 Jul 2019, at 16:54, Ilya Maximets wrote:
>>>
 On 18.07.2019 23:11, William Tu wrote:
> The patch introduces experimental AF_XDP support for OVS netdev.
> AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket
> type built upon the eBPF and XDP technology.  It is aims to have 
> comparable
> performance to DPDK but cooperate better with existing kernel's networking
> stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP 
> program
> attached to the netdev, by-passing a couple of Linux kernel's subsystems
> As a result, AF_XDP socket shows much better performance than AF_PACKET
> For more details about AF_XDP, please see linux kernel's
> Documentation/networking/af_xdp.rst. Note that by default, this feature is
> not compiled in.
>
> Signed-off-by: William Tu 


 Thanks, William, Eelco and Ben!

 I fixed couple of things and applied to master!
>>>
>>> Good to see this got merged into master while on PTO. However, when I got 
>>> back I decided to test it once more…
>>>
>>> When testing PVP I got a couple of packets trough, and then it would stall. 
>>> I thought it might be my kernel, so updated to yesterdays latest, no luck…
>>>
>>> I did see a bunch of “eno1: send failed due to exhausted memory pool.” 
>>> messages in the log. Putting back patch v14, made my problems go away…
>>>
>>> After some debugging, I noticed the problem was with the “continue” case in 
>>> the afxdp_complete_tx() function.
>>> Applying the following patch made it work again:
>>>
>>> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
>>> index b7cc0d988..9b335ddf0 100644
>>> --- a/lib/netdev-afxdp.c
>>> +++ b/lib/netdev-afxdp.c
>>> @@ -823,16 +823,21 @@ afxdp_complete_tx(struct xsk_socket_info *xsk_info)
>>>
>>>  if (tx_to_free == BATCH_SIZE || j == tx_done - 1) {
>>>  umem_elem_push_n(>mpool, tx_to_free, elems_push);
>>>  xsk_info->outstanding_tx -= tx_to_free;
>>>  tx_to_free = 0;
>>>  }
>>>  }
>>>
>>> +    if (tx_to_free) {
>>> +    umem_elem_push_n(>mpool, tx_to_free, elems_push);
>>> +    xsk_info->outstanding_tx -= tx_to_free;
>>> +    }
>>> +
>>>  if (tx_done > 0) {
>>>  xsk_ring_cons__release(>cq, tx_done);
>>>  } else {
>>>  COVERAGE_INC(afxdp_cq_empty);
>>>  }
>>>  }
>>
>> Good catch! Will you submit a patch?
>> BTW, to reduce the code duplication I'd suggest to remove the 'continue'
>> like this:
>>
>>     if (*addr != UINT64_MAX) {
>>     Do work;
>>     } else {
>>     COVERAGE_INC(afxdp_cq_skip);
>>     }
> 
> Done, patch has been sent out…
> 
>>>
>>>
>>> Which made me wonder why we do mark elements as being used? To my knowledge 
>>> (and looking at some of the code and examples), after the  
>>> xsk_ring_cons__release() function a xsk_ring_cons__peek() should not 
>>> receive any duplicate slots.
>>>
>>> I see a rather high number of afxdp_cq_skip, which should to my knowledge 
>>> never happen?
>>
>> I tried to investigate this previously, but didn't find anything suspicious.
>> So, for my knowledge, this should never happen too.
>> However, I only looked at the code without actually running, because I had no
>> HW available for testing.
>>
>> While investigation and stress-testing virtual ports I found few issues with
>> missing locking inside the kernel, so there is no trust for kernel part of 
>> XDP
>> implementation from my side. I'm suspecting that there are some other bugs in
>> kernel/libbpf that only could be reproduced with driver mode.
>>
>> This never happens for virtual ports with SKB mode, so I never saw this 
>> coverage
>> counter being non-zero.
> 
> Did some quick debugging, as something else has come up that needs my 
> attention :)
> 
> But once I’m in a faulty state and sent a single packet, causing 
> afxdp_complete_tx() to be called, it tells me 2048 descriptors are ready, 
> which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that there might be 
> some ring management bug. Maybe consumer and receiver are equal meaning 0 
> buffers, but it returns max? I did not look at the kernel code, so this is 
> just a wild guess :)
> 
> (gdb) p tx_done
> $3 = 2048
> 
> (gdb) p umem->cq
> $4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask = 2047, size = 
> 2048, producer = 0x7f08486b8000, consumer = 0x7f08486b8040, ring = 
> 0x7f08486b8080}

Thanks for debugging!

xsk_ring_cons__peek() just returns the difference between cached_prod
and cached_cons, but these values are too different:

3830466864 - 3578066899 = 252399965

Since this value > requested, it returns requested number (2048).

So, the ring is broken. At least broken its 'cached' part. It'll be good
to look at *consumer and *producer values to verify the 

Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.

2019-08-08 Thread Eelco Chaudron



On 8 Aug 2019, at 14:09, Ilya Maximets wrote:


On 08.08.2019 14:42, Eelco Chaudron wrote:



On 19 Jul 2019, at 16:54, Ilya Maximets wrote:


On 18.07.2019 23:11, William Tu wrote:

The patch introduces experimental AF_XDP support for OVS netdev.
AF_XDP, the Address Family of the eXpress Data Path, is a new Linux 
socket
type built upon the eBPF and XDP technology.  It is aims to have 
comparable
performance to DPDK but cooperate better with existing kernel's 
networking
stack.  An AF_XDP socket receives and sends packets from an 
eBPF/XDP program
attached to the netdev, by-passing a couple of Linux kernel's 
subsystems
As a result, AF_XDP socket shows much better performance than 
AF_PACKET

For more details about AF_XDP, please see linux kernel's
Documentation/networking/af_xdp.rst. Note that by default, this 
feature is

not compiled in.

Signed-off-by: William Tu 



Thanks, William, Eelco and Ben!

I fixed couple of things and applied to master!


Good to see this got merged into master while on PTO. However, when I 
got back I decided to test it once more…


When testing PVP I got a couple of packets trough, and then it would 
stall. I thought it might be my kernel, so updated to yesterdays 
latest, no luck…


I did see a bunch of “eno1: send failed due to exhausted memory 
pool.” messages in the log. Putting back patch v14, made my 
problems go away…


After some debugging, I noticed the problem was with the 
“continue” case in the afxdp_complete_tx() function.

Applying the following patch made it work again:

diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
index b7cc0d988..9b335ddf0 100644
--- a/lib/netdev-afxdp.c
+++ b/lib/netdev-afxdp.c
@@ -823,16 +823,21 @@ afxdp_complete_tx(struct xsk_socket_info 
*xsk_info)


 if (tx_to_free == BATCH_SIZE || j == tx_done - 1) {
 umem_elem_push_n(>mpool, tx_to_free, 
elems_push);

 xsk_info->outstanding_tx -= tx_to_free;
 tx_to_free = 0;
 }
 }

+    if (tx_to_free) {
+    umem_elem_push_n(>mpool, tx_to_free, 
elems_push);

+    xsk_info->outstanding_tx -= tx_to_free;
+    }
+
 if (tx_done > 0) {
 xsk_ring_cons__release(>cq, tx_done);
 } else {
 COVERAGE_INC(afxdp_cq_empty);
 }
 }


Good catch! Will you submit a patch?
BTW, to reduce the code duplication I'd suggest to remove the 
'continue'

like this:

if (*addr != UINT64_MAX) {
Do work;
} else {
COVERAGE_INC(afxdp_cq_skip);
}


Done, patch has been sent out…




Which made me wonder why we do mark elements as being used? To my 
knowledge (and looking at some of the code and examples), after the  
xsk_ring_cons__release() function a xsk_ring_cons__peek() should not 
receive any duplicate slots.


I see a rather high number of afxdp_cq_skip, which should to my 
knowledge never happen?


I tried to investigate this previously, but didn't find anything 
suspicious.

So, for my knowledge, this should never happen too.
However, I only looked at the code without actually running, because I 
had no

HW available for testing.

While investigation and stress-testing virtual ports I found few 
issues with
missing locking inside the kernel, so there is no trust for kernel 
part of XDP
implementation from my side. I'm suspecting that there are some other 
bugs in

kernel/libbpf that only could be reproduced with driver mode.

This never happens for virtual ports with SKB mode, so I never saw 
this coverage

counter being non-zero.


Did some quick debugging, as something else has come up that needs my 
attention :)


But once I’m in a faulty state and sent a single packet, causing 
afxdp_complete_tx() to be called, it tells me 2048 descriptors are 
ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that there 
might be some ring management bug. Maybe consumer and receiver are equal 
meaning 0 buffers, but it returns max? I did not look at the kernel 
code, so this is just a wild guess :)


(gdb) p tx_done
$3 = 2048

(gdb) p umem->cq
$4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask = 2047, 
size = 2048, producer = 0x7f08486b8000, consumer = 0x7f08486b8040, ring 
= 0x7f08486b8080}




$ ovs-appctl coverage/show  | grep xdp
afxdp_cq_empty 0.0/sec   
339.600/sec    5.6606/sec   total: 20378
afxdp_tx_full  0.0/sec    
29.967/sec    0.4994/sec   total: 1798
afxdp_cq_skip  0.0/sec 61884770.167/sec  
1174238.3644/sec   total: 4227258112



You mentioned you saw this high number in your v15 change notes, did 
you do any research on why?


Cheers,

Eelco

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.

2019-08-08 Thread Ilya Maximets
On 08.08.2019 14:42, Eelco Chaudron wrote:
> 
> 
> On 19 Jul 2019, at 16:54, Ilya Maximets wrote:
> 
>> On 18.07.2019 23:11, William Tu wrote:
>>> The patch introduces experimental AF_XDP support for OVS netdev.
>>> AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket
>>> type built upon the eBPF and XDP technology.  It is aims to have comparable
>>> performance to DPDK but cooperate better with existing kernel's networking
>>> stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
>>> attached to the netdev, by-passing a couple of Linux kernel's subsystems
>>> As a result, AF_XDP socket shows much better performance than AF_PACKET
>>> For more details about AF_XDP, please see linux kernel's
>>> Documentation/networking/af_xdp.rst. Note that by default, this feature is
>>> not compiled in.
>>>
>>> Signed-off-by: William Tu 
>>
>>
>> Thanks, William, Eelco and Ben!
>>
>> I fixed couple of things and applied to master!
> 
> Good to see this got merged into master while on PTO. However, when I got 
> back I decided to test it once more…
> 
> When testing PVP I got a couple of packets trough, and then it would stall. I 
> thought it might be my kernel, so updated to yesterdays latest, no luck…
> 
> I did see a bunch of “eno1: send failed due to exhausted memory pool.” 
> messages in the log. Putting back patch v14, made my problems go away…
> 
> After some debugging, I noticed the problem was with the “continue” case in 
> the afxdp_complete_tx() function.
> Applying the following patch made it work again:
> 
> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
> index b7cc0d988..9b335ddf0 100644
> --- a/lib/netdev-afxdp.c
> +++ b/lib/netdev-afxdp.c
> @@ -823,16 +823,21 @@ afxdp_complete_tx(struct xsk_socket_info *xsk_info)
> 
>  if (tx_to_free == BATCH_SIZE || j == tx_done - 1) {
>  umem_elem_push_n(>mpool, tx_to_free, elems_push);
>  xsk_info->outstanding_tx -= tx_to_free;
>  tx_to_free = 0;
>  }
>  }
> 
> +    if (tx_to_free) {
> +    umem_elem_push_n(>mpool, tx_to_free, elems_push);
> +    xsk_info->outstanding_tx -= tx_to_free;
> +    }
> +
>  if (tx_done > 0) {
>  xsk_ring_cons__release(>cq, tx_done);
>  } else {
>  COVERAGE_INC(afxdp_cq_empty);
>  }
>  }

Good catch! Will you submit a patch?
BTW, to reduce the code duplication I'd suggest to remove the 'continue'
like this:

if (*addr != UINT64_MAX) {
Do work;
} else {
COVERAGE_INC(afxdp_cq_skip);
}

> 
> 
> Which made me wonder why we do mark elements as being used? To my knowledge 
> (and looking at some of the code and examples), after the  
> xsk_ring_cons__release() function a xsk_ring_cons__peek() should not receive 
> any duplicate slots.
> 
> I see a rather high number of afxdp_cq_skip, which should to my knowledge 
> never happen?

I tried to investigate this previously, but didn't find anything suspicious.
So, for my knowledge, this should never happen too.
However, I only looked at the code without actually running, because I had no
HW available for testing.

While investigation and stress-testing virtual ports I found few issues with
missing locking inside the kernel, so there is no trust for kernel part of XDP
implementation from my side. I'm suspecting that there are some other bugs in
kernel/libbpf that only could be reproduced with driver mode.

This never happens for virtual ports with SKB mode, so I never saw this coverage
counter being non-zero.

> 
> $ ovs-appctl coverage/show  | grep xdp
> afxdp_cq_empty 0.0/sec   339.600/sec    5.6606/sec   total: 
> 20378
> afxdp_tx_full  0.0/sec    29.967/sec    0.4994/sec   total: 
> 1798
> afxdp_cq_skip  0.0/sec 61884770.167/sec  1174238.3644/sec   
> total: 4227258112
> 
> 
> You mentioned you saw this high number in your v15 change notes, did you do 
> any research on why?
> 
> Cheers,
> 
> Eelco
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.

2019-08-08 Thread Eelco Chaudron



On 19 Jul 2019, at 16:54, Ilya Maximets wrote:


On 18.07.2019 23:11, William Tu wrote:

The patch introduces experimental AF_XDP support for OVS netdev.
AF_XDP, the Address Family of the eXpress Data Path, is a new Linux 
socket
type built upon the eBPF and XDP technology.  It is aims to have 
comparable
performance to DPDK but cooperate better with existing kernel's 
networking
stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP 
program
attached to the netdev, by-passing a couple of Linux kernel's 
subsystems
As a result, AF_XDP socket shows much better performance than 
AF_PACKET

For more details about AF_XDP, please see linux kernel's
Documentation/networking/af_xdp.rst. Note that by default, this 
feature is

not compiled in.

Signed-off-by: William Tu 



Thanks, William, Eelco and Ben!

I fixed couple of things and applied to master!


Good to see this got merged into master while on PTO. However, when I 
got back I decided to test it once more…


When testing PVP I got a couple of packets trough, and then it would 
stall. I thought it might be my kernel, so updated to yesterdays latest, 
no luck…


I did see a bunch of “eno1: send failed due to exhausted memory 
pool.” messages in the log. Putting back patch v14, made my problems 
go away…


After some debugging, I noticed the problem was with the “continue” 
case in the afxdp_complete_tx() function.

Applying the following patch made it work again:

diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
index b7cc0d988..9b335ddf0 100644
--- a/lib/netdev-afxdp.c
+++ b/lib/netdev-afxdp.c
@@ -823,16 +823,21 @@ afxdp_complete_tx(struct xsk_socket_info 
*xsk_info)


 if (tx_to_free == BATCH_SIZE || j == tx_done - 1) {
 umem_elem_push_n(>mpool, tx_to_free, elems_push);
 xsk_info->outstanding_tx -= tx_to_free;
 tx_to_free = 0;
 }
 }

+if (tx_to_free) {
+umem_elem_push_n(>mpool, tx_to_free, elems_push);
+xsk_info->outstanding_tx -= tx_to_free;
+}
+
 if (tx_done > 0) {
 xsk_ring_cons__release(>cq, tx_done);
 } else {
 COVERAGE_INC(afxdp_cq_empty);
 }
 }


Which made me wonder why we do mark elements as being used? To my 
knowledge (and looking at some of the code and examples), after the  
xsk_ring_cons__release() function a xsk_ring_cons__peek() should not 
receive any duplicate slots.


I see a rather high number of afxdp_cq_skip, which should to my 
knowledge never happen?


$ ovs-appctl coverage/show  | grep xdp
afxdp_cq_empty 0.0/sec   339.600/sec5.6606/sec   
total: 20378
afxdp_tx_full  0.0/sec29.967/sec0.4994/sec   
total: 1798
afxdp_cq_skip  0.0/sec 61884770.167/sec  1174238.3644/sec   
total: 4227258112



You mentioned you saw this high number in your v15 change notes, did you 
do any research on why?


Cheers,

Eelco

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.

2019-07-19 Thread William Tu
On Fri, Jul 19, 2019 at 05:54:54PM +0300, Ilya Maximets wrote:
> On 18.07.2019 23:11, William Tu wrote:
> > The patch introduces experimental AF_XDP support for OVS netdev.
> > AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket
> > type built upon the eBPF and XDP technology.  It is aims to have comparable
> > performance to DPDK but cooperate better with existing kernel's networking
> > stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
> > attached to the netdev, by-passing a couple of Linux kernel's subsystems
> > As a result, AF_XDP socket shows much better performance than AF_PACKET
> > For more details about AF_XDP, please see linux kernel's
> > Documentation/networking/af_xdp.rst. Note that by default, this feature is
> > not compiled in.
> > 
> > Signed-off-by: William Tu 
> 
> 
> Thanks, William, Eelco and Ben!
> 
> I fixed couple of things and applied to master!
> 
> List of changes:
> * Dropped config.h from headers.
> * Removed double increment of 'util_xalloc' coverage counter in 
> xmalloc_size_align().
> * Fixed style of a couple of comments.
> * Renamed underscored functions from netdev-afxdp-pool.c to be more OVS-style.
>   Ex.: __umem_elem_pop_n --> umem_elem_pop_n__
> 
> 
> Best regards, Ilya Maximets.

Thanks for your final review and fixes.
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.

2019-07-19 Thread Ilya Maximets
On 18.07.2019 23:11, William Tu wrote:
> The patch introduces experimental AF_XDP support for OVS netdev.
> AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket
> type built upon the eBPF and XDP technology.  It is aims to have comparable
> performance to DPDK but cooperate better with existing kernel's networking
> stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
> attached to the netdev, by-passing a couple of Linux kernel's subsystems
> As a result, AF_XDP socket shows much better performance than AF_PACKET
> For more details about AF_XDP, please see linux kernel's
> Documentation/networking/af_xdp.rst. Note that by default, this feature is
> not compiled in.
> 
> Signed-off-by: William Tu 


Thanks, William, Eelco and Ben!

I fixed couple of things and applied to master!

List of changes:
* Dropped config.h from headers.
* Removed double increment of 'util_xalloc' coverage counter in 
xmalloc_size_align().
* Fixed style of a couple of comments.
* Renamed underscored functions from netdev-afxdp-pool.c to be more OVS-style.
  Ex.: __umem_elem_pop_n --> umem_elem_pop_n__


Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.

2019-07-18 Thread William Tu
The patch introduces experimental AF_XDP support for OVS netdev.
AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket
type built upon the eBPF and XDP technology.  It is aims to have comparable
performance to DPDK but cooperate better with existing kernel's networking
stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
attached to the netdev, by-passing a couple of Linux kernel's subsystems
As a result, AF_XDP socket shows much better performance than AF_PACKET
For more details about AF_XDP, please see linux kernel's
Documentation/networking/af_xdp.rst. Note that by default, this feature is
not compiled in.

Signed-off-by: William Tu 
---
v15:
 * address review feedback from Ilya
   https://patchwork.ozlabs.org/patch/1125476/
 * skip TCP related test cases
 * reclaim all CONS_NUM_DESC at complete tx
 * add retries to kick_tx
 * increase memory pool size
 * remove redundant xdp flag and bind flag
 * remove unused rx_dropped var
 * make tx_dropped counter atomic
 * refactor dp_packet_init_afxdp using dp_packet_init__
 * rebase to ovs master, test with latest bpf-next kernel commit b14a260e33ddb4
   Ilya's kernel patches are required
   commit 455302d1c9ae ("xdp: fix hang while unregistering device bound to xdp 
socket")
   commit 162c820ed896 ("xdp: hold device for umem regardless of zero-copy 
mode")
 Possible issues:
 * still lots of afxdp_cq_skip  (ovs-appctl coverage/show)
afxdp_cq_skip  44325273.6/sec 34362312.683/sec   572705.2114/sec   total: 
2106010377
 * TODO:
   'make check-afxdp' still not all pass
   IP fragmentation expiry test not fix yet, need to implement
   deferral memory free, s.t like dpdk_mp_sweep.  Currently hit
   some missing umem descs when reclaiming.
   NSH test case still failed (not due to afxdp)

v16:
  * address feedbacks from Ilya
  * add deferral memory free
  * add afxdp testsuites files to gitignore

v17:
  * address feedbacks from Ilya and Ben
  https://patchwork.ozlabs.org/patch/1131547/
  * ovs_spin_lock: add pthread_spin_lock checks, fix typo
  * update NEWS
  * add pthread_spin_lock check at OVS_CHECK_LINUX_AF_XDP
  * fix bug in xmalloc_size_align
  * rename xdpsock to netdev-afxdp-pool
  * remove struct umem_elem, use void *
  * fix style and comments
  * fix afxdp.rst
  * rebase to OVS master, tested on kernel 5.2.0-rc6
 Note: I still leave the last_tsc in pmd_perf_stats the same as v16

v18:
  * address feedbacks from Ilya and Ben
  https://patchwork.ozlabs.org/patch/1133416/
  * update document about tcp and reconfiguration
  * fix leak in tx spin locks
  * refactor __umem_pool alloc and assert
  * refactor macro and defines used in netdev-afxdp[-pool]
  * refactor the xpool->array to remove using type casting
  * add empty netdev_afxdp_rxq_destruct to avoid closing
afxdp socket
---
 Documentation/automake.mk |1 +
 Documentation/index.rst   |1 +
 Documentation/intro/install/afxdp.rst |  432 ++
 Documentation/intro/install/index.rst |1 +
 NEWS  |1 +
 acinclude.m4  |   35 ++
 configure.ac  |1 +
 lib/automake.mk   |   10 +
 lib/dp-packet.c   |   23 +
 lib/dp-packet.h   |   18 +-
 lib/dpif-netdev-perf.h|   24 +
 lib/netdev-afxdp-pool.c   |  167 ++
 lib/netdev-afxdp-pool.h   |   58 ++
 lib/netdev-afxdp.c| 1041 +
 lib/netdev-afxdp.h|   73 +++
 lib/netdev-linux-private.h|  132 +
 lib/netdev-linux.c|  126 ++--
 lib/netdev-provider.h |3 +
 lib/netdev.c  |   11 +
 lib/util.c|   92 ++-
 lib/util.h|5 +
 tests/.gitignore  |3 +
 tests/automake.mk |   16 +
 tests/system-afxdp-macros.at  |   39 ++
 tests/system-afxdp-testsuite.at   |   26 +
 tests/system-traffic.at   |2 +
 vswitchd/vswitch.xml  |   15 +
 27 files changed, 2247 insertions(+), 109 deletions(-)
 create mode 100644 Documentation/intro/install/afxdp.rst
 create mode 100644 lib/netdev-afxdp-pool.c
 create mode 100644 lib/netdev-afxdp-pool.h
 create mode 100644 lib/netdev-afxdp.c
 create mode 100644 lib/netdev-afxdp.h
 create mode 100644 lib/netdev-linux-private.h
 create mode 100644 tests/system-afxdp-macros.at
 create mode 100644 tests/system-afxdp-testsuite.at

diff --git a/Documentation/automake.mk b/Documentation/automake.mk
index 8472921746ba..2a3214a3cc7f 100644
--- a/Documentation/automake.mk
+++ b/Documentation/automake.mk
@@ -10,6 +10,7 @@ DOC_SOURCE = \
Documentation/intro/why-ovs.rst \
Documentation/intro/install/index.rst \
Documentation/intro/install/bash-completion.rst \
+