from:"Loftus, Ciara"

Re: [ovs-dev] [PATCH v4 0/3] Add support for TSO with DPDK

2020-01-17 Thread Loftus, Ciara




> -Original Message-
> From: Flavio Leitner 
> Sent: Thursday 16 January 2020 17:01
> To: d...@openvswitch.org
> Cc: Stokes, Ian ; Loftus, Ciara
> ; Ilya Maximets ;
> yangy...@inspur.com; txfh2007 ; Flavio Leitner
> 
> Subject: [PATCH v4 0/3] Add support for TSO with DPDK
> 
> Abbreviated as TSO, TCP Segmentation Offload is a feature which enables
> the network stack to delegate the TCP segmentation to the NIC reducing
> the per packet CPU overhead.
> 
> A guest using vhost-user interface with TSO enabled can send TCP packets
> much bigger than the MTU, which saves CPU cycles normally used to break
> the packets down to MTU size and to calculate checksums.
> 
> It also saves CPU cycles used to parse multiple packets/headers during
> the packet processing inside virtual switch.
> 
> If the destination of the packet is another guest in the same host, then
> the same big packet can be sent through a vhost-user interface skipping
> the segmentation completely. However, if the destination is not local,
> the NIC hardware is instructed to do the TCP segmentation and checksum
> calculation.
> 
> The first 2 patches are not really part of TSO support, but they are
> required to make sure everything works.
> 
> There are good improvements sending to or receiving from veth pairs or
> tap devices as well. See the iperf3 results below:
> 
> [*] veth with ethtool tx off.
> 
> VM sending to:  Default   Enabled Enabled/Default
>Local BR 3 Gbits/sec 23 Gbits/sec  7x
>Net NS (veth)3 Gbits/sec[*]  22 Gbits/sec  7x
>VM (same host) 2.5 Gbits/sec 24 Gbits/sec  9x
>Ext Host10 Gbits/sec 35 Gbits/sec  3x

I re-ran my tests and observed similar (slightly better, actually) improvements 
as I reported for the v3:
VM -> VM (same host): +5.5x
VM -> Ext Host: +4.1x

Tested-by: Ciara Loftus 

Thanks,
Ciara

>Ext Host (vxlan)   8.8 Gbits/sec (not supported)
> 
>   Using VLAN:
>Local BR 3 Gbits/sec 23 Gbits/sec  7x
>VM (same host) 2.5 Gbits/sec 21 Gbits/sec  8x
>Ext Host   6.4 Gbits/sec 34 Gbits/sec  5x
> 
>   Using IPv6:
>Net NS (veth)  2.7 Gbits/sec[*]  22 Gbits/sec  8x
>VM (same host) 2.6 Gbits/sec 21 Gbits/sec  8x
>Ext Host   8.7 Gbits/sec 34 Gbits/sec  4x
> 
>   Conntrack:
>No packet changes: 1.41 Gbits/sec33 Gbits/sec  23x
> 
> VM receiving from:
>Local BR   2.5 Gbits/sec 2.4 Gbits/sec 1x
>Net NS (veth)  2.5 Gbits/sec[*]  9.3 Gbits/sec 3x
>VM (same host) 4.9 Gbits/sec  25 Gbits/sec 5x
>Ext Host   9.7 Gbits/sec 9.4 Gbits/sec 1x
>Ext Host (vxlan)   5.5 Gbits/sec (not supported)
> 
>   Using VLAN:
>Local BR   2.4 Gbits/sec 2.4 Gbits/sec 1x
>VM (same host) 3.8 Gbits/sec  24 Gbits/sec 8x
>Ext Host   9.5 Gbits/sec 9.5 Gbits/sec 1x
> 
>   Using IPv6:
>Net NS (veth)  2.2 Gbits/sec[*]   9 Gbits/sec  4x
>VM (same host) 4.5 Gbits/sec 24 Gbits/sec  5x
>Ext Host   8.9 Gbits/sec8.9 Gbits/sec  1x
> 
> Used iperf3 -u to test UDP traffic limited at default 1Mbits/sec
> and noticed no change with the exception for tunneled packets (not
> supported).
> 
> Travis, AppVeyor, and Cirrus-ci passed.
> 
> Flavio Leitner (3):
>   dp-packet: preserve headroom when cloning a pkt batch
>   vhost: Disable multi-segmented buffers
>   netdev-dpdk: Add TCP Segmentation Offload support
> 
>  Documentation/automake.mk  |   1 +
>  Documentation/topics/index.rst |   1 +
>  Documentation/topics/userspace-tso.rst |  98 +++
>  NEWS   |   1 +
>  lib/automake.mk|   2 +
>  lib/conntrack.c|  29 +-
>  lib/dp-packet.h| 192 +++-
>  lib/ipf.c  |  32 +-
>  lib/netdev-dpdk.c  | 355 ---
>  lib/netdev-linux-private.h |   5 +
>  lib/netdev-linux.c | 386 ++---
>  lib/netdev-provider.h  |   9 +
>  lib/netdev.c   |  78 -
>  lib/userspace-tso.c|  48 +++
>  lib/userspace-tso.h|  23 ++
>  vswitchd/bridge.c  |   2 +
>  vswitchd/vswitch.xml   |  17 ++
>  17 files changed, 1154 insertions(+), 125 deletions(-)
>  create mode 100644 Documentation/topics/userspace-tso.rst
>  create mode 100644 lib/userspace-tso.c
>  create mode 100644 lib/userspace-tso.h
> 
> --
> 2.24.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] Documentation: add notes for TSO & i40e

2020-01-14 Thread Loftus, Ciara




> -Original Message-
> From: David Marchand 
> Sent: Monday 13 January 2020 15:58
> To: Loftus, Ciara 
> Cc: ovs dev ; Flavio Leitner ;
> Stokes, Ian 
> Subject: Re: [ovs-dev] [PATCH] Documentation: add notes for TSO & i40e
> 
> On Mon, Jan 13, 2020 at 4:49 PM Ciara Loftus  wrote:
> >
> > When using TSO in OVS-DPDK with an i40e device, the following
> > commit is required for DPDK, which fixes an issue on the TSO path:
> > http://git.dpdk.org/next/dpdk-next-
> net/commit/?id=b2a4dc260139409c539fb8e7f1b9d0a5182cfd2b
> 
> Please avoid referencing dpdk-next-net sha1's, as this tree is rebased
> on dpdk master quite frequently.
> This commit will reach the master branch in a few days hopefully
> (20.02-rc1 is expected this week).
> 
> In URLs, prefer https://.

Thanks for the feedback David.

I'll hold off submitting a v2 anyway until Flavio's series on which this patch 
depends is merged, to avoid the 0-day errors. So hopefully by then the fix will 
have reached the master branch and I can include the appropriate sha1, or 
alternatively a link to the patchwork patch.

Thanks,
Ciara
> 
> 
> > Document this as a limitation until a DPDK release with the fix
> > included is supported by OVS.
> >
> > Also, document best known methods for performance tuning when
> > testing TSO with the tool iperf.
> >
> > Signed-off-by: Ciara Loftus 
> > ---
> >  Documentation/topics/dpdk/tso.rst | 27
> +++
> >  1 file changed, 27 insertions(+)
> >
> > diff --git a/Documentation/topics/dpdk/tso.rst
> b/Documentation/topics/dpdk/tso.rst
> > index 189c86480..5c56c458c 100644
> > --- a/Documentation/topics/dpdk/tso.rst
> > +++ b/Documentation/topics/dpdk/tso.rst
> > @@ -94,3 +94,30 @@ datapath must support TSO or packets using that
> feature will be dropped
> >  on ports without TSO support.  That also means guests using vhost-user
> >  in client mode will receive TSO packet regardless of TSO being enabled
> >  or disabled within the guest.
> > +
> > +When the NIC performing the segmentation is using the i40e DPDK PMD,
> a fix
> > +must be included in the DPDK build, otherwise TSO will not work. The fix
> can
> > +be found on `the dpdk.org net-next tree`__.
> > +
> > +__ http://git.dpdk.org/next/dpdk-next-
> net/commit/?id=e975e3720ba16278c6b43f8938b75710573bbd6a
> 
> Ditto.
> 
> 
> > +
> > +This fix is expected to be included in the 19.11.1 release. When OVS
> migrates
> > +to this DPDK release, this limitation can be removed.
> > +
> > +~~
> > +Performance Tuning
> > +~~
> > +
> > +iperf is often used to test TSO performance. Care needs to be taken when
> > +configuring the environment in which the iperf server process is being
> run.
> > +Since the iperf server uses the NIC's kernel driver, IRQs will be 
> > generated.
> > +By default with some NICs eg. i40e, the IRQs will land on the same core as
> that
> > +which is being used by the server process, provided the number of NIC
> queues is
> > +greater or equal to that lcoreid. This causes contention between the iperf
> > +server process and the IRQs. For optimal performance, it is suggested to
> pin
> > +the IRQs to their own core. To change the affinity associated with a given
> IRQ
> > +number, you can 'echo' the desired coremask to the file
> > +/proc/irq//smp_affinity
> > +For more on SMP affinity, refer to the `Linux kernel documentation`__.
> > +
> > +__ https://www.kernel.org/doc/Documentation/IRQ-affinity.txt
> > --
> > 2.17.1
> >
> > ___
> > dev mailing list
> > d...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >
> 
> 
> --
> David Marchand

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] netdev-dpdk: Avoid undefined behavior processing devargs

2020-01-13 Thread Loftus, Ciara

> 
> In "Use of library functions" in the C standard, the following statement
> is written to apply to all library functions:
> 
>   If an argument to a function has an invalid value (such as ... a
>   null pointer ... the behavior is undefined.
> 
> Later, under the "String handling" section, "Comparison functions" no
> exception is listed for strcmp, which means NULL is invalid.  It may
> be possible for the smap_get to return NULL.
> 
> Given the above, we must check that new_devargs is not null.  The check
> against NULL for new_devargs later in the function is still valid.
> 
> Fixes: 55e075e65ef9 ("netdev-dpdk: Arbitrary 'dpdk' port naming")
> Signed-off-by: Aaron Conole 

LGTM.

Acked-by: Ciara Loftus 

> ---
> NOTE: This is just code-review caught while looking elsewhere, and I
>   didn't test it.
> 
>  lib/netdev-dpdk.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 02120a379..00501f7d5 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -1817,7 +1817,7 @@ netdev_dpdk_set_config(struct netdev *netdev,
> const struct smap *args,
> 
>  new_devargs = smap_get(args, "dpdk-devargs");
> 
> -if (dev->devargs && strcmp(new_devargs, dev->devargs)) {
> +if (dev->devargs && new_devargs && strcmp(new_devargs, dev-
> >devargs)) {
>  /* The user requested a new device.  If we return error, the caller
>   * will delete this netdev and try to recreate it. */
>  err = EAGAIN;
> --
> 2.21.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v3 0/3] Add support for TSO with DPDK

2020-01-10 Thread Loftus, Ciara




> -Original Message-
> From: Flavio Leitner 
> Sent: Thursday 9 January 2020 14:45
> To: d...@openvswitch.org
> Cc: Stokes, Ian ; Loftus, Ciara
> ; Ilya Maximets ;
> yangy...@inspur.com; Flavio Leitner 
> Subject: [PATCH v3 0/3] Add support for TSO with DPDK
> 
> Abbreviated as TSO, TCP Segmentation Offload is a feature which enables
> the network stack to delegate the TCP segmentation to the NIC reducing
> the per packet CPU overhead.
> 
> A guest using vhost-user interface with TSO enabled can send TCP packets
> much bigger than the MTU, which saves CPU cycles normally used to break
> the packets down to MTU size and to calculate checksums.
> 
> It also saves CPU cycles used to parse multiple packets/headers during
> the packet processing inside virtual switch.
> 
> If the destination of the packet is another guest in the same host, then
> the same big packet can be sent through a vhost-user interface skipping
> the segmentation completely. However, if the destination is not local,
> the NIC hardware is instructed to do the TCP segmentation and checksum
> calculation.
> 
> The first 2 patches are not really part of TSO support, but they are
> required to make sure everything works.
> 
> There are good improvements sending to or receiving from veth pairs or
> tap devices as well. See the iperf3 results below:
> 
> [*] veth with ethtool tx off.
> 
> VM sending to:  Default   Enabled
>Local BR   859 Mbits/sec 9.23 Gbits/sec
>Net NS (veth)  965 Mbits/sec[*]  9.74 Gbits/sec
>VM (same host)2.54 Gbits/sec 22.4 Gbits/sec
>Ext Host  10.3 Gbits/sec 35.0 Gbits/sec

I performed some similar tests. I recorded the following improvements in 
throughput:
VM -> VM (same host): +5.4x
VM -> Ext Host: +3.8x

I tested VM -> Ext Host for both MT27800 (ConnectX-5) and a XL710 (Fortville) 
NICs.
The result above was measured for the XL710.

Two things to note when testing with an i40e NIC:
1. The following patch is required for DPDK, which fixes an issue on the TSO 
path:
http://git.dpdk.org/next/dpdk-next-net/commit/?id=b2a4dc260139409c539fb8e7f1b9d0a5182cfd2b
2. For optimal performance, ensure the IRQs of the queue being used by the 
iperf server are pinned to their own core, as opposed to the same core as the 
server process, which appears to be the default behavior.

I intend on submitting a follow up patch which documents the above.

Tested-by: Ciara Loftus 

Thanks,
Ciara

>Ext Host (vxlan)  8.77 Gbits/sec (not supported)
> 
>   Using VLAN:
>Local BR   877 Mbits/sec 9.49 Gbits/sec
>VM (same host)2.35 Gbits/sec 23.3 Gbits/sec
>Ext Host  5.84 Gbits/sec 34.6 Gbits/sec
> 
>   Using IPv6:
>Net NS (veth)  937 Mbits/sec[*]  9.32 Gbits/sec
>VM (same host)2.53 Gbits/sec 21.1 Gbits/sec
>Ext Host  8.66 Gbits/sec 37.7 Gbits/sec
> 
>   Conntrack:
>No packet changes: 1.41 Gbits/sec33.1 Gbits/sec
> 
> VM receiving from:
>Local BR   221 Mbits/sec 220 Mbits/sec
>Net NS (veth)  221 Mbits/sec[*]  5.91 Gbits/sec
>VM (same host)4.79 Gbits/sec 22.2 Gbits/sec
>Ext Host  10.6 Gbits/sec 10.7 Gbits/sec
>Ext Host (vxlan)  5.82 Gbits/sec (not supported)
> 
>   Using VLAN:
>Local BR   223 Mbits/sec 219 Mbits/sec
>VM (same host)4.21 Gbits/sec 24.1 Gbits/sec
>Ext Host  10.3 Gbits/sec 10.2 Gbits/sec
> 
>   Using IPv6:
>Net NS (veth)  217 Mbits/sec[*]  9.32 Gbits/sec
>VM (same host)4.26 Gbits/sec 23.3 Gbits/sec
>Ext Host  9.99 Gbits/sec 10.1 Gbits/sec
> 
> Used iperf3 -u to test UDP traffic limited at default 1Mbits/sec
> and noticed no change with the exception for tunneled packets (not
> supported).
> 
> Travis, AppVeyor, and Cirrus-ci passed.
> 
> Flavio Leitner (3):
>   dp-packet: preserve headroom when cloning a pkt batch
>   vhost: Disable multi-segmented buffers
>   netdev-dpdk: Add TCP Segmentation Offload support
> 
>  Documentation/automake.mk   |   1 +
>  Documentation/topics/dpdk/index.rst |   1 +
>  Documentation/topics/dpdk/tso.rst   |  96 +
>  NEWS|   1 +
>  lib/automake.mk |   2 +
>  lib/conntrack.c |  29 ++-
>  lib/dp-packet.h | 158 +-
>  lib/ipf.c   |  32 +--
>  lib/netdev-dpdk.c   | 318 
>  lib/netdev-linux-private.h  |   4 +
>  lib/netdev-linux.c  | 296 +++---
>  lib/netdev-provider.h   |  10

Re: [ovs-dev] [PATCH 0/4] Add support for TSO with DPDK

2019-12-09 Thread Loftus, Ciara

> 
> Abbreviated as TSO, TCP Segmentation Offload is a feature which enables
> the network stack to delegate the TCP segmentation to the NIC reducing
> the per packet CPU overhead.
> 
> A guest using vhost-user interface with TSO enabled can send TCP packets
> much bigger than the MTU, which saves CPU cycles normally used to break
> the packets down to MTU size and to calculate checksums.
> 
> It also saves CPU cycles used to parse multiple packets/headers during
> the packet processing inside virtual switch.
> 
> If the destination of the packet is another guest in the same host, then
> the same big packet can be sent through a vhost-user interface skipping
> the segmentation completely. However, if the destination is not local,
> the NIC hardware is instructed to do the TCP segmentation and checksum
> calculation.
> 
> The first 3 patches are not really part of TSO support, but they are
> required to make sure everything works.
> 
> The TSO is only enabled in the vhost-user ports in client mode which
> means DPDK is required. The other ports (dpdk or linux) can receive
> those packets.
> 
> This patchset is based on branch dpdk-latest (v19.11 required).
> 
> The numbers look good and I noticed no regression so far. However, my
> environment is tuned but not well fined tuned, so consider those as
> ball park numbers only.
> 
> This is VM-to-external host (Gbits/sec)
>| No patch  |  TSO off  |  TSO on
>   -
> TCPv4  |10 |10 |   38  (line rate)
> ---
> TCPv6  | 9 | 9 |   38  (line rate)
> ---
> V4 Conntrack   | 1 | 1 |   33
> 
> 
> This is VM-to-VM in the same host (Gbits/sec)
>| No patch  |  TSO off  |  TSO on
>   -
> TCPv4  | 9.4   |9.4|   31
> ---
> TCPv6  | 8 |8  |   30
> ---

Hi Flavio,

Thanks for the patch.
Similarly, my environment is not tuned but I'm measuring a decent improvement 
too.
(These results include your fixes on 
https://github.com/fleitner/ovs/tree/tso-v2-devel)
VM to external host: +71%
VM to VM: +154%

Note: TSO appears to be broken for some cases on i40e: 
https://bugs.dpdk.org/show_bug.cgi?id=373
So I've applied this patch to DPDK for my tests: 
https://github.com/Mic92/dpdk/commit/2dc9b8dd2c8e2eb71f216b5b9222a4deb57482c9

I've also tested on 82599ES however since we hit near line rate with TSO off, 
no significant improvement can be measured.

Thanks,
Ciara

> 
> There are good improvements sending to veth pairs or tap devices too.
> 
> Flavio Leitner (4):
>   dp-packet: preserve headroom when cloning a pkt batch
>   vhost: Disable multi-segmented buffers
>   dp-packet: handle new dpdk buffer flags
>   netdev-dpdk: Add TCP Segmentation Offload support
> 
>  Documentation/automake.mk   |   1 +
>  Documentation/topics/dpdk/index.rst |   1 +
>  Documentation/topics/dpdk/tso.rst   |  83 +
>  NEWS|   1 +
>  lib/automake.mk |   2 +
>  lib/dp-packet.c |  50 -
>  lib/dp-packet.h |  38 --
>  lib/netdev-bsd.c|   7 ++
>  lib/netdev-dpdk.c   | 112 +++-
>  lib/netdev-dummy.c  |   7 ++
>  lib/netdev-linux.c  |  71 --
>  lib/tso.c   |  54 ++
>  lib/tso.h   |  23 ++
>  vswitchd/bridge.c   |   2 +
>  vswitchd/vswitch.xml|  14 
>  15 files changed, 430 insertions(+), 36 deletions(-)
>  create mode 100644 Documentation/topics/dpdk/tso.rst
>  create mode 100644 lib/tso.c
>  create mode 100644 lib/tso.h
> 
> --
> 2.23.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] ovs-dpdk zero-copy rx/tx support between physical nic and vhost user

2019-11-26 Thread Loftus, Ciara

> 
> Hi Ciara and Ian,
> 
> I'm checking the zero-copy support on both rx/tx side when
> using ovs-dpdk with vhostuser and dpdkport.
> Assuming PVP, IIUC[1,2], the rx (P to V) does not have zero-copy and
> the V to P has zero copy support by enabling dq-zero-copy.
> 
> Am I understanding this correctly?

Hi William,

That's correct. 

Thanks,
Ciara

> 
> Thank you!
> William
> 
> [1]
> commit 10087cba9deec95aaea080c49f2cbe648ebe92c8
> Author: Ciara Loftus 
> Date:   Wed Jan 31 10:44:54 2018 +
> 
> netdev-dpdk: Add support for vHost dequeue zero copy (experimental)
> 
> Zero copy is disabled by default. To enable it, set the 'dq-zero-copy'
> option to 'true' when configuring the Interface:
> 
> [2]
> commit 8b7baf5f3fc6598694b94715ad9786440973dc3f
> Author: Ian Stokes 
> Date:   Fri Oct 19 14:30:15 2018 +0100
> 
> Docs: Remove zero-copy QEMU limitation.
> 
> Remove note regarding zero-copy compatibility with QEMU >= 2.7.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCHv5] netdev-afxdp: Add need_wakeup supprt.

2019-10-16 Thread Loftus, Ciara

> 
> The patch adds support for using need_wakeup flag in AF_XDP rings.
> A new option, use_need_wakeup, is added. When this option is used,
> it means that OVS has to explicitly wake up the kernel RX, using poll()
> syscall and wake up TX, using sendto() syscall. This feature improves
> the performance by avoiding unnecessary sendto syscalls for TX.
> For RX, instead of kernel always busy-spinning on fille queue, OVS wakes
> up the kernel RX processing when fill queue is replenished.
> 
> The need_wakeup feature is merged into Linux kernel bpf-next tee with
> commit
> 77cd0d7b3f25 ("xsk: add support for need_wakeup flag in AF_XDP rings")
> and
> OVS enables it by default. Running the feature before this version causes
> xsk bind fails, please use options:use_need_wakeup=false to disable it.
> If users enable it but runs in an older version of libbpf, then the
> need_wakeup feature has no effect, and a warning message is logged.
> 
> For virtual interface, it's better set use_need_wakeup=false, since
> the virtual device's AF_XDP xmit is synchronous: the sendto syscall
> enters kernel and process the TX packet on tx queue directly.
> 
> On Intel Xeon E5-2620 v3 2.4GHz system, performance of physical port
> to physical port improves from 6.1Mpps to 7.3Mpps.


Hi William,

Thanks for the patch. 
I'm wondering if/how you've pinned the IRQs for the physical ports in the test 
above.
Are the IRQs pinned to standalone cores or does the PMD thread share the same 
core?
I'd be interested in seeing performance results for both, if you have them.

Thanks,
Ciara

> 
> Suggested-by: Ilya Maximets 
> Signed-off-by: William Tu 
> ---

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v1] Docs: Remove zero-copy QEMU limitation.

2018-10-23 Thread Loftus, Ciara

> 
> Remove note regarding zero-copy compatibility with QEMU >= 2.7.
> 
> When zero-copy was introduced to OVS it was incompatible with QEMU >=
> 2.7. This issue has since been fixed in DPDK with commit
> 803aeecef123 ("vhost: fix dequeue zero copy with virtio1") and
> backported to DPDK LTS branches. Remove the reference to this
> issue in the zero-copy documentation.
> 
> Cc: Ciara Loftus 
> Signed-off-by: Ian Stokes 

Thanks, Ian.

Acked-by: Ciara Loftus 

> ---
>  Documentation/topics/dpdk/vhost-user.rst | 6 --
>  1 file changed, 6 deletions(-)
> 
> diff --git a/Documentation/topics/dpdk/vhost-user.rst
> b/Documentation/topics/dpdk/vhost-user.rst
> index 176bf1768..6334590af 100644
> --- a/Documentation/topics/dpdk/vhost-user.rst
> +++ b/Documentation/topics/dpdk/vhost-user.rst
> @@ -500,12 +500,6 @@ QEMU versions v2.10 and greater). This value can be
> set like so::
> 
>  Because of this limitation, this feature is considered 'experimental'.
> 
> -The feature currently does not fully work with QEMU >= v2.7 due to a bug in
> -DPDK which will be addressed in an upcoming release. The patch to fix this
> -issue can be found on
> -`Patchwork
> -`__
> -
>  Further information can be found in the
>  `DPDK documentation
>  `__
> --
> 2.13.6

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] NEWS: Re-add vhost zero copy support.

2018-07-10 Thread Loftus, Ciara

> 
> An entry for experimental vhost zero copy support was removed
> incorrectly. Re-add this entry to NEWS.
> 
> Reported-by: Eelco Chaudron 
> Cc: Ciara Loftus 
> Fixes: c3c722d2c7ee ("Documentation: document ovs-dpdk flow offload")
> Signed-off-by: Ian Stokes 

Acked-by: Ciara Loftus 

> ---
>  NEWS | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/NEWS b/NEWS
> index 92e9b92..057e8bf 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -115,6 +115,7 @@ v2.9.0 - 19 Feb 2018
>   * New appctl command 'dpif-netdev/pmd-rxq-rebalance' to rebalance rxq
> to
> pmd assignments.
>   * Add rxq utilization of pmd to appctl 'dpif-netdev/pmd-rxq-show'.
> + * Add support for vHost dequeue zero copy (experimental).
> - Userspace datapath:
>   * Output packet batching support.
> - vswitchd:
> --
> 2.7.5

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] OVS (master) + DPDK(17.11) + multi-queue

2018-06-20 Thread Loftus, Ciara

> 
> Hi,
> 
> 
> On Tue, Jun 19, 2018 at 12:27 AM Ilya Maximets 
> wrote:
> 
> > Hi,
> > According to your log, your NIC has limited size of tx queues:
> >
> >   2018-06-19T04:34:46.106Z|00089|dpdk|ERR|PMD: Unsupported size of TX
> queue
> >(max size: 1024)
> >
> > This means that you have to configure 'n_txq_desc' <= 1024 in order to
> > configure your NIC properly. OVS uses 2048 by default and this is causes
> > device configuration failure.
> >
> > Try this:
> >
> > ovs-vsctl set interface eth1 options:n_txq_desc=1024
> >
> > It also likely that you will have to configure the same value for
> > 'n_rxq_desc'.
> >
> >
> Thank you. It was indeed no. queue descriptors issue. Modifying the config
> to 1024 fixed it.
> 
> I am using 'pdump/pcap' features in dpdk for packet capture with OVS-DPDK
> and currently it doesn't seem to work. I used following link
> 
> https://software.intel.com/en-us/articles/dpdk-pdump-in-open-vswitch-
> with-dpdk
> 
> OVS is linked to DPDK 17.11.2 which has pdump library built. OVS has pdump
> server socket file created
> 
> ls -l /usr/local/var/run/openvswitch/
> total 8
> ...
> srwxr-x--- 1 root root 0 Jun 19 19:26 pdump_server_socket
> srwxr-x--- 1 root root 0 Jun 19 19:26 vhost-user-1
> srwxr-x--- 1 root root 0 Jun 19 19:26 vhost-user-2
> 
> ./x86_64-native-linuxapp-gcc/build/app/pdump/dpdk-pdump -- --pdump
> port=1,queue=*,rx-dev=/tmp/pkts.pcap
> --server-socket-path=/usr/local/var/run/openvswitch
> EAL: Detected 8 lcore(s)
> PANIC in rte_eal_config_attach():
> *Cannot open '/var/run/.rte_config' for rte_mem_config*
> 6: [./x86_64-native-linuxapp-gcc/build/app/pdump/dpdk-
> pdump(_start+0x29)
> [0x4472e9]]
> 5: [/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)
> [0x7ff0ae7ba830]]
> 4: [./x86_64-native-linuxapp-gcc/build/app/pdump/dpdk-
> pdump(main+0x155)
> [0x44aa76]]
> 3:
> [./x86_64-native-linuxapp-gcc/build/app/pdump/dpdk-
> pdump(rte_eal_init+0xc7d)
> [0x49ef0d]]
> 2:
> [./x86_64-native-linuxapp-gcc/build/app/pdump/dpdk-
> pdump(__rte_panic+0xc3)
> [0x43ebb3]]
> 1:
> [./x86_64-native-linuxapp-gcc/build/app/pdump/dpdk-
> pdump(rte_dump_stack+0x2b)
> [0x4a573b]]
> Aborted
> 
> Anything I am missing?

Hi,

I can't reproduce your pdump issue. For me using the same command it works ok.
Does /var/run/.rte_config exist and do you have sufficient permissions to 
access that file as the user executing the pdump app?
http://dpdk.readthedocs.io/en/v17.11/prog_guide/multi_proc_support.html#running-multiple-independent-dpdk-applications
 may also help.

Thanks,
Ciara

> 
> Thanks.
> 
> 
> > Note that OVS manages TX queues itself and it will try to allocate
> > separate TX queue for each PMD thread + 1 for non-PMD threads. So,
> > it's kind of impossible to force it to configure only one TX queue
> > in case HW supports multiqueue.
> >
> > > Hi,
> > >
> > > I am using above configuration on my testbed and when I try to add a
> port
> > > which is bound to igb_uio, I see following errors due to Tx queue
> > > configuration. I just want to use single Tx and Rx queue for my testing.
> > I
> > > looked at Documentation/intro/install/dpdk.rst, i see only "DPDK Physical
> > > Port Rx Queues" and nothing for "Tx Queues". Kindly let me know how
> can I
> > > use single tx/rx queues and if I have to use multiple Tx queues what
> > > configuration changes I need to do?
> > >
> > > ovs logs==
> > > 2018-06-19T04:33:23.720Z|00075|bridge|INFO|ovs-vswitchd (Open
> vSwitch)
> > > 2.9.90
> > > 2018-06-19T04:33:32.735Z|00076|memory|INFO|127688 kB peak resident
> set
> > size
> > > after 10.1 seconds
> > > 2018-06-19T04:33:32.735Z|00077|memory|INFO|handlers:5 ports:1
> > > revalidators:3 rules:5
> > > 2018-06-19T04:33:40.857Z|00078|rconn|INFO|br0<->unix#0: connected
> > > 2018-06-19T04:33:40.858Z|00079|rconn|INFO|br0<->unix#1: connected
> > > 2018-06-19T04:33:40.859Z|00080|rconn|INFO|br0<->unix#2: connected
> > > 2018-06-19T04:34:46.094Z|00081|dpdk|INFO|EAL: PCI device
> :00:06.0 on
> > > NUMA socket 0
> > > 2018-06-19T04:34:46.094Z|00082|dpdk|INFO|EAL:   probe driver:
> 1d0f:ec20
> > > net_ena
> > > 2018-06-19T04:34:46.095Z|00083|dpdk|INFO|PMD: eth_ena_dev_init():
> > > Initializing 0:0:6.0
> > > 2018-06-19T04:34:46.095Z|00084|netdev_dpdk|INFO|Device
> ':00:06.0'
> > > attached to DPDK
> > > 2018-06-19T04:34:46.099Z|00085|dpif_netdev|INFO|PMD thread on
> numa_id: 0,
> > > core id:  0 created.
> > > 2018-06-19T04:34:46.099Z|00086|dpif_netdev|INFO|There are 1 pmd
> threads
> > on
> > > numa node 0
> > > 2018-06-19T04:34:46.105Z|00087|netdev_dpdk|WARN|Rx checksum
> offload is
> > not
> > > supported on port 0
> > > 2018-06-19T04:34:46.105Z|00088|dpdk|INFO|PMD: Set MTU: 1500
> > > 2018-06-19T04:34:46.106Z|00089|dpdk|ERR|PMD: Unsupported size of
> TX queue
> > > (max size: 1024)
> > > 2018-06-19T04:34:46.106Z|00090|netdev_dpdk|INFO|Interface eth1
> unable to
> > > setup txq(0): Invalid argument
> > >

Re: [ovs-dev] [RFC PATCH] netdev-dpdk: Integrate vHost User PMD

2018-06-01 Thread Loftus, Ciara

> 
> > On Mon, May 21, 2018 at 04:44:13PM +0100, Ciara Loftus wrote:
> > > The vHost PMD brings vHost User port types ('dpdkvhostuser' and
> > > 'dpdkvhostuserclient') under control of DPDK's librte_ether API, like
> > > all other DPDK netdev types ('dpdk' and 'dpdkr'). In doing so, direct
> > > calls to DPDK's librte_vhost library are removed and replaced with
> > > librte_ether API calls, for which most of the infrastructure is
> > > already in place.
> > >
> > > This change has a number of benefits, including:
> > > * Reduced codebase (~200LOC removed)
> > > * More features automatically enabled for vHost ports eg. custom stats
> > >   and additional get_status information.
> > > * OVS can be ignorant to changes in the librte_vhost API between DPDK
> > >   releases potentially making upgrades easier and the OVS codebase less
> > >   susceptible to change.
> > >
> > > The sum of all DPDK port types must not exceed RTE_MAX_ETHPORTS
> which
> > > is set and can be modified in the DPDK configuration. Prior to this
> > > patch this only applied to 'dpdk' and 'dpdkr' ports, but now applies
> > > to all DPDK port types including vHost User.
> > >
> > > Performance (pps) of the different topologies p2p, pvp, pvvp and vv
> > > has been measured to remain within a +/- 5% margin of existing
> > performance.
> >
> > Thanks for putting this together.
> >
> > I think when this idea was discussed at least in my head we would pretty
> > much kill any vhost specific info and use a standard eth API instead.
> > However, it doesn't look like to be case, we still have the mtu and queue
> > issues, special construct/destruct, send, and etc which IMHO defeats the
> > initial goal.
> 
> I agree, I think that would be the ideal situation but it seems where not 
> there
> yet.
> I wonder if that is something that could be changed and fed back to DPDK? If
> we will always have to have the separate implementations is that reflective
> of OVS requirements or a gap in DPDK implementation of vhost PMD?

Hi Ian & Flavio,

Thank you both for your responses. I agree, right now we are not at the ideal 
scenario using this API which would probably be closer to having the 
netdev_dpdk and netdev_dpdk_vhost* classes equivalent. However 4 functions have 
become common (get_ carrier, stats, custom_stats, features) and many of the 
remainder have some element of commonality through helper functions (send, 
receive, status, etc.). The hope would be that going forward we could narrow 
the gap through both OVS and DPDK changes. I think it would be difficult to 
narrow that gap if we opt for an "all or nothing" approach now.

> 
> >
> > Leaving that aside for a moment, I wonder about imposed limitations if we
> > switch to the eth API too. I mean, things that we can do today because OVS
> > is managing vhost that we won't be able after the API switch.
> 
> I've been thinking of this situation also. But one concern is by not using the
> vhost PMD will there be features that are unavailable to vhost in ovs?
> 
> Nothing comes to mind for now, and as long as we continue to access DPDK
> vhost library that should be ok. However it's something we should keep an
> eye in the future (For example we recently had an example of a DPDK
> function that could not be used with DPDK compiled for shared libs).
> 
> It would be interesting to see where the DPDK community are trending
> towards with vhost development in the future WRT this.

Feature-wise, it appears the development of any new DPDK vHost feature includes 
relevant support for that feature in the PMD. Dequeue zero copy and vhost iommu 
are examples of these. So going forward I don't see any issues there.

In my development and testing of this patch I haven't come across any 
limitations other than that out-of-the-box one is limited to a maximum of 32 
vHost ports as defined by RTE_MAX_ETHPORTS. I would be interested to hear if 
that would be a concern for users.

On the other hand there are many more new things we can do with the API switch 
too eg. more information in get_status & custom statistics and hopefully more 
going forward. Although understand preserving existing functionality is 
critical.

Understand this is a large patch and might take some time to review. But would 
definitely welcome any further high level feedback especially around the topics 
above from anybody in the community interested in netdev-dpdk/vHost.

Thanks,
Ciara

> 
> Ian
> 
> >
> > Thanks,
> > fbl
> >
> > ___
> > dev mailing list
> > d...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [RFC v7 11/13] netdev-dpdk: copy large packet to multi-seg. mbufs

2018-05-28 Thread Loftus, Ciara

> 
> From: Mark Kavanagh 
> 
> Currently, packets are only copied to a single segment in
> the function dpdk_do_tx_copy(). This could be an issue in
> the case of jumbo frames, particularly when multi-segment
> mbufs are involved.
> 
> This patch calculates the number of segments needed by a
> packet and copies the data to each segment.
> 
> Co-authored-by: Michael Qiu 
> 
> Signed-off-by: Mark Kavanagh 
> Signed-off-by: Michael Qiu 
> ---
>  lib/netdev-dpdk.c | 78
> ---
>  1 file changed, 68 insertions(+), 10 deletions(-)
> 
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index d3abdde..c6dfe6d 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -2178,6 +2178,71 @@ out:
>  }
>  }
> 
> +static int
> +dpdk_prep_tx_buf(struct dp_packet *packet, struct rte_mbuf **head,
> + struct rte_mempool *mp)
> +{
> +struct rte_mbuf *temp;
> +uint32_t size = dp_packet_size(packet);
> +uint16_t max_data_len, data_len;
> +uint32_t nb_segs = 0;
> +int i;
> +
> +temp = *head = rte_pktmbuf_alloc(mp);
> +if (OVS_UNLIKELY(!temp)) {
> +return 1;
> +}
> +
> +/* All new allocated mbuf's max data len is the same */
> +max_data_len = temp->buf_len - temp->data_off;
> +
> +/* Calculate # of output mbufs. */
> +nb_segs = size / max_data_len;
> +if (size % max_data_len) {
> +nb_segs = nb_segs + 1;
> +}
> +
> +/* Allocate additional mbufs when multiple output mbufs required. */
> +for (i = 1; i < nb_segs; i++) {
> +temp->next = rte_pktmbuf_alloc(mp);
> +if (!temp->next) {
> +rte_pktmbuf_free(*head);
> +*head = NULL;
> +break;
> +}
> +temp = temp->next;
> +}
> +/* We have to do a copy for now */
> +rte_pktmbuf_pkt_len(*head) = size;
> +temp = *head;
> +
> +data_len = size < max_data_len ? size: max_data_len;
> +if (packet->source == DPBUF_DPDK) {
> +*head = &(packet->mbuf);
> +while (temp && head && size > 0) {
> +rte_memcpy(rte_pktmbuf_mtod(temp, void *),
> +dp_packet_data((struct dp_packet *)head), data_len);
> +rte_pktmbuf_data_len(temp) = data_len;
> +*head = (*head)->next;
> +size = size - data_len;
> +data_len =  size < max_data_len ? size: max_data_len;
> +temp = temp->next;
> +}
> +} else {
> +int offset = 0;
> +while (temp && size > 0) {
> +memcpy(rte_pktmbuf_mtod(temp, void *),
> +dp_packet_at(packet, offset, data_len), data_len);
> +rte_pktmbuf_data_len(temp) = data_len;
> +temp = temp->next;
> +size = size - data_len;
> +offset += data_len;
> +data_len = size < max_data_len ? size: max_data_len;
> +}
> +}
> +return 0;
> +}
> +
>  /* Tx function. Transmit packets indefinitely */
>  static void
>  dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch
> *batch)
> @@ -2194,6 +2259,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid,
> struct dp_packet_batch *batch)
>  struct rte_mbuf *pkts[PKT_ARRAY_SIZE];
>  uint32_t cnt = batch_cnt;
>  uint32_t dropped = 0;
> +uint32_t i;
> 
>  if (dev->type != DPDK_DEV_VHOST) {
>  /* Check if QoS has been configured for this netdev. */
> @@ -2204,27 +2270,19 @@ dpdk_do_tx_copy(struct netdev *netdev, int
> qid, struct dp_packet_batch *batch)
> 
>  uint32_t txcnt = 0;
> 
> -for (uint32_t i = 0; i < cnt; i++) {
> +for (i = 0; i < cnt; i++) {
>  struct dp_packet *packet = batch->packets[i];
>  uint32_t size = dp_packet_size(packet);
> -
>  if (OVS_UNLIKELY(size > dev->max_packet_len)) {
>  VLOG_WARN_RL(, "Too big size %u max_packet_len %d",
>   size, dev->max_packet_len);
> -
>  dropped++;
>  continue;
>  }
> -
> -pkts[txcnt] = rte_pktmbuf_alloc(dev->mp);
> -if (OVS_UNLIKELY(!pkts[txcnt])) {
> +if (!dpdk_prep_tx_buf(packet, [txcnt], dev->mp)) {

We should enter here if the return is not zero.

>  dropped += cnt - i;
>  break;
>  }
> -
> -/* We have to do a copy for now */
> -memcpy(rte_pktmbuf_mtod(pkts[txcnt], void *),
> -   dp_packet_data(packet), size);
>  dp_packet_set_size((struct dp_packet *)pkts[txcnt], size);
>  dp_packet_copy_mbuf_flags((struct dp_packet *)pkts[txcnt], packet);
> 
> --
> 2.7.4
> 
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org

Re: [ovs-dev] [RFC v7 10/13] dp-packet: copy data from multi-seg. DPDK mbuf

2018-05-28 Thread Loftus, Ciara

> 
> From: Michael Qiu 
> 
> When doing packet clone, if packet source is from DPDK driver,
> multi-segment must be considered, and copy the segment's data one by
> one.
> 
> Also, lots of DPDK mbuf's info is missed during a copy, like packet
> type, ol_flags, etc.  That information is very important for DPDK to do
> packets processing.
> 
> Co-authored-by: Mark Kavanagh 
> 
> Signed-off-by: Michael Qiu 
> Signed-off-by: Mark Kavanagh 
> ---
>  lib/dp-packet.c   | 74
> ++-
>  lib/dp-packet.h   |  3 +++
>  lib/netdev-dpdk.c |  1 +
>  3 files changed, 67 insertions(+), 11 deletions(-)
> 
> diff --git a/lib/dp-packet.c b/lib/dp-packet.c
> index 6a57c3c..bce6987 100644
> --- a/lib/dp-packet.c
> +++ b/lib/dp-packet.c
> @@ -48,6 +48,22 @@ dp_packet_use__(struct dp_packet *b, void *base,
> size_t allocated,
>  dp_packet_init__(b, allocated, source);
>  }
> 
> +#ifdef DPDK_NETDEV
> +void
> +dp_packet_copy_mbuf_flags(struct dp_packet *dst, const struct
> dp_packet *src) {

Opening brace on newline - coding standards.

> +ovs_assert(dst != NULL && src != NULL);
> +struct rte_mbuf *buf_dst = &(dst->mbuf);
> +struct rte_mbuf buf_src = src->mbuf;
> +
> +buf_dst->nb_segs = buf_src.nb_segs;
> +buf_dst->ol_flags = buf_src.ol_flags;
> +buf_dst->packet_type = buf_src.packet_type;
> +buf_dst->tx_offload = buf_src.tx_offload;
> +}
> +#else
> +#define dp_packet_copy_mbuf_flags(arg1, arg2)
> +#endif
> +
>  /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes 
> of
>   * memory starting at 'base'.  'base' should be the first byte of a region
>   * obtained from malloc().  It will be freed (with free()) if 'b' is resized 
> or
> @@ -158,6 +174,49 @@ dp_packet_clone(const struct dp_packet *buffer)
>  return dp_packet_clone_with_headroom(buffer, 0);
>  }
> 
> +#ifdef DPDK_NETDEV
> +struct dp_packet *
> +dp_packet_clone_with_headroom(const struct dp_packet *buffer,
> +  size_t headroom) {
> +struct dp_packet *new_buffer;
> +uint32_t pkt_len = dp_packet_size(buffer);
> +
> +/* copy multi-seg data */
> +if (buffer->source == DPBUF_DPDK && buffer->mbuf.nb_segs > 1) {
> +uint32_t offset = 0;
> +void *dst = NULL;
> +struct rte_mbuf *tmbuf = CONST_CAST(struct rte_mbuf *,
> +&(buffer->mbuf));
> +
> +new_buffer = dp_packet_new_with_headroom(pkt_len, headroom);
> +dp_packet_set_size(new_buffer, pkt_len + headroom);
> +dst = dp_packet_tail(new_buffer);
> +
> +while (tmbuf) {
> +rte_memcpy((char *)dst + offset,
> +   rte_pktmbuf_mtod(tmbuf, void *), tmbuf->data_len);
> +offset += tmbuf->data_len;
> +tmbuf = tmbuf->next;
> +}
> +} else {
> +new_buffer =
> dp_packet_clone_data_with_headroom(dp_packet_data(buffer),
> +pkt_len, headroom);
> +}
> +
> +/* Copy the following fields into the returned buffer: l2_pad_size,
> + * l2_5_ofs, l3_ofs, l4_ofs, cutlen, packet_type and md. */
> +memcpy(_buffer->l2_pad_size, >l2_pad_size,
> +sizeof(struct dp_packet) -
> +offsetof(struct dp_packet, l2_pad_size));
> +
> +dp_packet_copy_mbuf_flags(new_buffer, buffer);
> +if (dp_packet_rss_valid(new_buffer)) {
> +new_buffer->mbuf.hash.rss = buffer->mbuf.hash.rss;
> +}
> +
> +return new_buffer;
> +}
> +#else
>  /* Creates and returns a new dp_packet whose data are copied from
> 'buffer'.
>   * The returned dp_packet will additionally have 'headroom' bytes of
>   * headroom. */
> @@ -165,32 +224,25 @@ struct dp_packet *
>  dp_packet_clone_with_headroom(const struct dp_packet *buffer, size_t
> headroom)
>  {
>  struct dp_packet *new_buffer;
> +uint32_t pkt_len = dp_packet_size(buffer);
> 
>  new_buffer =
> dp_packet_clone_data_with_headroom(dp_packet_data(buffer),
> - dp_packet_size(buffer),
> - headroom);
> + pkt_len, headroom);
> +

This change of introducing the pkt_len variable doesn't seem necessary.

>  /* Copy the following fields into the returned buffer: l2_pad_size,
>   * l2_5_ofs, l3_ofs, l4_ofs, cutlen, packet_type and md. */
>  memcpy(_buffer->l2_pad_size, >l2_pad_size,
>  sizeof(struct dp_packet) -
>  offsetof(struct dp_packet, l2_pad_size));
> 
> -#ifdef DPDK_NETDEV
> -new_buffer->mbuf.ol_flags = buffer->mbuf.ol_flags;
> -#else
>  new_buffer->rss_hash_valid = buffer->rss_hash_valid;
> -#endif
> -
>  if (dp_packet_rss_valid(new_buffer)) {
> -#ifdef DPDK_NETDEV
> -new_buffer->mbuf.hash.rss =

Re: [ovs-dev] [RFC v7 09/13] dp-packet: Handle multi-seg mbufs in resize__().

2018-05-28 Thread Loftus, Ciara

> 
> When enabled with DPDK OvS relies on mbufs allocated by mempools to
> receive and output data on DPDK ports. Until now, each OvS dp_packet has
> only one mbuf associated, which is allocated with the maximum possible
> size, taking the MTU into account. This approach, however, doesn't allow
> us to increase the allocated size in an mbuf, if needed, since an mbuf
> is allocated and initialised upon mempool creation. Thus, in the current
> implementatin this is dealt with by calling OVS_NOTEACHED() and
> terminating OvS.
> 
> To avoid this, and allow the allocated size to be increased, multiple
> mbufs can be linked together, in the multi-segment mbufs approach. Thus,
> dp_packet_resize__() has been modified to handle the DPBUF_DPDK case
> and
> allocate and link new mbufs together as needed.
> 
> Signed-off-by: Tiago Lam 
> ---
>  lib/dp-packet.c | 84
> +++--
>  1 file changed, 82 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/dp-packet.c b/lib/dp-packet.c
> index 7b603c7..6a57c3c 100644
> --- a/lib/dp-packet.c
> +++ b/lib/dp-packet.c
> @@ -238,8 +238,86 @@ dp_packet_resize__(struct dp_packet *b, size_t
> new_headroom, size_t new_tailroom
> 
>  switch (b->source) {
>  case DPBUF_DPDK:
> -OVS_NOT_REACHED();
> +{
> +#ifdef DPDK_NETDEV

Is it possible for the b->source == DPBUF_DPDK and DPDK_NETDEV to not be 
defined? If not, the #ifdef can be removed.

> +uint16_t max_data_len, nb_segs;
> +size_t miss_len;
> +size_t head_diff = 0;
> +struct rte_mbuf *mbuf, *fmbuf;
> +struct rte_mempool *mp;
> +
> +if (!netdev_dpdk_is_multi_segment_mbufs_enabled()) {
> +/* XXX: Handle single mbufs case better */
> +return;
> +}
> +
> +/* Calculate missing length in need of allocation */
> +if (new_headroom != dp_packet_headroom(b)) {
> +/* This is a headroom adjustment. Find out how much, if anything 
> */
> +head_diff = new_headroom - dp_packet_headroom(b);
> +
> +if (head_diff <= new_tailroom) {
> +miss_len = 0;
> +} else {
> +miss_len = head_diff - new_tailroom;
> +}
> +} else {
> +/* Otherwise this is a tailroom adjustment */
> +miss_len = new_tailroom - dp_packet_tailroom(b);
> +}
> +
> +mbuf = CONST_CAST(struct rte_mbuf *, >mbuf);
> +/* All new allocated mbuf's max data len is the same */
> +max_data_len = mbuf->buf_len - mbuf->data_off;
> +/* Calculate # of needed mbufs to accomodate 'miss_len' */
> +nb_segs = miss_len / max_data_len;
> +if (miss_len % max_data_len) {
> +nb_segs += 1;
> +}
> +
> +/* Proceed with the allocation of new mbufs */
> +mp = mbuf->pool;
> +fmbuf = mbuf;
> +mbuf = rte_pktmbuf_lastseg(mbuf);
> +
> +for (int i = 0; i < nb_segs; i++) {
> +/* This takes care of initialising buf_len, data_len and other
> + * fields properly */
> +struct dp_packet *pkt = dpdk_buf_alloc(mp);
> +mbuf->next = CONST_CAST(struct rte_mbuf *, >mbuf);
> +if (!mbuf->next) {
> +/* XXX: Flag a WARN and return an error as this will likely
> + * result in undefined behaviour */
> +free_dpdk_buf((struct dp_packet *) fmbuf);
> +fmbuf = NULL;
> +break;
> +}
> 
> +fmbuf->nb_segs += 1;
> +
> +mbuf = mbuf->next;
> +}
> +
> +/* Deal with headroom. If 'head_diff' calculated above is different
> + * than zero (thus, positive) the 'new_headroom' requested is bigger
> + * than the currently available headroom and thus, now that the
> needed
> + * space and mbufs have been allocated, the data needs to be shift
> + * right.
> + * XXX: Doesn't verify or take into account that new_headroom may 
> fall
> + * outside of an mbuf limits */
> +if (head_diff > 0) {
> +new_headroom = dp_packet_headroom(b) + head_diff;
> +
> +dp_packet_shift(b, head_diff);
> +}
> +
> +new_base = dp_packet_base(b);
> +
> +break;
> +#else
> +OVS_NOT_REACHED();
> +#endif
> +}
>  case DPBUF_MALLOC:
>  if (new_headroom == dp_packet_headroom(b)) {
>  new_base = xrealloc(dp_packet_base(b), new_allocated);
> @@ -263,7 +341,9 @@ dp_packet_resize__(struct dp_packet *b, size_t
> new_headroom, size_t new_tailroom
>  OVS_NOT_REACHED();
>  }
> 
> -dp_packet_set_allocated(b, new_allocated);
> +if (b->source != DPBUF_DPDK) {
> +dp_packet_set_allocated(b, new_allocated);
> +}
>  dp_packet_set_base(b, new_base);
> 
>  new_data = (char *) new_base + new_headroom;
> --
> 2.7.4
> 
>

Re: [ovs-dev] [RFC v7 07/13] dp-packet: Handle multi-seg mubfs in shift() func.

2018-05-28 Thread Loftus, Ciara

> 
> In its current implementation dp_packet_shift() is also unaware of
> multi-seg mbufs (that holds data in memory non-contiguously) and assumes
> that data exists contiguously in memory, memmove'ing data to perform the
> shift.
> 
> To add support for multi-seg mbuds a new set of functions was
> introduced, dp_packet_mbuf_shift() and dp_packet_mbuf_write(). These
> functions are used by dp_packet_shift(), when handling multi-seg mbufs,
> to shift and write data within a chain of mbufs.
> 
> Signed-off-by: Tiago Lam 
> ---
>  lib/dp-packet.c | 102
> +++-
>  1 file changed, 101 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/dp-packet.c b/lib/dp-packet.c
> index 1b39946..7b603c7 100644
> --- a/lib/dp-packet.c
> +++ b/lib/dp-packet.c
> @@ -294,13 +294,113 @@ dp_packet_prealloc_headroom(struct dp_packet
> *b, size_t size)
>  }
>  }
> 
> +#ifdef DPDK_NETDEV
> +/* XXX: Caller must provide the correct mbuf where the offset is located. It
> +   must also guarantee that 'ofs' is within bounds */
> +static void
> +dp_packet_mbuf_write(struct rte_mbuf *mbuf, int16_t ofs, uint32_t len,
> + const void *data)
> +{
> +char *dst_addr;
> +uint16_t data_len;
> +int len_copy;
> +while (mbuf) {
> +if (len == 0) {
> +break;
> +}
> +
> +dst_addr = rte_pktmbuf_mtod_offset(mbuf, char *, ofs);
> +data_len = MBUF_BUF_END(mbuf->buf_addr, mbuf->buf_len) -
> dst_addr;
> +
> +len_copy = MIN(len, data_len);
> +/* We don't know if 'data' is the result of a rte_pktmbuf_read call, 
> in
> + * which case we may end up writing to the same region of memory we
> are
> + * reading from and overlapping. Thus, the use of memmove */
> +memmove(dst_addr, data, len_copy);
> +
> +data = ((char *) data) + len_copy;
> +len -= len_copy;
> +ofs = 0;
> +
> +mbuf = mbuf->next;
> +}
> +}
> +
> +static void
> +dp_packet_mbuf_shift_(struct rte_mbuf *dbuf, int16_t dst_ofs,
> +  const struct rte_mbuf *sbuf, uint16_t src_ofs, int len)
> +{
> +char rd[len];
> +
> +const char *wd = rte_pktmbuf_read(sbuf, src_ofs, len, rd);
> +if (!wd) {
> +/* XXX: Log WARN here. */
> +return;
> +}
> +
> +dp_packet_mbuf_write(dbuf, dst_ofs, len, wd);
> +}
> +
> +/* XXX: Delta must fall within the bounds of an mbuf */
> +static void
> +dp_packet_mbuf_shift(struct dp_packet *b, int delta)
> +{
> +struct rte_mbuf *sbuf, *dbuf;
> +uint16_t src_ofs;
> +int16_t dst_ofs;
> +
> +struct rte_mbuf *mbuf = CONST_CAST(struct rte_mbuf *, >mbuf);
> +dbuf = mbuf;
> +sbuf = mbuf;
> +
> +/* Set the destination address to copy to */
> +if (delta < 0) {
> +dst_ofs = delta;
> +src_ofs = 0;
> +} else {
> +dst_ofs = delta;
> +src_ofs = 0;
> +}

The outcome of the if and else is the same.

> +
> +/* Shift data from src mbuf and offset to dst mbuf and offset */
> +dp_packet_mbuf_shift_(dbuf, dst_ofs, sbuf, src_ofs,
> +  rte_pktmbuf_pkt_len(sbuf));
> +
> +/* Update mbufs' properties, depending if there's one mbuf or more */
> +mbuf->data_off = mbuf->data_off + dst_ofs;
> +
> +struct rte_mbuf *tmbuf = rte_pktmbuf_lastseg(mbuf);
> +if (dbuf != tmbuf) {
> +tmbuf->data_len += delta;
> +mbuf->data_len -= dst_ofs;
> +}
> +}
> +#endif
> +
>  /* Shifts all of the data within the allocated space in 'b' by 'delta' bytes.
>   * For example, a 'delta' of 1 would cause each byte of data to move one
> byte
>   * forward (from address 'p' to 'p+1'), and a 'delta' of -1 would cause each
> - * byte to move one byte backward (from 'p' to 'p-1'). */
> + * byte to move one byte backward (from 'p' to 'p-1').
> + * Note for DPBUF_DPDK(XXX): The shift can only move within the bounds
> of an
> + * mbuf since it assumes the source and destination offsets will be on the
> + * same mbuf. */
>  void
>  dp_packet_shift(struct dp_packet *b, int delta)
>  {
> +#ifdef DPDK_NETDEV
> + if (b->source == DPBUF_DPDK) {
> +ovs_assert(delta > 0 ? delta <= dp_packet_tailroom(b)
> +   : delta < 0 ? -delta <= dp_packet_headroom(b)
> +   : true);
> +
> +
> +if (delta != 0) {

Could remove some code duplication here. The assert & check for delta !=0 is 
common for both DPDK_NETDEV and !DPDK_NETDEV

> +dp_packet_mbuf_shift(b, delta);
> +}
> +
> +return;
> +}
> +#endif
>  ovs_assert(delta > 0 ? delta <= dp_packet_tailroom(b)
> : delta < 0 ? -delta <= dp_packet_headroom(b)
> : true);
> --
> 2.7.4
> 
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list

Re: [ovs-dev] [RFC v7 05/13] dp-packet: Handle multi-seg mbufs in helper funcs.

2018-05-28 Thread Loftus, Ciara

> 
> Most helper functions in dp-packet assume that the data held by a
> dp_packet is contiguous, and perform operations such as pointer
> arithmetic under that assumption. However, with the introduction of
> multi-segment mbufs, where data is non-contiguous, such assumptions are
> no longer possible. Some examples of Such helper functions are
> dp_packet_tail(), dp_packet_tailroom(), dp_packet_end(),
> dp_packet_get_allocated() and dp_packet_at().
> 
> Thus, instead of assuming contiguous data in dp_packet, they  now
> iterate over the (non-contiguous) data in mbufs to perform their
> calculations.
> 
> Co-authored-by: Mark Kavanagh 
> 
> Signed-off-by: Mark Kavanagh 
> Signed-off-by: Tiago Lam 
> ---
>  lib/dp-packet.h | 252
> +---
>  1 file changed, 205 insertions(+), 47 deletions(-)
> 
> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
> index e79fb24..4d4b420 100644
> --- a/lib/dp-packet.h
> +++ b/lib/dp-packet.h
> @@ -80,6 +80,11 @@ struct dp_packet {
>  };
>  };
> 
> +#ifdef DPDK_NETDEV
> +#define MBUF_BUF_END(BUF_ADDR, BUF_LEN) \
> +(char *) (((char *) BUF_ADDR) + BUF_LEN)
> +#endif
> +
>  static inline void *dp_packet_data(const struct dp_packet *);
>  static inline void dp_packet_set_data(struct dp_packet *, void *);
>  static inline void *dp_packet_base(const struct dp_packet *);
> @@ -133,6 +138,10 @@ static inline void *dp_packet_at(const struct
> dp_packet *, size_t offset,
>   size_t size);
>  static inline void *dp_packet_at_assert(const struct dp_packet *,
>  size_t offset, size_t size);
> +#ifdef DPDK_NETDEV
> +static inline void * dp_packet_at_offset(const struct dp_packet *b,
> + size_t offset);
> +#endif
>  static inline void *dp_packet_tail(const struct dp_packet *);
>  static inline void *dp_packet_end(const struct dp_packet *);
> 
> @@ -180,40 +189,6 @@ dp_packet_delete(struct dp_packet *b)
>  }
>  }
> 
> -/* If 'b' contains at least 'offset + size' bytes of data, returns a pointer 
> to
> - * byte 'offset'.  Otherwise, returns a null pointer. */
> -static inline void *
> -dp_packet_at(const struct dp_packet *b, size_t offset, size_t size)
> -{
> -return offset + size <= dp_packet_size(b)
> -   ? (char *) dp_packet_data(b) + offset
> -   : NULL;
> -}
> -
> -/* Returns a pointer to byte 'offset' in 'b', which must contain at least
> - * 'offset + size' bytes of data. */
> -static inline void *
> -dp_packet_at_assert(const struct dp_packet *b, size_t offset, size_t size)
> -{
> -ovs_assert(offset + size <= dp_packet_size(b));
> -return ((char *) dp_packet_data(b)) + offset;
> -}
> -
> -/* Returns a pointer to byte following the last byte of data in use in 'b'. 
> */
> -static inline void *
> -dp_packet_tail(const struct dp_packet *b)
> -{
> -return (char *) dp_packet_data(b) + dp_packet_size(b);
> -}
> -
> -/* Returns a pointer to byte following the last byte allocated for use (but
> - * not necessarily in use) in 'b'. */
> -static inline void *
> -dp_packet_end(const struct dp_packet *b)
> -{
> -return (char *) dp_packet_base(b) + dp_packet_get_allocated(b);
> -}
> -
>  /* Returns the number of bytes of headroom in 'b', that is, the number of
> bytes
>   * of unused space in dp_packet 'b' before the data that is in use.  (Most
>   * commonly, the data in a dp_packet is at its beginning, and thus the
> @@ -224,19 +199,63 @@ dp_packet_headroom(const struct dp_packet *b)
>  return (char *) dp_packet_data(b) - (char *) dp_packet_base(b);
>  }
> 
> +#ifdef DPDK_NETDEV
> +static inline struct rte_mbuf *
> +dp_packet_mbuf_tail(const struct dp_packet *b)
> +{
> +struct rte_mbuf *buf = CONST_CAST(struct rte_mbuf *, >mbuf);
> +if (b->source == DPBUF_DPDK) {
> +
> +/* Find last segment where data ends, meaning the tail of the chained
> +   mbufs is there */
> +struct rte_mbuf *pmbuf;
> +while (buf) {
> +if (MBUF_BUF_END(buf->buf_addr, buf->buf_len) !=
> +rte_pktmbuf_mtod_offset(buf, char *, buf->data_len)) {
> +break;
> +}
> +
> +pmbuf = buf;
> +buf = buf->next;
> +}
> +
> +return (buf == NULL) ? pmbuf : buf;
> +} else {
> +return buf;
> +}
> +}
> +#endif

dp_packet_tail() seems to perform the exact same function. Can these functions 
be consolidated into one?

Am I right to assume that the tail will not necessarily be the last segment? 
Otherwise would rte_pktmbuf_lastseg() 
(http://dpdk.org/doc/api/rte__mbuf_8h.html#a0c2d001cd113362dd123d663210afcbb) 
be useful here?

> +
>  /* Returns the number of bytes that may be appended to the tail end of
>   * dp_packet 'b' before the dp_packet must be reallocated. */
>  static inline size_t
>  dp_packet_tailroom(const

Re: [ovs-dev] [RFC v7 06/13] dp-packet: Handle multi-seg mbufs in put*() funcs.

2018-05-28 Thread Loftus, Ciara

> 
> The dp_packet_put*() function - dp_packet_put_uninit(), dp_packet_put()
> and dp_packet_put_zeros() - are, in their current implementation,
> operating on the data buffer of a dp_packet as if it were contiguous,
> which in the case of multi-segment mbufs means they operate on the first
> mbuf in the chain. However, in the case of dp_packet_put_uninit(), for
> example, it is the data length of the last mbuf in the mbuf chain that
> should be adjusted. These functions have thus been modified to support
> multi-segment mbufs.
> 
> Additionally, most of the core logic in dp_pcket_put_uninit() was moved
> to a new helper function, dp_packet_put_uninit()_, to abstract the
> implementation details from the API, since in the case of multi-seg
> mbufs a new struct is returned that holds the mbuf and offset that
> constitute the tail. For the single mbuf case a pointer to the byte that
> constitute the tail still returned.
> 
> Co-authored-by: Mark Kavanagh 
> 
> Signed-off-by: Mark Kavanagh 
> Signed-off-by: Tiago Lam 
> ---
>  lib/dp-packet.c | 97
> +
>  lib/dp-packet.h |  5 +++
>  2 files changed, 89 insertions(+), 13 deletions(-)
> 
> diff --git a/lib/dp-packet.c b/lib/dp-packet.c
> index 782e7c2..1b39946 100644
> --- a/lib/dp-packet.c
> +++ b/lib/dp-packet.c
> @@ -312,27 +312,66 @@ dp_packet_shift(struct dp_packet *b, int delta)
>  }
>  }
> 
> -/* Appends 'size' bytes of data to the tail end of 'b', reallocating and
> - * copying its data if necessary.  Returns a pointer to the first byte of the
> - * new data, which is left uninitialized. */

Is the function rte_pktmbuf_append() 
(http://dpdk.org/doc/api/rte__mbuf_8h.html#ad5b0cd686ad3bcbb83416ca8395a080b) 
of any use here?

> -void *
> -dp_packet_put_uninit(struct dp_packet *b, size_t size)
> +static void *
> +dp_packet_put_uninit_(struct dp_packet *b, size_t size)
>  {
>  void *p;
>  dp_packet_prealloc_tailroom(b, size);
> +#ifdef DPDK_NETDEV
> +p = dp_packet_mbuf_tail(b);
> +/* In the case of multi-segment mbufs, the data length of the last mbuf
> + * should be adjusted by 'size' bytes. The packet length of the entire
> + * mbuf chain (stored in the first mbuf of said chain) is adjusted in
> + * the normal execution path below.
> + */
> +
> +struct rte_mbuf *tmbuf = (struct rte_mbuf *) p;
> +struct mbuf_tail *mbuf_tail = xmalloc(sizeof(*mbuf_tail));
> +size_t pkt_len = size;
> +
> +mbuf_tail->mbuf = tmbuf;
> +mbuf_tail->ofs = tmbuf->data_len;
> +
> +/* Adjust size of intermediate mbufs from current tail to end */
> +while (tmbuf && pkt_len > 0) {
> +tmbuf->data_len = MIN(pkt_len, tmbuf->buf_len - tmbuf->data_off);
> +pkt_len -= tmbuf->data_len;
> +
> +tmbuf = tmbuf->next;
> +}
> +
> +p = (void *) mbuf_tail;
> +#else
>  p = dp_packet_tail(b);
> +#endif
>  dp_packet_set_size(b, dp_packet_size(b) + size);
> +
>  return p;
>  }
> 
> -/* Appends 'size' zeroed bytes to the tail end of 'b'.  Data in 'b' is
> - * reallocated and copied if necessary.  Returns a pointer to the first byte 
> of
> - * the data's location in the dp_packet. */
> +static void *
> +dp_packet_tail_to_addr(struct dp_packet *b OVS_UNUSED, void *p) {
> +#ifdef DPDK_NETDEV
> +struct mbuf_tail *mbuf_tail = (struct mbuf_tail *) p;
> +p = (void *) rte_pktmbuf_mtod_offset(mbuf_tail->mbuf, void *,
> + mbuf_tail->ofs);
> +#endif
> +
> +return p;
> +}
> +
> +/* Appends 'size' bytes of data to the tail end of 'b', reallocating and
> + * copying its data if necessary.  Returns a pointer to the first byte of the
> + * new data, which is left uninitialized. */
>  void *
> -dp_packet_put_zeros(struct dp_packet *b, size_t size)
> +dp_packet_put_uninit(struct dp_packet *b, size_t size)
>  {
> -void *dst = dp_packet_put_uninit(b, size);
> -memset(dst, 0, size);
> +void *tail = dp_packet_put_uninit_(b, size);
> +
> +void *dst = dp_packet_tail_to_addr(b, tail);
> +
> +free(tail);

For the !DPDK_NETDEV case, the code remains the same (as it should) except this 
call to free(). It could maybe cause a problem.

> +
>  return dst;
>  }
> 
> @@ -342,8 +381,40 @@ dp_packet_put_zeros(struct dp_packet *b, size_t
> size)
>  void *
>  dp_packet_put(struct dp_packet *b, const void *p, size_t size)
>  {
> -void *dst = dp_packet_put_uninit(b, size);
> -memcpy(dst, p, size);
> +void *tail = dp_packet_put_uninit_(b, size);
> +#ifdef DPDK_NETDEV
> +struct mbuf_tail *mbuf_tail = (struct mbuf_tail *) tail;
> +
> +dp_packet_mbuf_write(mbuf_tail->mbuf, mbuf_tail->ofs, size, p);

This function doesn't exist yet.

> +#else
> +memcpy(tail, p, size);
> +#endif
> +
> +void *dst = dp_packet_tail_to_addr(b, tail);
> +
> +free(tail);
> +
> +return dst;
> +}
> +
> +/* Appends 'size' zeroed

Re: [ovs-dev] [RFC v7 01/13] netdev-dpdk: fix mbuf sizing

2018-05-28 Thread Loftus, Ciara

> 
> From: Mark Kavanagh 
> 
> There are numerous factors that must be considered when calculating
> the size of an mbuf:
> - the data portion of the mbuf must be sized in accordance With Rx
>   buffer alignment (typically 1024B). So, for example, in order to
>   successfully receive and capture a 1500B packet, mbufs with a
>   data portion of size 2048B must be used.
> - in OvS, the elements that comprise an mbuf are:
>   * the dp packet, which includes a struct rte mbuf (704B)
>   * RTE_PKTMBUF_HEADROOM (128B)
>   * packet data (aligned to 1k, as previously described)
>   * RTE_PKTMBUF_TAILROOM (typically 0)
> 
> Some PMDs require that the total mbuf size (i.e. the total sum of all
> of the above-listed components' lengths) is cache-aligned. To satisfy
> this requirement, it may be necessary to round up the total mbuf size
> with respect to cacheline size. In doing so, it's possible that the
> dp_packet's data portion is inadvertently increased in size, such that
> it no longer adheres to Rx buffer alignment. Consequently, the
> following property of the mbuf no longer holds true:
> 
> mbuf.data_len == mbuf.buf_len - mbuf.data_off
> 
> This creates a problem in the case of multi-segment mbufs, where that
> assumption is assumed to be true for all but the final segment in an
> mbuf chain. Resolve this issue by adjusting the size of the mbuf's
> private data portion, as opposed to the packet data portion when
> aligning mbuf size to cachelines.
> 
> Fixes: 4be4d22 ("netdev-dpdk: clean up mbuf initialization")
> Fixes: 31b88c9 ("netdev-dpdk: round up mbuf_size to cache_line_size")
> CC: Santosh Shukla 
> Signed-off-by: Mark Kavanagh 
> ---
>  lib/netdev-dpdk.c | 46 ++
>  1 file changed, 30 insertions(+), 16 deletions(-)
> 
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 87152a7..356a967 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -82,12 +82,6 @@ static struct vlog_rate_limit rl =
> VLOG_RATE_LIMIT_INIT(5, 20);
>   + (2 * VLAN_HEADER_LEN))
>  #define MTU_TO_FRAME_LEN(mtu)   ((mtu) + ETHER_HDR_LEN +
> ETHER_CRC_LEN)
>  #define MTU_TO_MAX_FRAME_LEN(mtu)   ((mtu) +
> ETHER_HDR_MAX_LEN)
> -#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)\
> - - ETHER_HDR_LEN - ETHER_CRC_LEN)
> -#define MBUF_SIZE(mtu)
> ROUND_UP((MTU_TO_MAX_FRAME_LEN(mtu) \
> - + sizeof(struct dp_packet) \
> - + RTE_PKTMBUF_HEADROOM),   \
> - RTE_CACHE_LINE_SIZE)
>  #define NETDEV_DPDK_MBUF_ALIGN  1024
>  #define NETDEV_DPDK_MAX_PKT_LEN 9728
> 
> @@ -491,7 +485,7 @@ is_dpdk_class(const struct netdev_class *class)
>   * behaviour, which reduces performance. To prevent this, use a buffer size
>   * that is closest to 'mtu', but which satisfies the aforementioned criteria.
>   */
> -static uint32_t
> +static uint16_t
>  dpdk_buf_size(int mtu)
>  {
>  return ROUND_UP((MTU_TO_MAX_FRAME_LEN(mtu) +
> RTE_PKTMBUF_HEADROOM),
> @@ -582,7 +576,7 @@ dpdk_mp_do_not_free(struct rte_mempool *mp)
> OVS_REQUIRES(dpdk_mp_mutex)
>   *  - a new mempool was just created;
>   *  - a matching mempool already exists. */
>  static struct rte_mempool *
> -dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
> +dpdk_mp_create(struct netdev_dpdk *dev, uint16_t mbuf_pkt_data_len)
>  {
>  char mp_name[RTE_MEMPOOL_NAMESIZE];
>  const char *netdev_name = netdev_get_name(>up);
> @@ -590,6 +584,7 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
>  uint32_t n_mbufs;
>  uint32_t hash = hash_string(netdev_name, 0);
>  struct rte_mempool *mp = NULL;
> +uint16_t mbuf_size, aligned_mbuf_size, mbuf_priv_data_len;
> 
>  /*
>   * XXX: rough estimation of number of mbufs required for this port:
> @@ -609,12 +604,13 @@ dpdk_mp_create(struct netdev_dpdk *dev, int
> mtu)
>   * longer than RTE_MEMPOOL_NAMESIZE. */
>  int ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
> "ovs%08x%02d%05d%07u",
> -   hash, socket_id, mtu, n_mbufs);
> +   hash, socket_id, mbuf_pkt_data_len, n_mbufs);

%u for printing the unsigned mbuf_pkt_data_len

>  if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) {
>  VLOG_DBG("snprintf returned %d. "
>   "Failed to generate a mempool name for \"%s\". "
> - "Hash:0x%x, socket_id: %d, mtu:%d, mbufs:%u.",
> - ret, netdev_name, hash, socket_id, mtu, n_mbufs);
> + "Hash:0x%x, socket_id: %d, pkt data room:%d, mbufs:%u.",
> + ret, netdev_name, hash, socket_id, mbuf_pkt_data_len,
> + n_mbufs);

Same here

>  break;
>

Re: [ovs-dev] [RFC v7 00/13] Support multi-segment mbufs

2018-05-28 Thread Loftus, Ciara

> 
> Overview
> 
> This patchset introduces support for multi-segment mbufs to OvS-DPDK.
> Multi-segment mbufs are typically used when the size of an mbuf is
> insufficient to contain the entirety of a packet's data. Instead, the
> data is split across numerous mbufs, each carrying a portion, or
> 'segment', of the packet data. mbufs are chained via their 'next'
> attribute (an mbuf pointer).

Hi Tiago,

Thanks for the patch. I've reviewed the code and have some feedback. I did not 
test or verify functionality. Some general comments first:
* Compilation warning: "lib/dp-packet.h:243:22: error: 'mbuf' may be used 
uninitialized"
* The dp-packet.* files are scattered with many more '#ifdef DPDK_NETDEV's. 
Where possible I would suggest to co-locate DPDK code and try to minimise these 
branches.

The rest of my comments will be inline in the code.

Thanks,
Ciara

> 
> Use Cases
> =
> i.  Handling oversized (guest-originated) frames, which are marked
> for hardware accelration/offload (TSO, for example).
> 
> Packets which originate from a non-DPDK source may be marked for
> offload; as such, they may be larger than the permitted ingress
> interface's MTU, and may be stored in an oversized dp-packet. In
> order to transmit such packets over a DPDK port, their contents
> must be copied to a DPDK mbuf (via dpdk_do_tx_copy). However, in
> its current implementation, that function only copies data into
> a single mbuf; if the space available in the mbuf is exhausted,
> but not all packet data has been copied, then it is lost.
> Similarly, when cloning a DPDK mbuf, it must be considered
> whether that mbuf contains multiple segments. Both issues are
> resolved within this patchset.
> 
> ii. Handling jumbo frames.
> 
> While OvS already supports jumbo frames, it does so by increasing
> mbuf size, such that the entirety of a jumbo frame may be handled
> in a single mbuf. This is certainly the preferred, and most
> performant approach (and remains the default). However, it places
> high demands on system memory; multi-segment mbufs may be
> prefereable for systems which are memory-constrained.
> 
> Enabling multi-segment mbufs
> 
> Multi-segment and single-segment mbufs are mutually exclusive, and the
> user must decide on which approach to adopt on init. The introduction
> of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this.
> 
> This is a global boolean value, which determines how jumbo frames are
> represented across all DPDK ports. In the absence of a user-supplied
> value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
> mbufs must be explicitly enabled / single-segment mbufs remain the
> default.
> 
> Setting the field is identical to setting existing DPDK-specific OVSDB
> fields:
> 
> ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
> ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
> ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
> ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
> 
> ---
> 
> v7:  - Rebase on master 5e720da ("erspan: fix invalid erspan version.");
>  - Address Ilya comments;
> - Fix non-DPDK build;
> - Serialise the access of non pmds to allocation and free of mbufs by
>   using a newly introduced mutex.
>  - Add a new set of tests that integrates with the recently added DPDK
>testsuite. These focus on allocating dp_packets, with a single or
>multiple mbufs, from an instantiated mempool and performing several
>operations on those, verifying if the data at the end matches what's
>expected;
>  - Fix bugs found by the new tests:
> - dp_packet_shift() wasn't taking into account shift lefts;
> - dp_packet_resize__() was misusing and miscalculating the tailrooms
>   and headrooms, ending up calculating the wrong number of segments
>   that needed allocation;
> - An mbuf's end was being miscalculated both in dp_packet_tail,
>   dp_packet_mbuf_tail() and dp_packet_end();
> - dp_packet_set_size() was not updating the number of chained
> segments
>   'nb_segs';
>  - Add support for multi-seg mbufs in dp_packet_clear().
> 
> v6:  - Rebase on master 7c0cb29 ("conntrack-tcp: Handle tcp session
>reuse.");
>  - Further improve dp_packet_put_uninit() and dp_packet_shift() to
>support multi-seg mbufs;
>  - Add support for multi-seg mbufs in dp_packet_l4_size() and
>improve other helper funcs, such as dp_packet_tail() and dp_
>packet_tailroom().
>  - Add support for multi-seg mbufs in dp_packet_put(), dp_packet_
>put_zeros(), as well as dp_packet_resize__() - allocating new
>mbufs and linking them together;
>  Restructured patchset:
>  - Squash patch 5 into patch 6, since they were both related

Re: [ovs-dev] [PATCH v1] netdev-dpdk: Handle ENOTSUP for rte_eth_dev_set_mtu.

2018-05-17 Thread Loftus, Ciara

> 
> The function rte_eth_dev_set_mtu is not supported for all DPDK drivers.
> Currently if it is not supported we return an error in
> dpdk_eth_dev_queue_setup. There are two issues with this.
> 
> (i) A device can still function even if rte_eth_dev_set_mtu is not
> supported albeit with the default max rx packet length.
> 
> (ii) When the ENOTSUP is returned it will not be caught in
> port_reconfigure() at the dpif-netdev layer. Port_reconfigure() checks
> if a netdev_reconfigure() function is supported for a given netdev and
> ignores EOPNOTSUPP errors as it assumes errors of this value mean there
> is no reconfiguration function. In this case the reconfiguration function
> is supported for netdev dpdk but a function called as part of the
> reconfigure (rte_eth_dev_set_mtu) may not be supported.
> 
> As this is a corner case, this commit warns a user when
> rte_eth_dev_set_mtu is not supported and informs them of the default
> max rx packet length that will be used instead.
> 
> Signed-off-by: Ian Stokes 
> Co-author: Michal Weglicki 

Tested-by: Ciara Loftus 

Hi Ian,

Thanks for the patch. It needs a rebase.

PCAP vdevs fall under the (i) category. Before this patch I always observed a 
SEGV on pcap rx. Queues were not setup for pcap devices due to the failed 
mtu_set function, and later polling on those non-existent queues caused the 
SEGV. This patch removes this case and reports the error as described instead.

Thanks,
Ciara

> ---
>  lib/netdev-dpdk.c | 17 ++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 50c7619..e6c483d 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -771,6 +771,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk
> *dev, int n_rxq, int n_txq)
>  int diag = 0;
>  int i;
>  struct rte_eth_conf conf = port_conf;
> +uint16_t conf_mtu;
> 
>  /* For some NICs (e.g. Niantic), scatter_rx mode needs to be explicitly
>   * enabled. */
> @@ -804,9 +805,19 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk
> *dev, int n_rxq, int n_txq)
> 
>  diag = rte_eth_dev_set_mtu(dev->port_id, dev->mtu);
>  if (diag) {
> -VLOG_ERR("Interface %s MTU (%d) setup error: %s",
> -dev->up.name, dev->mtu, rte_strerror(-diag));
> -break;
> +/* A device may not support rte_eth_dev_set_mtu, in this case
> + * check flag a warning to the user and include the devices
> + * configured MTU value that will be used instead. */
> +if (-ENOTSUP == diag) {
> +rte_eth_dev_get_mtu(dev->port_id, _mtu);
> +VLOG_WARN("Interface %s does not support MTU configuration, "
> +  "max packet size supported is %d.",
> +  dev->up.name, conf_mtu);
> +} else {
> +VLOG_ERR("Interface %s MTU (%d) setup error: %s",
> + dev->up.name, dev->mtu, rte_strerror(-diag));
> +break;
> +}
>  }
> 
>  for (i = 0; i < n_txq; i++) {
> --
> 2.7.5
> 
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 1/1] netdev-dpdk: Don't use PMD driver if not configured successfully

2018-05-16 Thread Loftus, Ciara

> 
> When initialization of the DPDK PMD driver fails
> (dpdk_eth_dev_init()), the reconfigure_datapath() function will remove
> the port from dp_netdev, and the port is not used.
> 
> Now when bridge_reconfigure() is called again, no changes to the
> previous failing netdev configuration are detected and therefore the
> ports gets added to dp_netdev and used uninitialized. This is causing
> exceptions...
> 
> The fix has two parts to it. First in netdev-dpdk.c we remember if the
> DPDK port was started or not, and when calling
> netdev_dpdk_reconfigure() we also try re-initialization if the port
> was not already active. The second part of the change is in
> dpif-netdev.c where it makes sure netdev_reconfigure() is called if
> the port needs reconfiguration, as netdev_is_reconf_required() is only
> true until netdev_reconfigure() is called (even if it fails).
> 
> Signed-off-by: Eelco Chaudron 

Tested-by: Ciara Loftus 

Thanks for the patch, Eelco. I tested it by simulating an error in 
dpdk_eth_dev_init causing the dpdk port to be removed from dp_netdev. Before 
this patch, if a vhost port was added after this error, a reconfigure was 
triggered and I observed a SEGV. With this patch applied I no longer see a SEGV.

The patch needs a rebase and one minor style fix below.

Thanks,
Ciara

> ---
>  lib/dpif-netdev.c | 7 ---
>  lib/netdev-dpdk.c | 8 +++-
>  2 files changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index be31fd092..9b0916f4f 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -3327,8 +3327,6 @@ port_reconfigure(struct dp_netdev_port *port)
>  struct netdev *netdev = port->netdev;
>  int i, err;
> 
> -port->need_reconfigure = false;
> -
>  /* Closes the existing 'rxq's. */
>  for (i = 0; i < port->n_rxq; i++) {
>  netdev_rxq_close(port->rxqs[i].rx);
> @@ -3338,7 +3336,7 @@ port_reconfigure(struct dp_netdev_port *port)
>  port->n_rxq = 0;
> 
>  /* Allows 'netdev' to apply the pending configuration changes. */
> -if (netdev_is_reconf_required(netdev)) {
> +if (netdev_is_reconf_required(netdev) || port->need_reconfigure) {
>  err = netdev_reconfigure(netdev);
>  if (err && (err != EOPNOTSUPP)) {
>  VLOG_ERR("Failed to set interface %s new configuration",
> @@ -3371,6 +3369,9 @@ port_reconfigure(struct dp_netdev_port *port)
>  /* Parse affinity list to apply configuration for new queues. */
>  dpif_netdev_port_set_rxq_affinity(port, port->rxq_affinity_list);
> 
> +/* If reconfiguration was successful mark it as such, so we can use it */
> +port->need_reconfigure = false;
> +
>  return 0;
>  }
> 
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index b14140571..2a9128d68 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -356,6 +356,8 @@ struct netdev_dpdk {
> 
>  /* If true, device was attached by rte_eth_dev_attach(). */
>  bool attached;
> +/* If true, rte_eth_dev_start() was successfully called */
> +bool started;
>  struct eth_addr hwaddr;
>  int mtu;
>  int socket_id;
> @@ -816,6 +818,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>   rte_strerror(-diag));
>  return -diag;
>  }
> +dev->started = true;
> 
>  rte_eth_promiscuous_enable(dev->port_id);
>  rte_eth_allmulticast_enable(dev->port_id);
> @@ -1098,6 +1101,7 @@ netdev_dpdk_destruct(struct netdev *netdev)
>  ovs_mutex_lock(_mutex);
> 
>  rte_eth_dev_stop(dev->port_id);
> +dev->started = false;
> 
>  if (dev->attached) {
>  rte_eth_dev_close(dev->port_id);
> @@ -3555,13 +3559,15 @@ netdev_dpdk_reconfigure(struct netdev
> *netdev)
>  && dev->mtu == dev->requested_mtu
>  && dev->rxq_size == dev->requested_rxq_size
>  && dev->txq_size == dev->requested_txq_size
> -&& dev->socket_id == dev->requested_socket_id) {
> +&& dev->socket_id == dev->requested_socket_id &&
> +dev->started) {

Move the && to the same line as dev->started to be coherent with the lines 
above.

>  /* Reconfiguration is unnecessary */
> 
>  goto out;
>  }
> 
>  rte_eth_dev_stop(dev->port_id);
> +dev->started = false;
> 
>  err = netdev_dpdk_mempool_configure(dev);
>  if (err && err != EEXIST) {
> --
> 2.14.3
> 
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v9] netdev-dpdk: Add support for vHost dequeue zero copy (experimental)

2018-01-23 Thread Loftus, Ciara

> 
> On 19.01.2018 20:19, Ciara Loftus wrote:
> > Zero copy is disabled by default. To enable it, set the 'dq-zero-copy'
> > option to 'true' when configuring the Interface:
> >
> > ovs-vsctl set Interface dpdkvhostuserclient0
> > options:vhost-server-path=/tmp/dpdkvhostuserclient0
> > options:dq-zero-copy=true
> >
> > When packets from a vHost device with zero copy enabled are destined for
> > a single 'dpdk' port, the number of tx descriptors on that 'dpdk' port
> > must be set to a smaller value. 128 is recommended. This can be achieved
> > like so:
> >
> > ovs-vsctl set Interface dpdkport options:n_txq_desc=128
> >
> > Note: The sum of the tx descriptors of all 'dpdk' ports the VM will send
> > to should not exceed 128. Due to this requirement, the feature is
> > considered 'experimental'.
> >
> > Testing of the patch showed a 15% improvement when switching 512B
> > packets between vHost devices on different VMs on the same host when
> > zero copy was enabled on the transmitting device.
> >
> > Signed-off-by: Ciara Loftus 
> > ---
> > v9:
> > * Rebase
> > * Fix docs issue
> > * Move variable asignment inside mutex
> > * Reset dq-zero-copy value if vhost_driver_register fails
> >
> >  Documentation/intro/install/dpdk.rst |  2 +
> >  Documentation/topics/dpdk/vhost-user.rst | 73
> 
> >  NEWS |  1 +
> >  lib/netdev-dpdk.c| 11 -
> >  vswitchd/vswitch.xml | 11 +
> >  5 files changed, 97 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/intro/install/dpdk.rst
> b/Documentation/intro/install/dpdk.rst
> > index 3fecb5c..087eb88 100644
> > --- a/Documentation/intro/install/dpdk.rst
> > +++ b/Documentation/intro/install/dpdk.rst
> > @@ -518,6 +518,8 @@ The above command sets the number of rx queues
> for DPDK physical interface.
> >  The rx queues are assigned to pmd threads on the same NUMA node in a
> >  round-robin fashion.
> >
> > +.. _dpdk-queues-sizes:
> > +
> >  DPDK Physical Port Queue Sizes
> >  ~~~
> >
> > diff --git a/Documentation/topics/dpdk/vhost-user.rst
> b/Documentation/topics/dpdk/vhost-user.rst
> > index 8447e2d..95517a6 100644
> > --- a/Documentation/topics/dpdk/vhost-user.rst
> > +++ b/Documentation/topics/dpdk/vhost-user.rst
> > @@ -458,3 +458,76 @@ Sample XML
> >  
> >
> >  .. _QEMU documentation: http://git.qemu-
> project.org/?p=qemu.git;a=blob;f=docs/specs/vhost-
> user.txt;h=7890d7169;hb=HEAD
> > +
> > +vhost-user Dequeue Zero Copy (experimental)
> > +---
> > +
> > +Normally when dequeuing a packet from a vHost User device, a memcpy
> operation
> > +must be used to copy that packet from guest address space to host
> address
> > +space. This memcpy can be removed by enabling dequeue zero-copy like
> so::
> > +
> > +$ ovs-vsctl add-port br0 dpdkvhostuserclient0 -- set Interface \
> > +dpdkvhostuserclient0 type=dpdkvhostuserclient \
> > +options:vhost-server-path=/tmp/dpdkvhostclient0 \
> > +options:dq-zero-copy=true
> > +
> > +With this feature enabled, a reference (pointer) to the packet is passed to
> > +the host, instead of a copy of the packet. Removing this memcpy can give
> a
> > +performance improvement for some use cases, for example switching
> large packets
> > +between different VMs. However additional packet loss may be
> observed.
> > +
> > +Note that the feature is disabled by default and must be explicitly enabled
> > +by setting the ``dq-zero-copy`` option to ``true`` while specifying the
> > +``vhost-server-path`` option as above. If you wish to split out the
> command
> > +into multiple commands as below, ensure ``dq-zero-copy`` is set before
> > +``vhost-server-path``::
> > +
> > +$ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-
> copy=true
> > +$ ovs-vsctl set Interface dpdkvhostuserclient0 \
> > +options:vhost-server-path=/tmp/dpdkvhostclient0
> > +
> > +The feature is only available to ``dpdkvhostuserclient`` port types.
> > +
> > +A limitation exists whereby if packets from a vHost port with
> > +``dq-zero-copy=true`` are destined for a ``dpdk`` type port, the number of
> tx
> > +descriptors (``n_txq_desc``) for that port must be reduced to a smaller
> number,
> > +128 being the recommended value. This can be achieved by issuing the
> following
> > +command::
> > +
> > +$ ovs-vsctl set Interface dpdkport options:n_txq_desc=128
> > +
> > +Note: The sum of the tx descriptors of all ``dpdk`` ports the VM will send
> to
> > +should not exceed 128. For example, in case of a bond over two physical
> ports
> > +in balance-tcp mode, one must divide 128 by the number of links in the
> bond.
> > +
> > +Refer to :ref:`dpdk-queues-sizes` for more information.
> > +
> > +The reason for this limitation is due to how the zero copy functionality is
> > +implemented. The vHost device's 'tx

Re: [ovs-dev] About : Enable optional dequeue zero copy for vHost User

2018-01-23 Thread Loftus, Ciara

Hi,

For the meantime this feature is proposed as ‘experimental’ for OVS DPDK.
Unless you are transmitting to a NIC, you don’t need to set the n_txq_desc.
My testing has been only with a DPDK driver in the guest. Have you tried that 
option?

Thanks,
Ciara

From: liyang07 [mailto:liyan...@corp.netease.com]
Sent: Wednesday, January 17, 2018 10:41 AM
To: Loftus, Ciara <ciara.lof...@intel.com>
Cc: d...@dpdk.org
Subject: About : Enable optional dequeue zero copy for vHost User


Hi Ciara,

I am tesing the function of "vHost dequeue zero copy" for vm2vm on a host, 
and I have some problems:

1. The networking is OK before run iperf, I can ping successed from vm1 to 
vm2，but after run iperf, the networking between vm1 and vm2 is down;(I think 
n_txq_desc cause the problem)

2. I know the limitation about n_txq_desc, but I cannot set the n_txq_desc 
for dpdkport while the vms on the same host, because there is no dpdkports work 
fow my testing;



Thus, how can I resolve it, thanks.




___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v8] netdev-dpdk: Add support for vHost dequeue zero copy (experimental)

2018-01-19 Thread Loftus, Ciara

> 
> On 05.01.2018 19:13, Ciara Loftus wrote:
> > Zero copy is disabled by default. To enable it, set the 'dq-zero-copy'
> > option to 'true' when configuring the Interface:
> >
> > ovs-vsctl set Interface dpdkvhostuserclient0
> > options:vhost-server-path=/tmp/dpdkvhostuserclient0
> > options:dq-zero-copy=true
> >
> > When packets from a vHost device with zero copy enabled are destined for
> > a single 'dpdk' port, the number of tx descriptors on that 'dpdk' port
> > must be set to a smaller value. 128 is recommended. This can be achieved
> > like so:
> >
> > ovs-vsctl set Interface dpdkport options:n_txq_desc=128
> >
> > Note: The sum of the tx descriptors of all 'dpdk' ports the VM will send
> > to should not exceed 128. Due to this requirement, the feature is
> > considered 'experimental'.
> >
> > Testing of the patch showed a 15% improvement when switching 512B
> > packets between vHost devices on different VMs on the same host when
> > zero copy was enabled on the transmitting device.
> >
> > Signed-off-by: Ciara Loftus 
> > ---
> > v8:
> > * Disallow configurability after vHost device has been registered &
> > update docs accordingly.
> > * Give performance datapoint in commit message.
> >
> >  Documentation/intro/install/dpdk.rst |  2 +
> >  Documentation/topics/dpdk/vhost-user.rst | 72
> 
> >  NEWS |  1 +
> >  lib/netdev-dpdk.c|  9 +++-
> >  vswitchd/vswitch.xml | 11 +
> >  5 files changed, 94 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/intro/install/dpdk.rst
> b/Documentation/intro/install/dpdk.rst
> > index 3fecb5c..087eb88 100644
> > --- a/Documentation/intro/install/dpdk.rst
> > +++ b/Documentation/intro/install/dpdk.rst
> > @@ -518,6 +518,8 @@ The above command sets the number of rx queues
> for DPDK physical interface.
> >  The rx queues are assigned to pmd threads on the same NUMA node in a
> >  round-robin fashion.
> >
> > +.. _dpdk-queues-sizes:
> > +
> >  DPDK Physical Port Queue Sizes
> >  ~~~
> >
> > diff --git a/Documentation/topics/dpdk/vhost-user.rst
> b/Documentation/topics/dpdk/vhost-user.rst
> > index 8447e2d..1a6c6d0 100644
> > --- a/Documentation/topics/dpdk/vhost-user.rst
> > +++ b/Documentation/topics/dpdk/vhost-user.rst
> > @@ -458,3 +458,75 @@ Sample XML
> >  
> >
> >  .. _QEMU documentation: http://git.qemu-
> project.org/?p=qemu.git;a=blob;f=docs/specs/vhost-
> user.txt;h=7890d7169;hb=HEAD
> > +
> > +vhost-user Dequeue Zero Copy (experimental)
> > +---
> > +
> > +Normally when dequeuing a packet from a vHost User device, a memcpy
> operation
> > +must be used to copy that packet from guest address space to host
> address
> > +space. This memcpy can be removed by enabling dequeue zero-copy like
> so::
> > +
> > +$ ovs-vsctl add-port br0 dpdkvhostuserclient0 -- set Interface \
> > +dpdkvhostuserclient0 type=dpdkvhostuserclient \
> > +options:vhost-server-path=/tmp/dpdkvhostclient0 \
> > +options:dq-zero-copy=true
> > +
> > +With this feature enabled, a reference (pointer) to the packet is passed to
> > +the host, instead of a copy of the packet. Removing this memcpy can give
> a
> > +performance improvement for some use cases, for example switching
> large packets
> > +between different VMs. However additional packet loss may be
> observed.
> > +
> > +Note that the feature is disabled by default and must be explicitly enabled
> > +by setting the 'dq-zero-copy' option to 'true' while specifying the
> > +'vhost-server-path' option as above. If you wish to split out the command
> into
> > +multiple commands as below, ensure 'dq-zero-copy' is set before
> > +'vhost-server-path'::
> > +
> > +$ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-
> copy=true
> > +$ ovs-vsctl set Interface dpdkvhostuserclient0 \
> > +options:vhost-server-path=/tmp/dpdkvhostclient0
> > +
> > +The feature is only available to dpdkvhostuserclient port types.
> > +
> > +A limitation exists whereby if packets from a vHost port with dq-zero-
> copy=true
> > +are destined for a 'dpdk' type port, the number of tx descriptors
> (n_txq_desc)
> > +for that port must be reduced to a smaller number, 128 being the
> recommended
> > +value. This can be achieved by issuing the following command::
> > +
> > +$ ovs-vsctl set Interface dpdkport options:n_txq_desc=128
> > +
> > +Note: The sum of the tx descriptors of all 'dpdk' ports the VM will send to
> > +should not exceed 128. For example, in case of a bond over two physical
> ports
> > +in balance-tcp mode, one must divide 128 by the number of links in the
> bond.
> > +
> > +Refer to :ref:`` for more information.
> > +
> > +The reason for this limitation is due to how the zero copy functionality is
> > +implemented. The vHost device's 'tx used vring', a virtio structure

Re: [ovs-dev] [ovs-dev, v5, 2/2] netdev-dpdk: Enable optional dequeue zero copy for vHost User

2017-12-18 Thread Loftus, Ciara

> 
> On 18.12.2017 15:28, Loftus, Ciara wrote:
> >>
> >> Not a full review.
> >
> > Thanks for your feedback.
> >
> >>
> >> General thoughts:
> >>
> >> If following conditions are true:
> >>
> >> 1. We don't need to add new feature to deprecated vhostuser port.
> >
> > Agree.
> >
> >>
> >> 2. We actually don't need to have ability to change ZC config if vhost-
> server-
> >> path
> >>already configured.
> >>
> >>Let me explain this condition:
> >>To change 'dq-zero-copy' if VM already started we need to stop VM,
> >> change the
> >>'dq-zero-copy' and start the VM back. It's not much simpler than stop
> VM,
> >> re-add
> >>port with different value for 'dq-zero-copy' and start VM back. I'm
> assuming
> >> here
> >>that VM restart is much more complex and unlikely operation than port
> >> removing
> >>from OVS and adding back. This will require documenting, but we
> already
> >> have a
> >>bunch of docs about how to modify this config option in current version.
> >
> > Don't fully agree. I can't say whether your assumption is correct.
> > Port-deletion & re-addition has repercussions eg. loss of statistics from an
> interface. Retaining those stats might be important for some.
> > Why not leave it up to the user to choose their preferred method of
> enabling feature ie. reboot the VM or delete & add the port.
> >
> >>
> >> we may drop patch #1 from the series and use just something like
> following
> >> code instead of code changes in this patch (not tested, just for a
> reference):
> >>
> >> ---
> >> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> >> index 2325f0e..c4cbf7c 100644
> >> --- a/lib/netdev-dpdk.c
> >> +++ b/lib/netdev-dpdk.c
> >> @@ -379,6 +379,9 @@ struct netdev_dpdk {
> >>  /* True if vHost device is 'up' and has been reconfigured at 
> >> least once
> */
> >>  bool vhost_reconfigured;
> >>  /* 3 pad bytes here. */
> >> +
> >> +/* True if dq-zero-copy feature requested for vhost device. */
> >> +bool vhost_dq_zc;
> >>  );
> >>
> >>  PADDED_MEMBERS(CACHE_LINE_SIZE,
> >> @@ -889,6 +892,7 @@ common_construct(struct netdev *netdev,
> >> dpdk_port_t port_no,
> >>  dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
> >>  ovsrcu_index_init(>vid, -1);
> >>  dev->vhost_reconfigured = false;
> >> +dev->vhost_dq_zc = false;
> >>  dev->attached = false;
> >>
> >>  ovsrcu_init(>qos_conf, NULL);
> >> @@ -1384,6 +1388,7 @@ netdev_dpdk_vhost_client_set_config(struct
> >> netdev *netdev,
> >>  path = smap_get(args, "vhost-server-path");
> >>  if (path && strcmp(path, dev->vhost_id)) {
> >>  strcpy(dev->vhost_id, path);
> >> +dev->vhost_dq_zc = smap_get_bool(args, "dq-zero-copy", false);
> >>  netdev_request_reconfigure(netdev);
> >>  }
> >>  }
> >> @@ -3288,6 +3293,11 @@ netdev_dpdk_vhost_client_reconfigure(struct
> >> netdev *netdev)
> >>  if (dpdk_vhost_iommu_enabled()) {
> >>  vhost_flags |= RTE_VHOST_USER_IOMMU_SUPPORT;
> >>  }
> >> +
> >> +/* Enable Zero Copy mode if configured. */
> >> +if (dev->vhost_dq_zc) {
> >> +vhost_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
> >> +}
> >>  err = rte_vhost_driver_register(dev->vhost_id, vhost_flags);
> >>  if (err) {
> >>  VLOG_ERR("vhost-user device setup failure for device %s\n",
> >>
> >> ---
> >>
> >> Thoughts?
> >
> > The above is a lot shorter of course which is nice, but I don't think we
> should sacrifice the ability to enable the feature post-boot for this.
> 
> But you wrote in documentation below:
> 
> +The feature cannot be enabled when the device is active (ie. VM booted). If
> +you wish to enable the feature after the VM has booted, you must
> shutdown
> +the VM and bring it back up.
> 
> What you mean saying "sacrifice the ability to enable the feature post-boot"?
> Am I missed something?

My understanding is that your

Re: [ovs-dev] [ovs-dev, v5, 2/2] netdev-dpdk: Enable optional dequeue zero copy for vHost User

2017-12-18 Thread Loftus, Ciara

> 
> Not a full review.

Thanks for your feedback.

> 
> General thoughts:
> 
> If following conditions are true:
> 
> 1. We don't need to add new feature to deprecated vhostuser port.

Agree.

> 
> 2. We actually don't need to have ability to change ZC config if vhost-server-
> path
>already configured.
> 
>Let me explain this condition:
>To change 'dq-zero-copy' if VM already started we need to stop VM,
> change the
>'dq-zero-copy' and start the VM back. It's not much simpler than stop VM,
> re-add
>port with different value for 'dq-zero-copy' and start VM back. I'm 
> assuming
> here
>that VM restart is much more complex and unlikely operation than port
> removing
>from OVS and adding back. This will require documenting, but we already
> have a
>bunch of docs about how to modify this config option in current version.

Don't fully agree. I can't say whether your assumption is correct.
Port-deletion & re-addition has repercussions eg. loss of statistics from an 
interface. Retaining those stats might be important for some.
Why not leave it up to the user to choose their preferred method of enabling 
feature ie. reboot the VM or delete & add the port.

> 
> we may drop patch #1 from the series and use just something like following
> code instead of code changes in this patch (not tested, just for a reference):
> 
> ---
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 2325f0e..c4cbf7c 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -379,6 +379,9 @@ struct netdev_dpdk {
>  /* True if vHost device is 'up' and has been reconfigured at least 
> once */
>  bool vhost_reconfigured;
>  /* 3 pad bytes here. */
> +
> +/* True if dq-zero-copy feature requested for vhost device. */
> +bool vhost_dq_zc;
>  );
> 
>  PADDED_MEMBERS(CACHE_LINE_SIZE,
> @@ -889,6 +892,7 @@ common_construct(struct netdev *netdev,
> dpdk_port_t port_no,
>  dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>  ovsrcu_index_init(>vid, -1);
>  dev->vhost_reconfigured = false;
> +dev->vhost_dq_zc = false;
>  dev->attached = false;
> 
>  ovsrcu_init(>qos_conf, NULL);
> @@ -1384,6 +1388,7 @@ netdev_dpdk_vhost_client_set_config(struct
> netdev *netdev,
>  path = smap_get(args, "vhost-server-path");
>  if (path && strcmp(path, dev->vhost_id)) {
>  strcpy(dev->vhost_id, path);
> +dev->vhost_dq_zc = smap_get_bool(args, "dq-zero-copy", false);
>  netdev_request_reconfigure(netdev);
>  }
>  }
> @@ -3288,6 +3293,11 @@ netdev_dpdk_vhost_client_reconfigure(struct
> netdev *netdev)
>  if (dpdk_vhost_iommu_enabled()) {
>  vhost_flags |= RTE_VHOST_USER_IOMMU_SUPPORT;
>  }
> +
> +/* Enable Zero Copy mode if configured. */
> +if (dev->vhost_dq_zc) {
> +vhost_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
> +}
>  err = rte_vhost_driver_register(dev->vhost_id, vhost_flags);
>  if (err) {
>  VLOG_ERR("vhost-user device setup failure for device %s\n",
> 
> ---
> 
> Thoughts?

The above is a lot shorter of course which is nice, but I don't think we should 
sacrifice the ability to enable the feature post-boot for this.

Thanks,
Ciara

> 
> 
> Best regards, Ilya Maximets.
> 
> P.S. Few comments about patch itself are inline.
> 
> On 08.12.2017 18:26, Ciara Loftus wrote:
> > Enabled per port like so:
> > ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=true
> >
> > The feature is disabled by default and can only be enabled/disabled when
> > a vHost port is down.
> >
> > When packets from a vHost device with zero copy enabled are destined for
> > a 'dpdk' port, the number of tx descriptors on that 'dpdk' port must be
> > set to a smaller value. 128 is recommended. This can be achieved like
> > so:
> >
> > ovs-vsctl set Interface dpdkport options:n_txq_desc=128
> >
> > Signed-off-by: Ciara Loftus 
> > ---
> > v5:
> > * Rebase
> > * Update docs with warning of potential packet loss
> > * Update docs with info on NIC & Virtio descriptor values and working
> > combinations
> >
> >  Documentation/howto/dpdk.rst | 33 
> >  Documentation/topics/dpdk/vhost-user.rst | 61
> ++
> >  NEWS |  2 +
> >  lib/netdev-dpdk.c| 89
> +++-
> >  vswitchd/vswitch.xml | 11 
> >  5 files changed, 195 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/howto/dpdk.rst
> b/Documentation/howto/dpdk.rst
> > index d123819..3e1b8f8 100644
> > --- a/Documentation/howto/dpdk.rst
> > +++ b/Documentation/howto/dpdk.rst
> > @@ -709,3 +709,36 @@ devices to bridge ``br0``. Once complete, follow
> the below steps:
> > Check traffic on multiple queues::
> >
> > $ cat /proc/interrupts | grep

Re: [ovs-dev] [PATCH v4 0/2] vHost Dequeue Zero Copy

2017-12-08 Thread Loftus, Ciara

> 
> > > Can you comment on that? Can a user also reduce the problem by
> > > configuring
> > > a) a larger virtio Tx queue size (up to 1K) in Qemu, or
> >
> > Is this possible right now without modifying QEMU src? I think the size is
> hardcoded to 256 at the moment although it may become
> > configurable in the future. If/when it does, we can test and update the
> docs if it does solve the problem. I don’t think we should suggest
> > modifying the QEMU src as a workaround now.
> 
> The possibility to configure the tx queue size has been upstreamed in Qemu
> 2.10:
> 
> commit 9b02e1618cf26aa52cf786f215d757506dda14f8
> Author: Wei Wang 
> Date:   Wed Jun 28 10:37:59 2017 +0800
> 
> virtio-net: enable configurable tx queue size
> 
> This patch enables the virtio-net tx queue size to be configurable
> between 256 (the default queue size) and 1024 by the user when the
> vhost-user backend is used
> 
> So you should be able to test larger tx queue sizes with Qemu 2.10.

That's good news, thanks for sharing the details.
I tested with tx_queue_size=1024 and it didn't resolve the issue completely, 
but allowed for a greater number of txq descriptors for the NIC:
For default QEMU VQ size = 256, max n_txq_desc value = 256
For QEMY VQ size = 1024, max n_txq_desc value = 512

> 
> >
> > > b) a larger mempool for packets in Tx direction inside the guest (driver?)
> >
> > Using the DPDK driver in the guest & generating traffic via testpmd I
> modified the number of descriptors given to the virtio device from
> > 512 (default) to 2048 & 4096 but it didn't resolve the issue unfortunately.
> 
> I re-read the virtio 1.0 spec and it states that the total number of virtio
> descriptors per virtqueue equals the size of the virtqueue. Descriptors just
> point to guest mbufs. The mempool the guest driver uses for mbufs is
> irrelevant. OVS as virtio device needs to return the virtio descriptors to the
> guest driver. That means the virtio queue size sets the limit on the packets 
> in
> flight in OVS and physical NICs.
> 
> I would like to add a statement in the documentation that explains this
> dependency between Qemu Tx queue size and maximum physical NIC Tx
> queue size when using the vhost zero copy feature on a port.

I will put my findings above in the documentation.

> 
> > > > > And what about increased packet drop risk due to shortened tx
> queues?
> > > >
> > > > I guess this could be an issue. If I had some data to back this up I 
> > > > would
> > > include it in the documentation and mention the risk.
> > > > If the risk is unacceptable to the user they may choose to not enable
> the
> > > feature. It's disabled by default so shouldn't introduce an issue for
> > > > the standard case.
> > >
> > > Yes, but it would be good to understand the potential drawback for a
> better
> > > judgement of the trade-off between better raw throughput and higher
> loss
> > > risk.
> >
> > I ran RFC2544 0% packet loss tests for ZC on & off (64B PVP) and observed
> the following:
> >
> > Max rate (Mpps) with 0% loss
> > ZC Off 2599518
> > ZC On  1678758
> >
> > As you suspected, there is a trade-off. I can mention this in the docs.
> 
> That degradation looks severe.
> It would be cool if you could re-run the test with a 1K queue size configured
> in Qemu 2.10 and NIC

I ran a couple of configurations, again 64B RFC2544 PVP:

NIC-TXDVirtio-TXDZCMpps
2048256 off2.105# default case
128  256 off2.162# checking effect of 
modifying NIC TXD (positive)
20481024   off2.455   # checking effect of 
modifying Virtio TXD (positive)
128  256 on1.587# default zero copy case
512  1024   on0.321# checking effect of 
modifying NIC & Virtio TXD (negative)

For the default non-zero copy case, it seems increasing the virtio queue size 
in the guest has a positive effect wrt packet loss, but has the opposite effect 
for the zero copy case.
It looks like the zero copy feature may increase the likelihood of packet loss, 
which I guess is a tradeoff for the increase pps you get with the feature.

Thanks,
Ciara


> 
> Regards,
> Jan
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v4 0/2] vHost Dequeue Zero Copy

2017-11-28 Thread Loftus, Ciara

> 
> Hi Ciara,
> 
> > Thanks for your feedback. The limitation is only placed on phy port queues
> on the VP (vhost -> phy) path. VV path and PV path are not
> > affected.
> 
> Yes, you are right. VM to VM traffic is copied on transmit to the second VM.
> 
> > > I would much rather put a requirement on tenants that their virtio drivers
> > > need to allocate enough virtio packet buffers if they want their VM to use
> > > zero-copy vhostuser ports. Or is the critical resource  owned and
> managed by
> > > Qemu and we'd need a patch on Qemu to overcome this limit?
> 
> Can you comment on that? Can a user also reduce the problem by
> configuring
> a) a larger virtio Tx queue size (up to 1K) in Qemu, or

Is this possible right now without modifying QEMU src? I think the size is 
hardcoded to 256 at the moment although it may become configurable in the 
future. If/when it does, we can test and update the docs if it does solve the 
problem. I don’t think we should suggest modifying the QEMU src as a workaround 
now.

> b) a larger mempool for packets in Tx direction inside the guest (driver?)

Using the DPDK driver in the guest & generating traffic via testpmd I modified 
the number of descriptors given to the virtio device from 512 (default) to 2048 
& 4096 but it didn't resolve the issue unfortunately.

> 
> > >
> > > And what about increased packet drop risk due to shortened tx queues?
> >
> > I guess this could be an issue. If I had some data to back this up I would
> include it in the documentation and mention the risk.
> > If the risk is unacceptable to the user they may choose to not enable the
> feature. It's disabled by default so shouldn't introduce an issue for
> > the standard case.
> 
> Yes, but it would be good to understand the potential drawback for a better
> judgement of the trade-off between better raw throughput and higher loss
> risk.

I ran RFC2544 0% packet loss tests for ZC on & off (64B PVP) and observed the 
following:

Max rate (Mpps) with 0% loss
ZC Off 2599518
ZC On  1678758

As you suspected, there is a trade-off. I can mention this in the docs.

Thanks,
Ciara

> 
> Regards, Jan
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] netdev-dpdk: Remove uneeded call to rte_eth_dev_count().

2017-11-27 Thread Loftus, Ciara

> 
> The call to rte_eth_dev_count() was added as workaround
> for rte_eth_dev_get_port_by_name() not handling cases
> when there was no DPDK ports. In recent versions of DPDK,
> rte_eth_dev_get_port_by_name() does handle this
> case, so the rte_eth_dev_count() call can be removed.
> 
> CC: Ciara Loftus 
> Signed-off-by: Kevin Traynor 
> ---
>  lib/netdev-dpdk.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index faff842..0436ff0 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -1192,6 +1192,5 @@ netdev_dpdk_process_devargs(struct
> netdev_dpdk *dev,
>  dpdk_port_t new_port_id = DPDK_ETH_PORT_ID_INVALID;
> 
> -if (!rte_eth_dev_count()
> -|| rte_eth_dev_get_port_by_name(name, _port_id)
> +if (rte_eth_dev_get_port_by_name(name, _port_id)
>  || !rte_eth_dev_is_valid_port(new_port_id)) {
>  /* Device not found in DPDK, attempt to attach it */
> @@ -2526,6 +2525,5 @@ netdev_dpdk_detach(struct unixctl_conn *conn,
> int argc OVS_UNUSED,
>  ovs_mutex_lock(_mutex);
> 
> -if (!rte_eth_dev_count() || rte_eth_dev_get_port_by_name(argv[1],
> - _id)) {
> +if (rte_eth_dev_get_port_by_name(argv[1], _id)) {

LGTM. DPDK commit f9ae888b1e19 removes the need for the call to count().

Acked-by: Ciara Loftus 

Thanks,
Ciara

>  response = xasprintf("Device '%s' not found in DPDK", argv[1]);
>  goto error;
> --
> 1.8.3.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [RFC PATCH V2 2/2] netdev-dpdk: add support for vhost IOMMU feature

2017-11-08 Thread Loftus, Ciara

> 
> DPDK v17.11 introduces support for the vHost IOMMU feature.
> This is a security feature, that restricts the vhost memory
> that a virtio device may access.
> 
> This feature also enables the vhost REPLY_ACK protocol, the
> implementation of which is known to work in newer versions of
> QEMU (i.e. v2.10.0), but is buggy in older versions (v2.7.0 -
> v2.9.0, inclusive). As such, the feature is disabled by default
> in (and should remain so, for the aforementioned older QEMU
> verions). Starting with QEMU v2.9.1, vhost-iommu-support can
> safely be enabled, even without having an IOMMU device, with
> no performance penalty.
> 
> This patch adds a new vhost port option, vhost-iommu-support,
> to allow enablement of the vhost IOMMU feature:
> 
> $ ovs-vsctl add-port br0 vhost-client-1 \
> -- set Interface vhost-client-1 type=dpdkvhostuserclient \
>  options:vhost-server-path=$VHOST_USER_SOCKET_PATH   \
>  options:vhost-iommu-support=true
> 
> Note that support for this feature is only implemented for vhost
> user client ports (since vhost user ports are considered deprecated).
> 
> Signed-off-by: Mark Kavanagh 
> ---
>  Documentation/topics/dpdk/vhost-user.rst | 21 +
>  NEWS |  1 +
>  lib/netdev-dpdk.c| 29 ++---
>  vswitchd/vswitch.xml | 10 ++
>  4 files changed, 58 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/topics/dpdk/vhost-user.rst
> b/Documentation/topics/dpdk/vhost-user.rst
> index fdf6ae1..6d65971 100644
> --- a/Documentation/topics/dpdk/vhost-user.rst
> +++ b/Documentation/topics/dpdk/vhost-user.rst
> @@ -250,6 +250,27 @@ Once the vhost-user-client ports have been added
> to the switch, they must be
>  added to the guest. Like vhost-user ports, there are two ways to do this:
> using
>  QEMU directly, or using libvirt. Only the QEMU case is covered here.
> 
> +vhost-user client IOMMU
> +~~~
> +It is possible to enable IOMMU support for vHost User client ports. This is
> +a feature which restricts the vhost memory that a virtio device can access,
> and
> +as such is useful in deployments in which security is a concern. IOMMU
> mode may
> +be enabled on the command line::
> +
> +$ ovs-vsctl add-port br0 vhost-client-1 \
> +-- set Interface vhost-client-1 type=dpdkvhostuserclient \
> + options:vhost-server-path=$VHOST_USER_SOCKET_PATH   \
> + options:vhost-iommu-support=true
> +
> +.. important::
> +
> +Enabling the IOMMU feature also enables the vhost user reply-ack
> protocol;
> +this is known to work on QEMU v2.10.0, but is buggy on older versions
> +(2.7.0 - 2.9.0, inclusive). Consequently, the IOMMU feaure is disabled by
> +default (and should remain so if using the aforementioned versions of
> QEMU).
> +Starting with QEMU v2.9.1, vhost-iommu-support can safely be enabled,
> even
> +without having an IOMMU device, with no performance penalty.
> +
>  Adding vhost-user-client ports to the guest (QEMU)
>  ~~
> 
> diff --git a/NEWS b/NEWS
> index 6acd8bd..de53d2f 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -13,6 +13,7 @@ Post-v2.8.0
>   * Add support for compiling OVS with the latest Linux 4.13 kernel
> - DPDK:
>   * Add support for DPDK v17.11
> + * Add support for vHost IOMMU feature
> 
>  v2.8.0 - 31 Aug 2017
>  
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index ed5bf62..2e9633a 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -1424,15 +1424,29 @@ netdev_dpdk_vhost_client_set_config(struct
> netdev *netdev,
>  {
>  struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>  const char *path;
> +bool iommu_enable;
> +bool request_reconfigure = false;
> +uint64_t vhost_driver_flags_prev = dev->vhost_driver_flags;
> 
>  ovs_mutex_lock(>mutex);
>  if (!(dev->vhost_driver_flags & RTE_VHOST_USER_CLIENT)) {
>  path = smap_get(args, "vhost-server-path");
>  if (path && strcmp(path, dev->vhost_id)) {
>  strcpy(dev->vhost_id, path);
> -netdev_request_reconfigure(netdev);
> +request_reconfigure = true;
>  }
>  }
> +
> +iommu_enable = smap_get_bool(args, "vhost-iommu-support", false);
> +if (iommu_enable)
> +dev->vhost_driver_flags |= RTE_VHOST_USER_IOMMU_SUPPORT;
> +else
> +dev->vhost_driver_flags &= ~RTE_VHOST_USER_IOMMU_SUPPORT;
> +if (vhost_driver_flags_prev != dev->vhost_driver_flags)
> +request_reconfigure = true;
> +
> +if (request_reconfigure)
> +netdev_request_reconfigure(netdev);
>  ovs_mutex_unlock(>mutex);

Thanks for making the change - LGTM

Acked-by: Ciara Loftus 

> 
>  return 0;
> @@ -3326,9 +3340,18 @@

Re: [ovs-dev] [RFC PATCH 2/2] netdev-dpdk: add support for vhost IOMMU feature

2017-11-07 Thread Loftus, Ciara

> 
> DPDK v17.11 introduces support for the vHost IOMMU feature.
> This is a security feature, that restricts the vhost memory
> that a virtio device may access.
> 
> This feature also enables the vhost REPLY_ACK protocol, the
> implementation of which is known to work in newer versions of
> QEMU (i.e. v2.10.0), but is buggy in older versions (v2.7.0 -
> v2.9.0, inclusive). As such, the feature is disabled by default
> in (and should remain so, for the aforementioned older QEMU
> verions).
> 
> This patch adds a new vhost port option, vhost-iommu-support,
> to allow enablement of the vhost IOMMU feature:
> 
> $ ovs-vsctl add-port br0 vhost-client-1 \
> -- set Interface vhost-client-1 type=dpdkvhostuserclient \
>  options:vhost-server-path=$VHOST_USER_SOCKET_PATH   \
>  options:vhost-iommu-support=true
> 
> Note that support for this feature is only implemented for vhost
> user client ports (since vhost user ports are considered deprecated).
> 
> Signed-off-by: Mark Kavanagh 
> ---
>  Documentation/topics/dpdk/vhost-user.rst | 19 +++
>  NEWS |  1 +
>  lib/netdev-dpdk.c| 26 +-
>  vswitchd/vswitch.xml | 10 ++
>  4 files changed, 51 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/topics/dpdk/vhost-user.rst
> b/Documentation/topics/dpdk/vhost-user.rst
> index fdf6ae1..ba70e45 100644
> --- a/Documentation/topics/dpdk/vhost-user.rst
> +++ b/Documentation/topics/dpdk/vhost-user.rst
> @@ -250,6 +250,25 @@ Once the vhost-user-client ports have been added
> to the switch, they must be
>  added to the guest. Like vhost-user ports, there are two ways to do this:
> using
>  QEMU directly, or using libvirt. Only the QEMU case is covered here.
> 
> +vhost-user client IOMMU
> +~~~
> +It is possible to enable IOMMU support for vHost User client ports. This is
> +a feature which restricts the vhost memory that a virtio device can access,
> and
> +as such is useful in deployments in which security is a concern. IOMMU
> mode may
> +be enabled on the command line::
> +
> +$ ovs-vsctl add-port br0 vhost-client-1 \
> +-- set Interface vhost-client-1 type=dpdkvhostuserclient \
> + options:vhost-server-path=$VHOST_USER_SOCKET_PATH   \
> + options:vhost-iommu-support=true
> +
> +.. important::
> +
> +Enabling the IOMMU feature also enables the vhost user reply-ack
> protocol;
> +this is known to work on QEMU v2.10.0, but is buggy on older versions
> +(2.7.0 - 2.9.0, inclusive). Consequently, the IOMMU feaure is disabled by
> +default (and should remain so if using the aforementioned versions of
> QEMU).
> +
>  Adding vhost-user-client ports to the guest (QEMU)
>  ~~
> 
> diff --git a/NEWS b/NEWS
> index 6acd8bd..de53d2f 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -13,6 +13,7 @@ Post-v2.8.0
>   * Add support for compiling OVS with the latest Linux 4.13 kernel
> - DPDK:
>   * Add support for DPDK v17.11
> + * Add support for vHost IOMMU feature
> 
>  v2.8.0 - 31 Aug 2017
>  
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index ed5bf62..d214fac 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -1424,15 +1424,22 @@ netdev_dpdk_vhost_client_set_config(struct
> netdev *netdev,
>  {
>  struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>  const char *path;
> +bool iommu_enable;
> 
>  ovs_mutex_lock(>mutex);
>  if (!(dev->vhost_driver_flags & RTE_VHOST_USER_CLIENT)) {
>  path = smap_get(args, "vhost-server-path");
> -if (path && strcmp(path, dev->vhost_id)) {
> +if (path && strcmp(path, dev->vhost_id))
>  strcpy(dev->vhost_id, path);
> -netdev_request_reconfigure(netdev);
> -}
>  }
> +
> +iommu_enable = smap_get_bool(args, "vhost-iommu-support", false);
> +if (iommu_enable)
> +dev->vhost_driver_flags |= RTE_VHOST_USER_IOMMU_SUPPORT;
> +else
> +dev->vhost_driver_flags &= ~RTE_VHOST_USER_IOMMU_SUPPORT;
> +
> +netdev_request_reconfigure(netdev);

Here we will always request a reconfigure. Best to avoid unnecessary 
reconfigures.
Suggest only reconfiguring if the iommu_enable has changed.

The interface for setting the feature looks good to me & I think it's right to 
disable by default given the restrictions the feature places on QEMU versions.

Thanks,
Ciara

>  ovs_mutex_unlock(>mutex);
> 
>  return 0;
> @@ -3326,9 +,18 @@ netdev_dpdk_vhost_client_reconfigure(struct
> netdev *netdev)
>   */
>  if (!(dev->vhost_driver_flags & RTE_VHOST_USER_CLIENT)
>  && strlen(dev->vhost_id)) {
> +/* Enable vhost IOMMU, if it was requested.
> + * XXX: the 'flags' variable is required, as not all vhost backend
> + *

Re: [ovs-dev] [PATCH RFC v2] netdev-dpdk: Allow specification of index for PCI devices

2017-10-19 Thread Loftus, Ciara

> 
> On 10/17/2017 11:48 AM, Ciara Loftus wrote:
> > Some NICs have only one PCI address associated with multiple ports. This
> > patch extends the dpdk-devargs option's format to cater for such
> > devices. Whereas before only one of N ports associated with one PCI
> > address could be added, now N can.
> >
> > This patch allows for the following use of the dpdk-devargs option:
> >
> > ovs-vsctl set Interface myport options:dpdk-devargs=:06:00.0,X
> >
> > Where X is an unsigned integer representing one of multiple ports
> > associated with the PCI address provided.
> >
> > This patch has not been tested.
> >
> > Signed-off-by: Ciara Loftus 
> > ---
> > v2:
> > * Simplify function to find port ID of indexed device
> > * Hotplug compatibility
> > * Detach compatibility
> > * Documentation
> > * Use strtol instead of atoi
> >
> >  Documentation/howto/dpdk.rst |  9 +
> >  Documentation/intro/install/dpdk.rst |  9 +
> >  NEWS |  2 ++
> >  lib/netdev-dpdk.c| 67 ++---
> ---
> >  vswitchd/vswitch.xml | 11 --
> >  5 files changed, 85 insertions(+), 13 deletions(-)
> >
> > diff --git a/Documentation/howto/dpdk.rst
> b/Documentation/howto/dpdk.rst
> > index d123819..9447b71 100644
> > --- a/Documentation/howto/dpdk.rst
> > +++ b/Documentation/howto/dpdk.rst
> > @@ -48,6 +48,15 @@ number of dpdk devices found in the log file::
> >  $ ovs-vsctl add-port br0 dpdk-p1 -- set Interface dpdk-p1 type=dpdk \
> >  options:dpdk-devargs=:01:00.1
> >
> > +If your PCI device has multiple ports under the same PCI ID, you can use
> the
> > +following notation to indicate the specific device you wish to add::
> > +
> > +$ ovs-vsctl add-port br0 dpdk-p0 -- set Interface dpdk-p0 type=dpdk \
> > +options:dpdk-devargs=:01:00.0,0
> > +
> > +The above would attempt to use the first device (0) associated with that
> PCI
> > +ID. Use ,1 ,2 etc. to access the next.
> > +
> >  After the DPDK ports get added to switch, a polling thread continuously
> polls
> >  DPDK devices and consumes 100% of the core, as can be checked from
> ``top`` and
> >  ``ps`` commands::
> > diff --git a/Documentation/intro/install/dpdk.rst
> b/Documentation/intro/install/dpdk.rst
> > index bb69ae5..d0e87f5 100644
> > --- a/Documentation/intro/install/dpdk.rst
> > +++ b/Documentation/intro/install/dpdk.rst
> > @@ -271,6 +271,15 @@ ports. For example, to create a userspace bridge
> named ``br0`` and add two
> >  DPDK devices will not be available for use until a valid dpdk-devargs is
> >  specified.
> >
> > +If your PCI device has multiple ports under the same PCI ID, you can use
> the
> > +following notation to indicate the specific device you wish to add::
> > +
> > +$ ovs-vsctl add-port br0 dpdk-p0 -- set Interface dpdk-p0 type=dpdk \
> > +options:dpdk-devargs=:01:00.0,0
> > +
> > +The above would attempt to use the first device (0) associated with that
> PCI
> > +ID. Use ,1 ,2 etc. to access the next.
> > +
> >  Refer to ovs-vsctl(8) and :doc:`/howto/dpdk` for more details.
> >
> >  Performance Tuning
> > diff --git a/NEWS b/NEWS
> > index 75f5fa5..ca885a6 100644
> > --- a/NEWS
> > +++ b/NEWS
> > @@ -5,6 +5,8 @@ Post-v2.8.0
> > chassis "hostname" in addition to a chassis "name".
> > - Linux kernel 4.13
> >   * Add support for compiling OVS with the latest Linux 4.13 kernel
> > +   - DPDK:
> > + * Support for adding devices with multiple DPDK ports under one PCI
> ID.
> >
> >  v2.8.0 - 31 Aug 2017
> > - ovs-ofctl:
> > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> > index c60f46f..35e15da 100644
> > --- a/lib/netdev-dpdk.c
> > +++ b/lib/netdev-dpdk.c
> > @@ -1214,16 +1214,40 @@
> netdev_dpdk_lookup_by_port_id(dpdk_port_t port_id)
> >  }
> >
> >  static dpdk_port_t
> > +dpdk_get_port_by_name_with_index(char *name, int idx, int base_id)
> > +{
> > +struct rte_eth_dev_info dev_info;
> > +char pci_addr[PCI_PRI_STR_SIZE];
> > +dpdk_port_t offset_port_id = base_id + idx;
> > +
> > +if (rte_eth_dev_is_valid_port(offset_port_id)) {
> > +rte_eth_dev_info_get(offset_port_id, _info);
> > +rte_pci_device_name(_info.pci_dev->addr, pci_addr,
> > +sizeof(pci_addr));
> > +if (!strcmp(pci_addr, name)) {
> > +return offset_port_id;
> > +}
> > +}
> > +
> > +VLOG_INFO("Invalid PCI offset %i for device %s", idx, name);
> > +
> > +return DPDK_ETH_PORT_ID_INVALID;
> > +}
> > +
> > +static dpdk_port_t
> >  netdev_dpdk_process_devargs(struct netdev_dpdk *dev,
> >  const char *devargs, char **errp)
> >  {
> > -/* Get the name up to the first comma. */
> > -char *name = xmemdup0(devargs, strcspn(devargs, ","));
> > +char *devargs_copy = xmemdup0((devargs), strlen(devargs));
> > +char *name, *idx_str;
> > +unsigned idx;
> >

Re: [ovs-dev] [PATCH RFC] netdev-dpdk: Allow specification of index for PCI devices

2017-10-16 Thread Loftus, Ciara

> 
> Hi Ciara, thanks for working on this patch. A few comments inline.

Thanks for your review Ian.

> 
> > Some NICs have only one PCI address associated with multiple ports. This
> > patch extends the dpdk-devargs option's format to cater for such devices.
> > Whereas before only one of N ports associated with one PCI address could
> > be added, now N can.
> >
> > This patch allows for the following use of the dpdk-devargs option:
> >
> > ovs-vsctl set Interface myport options:dpdk-devargs=:06:00.0,X
> >
> > Where X is an unsigned integer representing one of multiple ports
> > associated with the PCI address provided.
> >
> > This patch has not been tested.
> 
> Have you considered the hotplug usecase?
> 
> I think attaching in this case will be ok but OVS provides a mechanism to
> detach via appctl as well using the pci address (see command below.
> 
> $ ovs-appctl netdev-dpdk/detach :01:00.0
> 
> I think this will have to be modified also to take into account the index of 
> the
> port otherwise we hit the same situation again.
> 
> It's probably something that will need to be tested with this patch going
> forward.

It should be easy to reuse the process_devargs_index function for this case. 
I'll include this in a v2.

Thanks,
Ciara

> 
> Ian
> >
> > Signed-off-by: Ciara Loftus 
> > ---
> >  lib/netdev-dpdk.c | 79
> --
> > -
> >  1 file changed, 63 insertions(+), 16 deletions(-)
> >
> > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index c60f46f..c3562a3
> > 100644
> > --- a/lib/netdev-dpdk.c
> > +++ b/lib/netdev-dpdk.c
> > @@ -1214,30 +1214,77 @@
> netdev_dpdk_lookup_by_port_id(dpdk_port_t port_id)
> > }
> >
> >  static dpdk_port_t
> > +process_devargs_index(char *name, int idx) {
> > +int i = 0;
> > +struct rte_eth_dev_info dev_info;
> > +char pci_addr[PCI_PRI_STR_SIZE];
> > +dpdk_port_t offset_port_id = DPDK_ETH_PORT_ID_INVALID;
> > +
> > +for (i = 0; i < RTE_MAX_ETHPORTS; i++) {
> > +if (!rte_eth_dev_is_valid_port(i)) {
> > +continue;
> > +}
> > +rte_eth_dev_info_get(i, _info);
> > +rte_pci_device_name(_info.pci_dev->addr, pci_addr,
> > +sizeof(pci_addr));
> > +if (!strcmp(pci_addr, name)) {
> > +offset_port_id = i + idx;
> > +if (rte_eth_dev_is_valid_port(offset_port_id)) {
> > +rte_eth_dev_info_get(offset_port_id, _info);
> > +rte_pci_device_name(_info.pci_dev->addr, pci_addr,
> > +sizeof(pci_addr));
> > +if (!strcmp(pci_addr, name)) {
> > +return offset_port_id;
> > +} else {
> > +break;
> > +}
> > +} else {
> > +break;
> > +}
> > +}
> > +}
> > +
> > +VLOG_INFO("Invalid PCI offset %i for device %s", idx, name);
> > +return DPDK_ETH_PORT_ID_INVALID;
> > +}
> > +
> > +static dpdk_port_t
> >  netdev_dpdk_process_devargs(struct netdev_dpdk *dev,
> >  const char *devargs, char **errp)  {
> > -/* Get the name up to the first comma. */
> > -char *name = xmemdup0(devargs, strcspn(devargs, ","));
> > +char *devargs_copy = xmemdup0((devargs), strlen(devargs));
> > +char *name, *idx_str;
> > +unsigned idx;
> >  dpdk_port_t new_port_id = DPDK_ETH_PORT_ID_INVALID;
> >
> > -if (!rte_eth_dev_count()
> > -|| rte_eth_dev_get_port_by_name(name, _port_id)
> > -|| !rte_eth_dev_is_valid_port(new_port_id)) {
> > -/* Device not found in DPDK, attempt to attach it */
> > -if (!rte_eth_dev_attach(devargs, _port_id)) {
> > -/* Attach successful */
> > -dev->attached = true;
> > -VLOG_INFO("Device '%s' attached to DPDK", devargs);
> > -} else {
> > -/* Attach unsuccessful */
> > -new_port_id = DPDK_ETH_PORT_ID_INVALID;
> > -VLOG_WARN_BUF(errp, "Error attaching device '%s' to DPDK",
> > -  devargs);
> > +name = strtok_r(devargs_copy, ",", _copy);
> > +idx_str = strtok_r(devargs_copy, ",", _copy);
> > +
> > +if (idx_str) {
> > +idx = atoi(idx_str);
> > +new_port_id = process_devargs_index(name, idx);
> > +} else {
> > +if (!rte_eth_dev_count()
> > +|| rte_eth_dev_get_port_by_name(name, _port_id)
> > +|| !rte_eth_dev_is_valid_port(new_port_id)) {
> > +/* Device not found in DPDK, attempt to attach it */
> > +if (!rte_eth_dev_attach(devargs, _port_id)) {
> > +/* Attach successful */
> > +dev->attached = true;
> > +VLOG_INFO("Device '%s' attached to DPDK", devargs);
> > +} else {
> > +/* Attach unsuccessful */
> > +

Re: [ovs-dev] [PATCH v2 2/2] netdev-dpdk: Enable optional dequeue zero copy for vHost User

2017-10-16 Thread Loftus, Ciara

> 
> Thanks for the v2 Ciara. Comments inline.

Thanks for your review Ian. Hope to send a v3 soon. Responses inline.

Thanks,
Ciara

> 
> 
> > Enabled per port like so:
> > ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=true
> >
> > The feature is disabled by default and can only be enabled/disabled when a
> > vHost port is down.
> >
> > When packets from a vHost device with zero copy enabled are destined for
> a
> > 'dpdk' port, the number of tx descriptors on that 'dpdk' port must be set
> > to a smaller value. 128 is recommended. This can be achieved like
> > so:
> >
> > ovs-vsctl set Interface dpdkport options:n_txq_desc=128
> >
> > Signed-off-by: Ciara Loftus 
> > ---
> >  Documentation/howto/dpdk.rst | 29 +++
> >  Documentation/topics/dpdk/vhost-user.rst | 35 +
> >  NEWS |  3 ++
> >  lib/netdev-dpdk.c| 89
> > +++-
> >  vswitchd/vswitch.xml | 11 
> >  5 files changed, 166 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/howto/dpdk.rst
> b/Documentation/howto/dpdk.rst
> > index d123819..d149a8e 100644
> > --- a/Documentation/howto/dpdk.rst
> > +++ b/Documentation/howto/dpdk.rst
> > @@ -709,3 +709,32 @@ devices to bridge ``br0``. Once complete, follow
> the
> > below steps:
> > Check traffic on multiple queues::
> >
> > $ cat /proc/interrupts | grep virtio
> > +
> > +PHY-VM-PHY (vHost Dequeue Zero Copy)
> > +
> 
> Title underline is too short here, will cause an error on compilation.

Ok.

> 
> > +
> > +vHost dequeue zero copy functionality can  be validated using the
> > +PHY-VM-PHY configuration. To begin, follow the steps described in
> > +:ref:`dpdk-phy-phy` to create and initialize the database, start
> > +ovs-vswitchd and add ``dpdk``-type and ``dpdkvhostuser``-type devices
> > +and flows to bridge ``br0``. Once complete, follow the below steps:
> > +
> > +1. Enable dequeue zero copy on the vHost devices.
> > +
> > +   $ ovs-vsctl set Interface dpdkvhostuser0 options:dq-zero-copy=true
> > +   $ ovs-vsctl set Interface dpdkvhostuser1
> > + options:dq-zero-copy=true
> > +
> > +2. Reduce the number of txq descriptors on the phy ports.
> > +
> > +   $ ovs-vsctl set Interface phy0 options:n_txq_desc=128
> > +   $ ovs-vsctl set Interface phy1 options:n_txq_desc=128
> > +
> > +3. Proceed with the test by launching the VM and configuring guest
> > +forwarding, be it via the vHost loopback method or kernel forwarding
> > +method, and sending traffic.
> > +
> > +It is essential that step 1 is performed before booting the VM,
> > +otherwise the feature will not be enabled.
> > +
> > +Check for the log "VHOST_CONFIG: dequeue zero copy is enabled" to
> > +ensure the successful enablement of the feature.
> 
> Minor nit but the info message I saw in the log during testing differed from
> what's above, it was as follows:
> 
> netdev_dpdk|INFO|Zero copy enabled for vHost socket
> /usr/local/var/run/openvswitch/vhost-user0
> 
> Consider changing the message in the docs to reflect this.

There are two logs. The one you mentioned happens when you enable the option in 
the OVSDB. The second one that's in the documentation is a DPDK log that is 
reported when the feature is enabled on the device, which is the one you need 
to look out for. I'll update the documentation to reflect this point as it 
could be confusing.

> 
> > diff --git a/Documentation/topics/dpdk/vhost-user.rst
> > b/Documentation/topics/dpdk/vhost-user.rst
> > index 74ac06e..267ccaf 100644
> > --- a/Documentation/topics/dpdk/vhost-user.rst
> > +++ b/Documentation/topics/dpdk/vhost-user.rst
> > @@ -408,3 +408,38 @@ Sample XML
> >  
> >
> >  .. _QEMU documentation: http://git.qemu-
> > project.org/?p=qemu.git;a=blob;f=docs/specs/vhost-
> > user.txt;h=7890d7169;hb=HEAD
> > +
> > +vhost-user Dequeue Zero Copy
> > +-
> > +
> > +Normally when dequeuing a packet from a vHost User device, a memcpy
> > +operation must be used to copy that packet from guest address space to
> > +host address space. This memcpy can be removed by enabling dequeue
> zero-
> > copy like so:
> > +
> > +$ ovs-vsctl set Interface dpdkvhostuserclient0
> > + options:dq-zero-copy=true
> > +
> > +With this feature enabled, a reference (pointer) to the packet is
> > +passed to the host, instead of a copy of the packet. Removing this
> > +memcpy can give a performance improvement for some use cases, for
> > +example switching large packets between different VMs.
> > +
> > +Note that the feature is disabled by default and must be explicitly
> > +enabled by using the command above.
> > +
> > +The feature cannot be enabled when the device is active (ie. VM
> > +booted). If you wish to enable the feature after the VM has booted, you
> > +must shutdown the VM and bring it back up.
> 
> Does it

Re: [ovs-dev] [PATCH v2 0/2] vHost Dequeue Zero Copy

2017-10-12 Thread Loftus, Ciara

> 
> Hi Ciara,
> 
> These improvements look very good. I would expect even bigger
> improvements for big packets, as long as we don't hit some link bandwidth
> limitations. But at least the vhost-vhost cases should benefit.
> 
> Have you also tested larger packet sizes?

Hi Jan,

Thanks for the feedback. Here are some more datapoints for the VM2VM topology:

256B:   4.69 vs 5.42 Mpps (+~16%)
512B:   4.04 vs 4.90 Mpps (+~21%)
1518B:  2.51 vs 3.05 Mpps (+~22%)

As you guessed, I hit bandwidth when using a NIC & larger packet sizes so can't 
show any benefit.

> 
> I plan to review your patches.

Much appreciated.

Thanks,
Ciara

> 
> Thanks, Jan
> 
> > -Original Message-
> > From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of Ciara Loftus
> > Sent: Wednesday, 11 October, 2017 16:22
> > To: d...@openvswitch.org
> > Subject: [ovs-dev] [PATCH v2 0/2] vHost Dequeue Zero Copy
> >
> > This patch enables optional dequeue zero copy for vHost ports.
> > This gives a performance increase for some use cases. I'm using
> > the cover letter to report my results.
> >
> > vhost (vm1) -> vhost (vm2)
> > Using testpmd to source (txonly) in vm1 and sink (rxonly) in vm2.
> > 4C1Q 64B packets: 5.05Mpps -> 5.52Mpps = 9.2% improvement
> >
> > vhost (virtio_user backend 1) -> vhost (virtio_user backend 2)
> > Using 2 instances of testpmd, each with a virtio_user backend
> > connected to one of the two vhost ports created in OVS.
> > 2C1Q 1518B packets: 2.59Mpps -> 3.09Mpps = 19.3% improvement
> >
> > vhost -> phy
> > Using testpmd to source (txonly) and sink in the NIC
> > 1C1Q 64B packets: 6.81Mpps -> 7.76Mpps = 13.9% improvement
> >
> > phy -> vhost -> phy
> > No improvement measured
> >
> > This patch is dependent on the series below which fixes issues with
> > mempool management:
> > https://patchwork.ozlabs.org/patch/822590/
> >
> > v2 changes:
> > * Mention feature is disabled by default in the documentation
> > * Add PHY-VM-PHY with vHost dequeue zero copy documentation guide
> > * Line wrap link to DPDK documentation
> > * Rename zc_enabled to dq_zc_enabled for future-proofing
> > * Mention feature is available for both vHost port types in the docs
> > * In practise, rebooting the VM doesn't always enable the feature if
> > enabled post-boot, so update the documentation to suggest a shutdown
> > rather than a reboot. The reason why this doesn't work is probably
> > because the total downtime during reboot isn't enough to allow a vhost
> > device unregister & re-register with the new feature, so when the VM
> > starts again it doesn't pick up the new device as it hasn't been
> > re-registered in time.
> >
> > Ciara Loftus (2):
> >   netdev-dpdk: Helper function for vHost device setup
> >   netdev-dpdk: Enable optional dequeue zero copy for vHost User
> >
> >  Documentation/howto/dpdk.rst |  29 +
> >  Documentation/topics/dpdk/vhost-user.rst |  35 ++
> >  NEWS |   3 +
> >  lib/netdev-dpdk.c| 202 
> > +--
> >  vswitchd/vswitch.xml |  11 ++
> >  5 files changed, 218 insertions(+), 62 deletions(-)
> >
> > --
> > 2.7.5
> >
> > ___
> > dev mailing list
> > d...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v4 2/6] netdev-dpdk: Fix mempool names to reflect socket id.

2017-10-09 Thread Loftus, Ciara

> 
> Create mempool names by considering also the NUMA socket number.
> So a name reflects on what socket the mempool is allocated on.
> This change is needed for the NUMA-awareness feature.
> 
> CC: Kevin Traynor 
> CC: Aaron Conole 
> Reported-by: Ciara Loftus 
> Fixes: d555d9bded5f ("netdev-dpdk: Create separate memory pool for each
> port.")
> Signed-off-by: Antonio Fischetti 
> ---
> Mempool names now contains the requested socket id and become like:
> "ovs_4adb057e_1_2030_20512".
> 
> Tested with DPDK 17.05.2 (from dpdk-stable branch).
> NUMA-awareness feature enabled (DPDK/config/common_base).
> 
> Created 1 single dpdkvhostuser port type.
> OvS pmd-cpu-mask=FF3 # enable cores on both numa nodes
> QEMU core mask = 0xFC000 # cores for qemu on numa node 1 only
> 
>  Before launching the VM:
>  
> ovs-appctl dpif-netdev/pmd-rxq-show
> shows core #1 is serving the vhu port.
> 
> pmd thread numa_id 0 core_id 1:
> isolated : false
> port: dpdkvhostuser0queue-id: 0
> 
>  After launching the VM:
>  ---
> the vhu port is now managed by core #27
> pmd thread numa_id 1 core_id 27:
> isolated : false
> port: dpdkvhostuser0queue-id: 0
> 
> and the log shows a new mempool is allocated on NUMA node 1, while
> the previous one is deleted:
> 
> 2017-10-06T14:04:55Z|00105|netdev_dpdk|DBG|Allocated
> "ovs_4adb057e_1_2030_20512" mempool with 20512 mbufs
> 2017-10-06T14:04:55Z|00106|netdev_dpdk|DBG|Releasing
> "ovs_4adb057e_0_2030_20512" mempool
> ---
>  lib/netdev-dpdk.c | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 80a6ff3..0cf47cb 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -499,8 +499,8 @@ dpdk_mp_name(struct dpdk_mp *dmp)
>  {
>  uint32_t h = hash_string(dmp->if_name, 0);
>  char *mp_name = xcalloc(RTE_MEMPOOL_NAMESIZE, sizeof *mp_name);
> -int ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
> "ovs_%x_%d_%u",
> -   h, dmp->mtu, dmp->mp_size);
> +int ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
> "ovs_%x_%d_%d_%u",
> +   h, dmp->socket_id, dmp->mtu, dmp->mp_size);
>  if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) {
>  return NULL;
>  }
> @@ -534,9 +534,10 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu,
> bool *mp_exists)
>  char *mp_name = dpdk_mp_name(dmp);
> 
>  VLOG_DBG("Requesting a mempool of %u mbufs for netdev %s "
> - "with %d Rx and %d Tx queues.",
> + "with %d Rx and %d Tx queues, socket id:%d.",
>   dmp->mp_size, dev->up.name,
> - dev->requested_n_rxq, dev->requested_n_txq);
> + dev->requested_n_rxq, dev->requested_n_txq,
> + dev->requested_socket_id);
> 
>  dmp->mp = rte_pktmbuf_pool_create(mp_name, dmp->mp_size,
>MP_CACHE_SZ,
> --
> 2.4.11

Thanks for this fix Antonio. I've verified that the vHost User NUMA Awareness 
feature works again with this change.

Tested-by: Ciara Loftus 

> 
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [dpdk-users] adding dpdk ports sharing same pci address to ovs-dpdk bridge

2017-10-06 Thread Loftus, Ciara

> 
> On Thu, Sep 21, 2017 at 1:58 PM, Loftus, Ciara <ciara.lof...@intel.com>
> wrote:
> > 21/09/2017 10:04, Loftus, Ciara:
> > > > 20/09/2017 19:33, Kevin Traynor:
> > > > > On 09/08/2017 10:56 AM, Loftus, Ciara wrote:
> > > > > > It seems the DPDK function rte_eth_dev_get_port_by_name() will
> > > > > > always return the port ID of the first port on your NIC,
> > > > > > when you specify the single PCI address and that's where the
> > > > > > problem is. There doesn't seem to be a way currently to
> > > > > > indicate to the calling application that in fact two
> > > > > > (or more) port IDs are associated with the one PCI address.
> > > >
> > > > We have two ports (with the same PCI address) so we should have
> > > > two different names.
> > > > Where the names passed to rte_eth_dev_get_port_by_name() come
> > from?
> > > > It is the user parameter from options:dpdk-devargs=0002:01:00.0,
> right?
> > >
> > > Yes, we're using the PCI address specified by the user in dpdk-devargs.
> > >
> > > > > > I am cc-ing DPDK users mailing list for hopefully some input.
> > > > > > Are there any plans for the rte_eth_dev_get_port_by_name
> > function
> > > > > > to be compatible with NICs with multiple ports under the same PCI
> > address?
> > > >
> > > > We cannot return two different ports for the same name.
> > > > There are two issues here:
> > > >   - the input should not be the PCI address
> > > >   - the ethdev function should look at ethdev name, not rte_device
> > > > one
> > >
> > > This would require the user having to "guess" the DPDK ethdev name
> > > which is something we'd like to avoid.
> >
> > Yes, but you can provide a way to list the ports with their names
> > and characteristics.
> Ok, I see. Maybe something like this could be considered:
> 
> port A = dpdk-devargs=xx:yy:zz 0
> port B = dpdk-devargs=xx:yy:zz 1
> 
> If we detect a value after the PCI address we iterate through the
> rte_eth_dev_info
> (http://dpdk.org/doc/api/structrte__eth__dev__info.html) for all valid port
> IDs and assign port A to the first ethdev encountered with the provided PCI
> address, and port B to the second, etc.
> 
> If we don't detect a value, then we operate as normal. Thoughts?
> 
> Hi Everyone,
> 
> Anything finalized for sorting out this issue, do you need any more
> information regarding this issue ?

Hi,

I put together a very rough RFC that aims to work-around the issue:
https://mail.openvswitch.org/pipermail/ovs-dev/2017-October/339496.html

It hasn't been tested as I don't have access to the type of card that has one 
PCI for multiple ports.
If anybody does have access to such a device, I welcome you to try the patch 
although I'm not hopeful it will succeed first -pass.
Looking for feedback on implementation, interface, etc.

Thanks,
Ciara

> 
> Thanks,
> Devendra
> 
> 
> >
> > > We had the same problem using DPDK port IDs and decided not to use
> > > them anymore, and use the PCI instead as it took the guesswork out.
> > > Ethdev names and port IDs can change between tests, unlike the PCI
> > > address which tends to remain constant for a device.
> >
> > We can add a requirement on ethdev names and make sure they remain
> > constant for a given port.
> >
> > > > The idea is that we have only one rte_device object and we instantiate
> > > > two rte_eth_dev ports.
> > > > An ethdev port can be identified with its id (a number) or its unique
> > name.
> > > > Unfortunately, the user cannot guess the port id or the name set by the
> > > > PMD.
> > >
> > > Exactly. Thanks for clarifying what's going on under the hood.
> > >
> > > Ciara
> > >
> > > >
> > > > > Hi Adrien/Nelio,
> > > > >
> > > > > Is this something you can answer? We're wondering how to handle
> this
> > in
> > > > > OVS and whether a temporary or long term solution is needed.
> > > >
> > > > I suggest to rely on ethdev name.
> > > > You will need to show to the user the mapping between the bus
> > information
> > > > (PCI id here) and the device names.
> > > >
> > > > Another alternative is to add a new function returning all ethdev ports
> > > > associated to a given rte_device resource.
> > > > So you would get two ports and you could pick one on the first "add-
> > port",
> > > > and the other one for the second "add-port" command.
> > > > It means the user would be forced to add them in the right order if he
> > > > wants a reproducible result.

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v3 1/5] netdev-dpdk: fix mempool management with vhu client.

2017-10-06 Thread Loftus, Ciara

> 
> In a PVP test where vhostuser ports are configured as
> clients, OvS crashes when QEMU is launched.
> This patch avoids to call dpdk_mp_put() - and erroneously
> release the mempool - when it already exists.

Thanks for investigating this issue and for the patch.
I think the commit message could be made more generic since the freeing of the 
pre-existing mempool could potentially happen for other port types and 
topologies, not just vhostuserclient & PVP.

> 
> CC: Kevin Traynor 
> CC: Aaron Conole 
> Reported-by: Ciara Loftus 
> Fixes: d555d9bded5f ("netdev-dpdk: Create separate memory pool for each
> port.")
> Signed-off-by: Antonio Fischetti 
> ---
> I've tested this patch by
>   - changing at run-time the number of Rx queues:
>   ovs-vsctl set Interface dpdk0 type=dpdk options:n_rxq=4
> 
>   - reducing the MTU of the dpdk ports of 1 byte to force
> the configuration of an existing mempool:
>   ovs-vsctl set Interface dpdk0 mtu_request=1499
> 
> To replicate the bug scenario:
> 
>  PVP test setup
>  --
> CLIENT_SOCK_DIR=/tmp
> SOCK0=dpdkvhostuser0
> SOCK1=dpdkvhostuser1
> 
> 1 PMD
> Add 2 dpdk ports, n_rxq=1
> Add 2 vhu ports both of type dpdkvhostuserclient and specify vhost-server-
> path
>  ovs-vsctl set Interface dpdkvhostuser0 options:vhost-server-
> path="$CLIENT_SOCK_DIR/$SOCK0"
>  ovs-vsctl set Interface dpdkvhostuser1 options:vhost-server-
> path="$CLIENT_SOCK_DIR/$SOCK1"
> 
> Set port-based rules: dpdk0 <--> vhu0 and dpdk1 <--> vhu1
>  add-flow br0 in_port=1,action=output:3
>  add-flow br0 in_port=3,action=output:1
>  add-flow br0 in_port=4,action=output:2
>  add-flow br0 in_port=2,action=output:4

Nit - the steps to reproduce the bug are over-complicated. One only needs 1 
vhostuserclient port (no dpdk ports, no flows), and just boot the VM = crash.

> 
>  Launch QEMU
>  ---
> As OvS vhu ports are acting as clients, we must specify 'server' in the next
> command.
> VM_IMAGE=
> 
>  sudo -E taskset 0x3F00 $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64
> -name us-vhost-vm1 -cpu host -enable-kvm -m 4096M -object memory-
> backend-file,id=mem,size=4096M,mem-path=/dev/hugepages,share=on -
> numa node,memdev=mem -mem-prealloc -smp 4 -drive file=$VM_IMAGE -
> chardev socket,id=char0,path=$CLIENT_SOCK_DIR/$SOCK0,server -netdev
> type=vhost-user,id=mynet1,chardev=char0,vhostforce -device virtio-net-
> pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=off -chardev
> socket,id=char1,path=$CLIENT_SOCK_DIR/$SOCK1,server -netdev
> type=vhost-user,id=mynet2,chardev=char1,vhostforce -device virtio-net-
> pci,mac=00:00:00:00:00:02,netdev=mynet2,mrg_rxbuf=off --nographic
> 
>  Expected behavior
>  -
> With this fix OvS shouldn't crash.
> ---
>  lib/netdev-dpdk.c | 27 ++-
>  1 file changed, 14 insertions(+), 13 deletions(-)
> 
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index c60f46f..80a6ff3 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -508,7 +508,7 @@ dpdk_mp_name(struct dpdk_mp *dmp)
>  }
> 
>  static struct dpdk_mp *
> -dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
> +dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool *mp_exists)
>  {
>  struct dpdk_mp *dmp = dpdk_rte_mzalloc(sizeof *dmp);
>  if (!dmp) {
> @@ -530,8 +530,6 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
>  + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) *
> NETDEV_MAX_BURST
>  + MIN_NB_MBUF;
> 
> -bool mp_exists = false;
> -
>  do {
>  char *mp_name = dpdk_mp_name(dmp);

Slightly unrelated to this patch but another issue with the d555d9bded5f 
"netdev-dpdk: Create separate memory pool for each port." commit I came across 
when reviewing this fix:
dpdk_mp_name doesn't reflect what socket the mempool is allocated on. So that 
commit also breaks the vHost User NUMA feature. Could you include a fix for 
that bug too in this patch / part of this patchset?

> 
> @@ -559,7 +557,7 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
>  /* As the mempool create returned EEXIST we can expect the
>   * lookup has returned a valid pointer.  If for some reason
>   * that's not the case we keep track of it. */
> -mp_exists = true;
> +*mp_exists = true;
>  } else {
>  VLOG_ERR("Failed mempool \"%s\" create request of %u mbufs",
>   mp_name, dmp->mp_size);
> @@ -573,7 +571,7 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
>  rte_mempool_obj_iter(dmp->mp, ovs_rte_pktmbuf_init, NULL);
>  return dmp;
>  }
> -} while (!mp_exists &&
> +} while (!(*mp_exists) &&
>  (rte_errno == ENOMEM && (dmp->mp_size /= 2) >=
> MIN_NB_MBUF));
> 
>  rte_free(dmp);
> @@ -581,12 +579,12 @@ dpdk_mp_create(struct netdev_dpdk *dev, int
> mtu)
>  }
> 
>  static struct dpdk_mp *
>

Re: [ovs-dev] [dpdk-users] adding dpdk ports sharing same pci address to ovs-dpdk bridge

2017-09-21 Thread Loftus, Ciara

> 21/09/2017 10:04, Loftus, Ciara:
> > > 20/09/2017 19:33, Kevin Traynor:
> > > > On 09/08/2017 10:56 AM, Loftus, Ciara wrote:
> > > > > It seems the DPDK function rte_eth_dev_get_port_by_name() will
> > > > > always return the port ID of the first port on your NIC,
> > > > > when you specify the single PCI address and that's where the
> > > > > problem is. There doesn't seem to be a way currently to
> > > > > indicate to the calling application that in fact two
> > > > > (or more) port IDs are associated with the one PCI address.
> > >
> > > We have two ports (with the same PCI address) so we should have
> > > two different names.
> > > Where the names passed to rte_eth_dev_get_port_by_name() come
> from?
> > > It is the user parameter from options:dpdk-devargs=0002:01:00.0, right?
> >
> > Yes, we're using the PCI address specified by the user in dpdk-devargs.
> >
> > > > > I am cc-ing DPDK users mailing list for hopefully some input.
> > > > > Are there any plans for the rte_eth_dev_get_port_by_name
> function
> > > > > to be compatible with NICs with multiple ports under the same PCI
> address?
> > >
> > > We cannot return two different ports for the same name.
> > > There are two issues here:
> > >   - the input should not be the PCI address
> > >   - the ethdev function should look at ethdev name, not rte_device
> > > one
> >
> > This would require the user having to "guess" the DPDK ethdev name
> > which is something we'd like to avoid.
> 
> Yes, but you can provide a way to list the ports with their names
> and characteristics.

Ok, I see. Maybe something like this could be considered:

port A = dpdk-devargs=xx:yy:zz 0
port B = dpdk-devargs=xx:yy:zz 1

If we detect a value after the PCI address we iterate through the 
rte_eth_dev_info (http://dpdk.org/doc/api/structrte__eth__dev__info.html) for 
all valid port IDs and assign port A to the first ethdev encountered with the 
provided PCI address, and port B to the second, etc.

If we don't detect a value, then we operate as normal. Thoughts?

Thanks,
Ciara

> 
> > We had the same problem using DPDK port IDs and decided not to use
> > them anymore, and use the PCI instead as it took the guesswork out.
> > Ethdev names and port IDs can change between tests, unlike the PCI
> > address which tends to remain constant for a device.
> 
> We can add a requirement on ethdev names and make sure they remain
> constant for a given port.
> 
> > > The idea is that we have only one rte_device object and we instantiate
> > > two rte_eth_dev ports.
> > > An ethdev port can be identified with its id (a number) or its unique
> name.
> > > Unfortunately, the user cannot guess the port id or the name set by the
> > > PMD.
> >
> > Exactly. Thanks for clarifying what's going on under the hood.
> >
> > Ciara
> >
> > >
> > > > Hi Adrien/Nelio,
> > > >
> > > > Is this something you can answer? We're wondering how to handle this
> in
> > > > OVS and whether a temporary or long term solution is needed.
> > >
> > > I suggest to rely on ethdev name.
> > > You will need to show to the user the mapping between the bus
> information
> > > (PCI id here) and the device names.
> > >
> > > Another alternative is to add a new function returning all ethdev ports
> > > associated to a given rte_device resource.
> > > So you would get two ports and you could pick one on the first "add-
> port",
> > > and the other one for the second "add-port" command.
> > > It means the user would be forced to add them in the right order if he
> > > wants a reproducible result.

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [dpdk-users] adding dpdk ports sharing same pci address to ovs-dpdk bridge

2017-09-21 Thread Loftus, Ciara

> 20/09/2017 19:33, Kevin Traynor:
> > On 09/08/2017 10:56 AM, Loftus, Ciara wrote:
> > >> Hi,
> > >>
> > >> I have compiled and built ovs-dpdk using DPDK v17.08 and OVS v2.8.0.
> The
> > >> NIC that I am using is Mellanox ConnectX-3 Pro, which is a dual port 10G
> > >> NIC. The problem with this NIC is that it provides only one PCI address
> for
> > >> both the 10G ports.
> > >>
> > >> So when I am trying to add the two DPDK ports to my br0 bridge
> > >>
> > >> # ovs-vsctl --no-wait add-port br0 dpdk0 -- set Interface dpdk0
> type=dpdk
> > >> options:dpdk-devargs=0002:01:00.0
> > >>
> > >> # ovs-vsctl --no-wait add-port br0 dpdk1 -- set Interface dpdk1
> type=dpdk
> > >> options:dpdk-devargs=0002:01:00.0
> > >>
> > >> The port dpdk1 is added successfully and able to transfer data, but
> adding
> > >> dpdk0 to br0 fails:
> [...]
> > >> With OVS v2.6.1 I never had this problem as dpdk-devargs was not
> > >> mandatory
> > >> and just specifying port name was enough to add that port to bridge.
> > >>
> > >> Is there a way to add port both ports to bridge ?
> > >
> > > It seems the DPDK function rte_eth_dev_get_port_by_name() will
> always return the port ID of the first port on your NIC, when you specify the
> single PCI address and that's where the problem is. There doesn't seem to be
> a way currently to indicate to the calling application that in fact two (or 
> more)
> port IDs are associated with the one PCI address.
> 
> We have two ports (with the same PCI address) so we should have
> two different names.
> Where the names passed to rte_eth_dev_get_port_by_name() come from?
> It is the user parameter from options:dpdk-devargs=0002:01:00.0, right?

Yes, we're using the PCI address specified by the user in dpdk-devargs.

> 
> > > I am cc-ing DPDK users mailing list for hopefully some input. Are there 
> > > any
> plans for the rte_eth_dev_get_port_by_name function to be compatible
> with NICs with multiple ports under the same PCI address?
> 
> We cannot return two different ports for the same name.
> There are two issues here:
>   - the input should not be the PCI address
>   - the ethdev function should look at ethdev name, not rte_device
> one

This would require the user having to "guess" the DPDK ethdev name which is 
something we'd like to avoid.
We had the same problem using DPDK port IDs and decided not to use them 
anymore, and use the PCI instead as it took the guesswork out.
Ethdev names and port IDs can change between tests, unlike the PCI address 
which tends to remain constant for a device.

> 
> The idea is that we have only one rte_device object and we instantiate
> two rte_eth_dev ports.
> An ethdev port can be identified with its id (a number) or its unique name.
> Unfortunately, the user cannot guess the port id or the name set by the
> PMD.

Exactly. Thanks for clarifying what's going on under the hood.

Ciara

> 
> > Hi Adrien/Nelio,
> >
> > Is this something you can answer? We're wondering how to handle this in
> > OVS and whether a temporary or long term solution is needed.
> 
> I suggest to rely on ethdev name.
> You will need to show to the user the mapping between the bus information
> (PCI id here) and the device names.
> 
> Another alternative is to add a new function returning all ethdev ports
> associated to a given rte_device resource.
> So you would get two ports and you could pick one on the first "add-port",
> and the other one for the second "add-port" command.
> It means the user would be forced to add them in the right order if he
> wants a reproducible result.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] adding dpdk ports sharing same pci address to ovs-dpdk bridge

2017-09-19 Thread Loftus, Ciara

> Thanks for confirming Devendra
> 
> Adding Ciara
> There have been some offline discussions regarding the issue.

The workaround discussed is a patch to enable backwards compatibility with the 
old port IDs. Something like the following:

– set Interface portX options:dpdk-devargs=dpdkportid0

Looking for input.

Thanks,
Ciara

> 
> 
> From: devendra rawat 
> Date: Monday, September 18, 2017 at 4:27 AM
> To: Kevin Traynor 
> Cc: Darrel Ball , "ovs-dev@openvswitch.org"  d...@openvswitch.org>, "disc...@openvswitch.org"
> 
> Subject: Re: [ovs-dev] adding dpdk ports sharing same pci address to ovs-
> dpdk bridge
> 
> Hi Kevin,
> 
> On Fri, Sep 8, 2017 at 12:24 AM, Kevin Traynor 
> wrote:
> On 09/07/2017 06:47 PM, Darrell Ball wrote:
> > Adding disc...@openvswitch.org
> >
> > The related changes went into 2.7
> >
> >
> >
> > On 9/7/17, 3:51 AM, "ovs-dev-boun...@openvswitch.org on behalf of
> devendra rawat"  devendra.rawat.si...@gmail.com> wrote:
> >
> >     Hi,
> >
> >     I have compiled and built ovs-dpdk using DPDK v17.08 and OVS v2.8.0.
> The
> >     NIC that I am using is Mellanox ConnectX-3 Pro, which is a dual port 10G
> >     NIC. The problem with this NIC is that it provides only one PCI address 
> >for
> >     both the 10G ports.
> >
> >     So when I am trying to add the two DPDK ports to my br0 bridge
> >
> >     # ovs-vsctl --no-wait add-port br0 dpdk0 -- set Interface dpdk0
> type=dpdk
> >     options:dpdk-devargs=0002:01:00.0
> >
> >     # ovs-vsctl --no-wait add-port br0 dpdk1 -- set Interface dpdk1
> type=dpdk
> >     options:dpdk-devargs=0002:01:00.0
> >
> 
> Were you able to confirm those addresses by running ./dpdk-devbind.py -s
> in your /tools dir?
> 
> On running dpdk-devbind.py --status , I can see the ConnectX-3 pro NIC,
> having only one PCI address.
> 
> Network devices using DPDK-compatible driver
> 
> 
> 
> Network devices using kernel driver
> ===
> 0002:01:00.0 'MT27520 Family [ConnectX-3 Pro] 1007'
> if=enP4p1s0d1,enP4p1s0 drv=mlx4_core unused=
> 0006:01:00.0 'I210 Gigabit Network Connection 1533' if=enP6p1s0 drv=igb
> unused= *Active*
> 
> >     The port dpdk1 is added successfully and able to transfer data, but 
> >adding
> >     dpdk0 to br0 fails:
> >
> >     2017-09-06T14:19:20Z|00045|netdev_dpdk|INFO|Port 0:
> e4:1d:2d:4f:78:60
> >     2017-09-06T14:19:20Z|00046|bridge|INFO|bridge br0: added interface
> dpdk1 on
> >     port 1
> >     2017-09-06T14:19:20Z|00047|bridge|INFO|bridge br0: added interface
> br0 on
> >     port 65534
> >     2017-09-06T14:19:20Z|00048|dpif_netlink|WARN|Generic Netlink family
> >     'ovs_datapath' does not exist. The Open vSwitch kernel module is
> probably
> >     not loaded.
> >     2017-09-06T14:19:20Z|00049|netdev_dpdk|WARN|'dpdk0' is trying to
> use device
> >     '0002:01:00.0' which is already in use by 'dpdk1'
> >     2017-09-06T14:19:20Z|00050|netdev|WARN|dpdk0: could not set
> configuration
> >     (Address already in use)
> >     2017-09-06T14:19:20Z|00051|bridge|INFO|bridge br0: using datapath ID
> >     e41d2d4f7860
> >
> >
> >     With OVS v2.6.1 I never had this problem as dpdk-devargs was not
> mandatory
> >     and just specifying port name was enough to add that port to bridge.
> >
> >     Is there a way to add port both ports to bridge ?
> >
> >     Thanks,
> >     Devendra
> >     ___
> >     dev mailing list
> >     d...@openvswitch.org
> >     https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__mail.openvswitch.org_mailman_listinfo_ovs-
> 2Ddev=DwICAg=uilaK90D4TOVoH58JNXRgQ=BVhFA09CGX7JQ5Ih-
> uZnsw=qO7NdgrrorJhievOguQLmsfEFuBcPfz9NfQX7UME1-
> 8=ZKHbYlaTjm8VFj6Rggmcb2gw6s3xW4PxEtUy4YFG1VA=
> >
> >
> > ___
> > dev mailing list
> > d...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] adding dpdk ports sharing same pci address to ovs-dpdk bridge

2017-09-08 Thread Loftus, Ciara

> Hi,
> 
> I have compiled and built ovs-dpdk using DPDK v17.08 and OVS v2.8.0. The
> NIC that I am using is Mellanox ConnectX-3 Pro, which is a dual port 10G
> NIC. The problem with this NIC is that it provides only one PCI address for
> both the 10G ports.
> 
> So when I am trying to add the two DPDK ports to my br0 bridge
> 
> # ovs-vsctl --no-wait add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
> options:dpdk-devargs=0002:01:00.0
> 
> # ovs-vsctl --no-wait add-port br0 dpdk1 -- set Interface dpdk1 type=dpdk
> options:dpdk-devargs=0002:01:00.0
> 
> The port dpdk1 is added successfully and able to transfer data, but adding
> dpdk0 to br0 fails:
> 
> 2017-09-06T14:19:20Z|00045|netdev_dpdk|INFO|Port 0: e4:1d:2d:4f:78:60
> 2017-09-06T14:19:20Z|00046|bridge|INFO|bridge br0: added interface dpdk1
> on
> port 1
> 2017-09-06T14:19:20Z|00047|bridge|INFO|bridge br0: added interface br0
> on
> port 65534
> 2017-09-06T14:19:20Z|00048|dpif_netlink|WARN|Generic Netlink family
> 'ovs_datapath' does not exist. The Open vSwitch kernel module is probably
> not loaded.
> 2017-09-06T14:19:20Z|00049|netdev_dpdk|WARN|'dpdk0' is trying to use
> device
> '0002:01:00.0' which is already in use by 'dpdk1'
> 2017-09-06T14:19:20Z|00050|netdev|WARN|dpdk0: could not set
> configuration
> (Address already in use)
> 2017-09-06T14:19:20Z|00051|bridge|INFO|bridge br0: using datapath ID
> e41d2d4f7860
> 
> 
> With OVS v2.6.1 I never had this problem as dpdk-devargs was not
> mandatory
> and just specifying port name was enough to add that port to bridge.
> 
> Is there a way to add port both ports to bridge ?

It seems the DPDK function rte_eth_dev_get_port_by_name() will always return 
the port ID of the first port on your NIC, when you specify the single PCI 
address and that's where the problem is. There doesn't seem to be a way 
currently to indicate to the calling application that in fact two (or more) 
port IDs are associated with the one PCI address.

I am cc-ing DPDK users mailing list for hopefully some input. Are there any 
plans for the rte_eth_dev_get_port_by_name function to be compatible with NICs 
with multiple ports under the same PCI address?

Thanks,
Ciara

> 
> Thanks,
> Devendra
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v4] netdev-dpdk: Implement TCP/UDP TX cksum in ovs-dpdk side

2017-09-05 Thread Loftus, Ciara

> 
> Currently, the dpdk-vhost side in ovs doesn't support tcp/udp tx cksum.
> So L4 packets's cksum were calculated in VM side but performance is not
> good.
> Implementing tcp/udp tx cksum in ovs-dpdk side improves throughput in
> VM->phy->phy->VM situation. And it makes virtio-net frontend-driver
> support NETIF_F_SG(feature scatter-gather) as well.
> 
> Signed-off-by: Zhenyu Gao 
> ---
> 
> Here is some performance number:

Hi Zhenyu,

Thanks for the code changes since v3.
I tested a VM to VM case using iperf and observed a performance degradation 
when the tx cksum was offloaded to the host:

checksum in VM
0.0-30.0 sec  10.9 GBytes  3.12 Gbits/sec
0.0-30.0 sec  10.9 GBytes  3.11 Gbits/sec
0.0-30.0 sec  11.0 GBytes  3.16 Gbits/sec

checksum in ovs dpdk
0.0-30.0 sec  7.52 GBytes  2.15 Gbits/sec
0.0-30.0 sec  7.12 GBytes  2.04 Gbits/sec
0.0-30.0 sec  8.17 GBytes  2.34 Gbits/sec

I think for this feature to enabled we need performance to be roughly the same 
or better for all use cases. For now the gap here is too big I think.

If you wish to reproduce:

1 host, 2 VMs each with 1 vhost port and flows set up to switch packets from 
each vhost port to the other.

VM1:
ifconfig eth1 1.1.1.1/24 up
ethtool -K eth2 tx on/off
iperf -c 1.1.1.2 -i 1 -t 30

VM2:
ifconfig eth1 1.1.1.2/24 up
ethtool -K eth1 tx on/off
iperf -s -i 1

Thanks,
Ciara

> 
> Setup:
> 
>  qperf client
> +-+
> |   VM|
> +-+
>  |
>  |  qperf server
> +--+  ++
> | vswitch+dpdk |  | bare-metal |
> +--+  ++
>||
>||
>   pNic-PhysicalSwitch
> 
> do cksum in ovs-dpdk: Applied this patch and execute 'ethtool -K eth0 tx on'
> in VM side.
>   It offload cksum job to ovs-dpdk side.
> 
> do cksum in VM: Applied this patch and execute 'ethtool -K eth0 tx off' in VM
> side.
> VM calculate cksum for tcp/udp packets.
> 
> We can see huge improvment in TCP throughput if we leverage ovs-dpdk
> cksum.
> 
> [root@localhost ~]# qperf -t 10 -oo msg_size:1:64K:*2  host-qperf-server01
> tcp_bw tcp_lat udp_bw udp_lat
>   do cksum in ovs-dpdk  do cksum in VM without this patch
> tcp_bw:
> bw  =  1.9 MB/sec bw  =  1.92 MB/secbw  =  1.95 MB/sec
> tcp_bw:
> bw  =  3.97 MB/secbw  =  3.99 MB/secbw  =  3.98 MB/sec
> tcp_bw:
> bw  =  7.75 MB/secbw  =  7.79 MB/secbw  =  7.89 MB/sec
> tcp_bw:
> bw  =  14.7 MB/secbw  =  14.7 MB/secbw  =  14.9 MB/sec
> tcp_bw:
> bw  =  27.7 MB/secbw  =  27.4 MB/secbw  =  28 MB/sec
> tcp_bw:
> bw  =  51.1 MB/secbw  =  51.3 MB/secbw  =  51.8 MB/sec
> tcp_bw:
> bw  =  86.2 MB/secbw  =  84.4 MB/secbw  =  87.6 MB/sec
> tcp_bw:
> bw  =  141 MB/sec bw  =  142 MB/secbw  =  141 MB/sec
> tcp_bw:
> bw  =  203 MB/sec bw  =  201 MB/secbw  =  211 MB/sec
> tcp_bw:
> bw  =  267 MB/sec bw  =  250 MB/secbw  =  260 MB/sec
> tcp_bw:
> bw  =  324 MB/sec bw  =  295 MB/secbw  =  302 MB/sec
> tcp_bw:
> bw  =  397 MB/sec bw  =  363 MB/secbw  =  347 MB/sec
> tcp_bw:
> bw  =  765 MB/sec bw  =  510 MB/secbw  =  383 MB/sec
> tcp_bw:
> bw  =  850 MB/sec bw  =  710 MB/secbw  =  417 MB/sec
> tcp_bw:
> bw  =  1.09 GB/secbw  =  860 MB/secbw  =  444 MB/sec
> tcp_bw:
> bw  =  1.17 GB/secbw  =  979 MB/secbw  =  447 MB/sec
> tcp_bw:
> bw  =  1.17 GB/secbw  =  1.07 GB/sec   bw  =  462 MB/sec
> tcp_lat:
> latency  =  29.1 us   latency  =  28.7 uslatency  =  29.1 us
> tcp_lat:
> latency  =  29 us latency  =  28.8 uslatency  =  29 us
> tcp_lat:
> latency  =  29 us latency  =  28.8 uslatency  =  29 us
> tcp_lat:
> latency  =  29 us latency  =  28.9 uslatency  =  29 us
> tcp_lat:
> latency  =  29.2 us   latency  =  28.9 uslatency  =  29.1 us
> tcp_lat:
> latency  =  29.1 us   latency  =  29.1 uslatency  =  29.1 us
> tcp_lat:
> latency  =  29.5 us   latency  =  29.5 uslatency  =  29.5 us
> tcp_lat:
> latency  =  29.8 us   latency  =  29.8 uslatency  =  29.9 us
> tcp_lat:
> latency  =  30.7 us   latency  =  30.7 uslatency  =  30.7 us
> tcp_lat:
> latency  =  47.1 us   latency  =  46.2 uslatency  =  47.1 us
> tcp_lat:
> latency  =  52.1 us   latency  =  52.3 uslatency  =  53.3 us
> tcp_lat:
> latency  =  44 us latency  =  43.8 uslatency  =  43.2 us
> tcp_lat:
> latency  =  50 us latency  =  46.6 uslatency  =  47.8 us
> tcp_lat:
>

Re: [ovs-dev] [PATCH v3] netdev-dpdk: Implement TCP/UDP TX cksum in ovs-dpdk side

2017-09-01 Thread Loftus, Ciara

Hi Zhenyu,

Thanks for the v3. No feedback yet on the common implementation, so for now 
let's focus on this implementation.

Some high level comments:
- Moving the calculation to vhost rx we've removed the impact on non-vhost 
topology performance.
- The patch needs a rebase.
- checkpatch.py complains about a whitespace error. Please run 
"utilities/checkpatch.py" on any subsequent patches before you post.

Some comments inline too. I didn't get to verify performance with this review.

Thanks,
Ciara

> Currently, the dpdk-vhost side in ovs doesn't support tcp/udp tx cksum.
> So L4 packets's cksum were calculated in VM side but performance is not
> good.
> Implementing tcp/udp tx cksum in ovs-dpdk side improves throughput and

"Improves throughput" is a very general statement. Is the statement true for 
uses cases other than that which you benchmarked?

> makes virtio-net frontend-driver support NETIF_F_SG as well
> 
> Signed-off-by: Zhenyu Gao 
> ---
> 
> Here is some performance number:
> 
> Setup:
> 
>  qperf client
> +-+
> |   VM|
> +-+
>  |
>  |  qperf server
> +--+  ++
> | vswitch+dpdk |  | bare-metal |
> +--+  ++
>||
>||
>   pNic-PhysicalSwitch
> 
> do cksum in ovs-dpdk: Applied this patch and execute 'ethtool -K eth0 tx on'
> in VM side.
>   It offload cksum job to ovs-dpdk side.
> 
> do cksum in VM: Applied this patch and execute 'ethtool -K eth0 tx off' in VM
> side.
> VM calculate cksum for tcp/udp packets.
> 
> We can see huge improvment in TCP throughput if we leverage ovs-dpdk
> cksum.
> 
> [root@localhost ~]# qperf -t 10 -oo msg_size:1:64K:*2  host-qperf-server01
> tcp_bw tcp_lat udp_bw udp_lat
>   do cksum in ovs-dpdk  do cksum in VM without this patch
> tcp_bw:
> bw  =  2.05 MB/secbw  =  1.92 MB/secbw  =  1.95 MB/sec
> tcp_bw:
> bw  =  3.9 MB/sec bw  =  3.99 MB/secbw  =  3.98 MB/sec
> tcp_bw:
> bw  =  8.09 MB/secbw  =  7.82 MB/secbw  =  8.19 MB/sec
> tcp_bw:
> bw  =  14.9 MB/secbw  =  14.8 MB/secbw  =  15.7 MB/sec
> tcp_bw:
> bw  =  27.7 MB/secbw  =  28 MB/sec  bw  =  29.7 MB/sec
> tcp_bw:
> bw  =  51.2 MB/secbw  =  50.9 MB/secbw  =  54.9 MB/sec
> tcp_bw:
> bw  =  86.7 MB/secbw  =  86.8 MB/secbw  =  95.1 MB/sec
> tcp_bw:
> bw  =  149 MB/sec bw  =  160 MB/sec bw  =  149 MB/sec
> tcp_bw:
> bw  =  211 MB/sec bw  =  205 MB/sec bw  =  216 MB/sec
> tcp_bw:
> bw  =  271 MB/sec bw  =  254 MB/sec bw  =  275 MB/sec
> tcp_bw:
> bw  =  326 MB/sec bw  =  303 MB/sec bw  =  321 MB/sec
> tcp_bw:
> bw  =  407 MB/sec bw  =  359 MB/sec bw  =  361 MB/sec
> tcp_bw:
> bw  =  816 MB/sec bw  =  512 MB/sec bw  =  419 MB/sec
> tcp_bw:
> bw  =  840 MB/sec bw  =  756 MB/sec bw  =  457 MB/sec
> tcp_bw:
> bw  =  1.07 GB/secbw  =  880 MB/sec bw  =  480 MB/sec
> tcp_bw:
> bw  =  1.17 GB/secbw  =  1.01 GB/secbw  =  488 MB/sec
> tcp_bw:
> bw  =  1.17 GB/secbw  =  1.11 GB/secbw  =  483 MB/sec
> tcp_lat:
> latency  =  29 us latency  =  29.2 us   latency  =  29.6 us
> tcp_lat:
> latency  =  28.9 us   latency  =  29.3 us   latency  =  29.5 us
> tcp_lat:
> latency  =  29 us latency  =  29.3 us   latency  =  29.6 us
> tcp_lat:
> latency  =  29 us latency  =  29.4 us   latency  =  29.5 us
> tcp_lat:
> latency  =  29 us latency  =  29.2 us   latency  =  29.6 us
> tcp_lat:
> latency  =  29.1 us   latency  =  29.3 us   latency  =  29.7 us
> tcp_lat:
> latency  =  29.4 us   latency  =  29.6 us   latency  =  30 us
> tcp_lat:
> latency  =  29.8 us   latency  =  30.1 us   latency  =  30.2 us
> tcp_lat:
> latency  =  30.9 us   latency  =  30.9 us   latency  =  31 us
> tcp_lat:
> latency  =  46.9 us   latency  =  46.2 us   latency  =  32.2 us
> tcp_lat:
> latency  =  51.5 us   latency  =  52.6 us   latency  =  34.5 us
> tcp_lat:
> latency  =  43.9 us   latency  =  43.8 us   latency  =  43.6 us
> tcp_lat:
>  latency  =  47.6 us  latency  =  48 us latency  =  48.1 us
> tcp_lat:
> latency  =  77.7 us   latency  =  78.8 us   latency  =  78.8 us
> tcp_lat:
> latency  =  82.8 us   latency  =  82.3 us   latency  =  116 us
> tcp_lat:
> latency  =  94.8 us   latency  =  94.2 us   latency  =  134 us
> tcp_lat:
> latency  =  167 uslatency  =  197 uslatency  =  172 us
> udp_bw:
>

Re: [ovs-dev] [PATCH 2/3] dpif-netdev: Fix a couple of coding style issues.

2017-09-01 Thread Loftus, Ciara

> 
> A couple of trivial fixes for a ternery operator placement
> and pointer declaration.
> 
> Fixes: 655856ef39b9 ("dpif-netdev: Change rxq_scheduling to use rxq
> processing cycles.")
> Fixes: a2ac666d5265 ("dpif-netdev: Change definitions of 'idle' & 'processing'
> cycles")
> Cc: ciara.lof...@intel.com
> Reported-by: Ilya Maximets 
> Signed-off-by: Kevin Traynor 
> ---
>  lib/dpif-netdev.c | 11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 55d5656..1db9f10 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -3429,6 +3429,6 @@ static int
>  rxq_cycle_sort(const void *a, const void *b)
>  {
> -struct dp_netdev_rxq * qa;
> -struct dp_netdev_rxq * qb;
> +struct dp_netdev_rxq *qa;
> +struct dp_netdev_rxq *qb;
>  uint64_t total_qa, total_qb;
>  unsigned i;
> @@ -3865,7 +3865,8 @@ dpif_netdev_run(struct dpif *dpif)
> port->rxqs[i].rx,
> port->port_no);
> -cycles_count_intermediate(non_pmd, NULL, process_packets 
> ?
> -   PMD_CYCLES_PROCESSING
> - : PMD_CYCLES_IDLE);
> +cycles_count_intermediate(non_pmd, NULL,
> +  process_packets
> +  ? PMD_CYCLES_PROCESSING
> +  : PMD_CYCLES_IDLE);
>  }
>  }
> --
> 1.8.3.1

LGTM. Thanks!

Acked-by: Ciara Loftus 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v1] netdev-dpdk: Implement TCP/UDP TX cksum in ovs-dpdk side

2017-08-23 Thread Loftus, Ciara

> 
> Hi Ciara
> 
> You had a general concern below; can we conclude on that before going
> further ?
> 
> Thanks Darrell
> 
> “
> > On another note I have a general concern. I understand similar functionality
> > is present in the DPDK vhost sample app. I wonder if it would be feasible
> for
> > this to be implemented in the DPDK vhost library and leveraged here,
> rather
> > than having two implementations in two separate code bases.

This is something I'd like to see, although I wouldn't block on this patch 
waiting for it.
Maybe we can have the initial implementation as it is (if it proves 
beneficial), then move to a common DPDK API if/when it becomes available.

I've cc'ed DPDK users list hoping for some input. To summarise:
From my understanding, the DPDK vhost sample application calculates TX checksum 
for packets received from vHost ports with invalid/0 checksums:
http://dpdk.org/browse/dpdk/tree/examples/vhost/main.c#n910
The patch being discussed in this thread (also here: 
https://patchwork.ozlabs.org/patch/802070/) it seems does something very 
similar.
Wondering on the feasibility of putting this functionality in a rte_vhost 
library call such that we don't have two separate implementations?

Thanks,
Ciara

> >
> > I have some other comments inline.
> >
> > Thanks,
> > Ciara
> “
> 
> 
> 
> From: Gao Zhenyu <sysugaozhe...@gmail.com>
> Date: Wednesday, August 16, 2017 at 6:38 AM
> To: "Loftus, Ciara" <ciara.lof...@intel.com>
> Cc: "b...@ovn.org" <b...@ovn.org>, "Chandran, Sugesh"
> <sugesh.chand...@intel.com>, "ktray...@redhat.com"
> <ktray...@redhat.com>, Darrell Ball <db...@vmware.com>,
> "d...@openvswitch.org" <d...@openvswitch.org>
> Subject: Re: [ovs-dev] [PATCH v1] netdev-dpdk: Implement TCP/UDP TX
> cksum in ovs-dpdk side
> 
> Hi Loftus,
>    I had submitted a new version, please see
> https://patchwork.ozlabs.org/patch/802070/
>    It move the cksum to vhost receive side.
> Thanks
> Zhenyu Gao
> 
> 2017-08-10 12:35 GMT+08:00 Gao Zhenyu <sysugaozhe...@gmail.com>:
> I see, for flows in phy-phy setup, they should not be calculate cksum.
> I will revise my patch to do the cksum for vhost port only. I will send a new
> patch next week.
> 
> Thanks
> Zhenyu Gao
> 
> 2017-08-08 17:53 GMT+08:00 Loftus, Ciara <ciara.lof...@intel.com>:
> >
> > Hi Loftus,
> >
> > Thanks for testing and the comments!
> > Can you show more details about your phy-vm-phy,phy-phy setup and
> > testing steps? Then I can reproduce it to see if I can solve this pps 
> > problem.
> 
> You're welcome. I forgot to mention my tests were with 64B packets.
> 
> For phy-phy the setup is a single host with 2 dpdk physical ports and 1 flow
> rule port1 -> port2.
> See figure 3 here: https://tools.ietf.org/html/draft-ietf-bmwg-vswitch-
> opnfv-04#section-4
> 
> For the phy-vm-phy the setup is a single host with 2 dpdk physical ports and 2
> vhostuser ports with flow rules:
> Dpdk1 -> vhost 1 & vhost2 -> dpdk2
> IP rules are set up in the VM to route packets from vhost1 to vhost 2.
> See figure 4 in the link above.
> 
> >
> > BTW, how about throughput, did you saw improvment?
> 
> By throughput if you mean 0% packet loss, I did not test this.
> 
> Thanks,
> Ciara
> 
> >
> > I would like to implement vhost->vhost part.
> >
> > Thanks
> > Zhenyu Gao
> >
> > 2017-08-04 22:52 GMT+08:00 Loftus, Ciara <ciara.lof...@intel.com>:
> > >
> > > Currently, the dpdk-vhost side in ovs doesn't support tcp/udp tx cksum.
> > > So L4 packets's cksum were calculated in VM side but performance is not
> > > good.
> > > Implementing tcp/udp tx cksum in ovs-dpdk side improves throughput
> and
> > > makes virtio-net frontend-driver support NETIF_F_SG as well
> > >
> > > Signed-off-by: Zhenyu Gao <sysugaozhe...@gmail.com>
> > > ---
> > >
> > > Here is some performance number:
> > >
> > > Setup:
> > >
> > >  qperf client
> > > +-+
> > > |   VM    |
> > > +-+
> > >      |
> > >      |                          qperf server
> > > +--+              ++
> > > | vswitch+dpdk |              | bare-metal |
> > > +--+              ++
> > >        |                            |
> > >        |                            |
> > >       pNic-PhysicalSwitch
> > >
> > > do cksum in ovs-dpdk: Ap

Re: [ovs-dev] [PATCH] netdev-dpdk: include dpdk PCI header directly

2017-08-10 Thread Loftus, Ciara

> 
> On 08/09/2017 10:00 PM, Aaron Conole wrote:
> > As part of a devargs rework in DPDK, the PCI header file was removed, and
> > needs to be directly included.  This isn't required to build with 17.05 or
> > earlier, but will be required should a future update happen.
> >
> > Signed-off-by: Aaron Conole 
> 
> Tried to build with dpdk 17.08
> 
> Tested-By: Timothy Redaelli 

+1 for forwards compatibility

Acked-by: Ciara Loftus 

> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v1] netdev-dpdk: Implement TCP/UDP TX cksum in ovs-dpdk side

2017-08-08 Thread Loftus, Ciara

> 
> Hi Loftus,
> 
> Thanks for testing and the comments!
> Can you show more details about your phy-vm-phy,phy-phy setup and
> testing steps? Then I can reproduce it to see if I can solve this pps problem.

You're welcome. I forgot to mention my tests were with 64B packets.

For phy-phy the setup is a single host with 2 dpdk physical ports and 1 flow 
rule port1 -> port2.
See figure 3 here: 
https://tools.ietf.org/html/draft-ietf-bmwg-vswitch-opnfv-04#section-4

For the phy-vm-phy the setup is a single host with 2 dpdk physical ports and 2 
vhostuser ports with flow rules:
Dpdk1 -> vhost 1 & vhost2 -> dpdk2
IP rules are set up in the VM to route packets from vhost1 to vhost 2.
See figure 4 in the link above.

> 
> BTW, how about throughput, did you saw improvment?

By throughput if you mean 0% packet loss, I did not test this.

Thanks,
Ciara

> 
> I would like to implement vhost->vhost part.
> 
> Thanks
> Zhenyu Gao
> 
> 2017-08-04 22:52 GMT+08:00 Loftus, Ciara <ciara.lof...@intel.com>:
> >
> > Currently, the dpdk-vhost side in ovs doesn't support tcp/udp tx cksum.
> > So L4 packets's cksum were calculated in VM side but performance is not
> > good.
> > Implementing tcp/udp tx cksum in ovs-dpdk side improves throughput and
> > makes virtio-net frontend-driver support NETIF_F_SG as well
> >
> > Signed-off-by: Zhenyu Gao <sysugaozhe...@gmail.com>
> > ---
> >
> > Here is some performance number:
> >
> > Setup:
> >
> >  qperf client
> > +-+
> > |   VM    |
> > +-+
> >      |
> >      |                          qperf server
> > +--+              ++
> > | vswitch+dpdk |              | bare-metal |
> > +--+              ++
> >        |                            |
> >        |                            |
> >       pNic-PhysicalSwitch
> >
> > do cksum in ovs-dpdk: Applied this patch and execute 'ethtool -K eth0 tx
> on'
> > in VM side.
> >                       It offload cksum job to ovs-dpdk side.
> >
> > do cksum in VM: Applied this patch and execute 'ethtool -K eth0 tx off' in
> VM
> > side.
> >                 VM calculate cksum for tcp/udp packets.
> >
> > We can see huge improvment in TCP throughput if we leverage ovs-dpdk
> > cksum.
> Hi Zhenyu,
> 
> Thanks for the patch. I tested some alternative use cases and unfortunately I
> see a degradation for phy-phy and phy-vm-phy topologies.
> Here are my results:
> 
> phy-vm-phy:
> without patch: 0.871Mpps
> with patch (offload=on): 0.877Mpps
> with patch (offload=off): 0.891Mpps
> 
> phy-phy:
> without patch: 13.581Mpps
> with patch: 13.055Mpps
> 
> The half a million pps drop for the second test case is concerning to me but
> not surprising since we're adding extra complexity to netdev_dpdk_send()
> Could this be avoided? Would it make sense to put this functionality
> somewhere else eg. vhost receive?
> 
> On another note I have a general concern. I understand similar functionality
> is present in the DPDK vhost sample app. I wonder if it would be feasible for
> this to be implemented in the DPDK vhost library and leveraged here, rather
> than having two implementations in two separate code bases.
> 
> I have some other comments inline.
> 
> Thanks,
> Ciara
> 
> >
> > [root@localhost ~]# qperf -t 10 -oo msg_size:1:64K:*2  host-qperf-server01
> > tcp_bw tcp_lat udp_bw udp_lat
> >   do cksum in ovs-dpdk          do cksum in VM             without this 
> >patch
> > tcp_bw:
> >     bw  =  2.05 MB/sec        bw  =  1.92 MB/sec        bw  =  1.95 MB/sec
> > tcp_bw:
> >     bw  =  3.9 MB/sec         bw  =  3.99 MB/sec        bw  =  3.98 MB/sec
> > tcp_bw:
> >     bw  =  8.09 MB/sec        bw  =  7.82 MB/sec        bw  =  8.19 MB/sec
> > tcp_bw:
> >     bw  =  14.9 MB/sec        bw  =  14.8 MB/sec        bw  =  15.7 MB/sec
> > tcp_bw:
> >     bw  =  27.7 MB/sec        bw  =  28 MB/sec          bw  =  29.7 MB/sec
> > tcp_bw:
> >     bw  =  51.2 MB/sec        bw  =  50.9 MB/sec        bw  =  54.9 MB/sec
> > tcp_bw:
> >     bw  =  86.7 MB/sec        bw  =  86.8 MB/sec        bw  =  95.1 MB/sec
> > tcp_bw:
> >     bw  =  149 MB/sec         bw  =  160 MB/sec         bw  =  149 MB/sec
> > tcp_bw:
> >     bw  =  211 MB/sec         bw  =  205 MB/sec         bw  =  216 MB/sec
> > tcp_bw:
> >     bw  =  271 MB/sec         bw  =  254 MB/sec         bw  =  275 MB/sec
> > tcp_bw:
> >     bw  =  326 MB/sec         bw  =  303 MB/sec         bw  =  321 MB/se

Re: [ovs-dev] [PATCH v1] netdev-dpdk: Implement TCP/UDP TX cksum in ovs-dpdk side

2017-08-04 Thread Loftus, Ciara

> 
> Currently, the dpdk-vhost side in ovs doesn't support tcp/udp tx cksum.
> So L4 packets's cksum were calculated in VM side but performance is not
> good.
> Implementing tcp/udp tx cksum in ovs-dpdk side improves throughput and
> makes virtio-net frontend-driver support NETIF_F_SG as well
> 
> Signed-off-by: Zhenyu Gao 
> ---
> 
> Here is some performance number:
> 
> Setup:
> 
>  qperf client
> +-+
> |   VM|
> +-+
>  |
>  |  qperf server
> +--+  ++
> | vswitch+dpdk |  | bare-metal |
> +--+  ++
>||
>||
>   pNic-PhysicalSwitch
> 
> do cksum in ovs-dpdk: Applied this patch and execute 'ethtool -K eth0 tx on'
> in VM side.
>   It offload cksum job to ovs-dpdk side.
> 
> do cksum in VM: Applied this patch and execute 'ethtool -K eth0 tx off' in VM
> side.
> VM calculate cksum for tcp/udp packets.
> 
> We can see huge improvment in TCP throughput if we leverage ovs-dpdk
> cksum.

Hi Zhenyu,

Thanks for the patch. I tested some alternative use cases and unfortunately I 
see a degradation for phy-phy and phy-vm-phy topologies.
Here are my results:

phy-vm-phy:
without patch: 0.871Mpps
with patch (offload=on): 0.877Mpps
with patch (offload=off): 0.891Mpps

phy-phy:
without patch: 13.581Mpps
with patch: 13.055Mpps

The half a million pps drop for the second test case is concerning to me but 
not surprising since we're adding extra complexity to netdev_dpdk_send()
Could this be avoided? Would it make sense to put this functionality somewhere 
else eg. vhost receive?

On another note I have a general concern. I understand similar functionality is 
present in the DPDK vhost sample app. I wonder if it would be feasible for this 
to be implemented in the DPDK vhost library and leveraged here, rather than 
having two implementations in two separate code bases.

I have some other comments inline.

Thanks,
Ciara

> 
> [root@localhost ~]# qperf -t 10 -oo msg_size:1:64K:*2  host-qperf-server01
> tcp_bw tcp_lat udp_bw udp_lat
>   do cksum in ovs-dpdk  do cksum in VM without this patch
> tcp_bw:
> bw  =  2.05 MB/secbw  =  1.92 MB/secbw  =  1.95 MB/sec
> tcp_bw:
> bw  =  3.9 MB/sec bw  =  3.99 MB/secbw  =  3.98 MB/sec
> tcp_bw:
> bw  =  8.09 MB/secbw  =  7.82 MB/secbw  =  8.19 MB/sec
> tcp_bw:
> bw  =  14.9 MB/secbw  =  14.8 MB/secbw  =  15.7 MB/sec
> tcp_bw:
> bw  =  27.7 MB/secbw  =  28 MB/sec  bw  =  29.7 MB/sec
> tcp_bw:
> bw  =  51.2 MB/secbw  =  50.9 MB/secbw  =  54.9 MB/sec
> tcp_bw:
> bw  =  86.7 MB/secbw  =  86.8 MB/secbw  =  95.1 MB/sec
> tcp_bw:
> bw  =  149 MB/sec bw  =  160 MB/sec bw  =  149 MB/sec
> tcp_bw:
> bw  =  211 MB/sec bw  =  205 MB/sec bw  =  216 MB/sec
> tcp_bw:
> bw  =  271 MB/sec bw  =  254 MB/sec bw  =  275 MB/sec
> tcp_bw:
> bw  =  326 MB/sec bw  =  303 MB/sec bw  =  321 MB/sec
> tcp_bw:
> bw  =  407 MB/sec bw  =  359 MB/sec bw  =  361 MB/sec
> tcp_bw:
> bw  =  816 MB/sec bw  =  512 MB/sec bw  =  419 MB/sec
> tcp_bw:
> bw  =  840 MB/sec bw  =  756 MB/sec bw  =  457 MB/sec
> tcp_bw:
> bw  =  1.07 GB/secbw  =  880 MB/sec bw  =  480 MB/sec
> tcp_bw:
> bw  =  1.17 GB/secbw  =  1.01 GB/secbw  =  488 MB/sec
> tcp_bw:
> bw  =  1.17 GB/secbw  =  1.11 GB/secbw  =  483 MB/sec
> tcp_lat:
> latency  =  29 us latency  =  29.2 us   latency  =  29.6 us
> tcp_lat:
> latency  =  28.9 us   latency  =  29.3 us   latency  =  29.5 us
> tcp_lat:
> latency  =  29 us latency  =  29.3 us   latency  =  29.6 us
> tcp_lat:
> latency  =  29 us latency  =  29.4 us   latency  =  29.5 us
> tcp_lat:
> latency  =  29 us latency  =  29.2 us   latency  =  29.6 us
> tcp_lat:
> latency  =  29.1 us   latency  =  29.3 us   latency  =  29.7 us
> tcp_lat:
> latency  =  29.4 us   latency  =  29.6 us   latency  =  30 us
> tcp_lat:
> latency  =  29.8 us   latency  =  30.1 us   latency  =  30.2 us
> tcp_lat:
> latency  =  30.9 us   latency  =  30.9 us   latency  =  31 us
> tcp_lat:
> latency  =  46.9 us   latency  =  46.2 us   latency  =  32.2 us
> tcp_lat:
> latency  =  51.5 us   latency  =  52.6 us   latency  =  34.5 us
> tcp_lat:
> latency  =  43.9 us   latency  =  43.8 us   latency  =  43.6 us
> tcp_lat:
>  latency  =  47.6 us  latency  =  48 us latency  =  48.1 us
> tcp_lat:
> latency  =  77.7 us   latency  =  78.8 us

[ovs-dev] [RFC PATCH] netdev-dpdk: Add vHost User PMD

2017-05-30 Thread Loftus, Ciara

Apologies for the misformatted subject header!
This cover-letter is in relation to the following patch:
https://mail.openvswitch.org/pipermail/ovs-dev/2017-May/333108.html

Thanks,
Ciara

> -Original Message-
> From: Loftus, Ciara
> Sent: Tuesday, May 30, 2017 2:33 PM
> To: d...@openvswitch.org
> Cc: Loftus, Ciara <ciara.lof...@intel.com>; i.maxim...@samsung.com
> Subject: [RFC PATCH] *** SUBJECT HERE ***
> 
> This patch is to be applied on top of the DPDK 17.05 patch:
> https://mail.openvswitch.org/pipermail/ovs-dev/2017-May/333105.html
> 
> Previous:
> https://mail.openvswitch.org/pipermail/ovs-dev/2016-
> November/325492.html
> 
> Ciara Loftus (1):
>   netdev-dpdk: Add vHost User PMD
> 
>  lib/dpdk.c|  10 +
>  lib/dpdk.h|   1 +
>  lib/netdev-dpdk.c | 874 
> --
>  3 files changed, 325 insertions(+), 560 deletions(-)
> 
> --
> 2.4.11

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [RFC PATCH] netdev-dpdk: Add Tx intermediate queue for vhost ports.

2017-05-26 Thread Loftus, Ciara

> 
> This commit adds the intermediate queue for vHost-user ports. It
> improves the throughput in multiple virtual machines deployments and
> also in cases with VM doing packet forwarding in kernel stack.
> 
> This patch is aligned with intermediate queue implementation for dpdk
> ports that can be found here: https://patchwork.ozlabs.org/patch/723309/

As mentioned in previous reviews it would be nice to see these two patches 
rebased and posted as a series & the combined performance figures.

> 
> Signed-off-by: Bhanuprakash Bodireddy
> 
> Signed-off-by: Antonio Fischetti 
> Co-authored-by: Antonio Fischetti 
> ---

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v3] netdev-dpdk: Implement Tx intermediate queue for dpdk ports.

2017-05-26 Thread Loftus, Ciara

> 
> After packet classification, packets are queued in to batches depending
> on the matching netdev flow. Thereafter each batch is processed to
> execute the related actions. This becomes particularly inefficient if
> there are few packets in each batch as rte_eth_tx_burst() incurs expensive
> MMIO writes.
> 
> This commit adds back intermediate queue implementation. Packets are
> queued and burst when the packet count exceeds threshold. Also drain
> logic is refactored to handle packets hanging in the tx queues. Testing
> shows significant performance gains with this implementation.
> 
> Fixes: b59cc14e032d("netdev-dpdk: Use instant sending instead of
> queueing of packets.")
> Signed-off-by: Bhanuprakash Bodireddy
> >
> Signed-off-by: Antonio Fischetti 
> Co-authored-by: Antonio Fischetti 
> Signed-off-by: Markus Magnusson 
> Co-authored-by: Markus Magnusson 
> ---
> v2->v3
>   * Refactor the code
>   * Use thread local copy 'send_port_cache' instead of 'tx_port' while 
> draining
>   * Invoke dp_netdev_drain_txq_port() to drain the packets from the queue
> as
> part of pmd reconfiguration that gets triggered due to port
> addition/deletion
> or change in pmd-cpu-mask.
>   * Invoke netdev_txq_drain() from xps_get_tx_qid() to drain packets in old
> queue. This is possible in XPS case where the tx queue can change after
> timeout.
>   * Fix another bug in netdev_dpdk_eth_tx_burst() w.r.t 'txq->count'.
> 
> Latency stats:
> Collected the latency stats with PHY2PHY loopback case using 30 IXIA streams
> /UDP packets/uni direction traffic. All the stats are in nanoseconds. Results
> below compare latency results between Master vs patch.
> 
> case 1: Matching IP flow rules for each IXIA stream
> Eg:  For an IXIA stream with src Ip: 2.2.2.1, dst tip: 5.5.5.1
> ovs-ofctl add-flow br0 dl_type=0x0800,nw_src=2.2.2.1,actions=output:2
> 
> For an IXIA stream with src Ip: 4.4.4.1, dst tip: 15.15.15.1
> ovs-ofctl add-flow br0 dl_type=0x0800,nw_src=4.4.4.1,actions=output:2
> 
> Packet64 128 256 512
> BranchMaster  Patch   Master  Patch   Master  Patch   Master  Patch
> case 1  26100   222000  26190   217930  23890   199000  30370   212440 (min
> latency ns)
> 1239100 906910  1168740 691040  575470  574240  724360  734050 (max
> latency ns)
> 1189501   763908  913602  662941  486187  440482  470060  479376 
> (avg
> latency ns)
> 
> 
>1024  1280 1518
>Master   Patch Master Patch  Master Patch
>2832018961026520  220580 23950  200480  (min latency ns)
>701040   67584670  670390 19783490   685930 747040  (max latency ns)
>444033   469297415602 506215 429587 491593  (avg latency ns)
> 
> 
> case 2: ovs-ofctl add-flow br0 in_port=1,action=output:2
> 
> Packet64 128 256 512
> BranchMaster  Patch   Master  Patch   Master  Patch   Master  Patch
> case 2  18800   33970   19980   30350  2261026800   13500   20220
> 506140  596690  363010  363370  544520  541570  549120  77414700
> 459509  473536  254817  256801  287872  287277  290642  301572
> 
> 
>1024  1280 1518
>Master   Patch Master Patch  Master Patch
>2253015850 21350  36020  25970  34300
>549680   131964240 543390 81549210   552060 98207410
>292436   294388285468 305727 295133 300080
> 
> 
> case 3 is same as case 1 with INTERIM_QUEUE_BURST_THRESHOLD=16,
> instead of 32.
> 
> (w) patch
> case 364   128   2565121024   12801518
>  122700   119890   135200  117530  118900 116640  123710(min)
>  972830   808960   574180  696820  36717550   720500  726790(max)
>  783315   674814   463256  439412  467041 463093  471967(avg)
> 
> case 4 is same as case 2 with INTERIM_QUEUE_BURST_THRESHOLD=16,
> instead of 32.
> 
> (w) patch
> case 464   128   2565121024   1280  1518
>  317502614025250   17570   14750  28600 31460(min ns)
>  722690   363200   539760  538320  301845040  12556210  132114800(max
> ns)
>  485710   253497   285589  284095  293189 282834285829(avg ns)
> 
> v1->v2
>   * xps_get_tx_qid() is no more called twice. The last used qid is stored so
> the drain function will flush the right queue also when XPS is enabled.
>   * netdev_txq_drain() is called unconditionally and not just for dpdk ports.
>   * txq_drain() takes the 'tx_lock' for queue in case of dynamic tx queues.
>   * Restored counting of dropped packets.
>   * Changed scheduling of drain function.
>   * Updated comments in netdev-provider.h
>   *

Re: [ovs-dev] [PATCH v3] dpif-netdev: Change definitions of 'idle' & 'processing' cycles

2017-03-08 Thread Loftus, Ciara

> 
> On 02/21/2017 10:49 AM, Jan Scheurich wrote:
> >> -Original Message-
> >> From: Kevin Traynor [mailto:ktray...@redhat.com]
> >> Sent: Friday, 17 February, 2017 17:38
> >>
> >> If there are multiple queues in a poll list and only one has packets,
> >> the cycles polling the empty queues for packets will be counted in the
> >> processing time - whereas you'd expect them to be in the idle time.
> >>
> >> Is there much of a performance hit if the cycle_count_intermediate() is
> >> moved into the poll_cnt loop? OTOH, how much would the idle queues
> skew
> >> the counters? If it's a small amount it may be better than doing a
> >> cycle_count per q.
> >
> > In our testing an idle iteration over 8 vhostuser and one eth port
> > consumes on average 600 cycles, i.e. ~60 cycles per port. Compared to
> > the typical cost for processing a packet in the range of 1000 cycles,
> > the overhead of polling idle queues is not significant, unless a PMD
> > were to poll hundreds of queues. Also we have seen that an rte_rdtsc()
> > call can cost up to 20 cycles.
> >
> 
> Hi Ciara/Jan,
> 
> Thanks for the additional info. Considering this is just a helpful user
> stat and both options do not significantly skew performance or the stat,
> either way sounds ok to me. We can always change internally in the
> future if we need to.
> 
> Kevin.


Just a note: it might have gotten lost in discussion but there's no outstanding 
rework here.
Both v3 and v4 of this patch still apply cleanly to the HEAD of master so 
either version can be used if we want to push it upstream.

Thanks,
Ciara
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] netdev-dpdk: Add support for DPDK 17.02

2017-02-24 Thread Loftus, Ciara

> 
> >
> > Ciara Loftus  writes:
> >
> > > This commit announces support for DPDK 17.02. Compatibility with DPDK
> > > v16.11 is not broken yet thanks to no code changes being needed for the
> > > upgrade.
> > >
> > > Signed-off-by: Ciara Loftus 
> > > ---
> >
> > Is it possible to make this reflect an either 16.11 or 17.02 support
> > instead of a hard requirement?  17.02 is not a long-term supported
> > version, but IIRC 16.11 is.
> >
> > What do you think?
> 
> Good point. Would like to hear more opinions. This could be a good topic for
> the bi-weekly call next Thursday.

This was discussed on the call and the general consensus was not to proceed 
with the patch for the time being considering 16.11 is LTS and there is no 
immediate dependency on 17.02 for OVS. Still open to discussion though.

Thanks,
Ciara

> 
> Thanks,
> Ciara
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v3] dpif-netdev: Change definitions of 'idle' & 'processing' cycles

2017-02-20 Thread Loftus, Ciara

> 
> On 02/17/2017 10:39 AM, Ciara Loftus wrote:
> > Instead of counting all polling cycles as processing cycles, only count
> > the cycles where packets were received from the polling.
> >
> > Signed-off-by: Georg Schmuecking 
> > Signed-off-by: Ciara Loftus 
> > Co-authored-by: Georg Schmuecking 
> > Acked-by: Kevin Traynor 
> > ---
> > v3:
> > - Updated acks & co-author tags
> > - Removed unnecessary PMD_CYCLES_POLLING counter type
> > - Explain terms 'idle' and 'processing' cycles in
> >   vswitchd/ovs-vswitchd.8.in
> > v2:
> > - Rebase
> >
> >  lib/dpif-netdev.c  | 57 ++
> 
> >  vswitchd/ovs-vswitchd.8.in |  5 +++-
> >  2 files changed, 47 insertions(+), 15 deletions(-)
> >
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> > index 30907b7..6bf6809 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -279,8 +279,9 @@ enum dp_stat_type {
> >  };
> >
> >  enum pmd_cycles_counter_type {
> > -PMD_CYCLES_POLLING, /* Cycles spent polling NICs. */
> > -PMD_CYCLES_PROCESSING,  /* Cycles spent processing packets */
> > +PMD_CYCLES_IDLE,/* Cycles spent idle or unsuccessful 
> > polling */
> > +PMD_CYCLES_PROCESSING,  /* Cycles spent successfully polling and
> > + * processing polled packets */
> >  PMD_N_CYCLES
> >  };
> >
> > @@ -755,10 +756,10 @@ pmd_info_show_stats(struct ds *reply,
> >  }
> >
> >  ds_put_format(reply,
> > -  "\tpolling cycles:%"PRIu64" (%.02f%%)\n"
> > +  "\tidle cycles:%"PRIu64" (%.02f%%)\n"
> >"\tprocessing cycles:%"PRIu64" (%.02f%%)\n",
> > -  cycles[PMD_CYCLES_POLLING],
> > -  cycles[PMD_CYCLES_POLLING] / (double)total_cycles * 100,
> > +  cycles[PMD_CYCLES_IDLE],
> > +  cycles[PMD_CYCLES_IDLE] / (double)total_cycles * 100,
> >cycles[PMD_CYCLES_PROCESSING],
> >cycles[PMD_CYCLES_PROCESSING] / (double)total_cycles * 
> > 100);
> >
> > @@ -2953,30 +2954,43 @@ cycles_count_end(struct
> dp_netdev_pmd_thread *pmd,
> >  non_atomic_ullong_add(>cycles.n[type], interval);
> >  }
> >
> > -static void
> > +/* Calculate the intermediate cycle result and add to the counter 'type' */
> > +static inline void
> > +cycles_count_intermediate(struct dp_netdev_pmd_thread *pmd,
> > +  enum pmd_cycles_counter_type type)
> > +OVS_NO_THREAD_SAFETY_ANALYSIS
> > +{
> > +unsigned long long new_cycles = cycles_counter();
> > +unsigned long long interval = new_cycles - pmd->last_cycles;
> > +pmd->last_cycles = new_cycles;
> > +
> > +non_atomic_ullong_add(>cycles.n[type], interval);
> > +}
> > +
> > +static int
> >  dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
> > struct netdev_rxq *rx,
> > odp_port_t port_no)
> >  {
> >  struct dp_packet_batch batch;
> >  int error;
> > +int batch_cnt = 0;
> >
> >  dp_packet_batch_init();
> > -cycles_count_start(pmd);
> >  error = netdev_rxq_recv(rx, );
> > -cycles_count_end(pmd, PMD_CYCLES_POLLING);
> >  if (!error) {
> >  *recirc_depth_get() = 0;
> >
> > -cycles_count_start(pmd);
> > +batch_cnt = batch.count;
> >  dp_netdev_input(pmd, , port_no);
> > -cycles_count_end(pmd, PMD_CYCLES_PROCESSING);
> >  } else if (error != EAGAIN && error != EOPNOTSUPP) {
> >  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5);
> >
> >  VLOG_ERR_RL(, "error receiving data from %s: %s",
> >  netdev_rxq_get_name(rx), ovs_strerror(error));
> >  }
> > +
> > +return batch_cnt;
> >  }
> >
> >  static struct tx_port *
> > @@ -3438,21 +3452,27 @@ dpif_netdev_run(struct dpif *dpif)
> >  struct dp_netdev *dp = get_dp_netdev(dpif);
> >  struct dp_netdev_pmd_thread *non_pmd;
> >  uint64_t new_tnl_seq;
> > +int process_packets = 0;
> >
> >  ovs_mutex_lock(>port_mutex);
> >  non_pmd = dp_netdev_get_pmd(dp, NON_PMD_CORE_ID);
> >  if (non_pmd) {
> >  ovs_mutex_lock(>non_pmd_mutex);
> > +cycles_count_start(non_pmd);
> >  HMAP_FOR_EACH (port, node, >ports) {
> >  if (!netdev_is_pmd(port->netdev)) {
> >  int i;
> >
> >  for (i = 0; i < port->n_rxq; i++) {
> > -dp_netdev_process_rxq_port(non_pmd, port->rxqs[i].rx,
> > -   port->port_no);
> > +process_packets +=
> > +dp_netdev_process_rxq_port(non_pmd,
> > +   port->rxqs[i].rx,
> > +   port->port_no);
> >

Re: [ovs-dev] [PATCH] netdev-dpdk: Add support for DPDK 17.02

2017-02-17 Thread Loftus, Ciara

> 
> Ciara Loftus  writes:
> 
> > This commit announces support for DPDK 17.02. Compatibility with DPDK
> > v16.11 is not broken yet thanks to no code changes being needed for the
> > upgrade.
> >
> > Signed-off-by: Ciara Loftus 
> > ---
> 
> Is it possible to make this reflect an either 16.11 or 17.02 support
> instead of a hard requirement?  17.02 is not a long-term supported
> version, but IIRC 16.11 is.
> 
> What do you think?

Good point. Would like to hear more opinions. This could be a good topic for 
the bi-weekly call next Thursday.

Thanks,
Ciara
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v8] dpif-netdev: Conditional EMC insert

2017-02-14 Thread Loftus, Ciara

> 
> On 02/10/2017 10:57 AM, Ciara Loftus wrote:
> > Unconditional insertion of EMC entries results in EMC thrashing at high
> > numbers of parallel flows. When this occurs, the performance of the EMC
> > often falls below that of the dpcls classifier, rendering the EMC
> > practically useless.
> >
> > Instead of unconditionally inserting entries into the EMC when a miss
> > occurs, use a 1% probability of insertion. This ensures that the most
> > frequent flows have the highest chance of creating an entry in the EMC,
> > and the probability of thrashing the EMC is also greatly reduced.
> >
> > The probability of insertion is configurable, via the
> > other_config:emc-insert-prob option. For example the following command
> > increases the insertion probability to 10%.
> >
> > ovs-vsctl set Open_vSwitch . other_config:emc-insert-prob=10
> >
> > Signed-off-by: Ciara Loftus 
> > Signed-off-by: Georg Schmuecking 
> > Co-authored-by: Georg Schmuecking 
> > ---
> > v8:
> > - Floating point precision percentiles
> > - Moved NEWS entry to post-2.7 section and is no longer in the DPDK
> >   specific section.
> > v7:
> > - Remove further code duplication
> > v6:
> > - Refactor the code to remove duplication around calls to
> >   emc_probabilistic_insert()
> > v5:
> > - Use percentiles for emc-insert-prob (0-100%)
> > - Update docs to reflect the option not exclusive to the DPDK datapath.
> > v4:
> > - Added Georg to Authors file
> > - Set emc-insert-prob=1 for 'PMD - stats' unit test
> > - Use read_relaxed on atomic var
> > - Correctly allow for 0 and 100% probababilites
> > - Cache align new element in dp_netdev struct
> > - Revert to default probability if invalid emc-insert-prob set
> > - Allow configurability for non-DPDK case
> > v3:
> > - Use new dpif other_config infrastructure to tidy up how the
> >   emc-insert-prob value is passed to dpif-netdev.
> > v2:
> > - Enable probability configurability via other_config:emc-insert-prob
> >   option.
> >
> >  AUTHORS.rst  |  1 +
> >  Documentation/howto/dpdk.rst | 19 +
> >  NEWS |  2 ++
> >  lib/dpif-netdev.c| 50
> 
> >  lib/smap.c   | 10 +
> >  lib/smap.h   |  1 +
> >  tests/pmd.at |  1 +
> >  vswitchd/vswitch.xml | 13 
> >  8 files changed, 93 insertions(+), 4 deletions(-)
> >
> > diff --git a/AUTHORS.rst b/AUTHORS.rst
> > index b567fcc..3382ad8 100644
> > --- a/AUTHORS.rst
> > +++ b/AUTHORS.rst
> > @@ -382,6 +382,7 @@ Eric Lopez  elo...@nicira.com
> >  Frido Roose fr.ro...@gmail.com
> >  Gaetano Catalli gaetano.cata...@gmail.com
> >  Gavin Remaley   gavin_rema...@selinc.com
> > +Georg Schmuecking   georg.schmueck...@ericsson.com
> >  George Shuklin  ama...@desunote.ru
> >  Gerald Rogers   gerald.rog...@intel.com
> >  Ghanem Bahribahri.gha...@gmail.com
> > diff --git a/Documentation/howto/dpdk.rst
> b/Documentation/howto/dpdk.rst
> > index d1e6e89..9e0dbeb 100644
> > --- a/Documentation/howto/dpdk.rst
> > +++ b/Documentation/howto/dpdk.rst
> > @@ -354,6 +354,25 @@ the `DPDK documentation
> >
> >  Note: Not all DPDK virtual PMD drivers have been tested and verified to
> work.
> >
> > +EMC Insertion Probability
> > +-
> > +By default 1 in every 100 flows are inserted into the Exact Match Cache
> (EMC).
> > +It is possible to change this insertion probability by setting the
> > +``emc-insert-prob`` option::
> > +
> > +$ ovs-vsctl --no-wait set Open_vSwitch . other_config:emc-insert-
> prob=N
> > +
> > +where:
> > +
> > +``N``
> > +  is a positive floating point number between 0 and 100 representing the
> > +  percentage probability of EMC insertion.
> > +
> > +If ``N`` is set to 100, an insertion will be performed for every flow. If 
> > set
> > +to 0, no insertions will be performed and the EMC will effectively be
> disabled.
> > +
> > +For more information on the EMC refer to :doc:`/intro/install/dpdk` .
> > +
> >  .. _dpdk-ovs-in-guest:
> >
> >  OVS with DPDK Inside VMs
> > diff --git a/NEWS b/NEWS
> > index aebd99c..5c9e628 100644
> > --- a/NEWS
> > +++ b/NEWS
> > @@ -3,6 +3,8 @@ Post-v2.7.0
> > - Tunnels:
> >   * Added support to set packet mark for tunnel endpoint using
> > `egress_pkt_mark` OVSDB option.
> > +   - EMC insertion probability is reduced to 1% and is configurable via
> > + the new 'other_config:emc-insert-prob' option.
> >
> >  v2.7.0 - xx xxx 
> >  -
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> > index 0be5db5..c8cb9ee 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -144,6 +144,11 @@ struct netdev_flow_key {
> >  #define

Re: [ovs-dev] [PATCH 1/1] dpif-netdev: Conditional EMC insert

2017-01-26 Thread Loftus, Ciara

> 
> 2017-01-25 7:52 GMT-08:00 Loftus, Ciara <ciara.lof...@intel.com>:
> >> 2017-01-22 11:45 GMT-08:00 Jan Scheurich <jan.scheur...@web.de>:
> >> >
> >> >> It's not a big deal, since the most important use case we have for
> >> >> dpif-netdev is with dpdk, but I'd still like the code to behave
> >> >> similarly on different platforms.  How about defining a function that
> >> >> uses random_uint32 when compiling without DPDK?
> >> >>
> >> >> For testing it's not that simple, because unit tests can be run with
> >> >> or without DPDK.  It would need to be configurable at runtime.
> >> >> Perhaps making EM_FLOW_INSERT_PROB configurable at runtime
> would
> >> also
> >> >> help people that want to experiment with different values, even
> >> >> though, based on the comments, I guess they wouldn't really see
> much
> >> >> difference.
> >> >>
> >> >> Again, what do you think about simply using counting the packets and
> >> >> inserting only 1 every EM_FLOW_INSERT_PROB?
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Daniele
> >> >
> >> >
> >> > As far as I know Ciara did some quick tests with a counter-based
> >> > implementation and it performed 5% worse for 1K and 4K flows than
> then
> >> > current patch. Perhaps we could find the reason for that and fix it, but 
> >> > I
> >> > also feel uncomfortable with deterministic insertion of every Nth flow.
> This
> >> > could lead to very strange lock-step phenomena with typical artificial
> test
> >> > work loads, which often generate flows round-robin. I would rather use
> a
> >> > random function, as you suggest, or count "cycles" differently when
> >> > compiling without DPDK.
> >>
> >> Ok, using another pseudo random function when compiling without DPDK
> >> sounds
> >> good to me.
> >>
> >
> > Any suggestions for the random function?
> 
> I think we can use random_uint32() from lib/random.h
> 
> >
> >> >
> >> > I agree to making the parameter EM_FLOW_INSERT_PROB configurable
> for
> >> unit
> >> > test or other purposes. Should it be a new option in the OpenvSwitch
> table
> >> > in OVSDB or rather a run-time parameter to be changed with ovs-
> appctl?
> >>
> >> I think a new option in Openvswitch other_config would be appropriate.
> >
> > I like this idea. I've started making these changes. How about something
> like the following?..
> >
> > +   > +  type='{"type": "integer", "minInteger": 0, "maxInteger":
> 4294967295}'>
> > +
> > +  Specifies the probability (1/emc-insert-prob) of a flow being
> > +  inserted into the Exact Match Cache (EMC). Higher values of
> > +  emc-insert-prob will result in less insertions, and lower
> > +  values will result in more insertions. A value of zero will
> > +  result in no insertions and essentially disable the EMC.
> > +
> > +
> > +  Defaults to 100 ie. there is 1/100 chance of EMC insertion.
> 
> Looks good to me, thanks.
> 
> I would also add that this only applies to 'netdev' bridges (userspace) and
> that
> a value of 1 means that every flow is going to be sent to the EMC.

Thanks Daniele. I've posted a v2. Not sure it's the ideal approach so would 
appreciate your feedback if you get a chance.

On a separate note, I'm wondering would now be a good time to consider allowing 
the size of the EMC to be configurable ie. allow EM_FLOW_HASH_SHIFT or a 
similar value to be modifiable. Whatever scheme is followed for modifying 
insert probability could probably be easily be re-used for EMC sizing too. Just 
an idea!

Thanks,
Ciara

> 
> >
> > Thanks,
> > Ciara
> >
> >>
> >> Thanks,
> >>
> >> Daniele
> >>
> >> >
> >> > Jan
> >> >
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 0/1] dpif-netdev: Conditional EMC insert

2017-01-20 Thread Loftus, Ciara

> 
> On 01/12/2017 04:49 PM, Ciara Loftus wrote:
> > This patch is part of the OVS-DPDK performance optimizations presented
> > on the OVS fall conference
> > (http://openvswitch.org/support/ovscon2016/8/1400-gray.pdf)
> >
> > The Exact Match Cache does not perform well in use cases with a high
> > numbers of parallel packet flows. When the flow count exceeds 8k, which
> > is the size of the EMC, 'thrashing' occurs. Subsequent packets incur
> > EMC misses and lead to the insertion of new entries, which are likely
> > already overwritten by the time the next packet of a flow arrives.
> >
> > The extra cost of useless EMC insertions and failed EMC lookups degrades
> > the performance of the netdev-dpdk datapath up to 30% compared to the
> > pure performance of the dpcls classifier with EMC disabled. Profiling
> > has shown that the bulk of the degradation is caused by the EMC
> > insertions.
> >
> > To avoid this degradation we apply 'probabilistic' EMC insertions, whereby
> > an EMC miss only results in an EMC insertion with a random probability
> > of P=1/N (N integer). With this model, the insertion overhead of the EMC
> > can be reduced by a factor N, while the EMC speedup is maintained for a
> > small to medium number of flows. Probabilistic insertion can be
> > implemented with minimal run-time cost and naturally favors long-lived
> > flows.
> >
> > Different values for P from 1/100 to 1/4000 were validated and
> benchmarked
> > when creating this patch. Not much variance between different
> probabilities
> > was observed.
> >
> > We chose P=1/100 because it gives almost the same improvement as lower
> > probabilities while reaching the EMC capacity and thus optimal
> performance
> > quicker than the lower probabilities.
> >
> > For P=1/100, the EMC reached full capacity after 843ms when subjecting
> the
> > datapath with 10 long-lived flows at 14.8 Mpps. This is much quicker
> > compared to P=1/500 and P=1/1000, which took 3792ms and 7370ms
> > respectively.
> 
> The performance results are very impressive - it looks like ~50%
> performance improvement after about 10K flows.
> 
> Did you measure any negative effects when the the emc is not full?

Hi Kevin,

When the EMC is not full the impact is negligible.
For example I tested with 1, 128 & 1024 flows and measured a -0.04% degradation 
and +2% and +1% improvement with the patch.

> 
> What about about only using this type of scheme once the emc is full or
> above a certain threshold?

Given the above results I don't think implementing something like that is 
necessary. The additional logic could be what ends up degrading performance.

Thanks,
Ciara

> 
> Kevin.
> 
> >
> >  lib/dpif-netdev.c | 20 ++--
> >  1 file changed, 18 insertions(+), 2 deletions(-)
> >

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] Documentation: Update DPDK doc after port naming change.

2017-01-19 Thread Loftus, Ciara

> 
> options:dpdk-devargs is always required now.  This commit also changes
> some of the names from 'dpdk0' to various others.
> 
> netdev-dpdk/detach accepts a PCI id instead of a port name.
> 
> CC: Ciara Loftus 
> Fixes: 55e075e65ef9("netdev-dpdk: Arbitrary 'dpdk' port naming")
> Signed-off-by: Daniele Di Proietto 

Patch looks good. Thanks for the fixes!

Acked-by: Ciara Loftus 

> ---
>  Documentation/howto/dpdk.rst| 77 
> -
>  Documentation/howto/userspace-tunneling.rst |  2 +-
>  2 files changed, 43 insertions(+), 36 deletions(-)
> 
> diff --git a/Documentation/howto/dpdk.rst
> b/Documentation/howto/dpdk.rst
> index fbb4b5361..d1e6e899f 100644
> --- a/Documentation/howto/dpdk.rst
> +++ b/Documentation/howto/dpdk.rst
> @@ -44,8 +44,10 @@ ovs-vsctl can also be used to add DPDK devices. OVS
> expects DPDK device names
>  to start with ``dpdk`` and end with a portid. ovs-vswitchd should print the
>  number of dpdk devices found in the log file::
> 
> -$ ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
> -$ ovs-vsctl add-port br0 dpdk1 -- set Interface dpdk1 type=dpdk
> +$ ovs-vsctl add-port br0 dpdk-p0 -- set Interface dpdk-p0 type=dpdk \
> +options:dpdk-devargs=:01:00.0
> +$ ovs-vsctl add-port br0 dpdk-p1 -- set Interface dpdk-p1 type=dpdk \
> +options:dpdk-devargs=:01:00.1
> 
>  After the DPDK ports get added to switch, a polling thread continuously polls
>  DPDK devices and consumes 100% of the core, as can be checked from
> ``top`` and
> @@ -55,12 +57,12 @@ DPDK devices and consumes 100% of the core, as can
> be checked from ``top`` and
>  $ ps -eLo pid,psr,comm | grep pmd
> 
>  Creating bonds of DPDK interfaces is slightly different to creating bonds of
> -system interfaces. For DPDK, the interface type must be explicitly set. For
> -example::
> +system interfaces. For DPDK, the interface type and devargs must be
> explicitly
> +set. For example::
> 
> -$ ovs-vsctl add-bond br0 dpdkbond dpdk0 dpdk1 \
> --- set Interface dpdk0 type=dpdk \
> --- set Interface dpdk1 type=dpdk
> +$ ovs-vsctl add-bond br0 dpdkbond p0 p1 \
> +-- set Interface p0 type=dpdk options:dpdk-devargs=:01:00.0 \
> +-- set Interface p1 type=dpdk options:dpdk-devargs=:01:00.1
> 
>  To stop ovs-vswitchd & delete bridge, run::
> 
> @@ -98,7 +100,7 @@ where:
> 
>  For example::
> 
> -$ ovs-vsctl set interface dpdk0 options:n_rxq=4 \
> +$ ovs-vsctl set interface dpdk-p0 options:n_rxq=4 \
>  other_config:pmd-rxq-affinity="0:3,1:7,3:8"
> 
>  This will ensure:
> @@ -165,27 +167,27 @@ Flow Control
>  Flow control can be enabled only on DPDK physical ports. To enable flow
> control
>  support at tx side while adding a port, run::
> 
> -$ ovs-vsctl add-port br0 dpdk0 -- \
> -set Interface dpdk0 type=dpdk options:tx-flow-ctrl=true
> +$ ovs-vsctl add-port br0 dpdk-p0 -- set Interface dpdk-p0 type=dpdk \
> +options:dpdk-devargs=:01:00.0 options:tx-flow-ctrl=true
> 
>  Similarly, to enable rx flow control, run::
> 
> -$ ovs-vsctl add-port br0 dpdk0 -- \
> -set Interface dpdk0 type=dpdk options:rx-flow-ctrl=true
> +$ ovs-vsctl add-port br0 dpdk-p0 -- set Interface dpdk-p0 type=dpdk \
> +options:dpdk-devargs=:01:00.0 options:rx-flow-ctrl=true
> 
>  To enable flow control auto-negotiation, run::
> 
> -$ ovs-vsctl add-port br0 dpdk0 -- \
> -set Interface dpdk0 type=dpdk options:flow-ctrl-autoneg=true
> +$ ovs-vsctl add-port br0 dpdk-p0 -- set Interface dpdk-p0 type=dpdk \
> +options:dpdk-devargs=:01:00.0 options:flow-ctrl-autoneg=true
> 
>  To turn ON the tx flow control at run time for an existing port, run::
> 
> -$ ovs-vsctl set Interface dpdk0 options:tx-flow-ctrl=true
> +$ ovs-vsctl set Interface dpdk-p0 options:tx-flow-ctrl=true
> 
>  The flow control parameters can be turned off by setting ``false`` to the
>  respective parameter. To disable the flow control at tx side, run::
> 
> -$ ovs-vsctl set Interface dpdk0 options:tx-flow-ctrl=false
> +$ ovs-vsctl set Interface dpdk-p0 options:tx-flow-ctrl=false
> 
>  pdump
>  -
> @@ -234,13 +236,12 @@ enable Jumbo Frames support for a DPDK port,
> change the Interface's
>  ``mtu_request`` attribute to a sufficiently large value. For example, to add 
> a
>  DPDK Phy port with MTU of 9000::
> 
> -$ ovs-vsctl add-port br0 dpdk0 \
> -  -- set Interface dpdk0 type=dpdk \
> -  -- set Interface dpdk0 mtu_request=9000`
> +$ ovs-vsctl add-port br0 dpdk-p0 -- set Interface dpdk-p0 type=dpdk \
> +  options:dpdk-devargs=:01:00.0 mtu_request=9000
> 
>  Similarly, to change the MTU of an existing port to 6200::
> 
> -$ ovs-vsctl set Interface dpdk0 mtu_request=6200
> +$ ovs-vsctl set Interface dpdk-p0 mtu_request=6200
> 
>  Some additional

Re: [ovs-dev] [PATCH] netdev: Add 'errp' to set_config().

2017-01-11 Thread Loftus, Ciara

> 
> Since 55e075e65ef9("netdev-dpdk: Arbitrary 'dpdk' port naming"),
> set_config() is used to identify a DPDK device, so it's better to report
> its detailed error message to the user.  Tunnel devices and patch ports
> rely a lot on set_config() as well.
> 
> This commit adds a param to set_config() that can be used to return
> an error message and makes use of that in netdev-dpdk and netdev-vport.
> 
> Before this patch:
> 
> $ ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
> ovs-vsctl: Error detected while setting up 'dpdk0': dpdk0: could not set
> configuration (Invalid argument).  See ovs-vswitchd log for details.
> ovs-vsctl: The default log directory is "/var/log/openvswitch/".
> 
> $ ovs-vsctl add-port br0 p+ -- set Interface p+ type=patch
> ovs-vsctl: Error detected while setting up 'p+': p+: could not set 
> configuration
> (Invalid argument).  See ovs-vswitchd log for details.
> ovs-vsctl: The default log directory is "/var/log/openvswitch/".
> 
> $ ovs-vsctl add-port br0 gnv0 -- set Interface gnv0 type=geneve
> ovs-vsctl: Error detected while setting up 'gnv0': gnv0: could not set
> configuration (Invalid argument).  See ovs-vswitchd log for details.
> ovs-vsctl: The default log directory is "/var/log/openvswitch/".
> 
> After this patch:
> 
> $ ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
> ovs-vsctl: Error detected while setting up 'dpdk0': 'dpdk0' is missing
> 'options:dpdk-devargs'. The old 'dpdk' names are not supported.
> See ovs-vswitchd log for details.
> ovs-vsctl: The default log directory is "/var/log/openvswitch/".
> 
> $ ovs-vsctl add-port br0 p+ -- set Interface p+ type=patch
> ovs-vsctl: Error detected while setting up 'p+': p+: patch type requires valid
> 'peer' argument.  See ovs-vswitchd log for details.
> ovs-vsctl: The default log directory is "/var/log/openvswitch/".
> 
> $ ovs-vsctl add-port br0 gnv0 -- set Interface gnv0 type=geneve
> ovs-vsctl: Error detected while setting up 'gnv0': gnv0: geneve type requires
> valid 'remote_ip' argument.  See ovs-vswitchd log for details.
> ovs-vsctl: The default log directory is "/var/log/openvswitch/".
> 
> CC: Ciara Loftus 
> CC: Kevin Traynor 
> Signed-off-by: Daniele Di Proietto 
> ---
>  lib/netdev-dpdk.c | 27 ++
>  lib/netdev-dummy.c|  3 +-
>  lib/netdev-provider.h |  9 --
>  lib/netdev-vport.c| 76 ++
> -
>  lib/netdev.c  | 10 +--
>  5 files changed, 84 insertions(+), 41 deletions(-)
> 
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 0f02c4d74..1bcc27a62 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -1132,7 +1132,7 @@ netdev_dpdk_lookup_by_port_id(int port_id)
>  }
> 
>  static int
> -netdev_dpdk_process_devargs(const char *devargs)
> +netdev_dpdk_process_devargs(const char *devargs, char **errp)
>  {
>  uint8_t new_port_id = UINT8_MAX;
> 
> @@ -1145,7 +1145,7 @@ netdev_dpdk_process_devargs(const char
> *devargs)
>  VLOG_INFO("Device '%s' attached to DPDK", devargs);
>  } else {
>  /* Attach unsuccessful */
> -VLOG_INFO("Error attaching device '%s' to DPDK", devargs);
> +VLOG_WARN_BUF(errp, "Error attaching device '%s' to DPDK",
> devargs);
>  return -1;
>  }
>  }
> @@ -1184,7 +1184,8 @@ dpdk_process_queue_size(struct netdev *netdev,
> const struct smap *args,
>  }
> 
>  static int
> -netdev_dpdk_set_config(struct netdev *netdev, const struct smap *args)
> +netdev_dpdk_set_config(struct netdev *netdev, const struct smap *args,
> +   char **errp)
>  {
>  struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>  bool rx_fc_en, tx_fc_en, autoneg;
> @@ -1225,7 +1226,7 @@ netdev_dpdk_set_config(struct netdev *netdev,
> const struct smap *args)
>   * is valid */
>  if (!(dev->devargs && !strcmp(dev->devargs, new_devargs)
> && rte_eth_dev_is_valid_port(dev->port_id))) {
> -int new_port_id = netdev_dpdk_process_devargs(new_devargs);
> +int new_port_id = netdev_dpdk_process_devargs(new_devargs,
> errp);
>  if (!rte_eth_dev_is_valid_port(new_port_id)) {
>  err = EINVAL;
>  } else if (new_port_id == dev->port_id) {
> @@ -1235,10 +1236,10 @@ netdev_dpdk_set_config(struct netdev *netdev,
> const struct smap *args)
>  struct netdev_dpdk *dup_dev;
>  dup_dev = netdev_dpdk_lookup_by_port_id(new_port_id);
>  if (dup_dev) {
> -VLOG_WARN("'%s' is trying to use device '%s' which is "
> -  "already in use by '%s'.",
> -  netdev_get_name(netdev), new_devargs,
> -  netdev_get_name(_dev->up));
> +VLOG_WARN_BUF(errp, "'%s' is trying to use device '%s' "
> +

Re: [ovs-dev] [PATCH] netdev-dpdk: Assign socket id according to device's numa id

2017-01-11 Thread Loftus, Ciara

> 
> Binbin Xu  writes:
> 
> > After the commit "55e075e65ef9ecbd70e5e0fada2704c3d73724d8
> > netdev-dpdk: Arbitrary 'dpdk' port naming", we could hotplug
> > attach DPDK ports specified via the 'dpdk-devargs' option.
> >
> > But the socket id of DPDK ports can't be assigned correctly,
> > it is always 0. The socket id of DPDK ports should be assigned
> > according to the numa id of the device.
> >
> > Signed-off-by: Binbin Xu 
> > ---
> 
> I haven't reviewed the patch yet, but can you reformat this message with
> a fixes: tag?  That makes backporting swathes of commits easier.
> 
> It would be:
> Fixes: 55e075e65ef9 ("netdev-dpdk: Arbitrary 'dpdk' port naming")
> 
> I've also CC'd some folks involved in testing / using this patch

Agree with Aaron re restructuring the commit message. The patch itself looks 
good to me. Thanks for providing the fix.

Acked-by: Ciara Loftus 

> 
> Thanks,
> Aaron
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v2 2/3] netdev-dpdk: Arbitrary 'dpdk' port naming

2016-12-15 Thread Loftus, Ciara

> 
> Thanks for the new version, I'm still testing this.

Thanks Daniele. I plan to do some more thorough testing myself.

> 
> One comment inline
> 
> 2016-12-14 9:06 GMT-08:00 Ciara Loftus :
> > 'dpdk' ports no longer have naming restrictions. Now, instead of
> > specifying the dpdk port ID as part of the name, the PCI address of the
> > device must be specified via the 'dpdk-devargs' option. eg.
> >
> > ovs-vsctl add-port br0 my-port
> > ovs-vsctl set Interface my-port type=dpdk
> > ovs-vsctl set Interface my-port options:dpdk-devargs=:06:00.3
> >
> > The user must no longer hotplug attach DPDK ports by issuing the
> > specific ovs-appctl netdev-dpdk/attach command. The hotplug is now
> > automatically invoked when a valid PCI address is set in the
> > dpdk-devargs. The format for ovs-appctl netdev-dpdk/detach command
> > has changed in that the user now must specify the relevant PCI address
> > as input instead of the port name.
> >
> > Signed-off-by: Ciara Loftus 
> > Signed-off-by: Daniele Di Proietto 
> > Co-authored-by: Daniele Di Proietto 
> > ---
> > Changelog:
> > * Keep port-detach appctl function - use PCI as input arg
> > * Add requires_mutex to devargs processing functions
> > * use reconfigure infrastructure for devargs changes
> > * process devargs even if valid portid ie. device already configured
> > * report err if dpdk-devargs is not specified
> > * Add Daniele's incremental patch & Sign-off + Co-author tags
> > * Update details of detach command to reflect PCI is needed instead of
> >   port name
> > * Update NEWS to mention that the new naming scheme is not backwards
> >   compatible
> > * Use existing DPDk functions to get port ID from PCI address/devname
> > * Merged process_devargs and process_pdevargs functions
> >
> >  Documentation/intro/install/dpdk-advanced.rst |   5 +-
> >  Documentation/intro/install/dpdk.rst  |  14 ++-
> >  NEWS  |   5 +
> >  lib/netdev-dpdk.c | 169 
> > +-
> >  vswitchd/vswitch.xml  |   8 ++
> >  5 files changed, 136 insertions(+), 65 deletions(-)
> >
> > diff --git a/Documentation/intro/install/dpdk-advanced.rst
> b/Documentation/intro/install/dpdk-advanced.rst
> > index 43acfc6..260ed59 100644
> > --- a/Documentation/intro/install/dpdk-advanced.rst
> > +++ b/Documentation/intro/install/dpdk-advanced.rst
> > @@ -944,9 +944,8 @@ dpdk_nic_bind.py script:
> >
> >  Then it can be attached to OVS:
> >
> > -   $ ovs-appctl netdev-dpdk/attach :01:00.0
> > -
> > -At this point, the user can create a ovs port using the add-port command.
> > +   $ ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk \
> > +   options:dpdk-devargs=:01:00.0
> >
> >  It is also possible to detach a port from ovs, the user has to remove the
> >  port using the del-port command, then it can be detached using:
> > diff --git a/Documentation/intro/install/dpdk.rst
> b/Documentation/intro/install/dpdk.rst
> > index 7724c8a..40b5b67 100644
> > --- a/Documentation/intro/install/dpdk.rst
> > +++ b/Documentation/intro/install/dpdk.rst
> > @@ -245,12 +245,14 @@ Bridges should be created with a
> ``datapath_type=netdev``::
> >
> >  $ ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev
> >
> > -Now you can add DPDK devices. OVS expects DPDK device names to start
> with
> > -``dpdk`` and end with a portid. ovs-vswitchd should print the number of
> dpdk
> > -devices found in the log file::
> > -
> > -$ ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
> > -$ ovs-vsctl add-port br0 dpdk1 -- set Interface dpdk1 type=dpdk
> > +Now you can add dpdk devices. The PCI address of the device needs to be
> > +set using the 'dpdk-devargs' option. DPDK will print the PCI addresses of
> > +eligible devices found during initialisation.
> > +
> > +$ ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk \
> > +options:dpdk-devargs=:06:00.0
> > +$ ovs-vsctl add-port br0 dpdk1 -- set Interface dpdk1 type=dpdk \
> > +options:dpdk-devargs=:06:00.1
> >
> >  After the DPDK ports get added to switch, a polling thread continuously
> polls
> >  DPDK devices and consumes 100% of the core, as can be checked from
> 'top' and
> > diff --git a/NEWS b/NEWS
> > index b596cf3..8c7f38e 100644
> > --- a/NEWS
> > +++ b/NEWS
> > @@ -42,6 +42,11 @@ Post-v2.6.0
> > which set the number of rx and tx descriptors to use for the given 
> > port.
> >   * Support for DPDK v16.11.
> >   * Port Hotplug is now supported.
> > + * DPDK physical ports can now have arbitrary names. The PCI address
> of
> > +   the device must be set using the 'dpdk-devargs' option. 
> > Compatibility
> > +   with the old dpdk naming scheme is broken, and as such a
> > +   device will not be available for use

Re: [ovs-dev] [PATCH] netdev-dpdk: Add vHost User PMD

2016-12-06 Thread Loftus, Ciara

> 
> Thanks for the patch.
> 
> I experience a crash with this patch applied by starting ovs and
> immediately adding a vhostuserclient port.  It's not reproducible 100%
> of the times.
> 
> Program received signal SIGSEGV, Segmentation fault.
> rte_eth_xstats_get (port_id=3 '\003', xstats=xstats@entry=0x2e6eaa0,
> n=n@entry=5160) at
> /home/daniele/dpdk-16.11/lib/librte_ether/rte_ethdev.c:1466
> 1466val = *stats_ptr;
> (gdb) bt
> #0  rte_eth_xstats_get (port_id=3 '\003',
> xstats=xstats@entry=0x2e6eaa0, n=n@entry=5160) at
> /home/daniele/dpdk-16.11/lib/librte_ether/rte_ethdev.c:1466
> #1  0x0077b88c in netdev_dpdk_get_stats
> (netdev=0x7f4f3ffb58c0, stats=0x7fff1239a900) at
> ../lib/netdev-dpdk.c:1871
> #2  0x006af9c3 in netdev_get_stats (netdev=,
> stats=stats@entry=0x7fff1239a900) at ../lib/netdev.c:1338
> #3  0x0060b25f in iface_refresh_stats
> (iface=iface@entry=0x2e6ea20) at ../vswitchd/bridge.c:2365
> #4  0x0060fa59 in iface_refresh_stats (iface=0x2e6ea20) at
> ../vswitchd/bridge.c:2359
> #5  iface_create (port_cfg=0x2e69af0, iface_cfg=0x2e69e40,
> br=0x2e31d20) at ../vswitchd/bridge.c:1829
> #6  bridge_add_ports__ (br=br@entry=0x2e31d20,
> wanted_ports=wanted_ports@entry=0x2e31e00,
> with_requested_port=with_requested_port@entry=true) at
> ../vswitchd/bridge.c:912
> #7  0x006105c2 in bridge_add_ports (wanted_ports=0x2e31e00,
> br=0x2e31d20) at ../vswitchd/bridge.c:923
> #8  bridge_reconfigure (ovs_cfg=ovs_cfg@entry=0x2de83b0) at
> ../vswitchd/bridge.c:644
> #9  0x0061372b in bridge_run () at ../vswitchd/bridge.c:2959
> #10 0x0040c48d in main (argc=6, argv=0x7fff1239af48) at
> ../vswitchd/ovs-vswitchd.c:111
> (gdb) info locals
> eth_stats = {ipackets = 0, opackets = 0, ibytes = 0, obytes = 0,
> imissed = 0, ierrors = 0, oerrors = 0, rx_nombuf = 0, q_ipackets = {0
> }, q_opackets = {0 }, q_ibytes =
> {0 }, q_obytes = {0 },
>   q_errors = {0 }}
> dev = 0xd8fbc0 
> count = 
> i = 2
> q = 934
> xcount = 
> val = 5548549585684803438
> stats_ptr = 0x7fff1239c000
> 
> Maybe this is happening because q is 934, while eth_stats only
> includes room for 16 queues.
> 

I'm looking into this at the moment - possibly an issue in the vhost pmd code.

> 
> 
> I suspect netdev_dpdk_vhost_reconfigure() should do much more than
> what it does.  I think that if we change the memory pool we should
> call dpdk_eth_dev_init(), because we need to pass the mempool to the
> PMD via rte_eth_rx_queue_setup.

You're right. Unfortunately this means we'll probably need to call 
rte_eth_dev_stop in vhost_reconfigure which will cause a link status change and 
thus trigger the lsc callback where we try to acquire the dev->mutex a second 
time & encounter a deadlock. I'll look into a way around this too!

> 
> 
> 
> Couple more comments inline
> 
> 2016-11-29 3:44 GMT-08:00 Ciara Loftus :
> > The vHost PMD allows vHost User ports to be controlled by the
> > librte_ether API, like physical 'dpdk' ports and IVSHM 'dpdkr' ports.
> > This commit integrates this PMD into OVS and removes direct calls to the
> > librte_vhost DPDK library.
> >
> > This commit requires DPDK v16.11 functionality that isn't available in
> > previous releases, and thus breaks compatibility with such releases.
> >
> > Signed-off-by: Ciara Loftus 
> >
> > ---
> > Change log:
> > * Rebase
> > * Alloc netdev->n_txq instead of max (1024) during netdev_dpdk_init
> > * Don't unregister unregistered callbacks.
> > * Move callback register to avoid need to call unregister in error case
> > * Detach and free pool ID if failure during client_reconfigure.
> > * Unregister callbacks before detach
> > * change vhost_pmd_id to signed int and use -1 value to indicate an
> >   ID from the pool hasn't been alloced for it yet.
> > * free pool ID if rte_eth_dev_attach fails.
> > * check link status on destruct and print warning if the port is live.
> > * only free ID during destruct if it has been allocated.
> > * Use id-pool implementation for allocating vHost PMD IDs
> > * Added DPDK_DEV_VHOST_CLIENT netdev_dpdk "type"
> > * Reintroduced socket-id logic to correctly set client type sid
> > * Removed magic number & used var instead
> > * Remove race condition in netdev_dpdk_init by reordering code.
> > * Detach vHost PMD if netdev_dpdk_init fails
> > * Introduce "out" in client_reconfigure for when txq alloc fails
> > * Remove set_tx_multiq fn for vHost ports
> >
> >  NEWS  |   2 +
> >  lib/dpdk.c|  11 +
> >  lib/dpdk.h|   1 +
> >  lib/netdev-dpdk.c | 956 -
> -
> >  4 files changed, 373 insertions(+), 597 deletions(-)
> >
> > diff --git a/NEWS b/NEWS
> > index 22beeb2..b979d2a 100644
> > --- a/NEWS
> > +++ b/NEWS
> > @@ -39,6 +39,8 @@ Post-v2.6.0
> >   * New option 'n_rxq_desc' and 'n_txq_desc' fields for DPDK interfaces
> > which set the number of

Re: [ovs-dev] [PATCH] netdev-dpdk: Add support for DPDK 16.11

2016-11-24 Thread Loftus, Ciara

> 
> 
> On 11/24/2016 04:34 PM, Loftus, Ciara wrote:
> >>
> >>
> >> On 11/24/2016 03:59 PM, Loftus, Ciara wrote:
> >>>>
> >>>> On 11/24/2016 03:20 PM, Ciara Loftus wrote:
> >>>>> This commit announces support for DPDK 16.11. Compaitibilty with
> DPDK
> >>>>> v16.07 is not broken yet thanks to only minor code changes being
> >> needed
> >>>>> for the upgrade.
> >>>>>
> >>>>> Signed-off-by: Ciara Loftus <ciara.lof...@intel.com>
> >>>>> ---
> >>>>>  .travis/linux-build.sh|  2 +-
> >>>>>  INSTALL.DPDK-ADVANCED.rst |  4 ++--
> >>>>>  INSTALL.DPDK.rst  | 18 +-
> >>>>>  NEWS  |  1 +
> >>>>>  lib/netdev-dpdk.c |  3 ++-
> >>>>>  5 files changed, 15 insertions(+), 13 deletions(-)
> >>>> 
> >>>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> >>>>> index de78ddd..7564ad7 100644
> >>>>> --- a/lib/netdev-dpdk.c
> >>>>> +++ b/lib/netdev-dpdk.c
> >>>>> @@ -2593,7 +2593,8 @@ netdev_dpdk_vhost_class_init(void)
> >>>>>  rte_vhost_driver_callback_register(_net_device_ops);
> >>>>>  rte_vhost_feature_disable(1ULL << VIRTIO_NET_F_HOST_TSO4
> >>>>>| 1ULL << VIRTIO_NET_F_HOST_TSO6
> >>>>> -  | 1ULL << VIRTIO_NET_F_CSUM);
> >>>>> +  | 1ULL << VIRTIO_NET_F_CSUM
> >>>>> +  | 1ULL << 
> >>>>> VIRTIO_RING_F_INDIRECT_DESC);
> >>>>>  ovs_thread_create("vhost_thread", start_vhost_loop, NULL);
> >>>>>
> >>>>>  ovsthread_once_done();
> >>>>>
> >>>> Any reason you disable indirect descs?
> >>>> Did you measure performance degradation with it?
> >>>
> >>> I measured a slight decrease ~-6% for the use case I tested.
> >> Ok, could you provide more details about this use-case?
> >> I'm interested to know which cases (don't) benefit of this feature.
> >
> > Single flow 64B packets unidirectional loopback test (dpdk -> vhost, vhost -
> > dpdk) with testpmd forwarding between vhost ports in the VM. Mergeable
> buffers disabled.
> Ok, thanks.
> Is testpmd in IO mode?

Correct, default mode.

> 
> Anyway, this is good to see that DPDK version upgrade has no blockers.
> With indirect disabled, how performance compare to v16.07?

I'm taking rough measurements but 16.07 performs slightly better for this case 
- ~+2% on indirect disabled, ~+8% on indirect enabled.

Thanks,
Ciara

> 
> >
> >>
> >>> Either way, if it is to be enabled it should be enabled in a separate 
> >>> patch.
> >> Well, I'm not sure.
> >> I think this is not a good way to disable features, because once
> >> OVS packaged, we cannot enable it anymore.
> >
> > My reasoning not to enable it in this patch (other than the degradation) is
> because the purpose of this patch is the DPDK upgrade. We could have a
> second patch "enable indirect descriptors for vHost" for example, as
> suggested by Ilya previously:
> > https://mail.openvswitch.org/pipermail/ovs-dev/2016-
> October/324586.html
> Ok.
> >
> >>
> >> It should be the role of the management tool to disable it by default,
> >> by configuring the Qemu params properly if needed (indirect_desc=off).
> >> Doing this, it could still be enabled without rebuilding the package if
> >> it benefits some use-cases.
> >>
> >> I would think this is the same for the other features that are disabled,
> >> as soon as they don't break functionality. If other features break
> >> functionality, it should be reported so that it gets fixed.
> >>
> >> Moreover, we are looking at cross-version migration for vhost currently,
> >> and rte_vhost_feature_disable is problematic, because it prevent us to
> >> export virtio features supported by a given DPDK version reliably at
> >> build time.
> >
> > I see your reasoning too. Would like to hear input from other folks in the
> community before I respin.
> Yes, sure.
> And cross-version migration is currently being discussed.
> 
> Thanks,
> Maxime
> >
> > Thanks,
> > Ciara
> >
> >>
> >> Cheers,
> >> Maxime
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] netdev-dpdk: Add support for DPDK 16.11

2016-11-24 Thread Loftus, Ciara

> 
> 
> On 11/24/2016 03:59 PM, Loftus, Ciara wrote:
> >>
> >> On 11/24/2016 03:20 PM, Ciara Loftus wrote:
> >>> This commit announces support for DPDK 16.11. Compaitibilty with DPDK
> >>> v16.07 is not broken yet thanks to only minor code changes being
> needed
> >>> for the upgrade.
> >>>
> >>> Signed-off-by: Ciara Loftus <ciara.lof...@intel.com>
> >>> ---
> >>>  .travis/linux-build.sh|  2 +-
> >>>  INSTALL.DPDK-ADVANCED.rst |  4 ++--
> >>>  INSTALL.DPDK.rst  | 18 +-
> >>>  NEWS  |  1 +
> >>>  lib/netdev-dpdk.c |  3 ++-
> >>>  5 files changed, 15 insertions(+), 13 deletions(-)
> >> 
> >>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> >>> index de78ddd..7564ad7 100644
> >>> --- a/lib/netdev-dpdk.c
> >>> +++ b/lib/netdev-dpdk.c
> >>> @@ -2593,7 +2593,8 @@ netdev_dpdk_vhost_class_init(void)
> >>>  rte_vhost_driver_callback_register(_net_device_ops);
> >>>  rte_vhost_feature_disable(1ULL << VIRTIO_NET_F_HOST_TSO4
> >>>| 1ULL << VIRTIO_NET_F_HOST_TSO6
> >>> -  | 1ULL << VIRTIO_NET_F_CSUM);
> >>> +  | 1ULL << VIRTIO_NET_F_CSUM
> >>> +  | 1ULL << VIRTIO_RING_F_INDIRECT_DESC);
> >>>  ovs_thread_create("vhost_thread", start_vhost_loop, NULL);
> >>>
> >>>  ovsthread_once_done();
> >>>
> >> Any reason you disable indirect descs?
> >> Did you measure performance degradation with it?
> >
> > I measured a slight decrease ~-6% for the use case I tested.
> Ok, could you provide more details about this use-case?
> I'm interested to know which cases (don't) benefit of this feature.

Single flow 64B packets unidirectional loopback test (dpdk -> vhost, vhost -> 
dpdk) with testpmd forwarding between vhost ports in the VM. Mergeable buffers 
disabled.

> 
> > Either way, if it is to be enabled it should be enabled in a separate patch.
> Well, I'm not sure.
> I think this is not a good way to disable features, because once
> OVS packaged, we cannot enable it anymore.

My reasoning not to enable it in this patch (other than the degradation) is 
because the purpose of this patch is the DPDK upgrade. We could have a second 
patch "enable indirect descriptors for vHost" for example, as suggested by Ilya 
previously:
https://mail.openvswitch.org/pipermail/ovs-dev/2016-October/324586.html

> 
> It should be the role of the management tool to disable it by default,
> by configuring the Qemu params properly if needed (indirect_desc=off).
> Doing this, it could still be enabled without rebuilding the package if
> it benefits some use-cases.
> 
> I would think this is the same for the other features that are disabled,
> as soon as they don't break functionality. If other features break
> functionality, it should be reported so that it gets fixed.
> 
> Moreover, we are looking at cross-version migration for vhost currently,
> and rte_vhost_feature_disable is problematic, because it prevent us to
> export virtio features supported by a given DPDK version reliably at
> build time.

I see your reasoning too. Would like to hear input from other folks in the 
community before I respin.

Thanks,
Ciara

> 
> Cheers,
> Maxime
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] netdev-dpdk: Add support for DPDK 16.11

2016-11-24 Thread Loftus, Ciara

> 
> On 11/24/2016 03:20 PM, Ciara Loftus wrote:
> > This commit announces support for DPDK 16.11. Compaitibilty with DPDK
> > v16.07 is not broken yet thanks to only minor code changes being needed
> > for the upgrade.
> >
> > Signed-off-by: Ciara Loftus 
> > ---
> >  .travis/linux-build.sh|  2 +-
> >  INSTALL.DPDK-ADVANCED.rst |  4 ++--
> >  INSTALL.DPDK.rst  | 18 +-
> >  NEWS  |  1 +
> >  lib/netdev-dpdk.c |  3 ++-
> >  5 files changed, 15 insertions(+), 13 deletions(-)
> 
> > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> > index de78ddd..7564ad7 100644
> > --- a/lib/netdev-dpdk.c
> > +++ b/lib/netdev-dpdk.c
> > @@ -2593,7 +2593,8 @@ netdev_dpdk_vhost_class_init(void)
> >  rte_vhost_driver_callback_register(_net_device_ops);
> >  rte_vhost_feature_disable(1ULL << VIRTIO_NET_F_HOST_TSO4
> >| 1ULL << VIRTIO_NET_F_HOST_TSO6
> > -  | 1ULL << VIRTIO_NET_F_CSUM);
> > +  | 1ULL << VIRTIO_NET_F_CSUM
> > +  | 1ULL << VIRTIO_RING_F_INDIRECT_DESC);
> >  ovs_thread_create("vhost_thread", start_vhost_loop, NULL);
> >
> >  ovsthread_once_done();
> >
> Any reason you disable indirect descs?
> Did you measure performance degradation with it?

I measured a slight decrease ~-6% for the use case I tested.
Either way, if it is to be enabled it should be enabled in a separate patch.

Thanks,
Ciara

> 
> 
> Thanks,
> Maxime
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

67 matches

Mail list logo