Re: [ovs-dev] Issues configuring OVS-DPDK in openstack queens

2018-10-25 Thread O Mahony, Billy
Hi Manojawa,

Comments below.

BR,
Billy

From: Manojawa Paritala [mailto:manojaw...@biarca.com]
Sent: Thursday, October 25, 2018 10:33 AM
To: O Mahony, Billy 
Cc: ovs-disc...@openvswitch.org; ovs-dev@openvswitch.org; Subba Rao Kodavalla 
; Song, Kee SangX ; Srinivasa Goda 
; Kris Rajana 
Subject: Re: [ovs-dev] Issues configuring OVS-DPDK in openstack queens

Hi Billy,

The issue of br-flat1 getting deleted is fixed now with the key-value pair of 
"datapath_types = netdev,system" in OVS neutron-openvswitch-agent.ini in the 
compute node.

[[BO'M]] Glad to hear you solved it.

Earlier, we used the key-value pair as "datapath_type = netdev". Please note 
the new key name we used is "datapath_types" and in the the value we mentioned 
"netdev,system".

So, now our updated OVS section in neutron-openvswitch-agent.ini is as below.

[OVS]
bridge_mappings = flat:br-flat1,vxlan:br-vxlan,vlan:br-vlan,vlan1:br0,vlan2:br1
local_ip = 192.168.5.14
datapath_types = netdev,system


One issue that we see now is, when we add "vlan1:br0,vlan2:br1" to 
bridge_mappings and restart the neutron-openvswitch-agent.ini and 
openvsiwtch-agent services, in the "ovs-vsctl show" output, the ports in the 
bridges br0 & br1 have the below error. This issue is getting resolved, if I 
change the datapath_type of those bridges to netdev.
[[BO'M]] As both system and netdev (dpdk) type datapaths are in use here I 
would guess that br1 and br0 here are both being set as system type datapaths 
(you can check in ovsdb). But system type datapath (ie openvswitch.ko) is a 
kernel based datapath and does not support dpdk (userspace) type interfaces – 
ie dpdk[vhostuser[client]]

Getting OHA/neutron-agent to use specific datapath types for specific bridges 
is beyond the scope of this ML. But let me know if you have any OvS issues.

  Bridge "br1"
Controller "tcp:127.0.0.1:6633<http://127.0.0.1:6633>"
is_connected: true
fail_mode: secure
Port "phy-br1"
Interface "phy-br1"
type: patch
options: {peer="int-br1"}
Port "vhost-user-5"
Interface "vhost-user-5"
type: dpdkvhostuser
error: "could not add network device vhost-user-5 to ofproto 
(Invalid argument)"
Port "port1"
Interface "port1"
type: dpdk
options: {dpdk-devargs=":b1:00.1"}
error: "Error attaching device ':b1:00.1' to DPDK"
Port "vhost-user-6"
Interface "vhost-user-6"
type: dpdkvhostuser
error: "could not add network device vhost-user-6 to ofproto 
(Invalid argument)"
Port "br1"
Interface "br1"
type: internal
Bridge "br0"
Controller "tcp:127.0.0.1:6633<http://127.0.0.1:6633>"
is_connected: true
fail_mode: secure
Port "vhost-user-1"
Interface "vhost-user-1"
type: dpdkvhostuser
error: "could not add network device vhost-user-1 to ofproto 
(Invalid argument)"
Port "br0"
Interface "br0"
type: internal
Port "port0"
Interface "port0"
type: dpdk
options: {dpdk-devargs=":af:00.1"}
error: "could not add network device port0 to ofproto (Invalid 
argument)"
Port "phy-br0"
Interface "phy-br0"
type: patch
options: {peer="int-br0"}
Port "vhost-user-3"
Interface "vhost-user-3"
type: dpdkvhostuser
error: "could not add network device vhost-user-3 to ofproto 
(Invalid argument)"
Port "vhost-user-2"
Interface "vhost-user-2"
type: dpdkvhostuser
error: "could not add network device vhost-user-2 to ofproto 
(Invalid argument)"
Port "vhost-user-4"
Interface "vhost-user-4"
type: dpdkvhostuser
error: "could not add network device vhost-user-4 to ofproto 
(Invalid argument)"


Thanks & Regards,
PVMJ

On Wed, Oct 24, 2018 at 7:47 PM Manojawa Paritala 
mailto:manojaw...@biarca.com>> wrote:
Hi Billy,

The reason for br-vlan getting deleted in my previous attempt was due to a 
small misconfiguration in the openstack-agent.ini. I corrected it and re-done 
the test. Attaching the new logs with this email. Please ignore the earlier 
logs.

Apologies fo

Re: [ovs-dev] Issues configuring OVS-DPDK in openstack queens

2018-10-24 Thread O Mahony, Billy


From: Manojawa Paritala [mailto:manojaw...@biarca.com]
Sent: Tuesday, October 23, 2018 5:37 PM
To: O Mahony, Billy 
Cc: ovs-disc...@openvswitch.org; ovs-dev@openvswitch.org; Subba Rao Kodavalla 
; Song, Kee SangX ; Srinivasa Goda 
; Kris Rajana 
Subject: Re: [ovs-dev] Issues configuring OVS-DPDK in openstack queens

Hi Billy,

There are no br-flat1 entries in ovsdb.

As suggested I increased the log level to debug and then tried the same 
scenario again. Though the result was same (br-flat1 getting deleted), I 
observed the below 2 issues (i assume).
[[BO'M]] You can just increase the log level for the bridge module (all the 
extra revalidator debug is making it hard to read the logs.):
ovs-appctl vlog/set bridge:file:dbg

[[BO'M]] Also is there logging from the neutron agent. OvS should not be 
removing the br-flat1 of it’s own accord. Something external is updating ovsdb 
to remove it’s record. Is ovn in the loop here? Is there and ovn-controller or 
similar process running on the host?
I think this is important. As far as I know the decision to delete the bridge 
will not be made by vswitchd. It will be something external that will remove 
the record in ovsdb bridge table. If the neutron agent log does not mention it 
is doing this then maybe check the ovsdb .log file (it may not exist if it’s 
not configured on the ovsdb command line) – you’ll have to check the ovsdb man 
pages and also set. The ovsdb-tool is showing just the delete transaction so we 
need to find out where that delete request is coming from.

Issue-1 :-
1. Everything is up and running. That is all the bridges are displayed in OVS 
and no issues in the logs.

[[BO'M]] can you add the output from vsctl show at this point so I can see the 
desired post agent restart state.
2. I add the below entries in the OVS section of neutron's 
openvswitch-agent.ini file and restart the respective service.

datapath_type=netdev
vhostuser_socket_dir=/var/run/openvswitch

3. As mentioned earlier, bridge br-flat1 is deleted. At this point of time, 
observed the below.

3.1 br-int dtapath type changed from "system" to "netdev".
3.2. Not sure if it is an expected behavour, but there were MAC address changed 
prints only for both br-flat1 & br-int.

2018-10-23T14:39:40.253Z|41205|in_band|DBG|br-int: remote MAC address changed 
from 00:00:00:00:00:00 to 00:a0:c9:0e:01:01
2018-10-23T14:39:41.347Z|41374|in_band|DBG|br-flat1: remote MAC address changed 
from 00:00:00:00:00:00 to 00:a0:c9:0e:01:01
2018-10-23T14:39:48.229Z|41852|in_band|DBG|br-int: remote MAC address changed 
from 00:00:00:00:00:00 to 00:a0:c9:0e:01:01
2018-10-23T14:39:55.032Z|42008|in_band|DBG|br-int: remote MAC address changed 
from 00:00:00:00:00:00 to 00:a0:c9:0e:01:01
[[BO'M]] The datapath type change is expected. The MAC address changes I’m not 
sure.

3.3 interface state of br-int & eth8 (attached to br-flat1) are down.
[[BO'M]] Can you copy the o/p from vsctl show again at this point.

Attaching the debug logs of ovs-vswitch.


Issue-2 :-
1. Everything is up and running. That is all the bridges are displayed in OVS 
and no issues in the logs. I have one bridge named br0, which is of netdev type 
and I have attached a dpdk port, vhost-user port. Everthing is file.


[[BO'M]] When you say everything is fine br-flat1 is still missing right? If 
that is the case lets stick with the missing bridge issue for now. (There is 
actually quite a few issues in the vsctl output below that we deal with after 
we figure out who deleted  br-flat1).

2. Now, in the OVS section ofneutron's openvswitch-agent.ini file, to the 
existing "bridge_mappings" key, I added an extra value "vlan1:br0". The new 
key-value pair now is as below. I wanted to create a new network and then map 
the bridge br0. So, I added this entry.

bridge_mappings = flat:br-flat1,vxlan:br-vxlan,vlan:br-vlan,vlan1:br0

3. Now, when I restart the openvswitch-agent service, I observed that the 
datapath type of br0 changed from netdev to system. In the ovs-vsctl show 
output, I see the below.

   Bridge "br0"
Controller "tcp:127.0.0.1:6633<http://127.0.0.1:6633>"
is_connected: true
fail_mode: secure
Port "vhost-user-1"
Interface "vhost-user-1"
type: dpdkvhostuser
error: "could not add network device vhost-user-1 to ofproto 
(Invalid argument)"
Port "br0"
Interface "br0"
type: internal
Port "port0"
Interface "port0"
type: dpdk
options: {dpdk-devargs=":af:00.1"}
error: "could not add network device port0 to ofproto (Invalid 
argument)"
Port "phy-br0"
Interface "phy-br0"
    type: patch
options: {peer="int-br0&qu

Re: [ovs-dev] Issues configuring OVS-DPDK in openstack queens

2018-10-23 Thread O Mahony, Billy
Hi Manojawa,

So is there any remaining entry br-flat entry in the ovsdb? Does it give any 
clue to the reason – there may be a free-form ‘status’ or ‘info’ field for that 
purpose.

I can understand the situation where a bridge might get incorrectly configured 
but I can’t understand why it is deleted by something other than the agent.

Maybe it tries to create the bridge, there is some error so it decides to 
delete it. Are there more detailed log levels available for the agent? You may 
be able to turn on more detailed logging for the bridge logging in OvS too.

/Billy.


From: Manojawa Paritala [mailto:manojaw...@biarca.com]
Sent: Tuesday, October 23, 2018 12:16 PM
To: O Mahony, Billy 
Cc: ovs-disc...@openvswitch.org; ovs-dev@openvswitch.org
Subject: Re: [ovs-dev] Issues configuring OVS-DPDK in openstack queens

Hi Billy,

Thank you for your reply.

1. Huge pages are properly set. Based on the dpdk configuration 
dpdk-socket-mem="4096,4096", 8 pages were created under /dev/hugepages.
2. dpdk-p0 is not attached to br-flat1. Actually I defined the bridge as 
br-flat1.
3. Yes,  'ovs-vsctl show'  does not show br-flat1. As soon as I add the below 
entries in openvswitch-agent.ini and restart the neutron-openmvswitch-agent 
service, br-flat1 is getting deleted. I can see that in the ovs-vswitch logs 
and also in the output of "ovsdb-tool -mmm show-log"

datapath_type=netdev
vhostuser_socket_dir=/var/run/openvswitch

4. I do not see any errors in then neutron-openvswitch-agent logs, except for 
the below which are displayed after the bridge is deleted.

ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
[req-99a234e3-c943-4234-8c4d-f0fdc594df8f - - - - -] Bridge br-flat1 for 
physical network flat does not exist. Agent terminated!

Thanks & Regards,
PVMJ

On Tue, Oct 23, 2018 at 3:06 PM O Mahony, Billy 
mailto:billy.o.mah...@intel.com>> wrote:
Hi,

I don't see any errors relating to the dpdk interfaces. But it is also not 
clear where the user-space drivers are bound and the hugepage memory is set up. 
So double check those two items.

Is the dpdk-p0 interface being attached to br-flat? Even if there are issues 
with the dpdk port the bridge should not be deleted (at least not automatically 
by OvS).

Can you confirm with 'ovs-vsctl show' that the br-flat is actually not present 
after the agent is restarted. And that the dpdk-p0 is not reporting an error.

What does the neutron-openmvswitch-agent logs say?

Also run ovsdb-tool -mmm show-log which might give a clue as to when and how 
br-flat is being modified.

Regards,
Billy

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org<mailto:ovs-dev-boun...@openvswitch.org> 
> [mailto:ovs-dev-<mailto:ovs-dev->
> boun...@openvswitch.org<mailto:boun...@openvswitch.org>] On Behalf Of 
> Manojawa Paritala
> Sent: Monday, October 22, 2018 3:31 PM
> To: ovs-disc...@openvswitch.org<mailto:ovs-disc...@openvswitch.org>; 
> ovs-dev@openvswitch.org<mailto:ovs-dev@openvswitch.org>
> Subject: [ovs-dev] Issues configuring OVS-DPDK in openstack queens
>
> Hello All,
>
> On a 3 node (one controller + 2 compute), we configured Openstack Queens
> using OSA with OVS. On all the nodes, we defined br-mgmt as linux bridge, br-
> tun as private network and br-flat as external.
> Installation was successful and we could create networks and instances on
> Openstack.
>
> Below are the versions of the OVS packages used on each node.
>
> Controller :- openstack-vswitch - 2.9.0
> Computes :- openstack-vswitch-dpdk - 2.9.0 (as we wanted to configure dpdk on
> the compute hosts)
>
> The openstack-vswitch-dpdk 2.9.0 package that we installed had dpdk version
> 17.11.3. When we tried to enable DPDK it failed with the below error.
>
> dpdk|ERR|DPDK not supported in this copy of Open vSwitch
>
> So, we downloaded the sources for dpdk 17.11.4 and openvswitch 2.9.2, built
> openvswitch with dpdk as suggested in the below official link.
> No issues on Openstack or OVS.
> http://docs.openvswitch.org/en/latest/intro/install/dpdk/
>
> Then, we added the below parameters to OVS and everything looked ok.
> No issues in Openstack or OVS.
>
> $ovs-vsctl get Open_vSwitch . other_config {dpdk-extra="-n 2", 
> dpdk-init="true",
> dpdk-lcore-mask="0x3000", dpdk-socket-mem="4096,4096", pmd-
> cpu-mask="0xf3c", vhost-iommu-support="true"}
>
> Then on the compute node, in openvswitch_agent.ini file - OVS section, I added
> the below (based on the link
> https://docs.openstack.org/neutron/pike/contributor/internals/ovs_vhostuser.h
> tml
> )
> and restarted neutron-openmvswitch-agent service.
>
> datapath_type=netdev
> vhostuser_socket_dir=/var/run/openvswitch
>
> After the a

Re: [ovs-dev] Issues configuring OVS-DPDK in openstack queens

2018-10-23 Thread O Mahony, Billy
Hi,

I don't see any errors relating to the dpdk interfaces. But it is also not 
clear where the user-space drivers are bound and the hugepage memory is set up. 
So double check those two items.

Is the dpdk-p0 interface being attached to br-flat? Even if there are issues 
with the dpdk port the bridge should not be deleted (at least not automatically 
by OvS).

Can you confirm with 'ovs-vsctl show' that the br-flat is actually not present 
after the agent is restarted. And that the dpdk-p0 is not reporting an error.

What does the neutron-openmvswitch-agent logs say?

Also run ovsdb-tool -mmm show-log which might give a clue as to when and how 
br-flat is being modified.

Regards,
Billy

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of Manojawa Paritala
> Sent: Monday, October 22, 2018 3:31 PM
> To: ovs-disc...@openvswitch.org; ovs-dev@openvswitch.org
> Subject: [ovs-dev] Issues configuring OVS-DPDK in openstack queens
> 
> Hello All,
> 
> On a 3 node (one controller + 2 compute), we configured Openstack Queens
> using OSA with OVS. On all the nodes, we defined br-mgmt as linux bridge, br-
> tun as private network and br-flat as external.
> Installation was successful and we could create networks and instances on
> Openstack.
> 
> Below are the versions of the OVS packages used on each node.
> 
> Controller :- openstack-vswitch - 2.9.0
> Computes :- openstack-vswitch-dpdk - 2.9.0 (as we wanted to configure dpdk on
> the compute hosts)
> 
> The openstack-vswitch-dpdk 2.9.0 package that we installed had dpdk version
> 17.11.3. When we tried to enable DPDK it failed with the below error.
> 
> dpdk|ERR|DPDK not supported in this copy of Open vSwitch
> 
> So, we downloaded the sources for dpdk 17.11.4 and openvswitch 2.9.2, built
> openvswitch with dpdk as suggested in the below official link.
> No issues on Openstack or OVS.
> http://docs.openvswitch.org/en/latest/intro/install/dpdk/
> 
> Then, we added the below parameters to OVS and everything looked ok.
> No issues in Openstack or OVS.
> 
> $ovs-vsctl get Open_vSwitch . other_config {dpdk-extra="-n 2", 
> dpdk-init="true",
> dpdk-lcore-mask="0x3000", dpdk-socket-mem="4096,4096", pmd-
> cpu-mask="0xf3c", vhost-iommu-support="true"}
> 
> Then on the compute node, in openvswitch_agent.ini file - OVS section, I added
> the below (based on the link
> https://docs.openstack.org/neutron/pike/contributor/internals/ovs_vhostuser.h
> tml
> )
> and restarted neutron-openmvswitch-agent service.
> 
> datapath_type=netdev
> vhostuser_socket_dir=/var/run/openvswitch
> 
> After the above change, bridge br-flat is getting deleted from OVS.
> Attached are the logs after I restart the neutron-openmvswitch-agent service 
> on
> the compute now. Not sure what the issue is.
> 
> Can any of you please let me know if we are missing anything?
> 
> Best Regards,
> PVMJ
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v5 2/2] tests: Fix unit test case caused by SMC cache.

2018-07-11 Thread O Mahony, Billy
Acked-by: Billy O'Mahony 

> -Original Message-
> From: Wang, Yipeng1
> Sent: Tuesday, July 10, 2018 11:14 AM
> To: d...@openvswitch.org; jan.scheur...@ericsson.com; O Mahony, Billy
> 
> Cc: Wang, Yipeng1 ; Stokes, Ian
> ; b...@ovn.org
> Subject: [PATCH v5 2/2] tests: Fix unit test case caused by SMC cache.
> 
> Test 1024 PMD - stats reported different stats data during tests because of 
> the
> SMC data. This commit fix the test.
> 
> Signed-off-by: Yipeng Wang 
> ---
>  tests/pmd.at | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/tests/pmd.at b/tests/pmd.at index 60452f5..4cae6c8 100644
> --- a/tests/pmd.at
> +++ b/tests/pmd.at
> @@ -196,12 +196,13 @@ dummy@ovs-dummy: hit:0 missed:0
>  p0 7/1: (dummy-pmd: configured_rx_queues=4,
> configured_tx_queues=, requested_rx_queues=4,
> requested_tx_queues=)
>  ])
> 
> -AT_CHECK([ovs-appctl dpif-netdev/pmd-stats-show | sed
> SED_NUMA_CORE_PATTERN | sed '/cycles/d' | grep pmd -A 8], [0], [dnl
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-stats-show | sed
> +SED_NUMA_CORE_PATTERN | sed '/cycles/d' | grep pmd -A 9], [0], [dnl
>  pmd thread numa_id  core_id :
>packets received: 0
>packet recirculations: 0
>avg. datapath passes per packet: 0.00
>emc hits: 0
> +  smc hits: 0
>megaflow hits: 0
>avg. subtable lookups per megaflow hit: 0.00
>miss with success upcall: 0
> @@ -226,12 +227,13 @@ AT_CHECK([cat ovs-vswitchd.log | filter_flow_install
> | strip_xout], [0], [dnl
> recirc_id(0),in_port(1),packet_type(ns=0,id=0),eth(src=50:54:00:00:00:77,dst=50
> :54:00:00:01:78),eth_type(0x0800),ipv4(frag=no), actions: 
>  ])
> 
> -AT_CHECK([ovs-appctl dpif-netdev/pmd-stats-show | sed
> SED_NUMA_CORE_PATTERN | sed '/cycles/d' | grep pmd -A 8], [0], [dnl
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-stats-show | sed
> +SED_NUMA_CORE_PATTERN | sed '/cycles/d' | grep pmd -A 9], [0], [dnl
>  pmd thread numa_id  core_id :
>packets received: 20
>packet recirculations: 0
>avg. datapath passes per packet: 1.00
>emc hits: 19
> +  smc hits: 0
>megaflow hits: 0
>avg. subtable lookups per megaflow hit: 0.00
>miss with success upcall: 1
> --
> 2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v5 1/2] dpif-netdev: Add SMC cache after EMC cache

2018-07-11 Thread O Mahony, Billy
Acked-by: Billy O'Mahony 

> -Original Message-
> From: Wang, Yipeng1
> Sent: Tuesday, July 10, 2018 11:14 AM
> To: d...@openvswitch.org; jan.scheur...@ericsson.com; O Mahony, Billy
> 
> Cc: Wang, Yipeng1 ; Stokes, Ian
> ; b...@ovn.org
> Subject: [PATCH v5 1/2] dpif-netdev: Add SMC cache after EMC cache
> 
> This patch adds a signature match cache (SMC) after exact match cache (EMC).
> The difference between SMC and EMC is SMC only stores a signature of a flow
> thus it is much more memory efficient. With same memory space, EMC can
> store 8k flows while SMC can store 1M flows. It is generally beneficial to 
> turn on
> SMC but turn off EMC when traffic flow count is much larger than EMC size.
> 
> SMC cache will map a signature to an dp_netdev_flow index in flow_table. Thus,
> we add two new APIs in cmap for lookup key by index and lookup index by key.
> 
> For now, SMC is an experimental feature that it is turned off by default. One 
> can
> turn it on using ovsdb options.
> 
> Signed-off-by: Yipeng Wang 
> Co-authored-by: Jan Scheurich 
> Signed-off-by: Jan Scheurich 
> ---
>  Documentation/topics/dpdk/bridge.rst |  15 ++
>  NEWS |   2 +
>  lib/cmap.c   |  74 
>  lib/cmap.h   |  11 ++
>  lib/dpif-netdev-perf.h   |   1 +
>  lib/dpif-netdev.c| 329 
> +++
>  tests/pmd.at |   1 +
>  vswitchd/vswitch.xml |  13 ++
>  8 files changed, 409 insertions(+), 37 deletions(-)
> 
> diff --git a/Documentation/topics/dpdk/bridge.rst
> b/Documentation/topics/dpdk/bridge.rst
> index 63f8a62..df74c02 100644
> --- a/Documentation/topics/dpdk/bridge.rst
> +++ b/Documentation/topics/dpdk/bridge.rst
> @@ -102,3 +102,18 @@ For certain traffic profiles with many parallel flows, 
> it's
> recommended to set  ``N`` to '0' to achieve higher forwarding performance.
> 
>  For more information on the EMC refer to :doc:`/intro/install/dpdk` .
> +
> +
> +SMC cache (experimental)
> +-
> +
> +SMC cache or signature match cache is a new cache level after EMC cache.
> +The difference between SMC and EMC is SMC only stores a signature of a
> +flow thus it is much more memory efficient. With same memory space, EMC
> +can store 8k flows while SMC can store 1M flows. When traffic flow
> +count is much larger than EMC size, it is generally beneficial to turn
> +off EMC and turn on SMC. It is currently turned off by default and an
> experimental feature.
> +
> +To turn on SMC::
> +
> +$ ovs-vsctl --no-wait set Open_vSwitch .
> + other_config:smc-enable=true
> diff --git a/NEWS b/NEWS
> index 92e9b92..f30a1e0 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -44,6 +44,8 @@ Post-v2.9.0
>   ovs-appctl dpif-netdev/pmd-perf-show
>   * Supervision of PMD performance metrics and logging of suspicious
> iterations
> + * Add signature match cache (SMC) as experimental feature. When turned
> on,
> +   it improves throughput when traffic has many more flows than EMC size.
> - ERSPAN:
>   * Implemented ERSPAN protocol (draft-foschiano-erspan-00.txt) for
> both kernel datapath and userspace datapath.
> diff --git a/lib/cmap.c b/lib/cmap.c
> index 07719a8..cb9cd32 100644
> --- a/lib/cmap.c
> +++ b/lib/cmap.c
> @@ -373,6 +373,80 @@ cmap_find(const struct cmap *cmap, uint32_t hash)
> hash);
>  }
> 
> +/* Find a node by the index of the entry of cmap. Index N means the
> +N/CMAP_K
> + * bucket and N%CMAP_K entry in that bucket.
> + * Notice that it is not protected by the optimistic lock (versioning)
> +because
> + * it does not compare the hashes. Currently it is only used by the
> +datapath
> + * SMC cache.
> + *
> + * Return node for the entry of index or NULL if the index beyond
> +boundary */ const struct cmap_node * cmap_find_by_index(const struct
> +cmap *cmap, uint32_t index) {
> +const struct cmap_impl *impl = cmap_get_impl(cmap);
> +
> +uint32_t b = index / CMAP_K;
> +uint32_t e = index % CMAP_K;
> +
> +if (b > impl->mask) {
> +return NULL;
> +}
> +
> +const struct cmap_bucket *bucket = >buckets[b];
> +
> +return cmap_node_next(>nodes[e]);
> +}
> +
> +/* Find the index of certain hash value. Currently only used by the
> +datapath
> + * SMC cache.
> + *
> + * Return the index of the entry if found, or UINT32_MAX if not found.
> +The
> + * function assumes entry index cannot be larger than UINT32_MAX. */
> +uint32_t cmap_find_index(const struct cmap *cmap, uint32_t hash) {

Re: [ovs-dev] [PATCH v4 1/2] dpif-netdev: Add SMC cache after EMC cache

2018-07-05 Thread O Mahony, Billy
Hi Yipeng,

Some further comments below. Mainly to do with readability and understanding of 
the changes.

Regards,
Billy.

> -Original Message-
> From: Wang, Yipeng1
> Sent: Friday, June 29, 2018 6:53 PM
> To: d...@openvswitch.org
> Cc: Wang, Yipeng1 ; jan.scheur...@ericsson.com;
> Stokes, Ian ; O Mahony, Billy
> ; Loftus, Ciara 
> Subject: [PATCH v4 1/2] dpif-netdev: Add SMC cache after EMC cache
> 
> This patch adds a signature match cache (SMC) after exact match cache (EMC).
> The difference between SMC and EMC is SMC only stores a signature of a flow
> thus it is much more memory efficient. With same memory space, EMC can
> store 8k flows while SMC can store 1M flows. It is generally beneficial to 
> turn on
> SMC but turn off EMC when traffic flow count is much larger than EMC size.
> 
> SMC cache will map a signature to an netdev_flow index in flow_table. Thus, we
> add two new APIs in cmap for lookup key by index and lookup index by key.
> 
> For now, SMC is an experimental feature that it is turned off by default. One 
> can
> turn it on using ovsdb options.
> 
> Signed-off-by: Yipeng Wang 
> ---
>  Documentation/topics/dpdk/bridge.rst |  15 ++
>  NEWS |   2 +
>  lib/cmap.c   |  73 +
>  lib/cmap.h   |   5 +
>  lib/dpif-netdev-perf.h   |   1 +
>  lib/dpif-netdev.c| 310 
> ++-
>  tests/pmd.at |   1 +
>  vswitchd/vswitch.xml |  13 ++
>  8 files changed, 383 insertions(+), 37 deletions(-)
> 
> diff --git a/Documentation/topics/dpdk/bridge.rst
> b/Documentation/topics/dpdk/bridge.rst
> index 63f8a62..df74c02 100644
> --- a/Documentation/topics/dpdk/bridge.rst
> +++ b/Documentation/topics/dpdk/bridge.rst
> @@ -102,3 +102,18 @@ For certain traffic profiles with many parallel flows, 
> it's
> recommended to set  ``N`` to '0' to achieve higher forwarding performance.
> 
>  For more information on the EMC refer to :doc:`/intro/install/dpdk` .
> +
> +
> +SMC cache (experimental)
> +-
> +
> +SMC cache or signature match cache is a new cache level after EMC cache.
> +The difference between SMC and EMC is SMC only stores a signature of a
> +flow thus it is much more memory efficient. With same memory space, EMC
> +can store 8k flows while SMC can store 1M flows. When traffic flow
> +count is much larger than EMC size, it is generally beneficial to turn
> +off EMC and turn on SMC. It is currently turned off by default and an
> experimental feature.
> +
> +To turn on SMC::
> +
> +$ ovs-vsctl --no-wait set Open_vSwitch .
> + other_config:smc-enable=true
> diff --git a/NEWS b/NEWS
> index cd15a33..26d6ef1 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -40,6 +40,8 @@ Post-v2.9.0
>   ovs-appctl dpif-netdev/pmd-perf-show
>   * Supervision of PMD performance metrics and logging of suspicious
> iterations
> + * Add signature match cache (SMC) as experimental feature. When turned
> on,
> +   it improves throughput when traffic has many more flows than EMC size.
> - ERSPAN:
>   * Implemented ERSPAN protocol (draft-foschiano-erspan-00.txt) for
> both kernel datapath and userspace datapath.
> diff --git a/lib/cmap.c b/lib/cmap.c
> index 07719a8..db1c806 100644
> --- a/lib/cmap.c
> +++ b/lib/cmap.c
> @@ -373,6 +373,79 @@ cmap_find(const struct cmap *cmap, uint32_t hash)
> hash);
>  }
> 
> +/* Find a node by the index of the entry of cmap. For example, index 7
> +means
> + * the second bucket and the third item.
> + * Notice that it is not protected by the optimistic lock (versioning)
> +because
> + * of performance reasons. Currently it is only used by the datapath DFC 
> cache.
> + *
> + * Return node for the entry of index or NULL if the index beyond
> +boundary */ const struct cmap_node * cmap_find_by_index(const struct
> +cmap *cmap, uint16_t index) {
> +const struct cmap_impl *impl = cmap_get_impl(cmap);
> +
> +uint32_t b = index / CMAP_K;
> +uint32_t e = index % CMAP_K;
> +
> +if (b > impl->mask) {
> +return NULL;
> +}
> +
> +const struct cmap_bucket *bucket = >buckets[b];
> +
> +return cmap_node_next(>nodes[e]);
> +}
> +
> +/* Find the index of certain hash value. Currently only used by the
> +datapath
> + * DFC cache.
> + *
> + * Return the index of the entry if found, or UINT32_MAX if not found
[[BO'M]]  An intro the concept of index would be useful here especially as it 
does not currently exist in cmap. Something like: "The 'index' o

Re: [ovs-dev] [PATCH v4 1/2] dpif-netdev: Add SMC cache after EMC cache

2018-07-04 Thread O Mahony, Billy
Hi,

I've checked the latest patch and the performance results I get are similar to 
the ones give in the previous patches. Also enabling/disabling the DFC on the 
fly works as expected.

The main query I have regards the slow sweep for SMC

[[BO'M]] The slow sweep removes EMC entries that are no longer valid because 
the associated dpcls rule has been changed or has expired. Is there a mechanism 
to remove SMC entries associated with a changed/expired dpcls rules?

Further comments, queries inline. Review not complete yet, I'll try to finish 
off tomorrow.

Regards,
Billy.


> -Original Message-
> From: Wang, Yipeng1
> Sent: Friday, June 29, 2018 6:53 PM
> To: d...@openvswitch.org
> Cc: Wang, Yipeng1 ; jan.scheur...@ericsson.com;
> Stokes, Ian ; O Mahony, Billy
> ; Loftus, Ciara 
> Subject: [PATCH v4 1/2] dpif-netdev: Add SMC cache after EMC cache
> 
> This patch adds a signature match cache (SMC) after exact match cache (EMC).
> The difference between SMC and EMC is SMC only stores a signature of a flow
> thus it is much more memory efficient. With same memory space, EMC can
> store 8k flows while SMC can store 1M flows. It is generally beneficial to 
> turn on
> SMC but turn off EMC when traffic flow count is much larger than EMC size.
> 
> SMC cache will map a signature to an netdev_flow index in flow_table. Thus, we
> add two new APIs in cmap for lookup key by index and lookup index by key.
> 
> For now, SMC is an experimental feature that it is turned off by default. One 
> can
> turn it on using ovsdb options.
> 
> Signed-off-by: Yipeng Wang 
> ---
>  Documentation/topics/dpdk/bridge.rst |  15 ++
>  NEWS |   2 +
>  lib/cmap.c   |  73 +
>  lib/cmap.h   |   5 +
>  lib/dpif-netdev-perf.h   |   1 +
>  lib/dpif-netdev.c| 310 
> ++-
>  tests/pmd.at |   1 +
>  vswitchd/vswitch.xml |  13 ++
>  8 files changed, 383 insertions(+), 37 deletions(-)
> 
> diff --git a/Documentation/topics/dpdk/bridge.rst
> b/Documentation/topics/dpdk/bridge.rst
> index 63f8a62..df74c02 100644
> --- a/Documentation/topics/dpdk/bridge.rst
> +++ b/Documentation/topics/dpdk/bridge.rst
> @@ -102,3 +102,18 @@ For certain traffic profiles with many parallel flows, 
> it's
> recommended to set  ``N`` to '0' to achieve higher forwarding performance.
> 
>  For more information on the EMC refer to :doc:`/intro/install/dpdk` .
> +
> +
> +SMC cache (experimental)
> +-
> +
> +SMC cache or signature match cache is a new cache level after EMC cache.
> +The difference between SMC and EMC is SMC only stores a signature of a
> +flow thus it is much more memory efficient. With same memory space, EMC
> +can store 8k flows while SMC can store 1M flows. When traffic flow
> +count is much larger than EMC size, it is generally beneficial to turn
> +off EMC and turn on SMC. It is currently turned off by default and an
> experimental feature.
> +
> +To turn on SMC::
> +
> +$ ovs-vsctl --no-wait set Open_vSwitch .
> + other_config:smc-enable=true
> diff --git a/NEWS b/NEWS
> index cd15a33..26d6ef1 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -40,6 +40,8 @@ Post-v2.9.0
>   ovs-appctl dpif-netdev/pmd-perf-show
>   * Supervision of PMD performance metrics and logging of suspicious
> iterations
> + * Add signature match cache (SMC) as experimental feature. When turned
> on,
> +   it improves throughput when traffic has many more flows than EMC size.
> - ERSPAN:
>   * Implemented ERSPAN protocol (draft-foschiano-erspan-00.txt) for
> both kernel datapath and userspace datapath.
> diff --git a/lib/cmap.c b/lib/cmap.c
> index 07719a8..db1c806 100644
> --- a/lib/cmap.c
> +++ b/lib/cmap.c
> @@ -373,6 +373,79 @@ cmap_find(const struct cmap *cmap, uint32_t hash)
> hash);
>  }
> 
> +/* Find a node by the index of the entry of cmap. For example, index 7
> +means
> + * the second bucket and the third item.
[[BO'M]] Is this is assuming 4 for the bucket size. Maybe explicitly add where 
the bucket size is coming from - CMAP_K ? If so is that value not 5 for a 64 
bit system?

> + * Notice that it is not protected by the optimistic lock (versioning)
> +because
> + * of performance reasons. Currently it is only used by the datapath DFC 
> cache.
> + *
> + * Return node for the entry of index or NULL if the index beyond
> +boundary */ const struct cmap_node * cmap_find_by_index(const struct
> +cmap *cmap, uint16_t index) {
> +const struct cmap_impl *impl = cmap_get_impl(cmap);
> +
> +uint3

Re: [ovs-dev] [PATCH v3 0/6] dpif-netdev: Combine CD and DFC patch for datapath refactor

2018-06-22 Thread O Mahony, Billy
I have replicated some of tests scenarios described below and confirm the 
performance improvements.

I hope to get some time to review the code itself in the next week. 

Regards,
Billy.

> -Original Message-
> From: Wang, Yipeng1
> Sent: Tuesday, May 15, 2018 5:13 PM
> To: d...@openvswitch.org
> Cc: b...@ovn.org; jan.scheur...@ericsson.com; u9012...@gmail.com; Stokes,
> Ian ; O Mahony, Billy ;
> Wang, Yipeng1 ; Gobriel, Sameh
> ; Tai, Charlie 
> Subject: [PATCH v3 0/6] dpif-netdev: Combine CD and DFC patch for datapath
> refactor
> 
> This patch set is the V3 implementation to combine the CD and DFC design.
> Both patches intend to refactor datapath to avoid costly sequential subtable
> search.
> 
> CD and DFC patch sets:
> CD: [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation
> https://mail.openvswitch.org/pipermail/ovs-dev/2017-October/340305.html
> 
> DFC: [PATCH] dpif-netdev: Refactor datapath flow cache
> https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/341066.html
> 
> 1. The first commit is a rebase of Jan Scheurich's patch of [PATCH] 
> dpif-netdev:
> Refactor datapath flow cache with a couple of bug fixes. The patch include EMC
> improvements together with the new DFC structure.
> 
> 2. The second commit is to incorporate CD's way-associative design into DFC to
> improve the hit rate.
> 
> 3. The third commit is to change the distributor to cache an index of 
> flow_table
> entry to improve memory efficiency.
> 
> 4. The fourth commit is to split DFC into EMC and SMC for better organization.
> Also the lookup function is rewritten to do batching processing.
> 
> 5. The fifth commit is to automatically turn off DFC/CD when there is a very
> large number of megaflows.
> 
> 6. The sixth commit modifies a unit test to avoid failure.
> 
> We did a phy-2-phy test to evaluate the performance improvement with this
> patch set. The traffic pattern we use is based on Billy's original TREX 
> script:
> https://mail.openvswitch.org/pipermail/ovs-dev/2018-March/345032.html
> 
> We augment the script to generate power law distribution of flows to have
> different bandwidth and to access different subtables.
> 
> For example, there are n flows each has bandwidth of w, while n/4 flows each
> has bandwidth of 2w, while n/9 flows each has bandwidth of 3w, and so on
> (Power Law distribution, y = Cx^-2). For subtable, the second most accessed
> subtable has
> 1/2 accesses of the first most accessed subtable, the third most accessed
> subtable has 1/3 accesses of the first most accessed subtable and so on 
> (Zipf's
> law).
> 
> The CD/DFC size is 1 million entries. The speedup results are listed below:
> 
> #flow#subtablespeedup
> 1000 11.015523746
> 1000 51.032199838
> 1000 10   1.050814738
> 1000 20   1.081794454
> 111.201704118
> 151.31634144
> 110   1.402493331
> 120   1.531133279
> 10   11.11088487
> 10   51.458748559
> 10   10   1.683044348
> 10   20   2.034441401
> 100  11.004339563
> 100  51.256745291
> 100  10   1.444329892
> 100  20   1.666275853
> 
> Both flow traffic and subtable accesses are skewed. The table shows the total
> number.
> The most performance improvement happens when flow can totally hit DFC/CD
> thus bypass the megaflow cache, and when there are multiple subtables.
> When all flows hit EMC or flow count is larger than CD/DFC size, the
> performance improvement reduces.
> 
> v2->v3:
> 1. Add the 5th commit: it is to automatically turn off DFC/CD when the number
> of megaflow is larger than 2^16 since we use 16bits in the distributor to 
> index
> megaflows.
> 2. Add the 6th commit: since the pmd stats now print out the DFC/CD statistics
> one of the unit test has mismatch output. This commit fixed this issue.
> 3. In first commit, the char key[248] array is changed to uint64_t key[31]
> because of the OSX compilation warning that char array is 1 byte alligned 
> while
> 8-byte alignment is required during type conversion.
> 
> 
> v1->v2:
> 1. Add comment and follow code style for cmap code (Ben's comment) 2. Fix a
> bug in the first commit that fails multiple unit tests. Since DFC is
>per PMD not per port, the port mask should be included in rule.
> 3. Added commit 4. This commit separates DFC to be EMC cache and SMC
> (signature
>match cache) for easier optimization and readability.
> 4. In commit 4, DFC lookup is refactored to do batching lookup.
> 5. Rebase and other min

Re: [ovs-dev] [PATCH v2] OVS-DPDK: Change "dpdk-socket-mem" default value.

2018-05-09 Thread O Mahony, Billy
Thanks, Aaron. Will do.

> -Original Message-
> From: Aaron Conole [mailto:acon...@redhat.com]
> Sent: Tuesday, May 8, 2018 8:35 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>
> Cc: d...@openvswitch.org; Rybka, MarcinX <marcinx.ry...@intel.com>
> Subject: Re: [ovs-dev] [PATCH v2] OVS-DPDK: Change "dpdk-socket-mem"
> default value.
> 
> Billy O'Mahony <billy.o.mah...@intel.com> writes:
> 
> > From: Marcin Rybka <marcinx.ry...@intel.com>
> >
> > When "dpdk-socket-mem" and "dpdk-alloc-mem" are not specified,
> > "dpdk-socket-mem" will be set to allocate 1024MB on each NUMA node.
> > This change will prevent OVS from failing when NIC is attached on NUMA
> > node 1 and higher. Patch contains documentation update.
> >
> > Signed-off-by: Marcin Rybka <marcinx.ry...@intel.com>
> > Co-authored-by:: Billy O'Mahony <billy.o.mah...@intel.com>
> ^^ Needs refactoring.  Should be:
> 
> Co-authored-by: Billy O'Mahony <billy.o.mah...@intel.com>
> Signed-off-by: Billy O'Mahony <billy.o.mah...@intel.com>
> 
> > ---
> 
> Thanks for the change, Billy!
> 
> You can ignore my previous email (on v1).
> 
> >  Documentation/intro/install/dpdk.rst |  3 ++-
> >  lib/dpdk.c   | 29 -
> >  vswitchd/vswitch.xml |  7 ---
> >  3 files changed, 34 insertions(+), 5 deletions(-)
> >
> > diff --git a/Documentation/intro/install/dpdk.rst
> > b/Documentation/intro/install/dpdk.rst
> > index fea4890..b68438d 100644
> > --- a/Documentation/intro/install/dpdk.rst
> > +++ b/Documentation/intro/install/dpdk.rst
> > @@ -228,7 +228,8 @@ listed below. Defaults will be provided for all values
> not explicitly set.
> >
> >  ``dpdk-socket-mem``
> >Comma separated list of memory to pre-allocate from hugepages on
> > specific
> > -  sockets.
> > +  sockets. If not specified, 1024 MB will be set for each numa node
> > + by  default.
> >
> >  ``dpdk-hugepage-dir``
> >Directory where hugetlbfs is mounted diff --git a/lib/dpdk.c
> > b/lib/dpdk.c index 00dd974..733c67d 100644
> > --- a/lib/dpdk.c
> > +++ b/lib/dpdk.c
> > @@ -35,6 +35,7 @@
> >  #include "netdev-dpdk.h"
> >  #include "openvswitch/dynamic-string.h"
> >  #include "openvswitch/vlog.h"
> > +#include "ovs-numa.h"
> >  #include "smap.h"
> >
> >  VLOG_DEFINE_THIS_MODULE(dpdk);
> > @@ -163,6 +164,29 @@ construct_dpdk_options(const struct smap
> *ovs_other_config,
> >  return ret;
> >  }
> >
> > +static char *
> > +construct_dpdk_socket_mem(void)
> > +{
> > +int numa = 0;
> > +const char *def_value = "1024";
> > +int numa_nodes = ovs_numa_get_n_numas();
> > +
> > +if (numa_nodes == 0 || numa_nodes == OVS_NUMA_UNSPEC) {
> 
> Not sure why the first leg of the || branch is here.  It can probably be 
> removed
> (or maybe treated as a bigger error?)
> 
> > +numa_nodes = 1;
> > +}
> > +
> > +/* Allocate enough memory for digits, comma-sep and terminator. */
> > +char *dpdk_socket_mem = xzalloc(numa_nodes * (strlen(def_value) +
> > + 1));
> > +
> > +strcat(dpdk_socket_mem, def_value);
> > +for (numa = 1; numa < numa_nodes; ++numa) {
> > +strcat(dpdk_socket_mem, ",");
> > +strcat(dpdk_socket_mem, def_value);
> > +}
> > +
> > +return dpdk_socket_mem;
> > +}
> > +
> >  #define MAX_DPDK_EXCL_OPTS 10
> >
> >  static int
> > @@ -170,6 +194,7 @@ construct_dpdk_mutex_options(const struct smap
> *ovs_other_config,
> >   char ***argv, const int initial_size,
> >   char **extra_args, const size_t
> > extra_argc)  {
> > +char *default_dpdk_socket_mem = construct_dpdk_socket_mem();
> >  struct dpdk_exclusive_options_map {
> >  const char *category;
> >  const char *ovs_dpdk_options[MAX_DPDK_EXCL_OPTS];
> > @@ -180,7 +205,7 @@ construct_dpdk_mutex_options(const struct smap
> *ovs_other_config,
> >  {"memory type",
> >   {"dpdk-alloc-mem", "dpdk-socket-mem", NULL,},
> >   {"-m", "--socket-mem",NULL,},
> > - "1024,0", 1
> > + default_dpdk_socket_mem, 1
> >  },
> >  };
> >

Re: [ovs-dev] [PATCH v12 3/3] dpif-netdev: Detection and logging of suspicious PMD iterations

2018-04-20 Thread O Mahony, Billy
Acked-by: Billy O'Mahony <billy.o.mah...@intel.com>

> -Original Message-
> From: Jan Scheurich [mailto:jan.scheur...@ericsson.com]
> Sent: Thursday, April 19, 2018 6:41 PM
> To: d...@openvswitch.org
> Cc: ktray...@redhat.com; Stokes, Ian <ian.sto...@intel.com>;
> i.maxim...@samsung.com; O Mahony, Billy <billy.o.mah...@intel.com>; Jan
> Scheurich <jan.scheur...@ericsson.com>
> Subject: [PATCH v12 3/3] dpif-netdev: Detection and logging of suspicious PMD
> iterations
> 
> This patch enhances dpif-netdev-perf to detect iterations with suspicious
> statistics according to the following criteria:
> 
> - iteration lasts longer than US_THR microseconds (default 250).
>   This can be used to capture events where a PMD is blocked or
>   interrupted for such a period of time that there is a risk for
>   dropped packets on any of its Rx queues.
> 
> - max vhost qlen exceeds a threshold Q_THR (default 128). This can
>   be used to infer virtio queue overruns and dropped packets inside
>   a VM, which are not visible in OVS otherwise.
> 
> Such suspicious iterations can be logged together with their iteration 
> statistics
> to be able to correlate them to packet drop or other events outside OVS.
> 
> A new command is introduced to enable/disable logging at run-time and to
> adjust the above thresholds for suspicious iterations:
> 
> ovs-appctl dpif-netdev/pmd-perf-log-set on | off
> [-b before] [-a after] [-e|-ne] [-us usec] [-q qlen]
> 
> Turn logging on or off at run-time (on|off).
> 
> -b before:  The number of iterations before the suspicious iteration to
> be logged (default 5).
> -a after:   The number of iterations after the suspicious iteration to
> be logged (default 5).
> -e: Extend logging interval if another suspicious iteration is
> detected before logging occurs.
> -ne:Do not extend logging interval (default).
> -q qlen:Suspicious vhost queue fill level threshold. Increase this
> to 512 if the Qemu supports 1024 virtio queue length.
> (default 128).
> -us usec:   change the duration threshold for a suspicious iteration
> (default 250 us).
> 
> Note: Logging of suspicious iterations itself consumes a considerable amount 
> of
> processing cycles of a PMD which may be visible in the iteration history. In 
> the
> worst case this can lead OVS to detect another suspicious iteration caused by
> logging.
> 
> If more than 100 iterations around a suspicious iteration have been logged 
> once,
> OVS falls back to the safe default values (-b 5/-a 5/-ne) to avoid that 
> logging
> itself causes continuos further logging.
> 
> Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
> Acked-by: Billy O'Mahony <billy.o.mah...@intel.com>
> ---
>  NEWS|   2 +
>  lib/dpif-netdev-perf.c  | 223
> 
>  lib/dpif-netdev-perf.h  |  21 +
>  lib/dpif-netdev-unixctl.man |  59 
>  lib/dpif-netdev.c   |   5 +
>  5 files changed, 310 insertions(+)
> 
> diff --git a/NEWS b/NEWS
> index a665c7f..7259492 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -27,6 +27,8 @@ Post-v2.9.0
>   * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single
> PMD
>   * Detailed PMD performance metrics available with new command
>   ovs-appctl dpif-netdev/pmd-perf-show
> + * Supervision of PMD performance metrics and logging of suspicious
> +   iterations
> 
>  v2.9.0 - 19 Feb 2018
>  
> diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c index
> caa0e27..47ce2c2 100644
> --- a/lib/dpif-netdev-perf.c
> +++ b/lib/dpif-netdev-perf.c
> @@ -25,6 +25,24 @@
> 
>  VLOG_DEFINE_THIS_MODULE(pmd_perf);
> 
> +#define ITER_US_THRESHOLD 250   /* Warning threshold for iteration duration
> +   in microseconds. */
> +#define VHOST_QUEUE_FULL 128/* Size of the virtio TX queue. */
> +#define LOG_IT_BEFORE 5 /* Number of iterations to log before
> +   suspicious iteration. */
> +#define LOG_IT_AFTER 5  /* Number of iterations to log after
> +   suspicious iteration. */
> +
> +bool log_enabled = false;
> +bool log_extend = false;
> +static uint32_t log_it_before = LOG_IT_BEFORE; static uint32_t
> +log_it_after = LOG_IT_AFTER; static uint32_t log_us_thr =
> +ITER_US_THRESHOLD; uint32_t log_q_thr = VHOST_QUEUE_FULL; uint64_t
> +iter_cycle_threshold;
> +
> +static struct vlog_rate_limit latency_rl = VLOG_RATE_LIMIT_INIT(600,
> +600);
> 

Re: [ovs-dev] [PATCH v12 2/3] dpif-netdev: Detailed performance stats for PMDs

2018-04-20 Thread O Mahony, Billy
Acked-by: Billy O'Mahony <billy.o.mah...@intel.com>

> -Original Message-
> From: Jan Scheurich [mailto:jan.scheur...@ericsson.com]
> Sent: Thursday, April 19, 2018 6:41 PM
> To: d...@openvswitch.org
> Cc: ktray...@redhat.com; Stokes, Ian <ian.sto...@intel.com>;
> i.maxim...@samsung.com; O Mahony, Billy <billy.o.mah...@intel.com>; Jan
> Scheurich <jan.scheur...@ericsson.com>
> Subject: [PATCH v12 2/3] dpif-netdev: Detailed performance stats for PMDs
> 
> This patch instruments the dpif-netdev datapath to record detailed
> statistics of what is happening in every iteration of a PMD thread.
> 
> The collection of detailed statistics can be controlled by a new
> Open_vSwitch configuration parameter "other_config:pmd-perf-metrics".
> By default it is disabled. The run-time overhead, when enabled, is
> in the order of 1%.
> 
> The covered metrics per iteration are:
>   - cycles
>   - packets
>   - (rx) batches
>   - packets/batch
>   - max. vhostuser qlen
>   - upcalls
>   - cycles spent in upcalls
> 
> This raw recorded data is used threefold:
> 
> 1. In histograms for each of the following metrics:
>- cycles/iteration (log.)
>- packets/iteration (log.)
>- cycles/packet
>- packets/batch
>- max. vhostuser qlen (log.)
>- upcalls
>- cycles/upcall (log)
>The histograms bins are divided linear or logarithmic.
> 
> 2. A cyclic history of the above statistics for 999 iterations
> 
> 3. A cyclic history of the cummulative/average values per millisecond
>wall clock for the last 1000 milliseconds:
>- number of iterations
>- avg. cycles/iteration
>- packets (Kpps)
>- avg. packets/batch
>- avg. max vhost qlen
>- upcalls
>- avg. cycles/upcall
> 
> The gathered performance metrics can be printed at any time with the
> new CLI command
> 
> ovs-appctl dpif-netdev/pmd-perf-show [-nh] [-it iter_len] [-ms ms_len]
> [-pmd core] [dp]
> 
> The options are
> 
> -nh:Suppress the histograms
> -it iter_len:   Display the last iter_len iteration stats
> -ms ms_len: Display the last ms_len millisecond stats
> -pmd core:  Display only the specified PMD
> 
> The performance statistics are reset with the existing
> dpif-netdev/pmd-stats-clear command.
> 
> The output always contains the following global PMD statistics,
> similar to the pmd-stats-show command:
> 
> Time: 15:24:55.270
> Measurement duration: 1.008 s
> 
> pmd thread numa_id 0 core_id 1:
> 
>   Cycles:2419034712  (2.40 GHz)
>   Iterations:572817  (1.76 us/it)
>   - idle:486808  (15.9 % cycles)
>   - busy: 86009  (84.1 % cycles)
>   Rx packets:   2399607  (2381 Kpps, 848 cycles/pkt)
>   Datapath passes:  3599415  (1.50 passes/pkt)
>   - EMC hits:336472  ( 9.3 %)
>   - Megaflow hits:  3262943  (90.7 %, 1.00 subtbl lookups/hit)
>   - Upcalls:  0  ( 0.0 %, 0.0 us/upcall)
>   - Lost upcalls: 0  ( 0.0 %)
>   Tx packets:   2399607  (2381 Kpps)
>   Tx batches:171400  (14.00 pkts/batch)
> 
> Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
> Acked-by: Billy O'Mahony <billy.o.mah...@intel.com>
> ---
>  NEWS|   4 +
>  lib/automake.mk |   1 +
>  lib/dpif-netdev-perf.c  | 462
> +++-
>  lib/dpif-netdev-perf.h  | 197 ---
>  lib/dpif-netdev-unixctl.man | 157 +++
>  lib/dpif-netdev.c   | 187 --
>  manpages.mk |   2 +
>  vswitchd/ovs-vswitchd.8.in  |  27 +--
>  vswitchd/vswitch.xml|  12 ++
>  9 files changed, 985 insertions(+), 64 deletions(-)
>  create mode 100644 lib/dpif-netdev-unixctl.man
> 
> diff --git a/NEWS b/NEWS
> index cd4ffbb..a665c7f 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -23,6 +23,10 @@ Post-v2.9.0
> other IPv4/IPv6-based protocols whenever a reject ACL rule is hit.
>   * ACL match conditions can now match on Port_Groups as well as address
> sets that are automatically generated by Port_Groups.
> +   - Userspace datapath:
> + * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single
> PMD
> + * Detailed PMD performance metrics available with new command
> + ovs-appctl dpif-netdev/pmd-perf-show
> 
>  v2.9.0 - 19 Feb 2018
>  
> diff --git a/lib/automake.mk b/lib/automake.mk
> index 915a33b..3276aaa 100644
> --- a/lib/automake.mk
> +++ b/lib/automake.mk
> @@ -491,6 +491,7 @@ MAN_FRAGMENTS += \
>   lib/dpctl.man

Re: [ovs-dev] [PATCH v12 1/3] netdev: Add optional qfill output parameter to rxq_recv()

2018-04-20 Thread O Mahony, Billy
Acked-by: Billy O'Mahony <billy.o.mah...@intel.com>

> -Original Message-
> From: Jan Scheurich [mailto:jan.scheur...@ericsson.com]
> Sent: Thursday, April 19, 2018 6:41 PM
> To: d...@openvswitch.org
> Cc: ktray...@redhat.com; Stokes, Ian <ian.sto...@intel.com>;
> i.maxim...@samsung.com; O Mahony, Billy <billy.o.mah...@intel.com>; Jan
> Scheurich <jan.scheur...@ericsson.com>
> Subject: [PATCH v12 1/3] netdev: Add optional qfill output parameter to
> rxq_recv()
> 
> If the caller provides a non-NULL qfill pointer and the netdev 
> implemementation
> supports reading the rx queue fill level, the rxq_recv() function returns the
> remaining number of packets in the rx queue after reception of the packet 
> burst
> to the caller. If the implementation does not support this, it returns 
> -ENOTSUP
> instead. Reading the remaining queue fill level should not substantilly slow 
> down
> the recv() operation.
> 
> A first implementation is provided for ethernet and vhostuser DPDK ports in
> netdev-dpdk.c.
> 
> This output parameter will be used in the upcoming commit for PMD
> performance metrics to supervise the rx queue fill level for DPDK vhostuser
> ports.
> 
> Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
> Acked-by: Billy O'Mahony <billy.o.mah...@intel.com>
> ---
>  lib/dpif-netdev.c |  2 +-
>  lib/netdev-bsd.c  |  8 +++-
>  lib/netdev-dpdk.c | 41 -
>  lib/netdev-dummy.c|  8 +++-
>  lib/netdev-linux.c|  7 ++-
>  lib/netdev-provider.h |  8 +++-
>  lib/netdev.c  |  5 +++--
>  lib/netdev.h  |  3 ++-
>  8 files changed, 69 insertions(+), 13 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index be31fd0..7ce3943 
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -3277,7 +3277,7 @@ dp_netdev_process_rxq_port(struct
> dp_netdev_pmd_thread *pmd,
>  pmd->ctx.last_rxq = rxq;
>  dp_packet_batch_init();
> 
> -error = netdev_rxq_recv(rxq->rx, );
> +error = netdev_rxq_recv(rxq->rx, , NULL);
>  if (!error) {
>  /* At least one packet received. */
>  *recirc_depth_get() = 0;
> diff --git a/lib/netdev-bsd.c b/lib/netdev-bsd.c index 05974c1..b70f327 100644
> --- a/lib/netdev-bsd.c
> +++ b/lib/netdev-bsd.c
> @@ -618,7 +618,8 @@ netdev_rxq_bsd_recv_tap(struct netdev_rxq_bsd *rxq,
> struct dp_packet *buffer)  }
> 
>  static int
> -netdev_bsd_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch)
> +netdev_bsd_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
> +int *qfill)
>  {
>  struct netdev_rxq_bsd *rxq = netdev_rxq_bsd_cast(rxq_);
>  struct netdev *netdev = rxq->up.netdev; @@ -643,6 +644,11 @@
> netdev_bsd_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch)
>  batch->packets[0] = packet;
>  batch->count = 1;
>  }
> +
> +if (qfill) {
> +*qfill = -ENOTSUP;
> +}
> +
>  return retval;
>  }
> 
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index ee39cbe..a4fc382
> 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -1812,13 +1812,13 @@ netdev_dpdk_vhost_update_rx_counters(struct
> netdev_stats *stats,
>   */
>  static int
>  netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
> -   struct dp_packet_batch *batch)
> +   struct dp_packet_batch *batch, int *qfill)
>  {
>  struct netdev_dpdk *dev = netdev_dpdk_cast(rxq->netdev);
>  struct ingress_policer *policer = netdev_dpdk_get_ingress_policer(dev);
>  uint16_t nb_rx = 0;
>  uint16_t dropped = 0;
> -int qid = rxq->queue_id;
> +int qid = rxq->queue_id * VIRTIO_QNUM + VIRTIO_TXQ;
>  int vid = netdev_dpdk_get_vid(dev);
> 
>  if (OVS_UNLIKELY(vid < 0 || !dev->vhost_reconfigured @@ -1826,14
> +1826,23 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
>  return EAGAIN;
>  }
> 
> -nb_rx = rte_vhost_dequeue_burst(vid, qid * VIRTIO_QNUM + VIRTIO_TXQ,
> -dev->mp,
> +nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->mp,
>  (struct rte_mbuf **) batch->packets,
>  NETDEV_MAX_BURST);
>  if (!nb_rx) {
>  return EAGAIN;
>  }
> 
> +if (qfill) {
> +if (nb_rx == NETDEV_MAX_BURST) {
> +/* The DPDK API returns a uint32_t which often has invalid bits 
> in
> + * the upper 16-bits. Need to restrict the val

Re: [ovs-dev] [PATCH] netdev-dpdk: fix MAC address in port addr example

2018-04-12 Thread O Mahony, Billy
Hi Marcelo,

Apologies. It wasn't clear that you had actually hands on experience of the 
issue.

Regards,
Billy. 

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of O Mahony, Billy
> Sent: Wednesday, April 11, 2018 10:32 AM
> To: Marcelo Ricardo Leitner <marcelo.leit...@gmail.com>;
> d...@openvswitch.org
> Cc: slav...@mellanox.com
> Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: fix MAC address in port addr
> example
> 
> Hi Marcelo,
> 
> I haven't used the specific cards referred to in the surrounding documentation
> but I don't think the 'mac' address format is a typo.
> 
> The notation is specific to some vendor NICs that have several Ethernet 
> devices
> sharing a single PCI bus:device.function address. In that case the PCI address
> alone cannot distinguish the Ethernet device to be uses ofr the dpdk port.
> 
> "Some NICs (i.e. Mellanox ConnectX-3) have only one PCI address associated
> with multiple ports. Using a PCI device like above won't work. Instead, below
> usage is suggested::"
> 
> Regards,
> Billy.
> 
> > -Original Message-
> > From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> > boun...@openvswitch.org] On Behalf Of Marcelo Ricardo Leitner
> > Sent: Monday, April 9, 2018 6:21 PM
> > To: d...@openvswitch.org
> > Cc: marcelo.leit...@gmail.com; slav...@mellanox.com
> > Subject: [ovs-dev] [PATCH] netdev-dpdk: fix MAC address in port addr
> > example
> >
> > The MAC address is always 6-bytes long, never 7. The extra :01 and :02
> > doesn't belong in there as it doesn't mean selecting one port or another.
> >
> > Instead, use an incrementing MAC address, which is what usually
> > happens on such cards.
> >
> > See-also: http://www.dpdk.org/ml/archives/dev/2018-April/094976.html
> > Fixes: 5e7588186839 ("netdev-dpdk: fix port addition for ports sharing
> > same PCI
> > id")
> > Signed-off-by: Marcelo Ricardo Leitner <marcelo.leit...@gmail.com>
> > ---
> >  Documentation/howto/dpdk.rst | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/howto/dpdk.rst
> > b/Documentation/howto/dpdk.rst index
> >
> 79b626c76d0dd45381bd75ab867b7815ca941208..69e692f40d500cf65d59d1979
> > e07afa6f99cf903 100644
> > --- a/Documentation/howto/dpdk.rst
> > +++ b/Documentation/howto/dpdk.rst
> > @@ -53,9 +53,9 @@ with multiple ports. Using a PCI device like above
> > won't work. Instead, below  usage is suggested::
> >
> >  $ ovs-vsctl add-port br0 dpdk-p0 -- set Interface dpdk-p0 type=dpdk \
> > -options:dpdk-devargs="class=eth,mac=00:11:22:33:44:55:01"
> > +options:dpdk-devargs="class=eth,mac=00:11:22:33:44:55"
> >  $ ovs-vsctl add-port br0 dpdk-p1 -- set Interface dpdk-p1 type=dpdk \
> > -options:dpdk-devargs="class=eth,mac=00:11:22:33:44:55:02"
> > +options:dpdk-devargs="class=eth,mac=00:11:22:33:44:56"
> >
> >  Note: such syntax won't support hotplug. The hotplug is supposed to
> > work with future DPDK release, v18.05.
> > --
> > 2.14.3
> >
> > ___
> > dev mailing list
> > d...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] netdev-dpdk: fix MAC address in port addr example

2018-04-11 Thread O Mahony, Billy
Hi Marcelo,

I haven't used the specific cards referred to in the surrounding documentation 
but I don't think the 'mac' address format is a typo.

The notation is specific to some vendor NICs that have several Ethernet devices 
sharing a single PCI bus:device.function address. In that case the PCI address 
alone cannot distinguish the Ethernet device to be uses ofr the dpdk port.

"Some NICs (i.e. Mellanox ConnectX-3) have only one PCI address associated
with multiple ports. Using a PCI device like above won't work. Instead, below
usage is suggested::"

Regards,
Billy. 

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of Marcelo Ricardo Leitner
> Sent: Monday, April 9, 2018 6:21 PM
> To: d...@openvswitch.org
> Cc: marcelo.leit...@gmail.com; slav...@mellanox.com
> Subject: [ovs-dev] [PATCH] netdev-dpdk: fix MAC address in port addr example
> 
> The MAC address is always 6-bytes long, never 7. The extra :01 and :02 doesn't
> belong in there as it doesn't mean selecting one port or another.
> 
> Instead, use an incrementing MAC address, which is what usually happens on
> such cards.
> 
> See-also: http://www.dpdk.org/ml/archives/dev/2018-April/094976.html
> Fixes: 5e7588186839 ("netdev-dpdk: fix port addition for ports sharing same 
> PCI
> id")
> Signed-off-by: Marcelo Ricardo Leitner 
> ---
>  Documentation/howto/dpdk.rst | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/howto/dpdk.rst b/Documentation/howto/dpdk.rst
> index
> 79b626c76d0dd45381bd75ab867b7815ca941208..69e692f40d500cf65d59d1979
> e07afa6f99cf903 100644
> --- a/Documentation/howto/dpdk.rst
> +++ b/Documentation/howto/dpdk.rst
> @@ -53,9 +53,9 @@ with multiple ports. Using a PCI device like above won't
> work. Instead, below  usage is suggested::
> 
>  $ ovs-vsctl add-port br0 dpdk-p0 -- set Interface dpdk-p0 type=dpdk \
> -options:dpdk-devargs="class=eth,mac=00:11:22:33:44:55:01"
> +options:dpdk-devargs="class=eth,mac=00:11:22:33:44:55"
>  $ ovs-vsctl add-port br0 dpdk-p1 -- set Interface dpdk-p1 type=dpdk \
> -options:dpdk-devargs="class=eth,mac=00:11:22:33:44:55:02"
> +options:dpdk-devargs="class=eth,mac=00:11:22:33:44:56"
> 
>  Note: such syntax won't support hotplug. The hotplug is supposed to work with
> future DPDK release, v18.05.
> --
> 2.14.3
> 
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [RFC v2 1/2] ingress scheduling: schema and docs

2018-04-05 Thread O Mahony, Billy


> -Original Message-
> From: Ben Pfaff [mailto:b...@ovn.org]
> Sent: Wednesday, April 4, 2018 6:00 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>
> Cc: d...@openvswitch.org; i.maxim...@samsung.com
> Subject: Re: [ovs-dev] [RFC v2 1/2] ingress scheduling: schema and docs
> 
> On Wed, Apr 04, 2018 at 08:17:26AM +, O Mahony, Billy wrote:
> >
> >
> > > -Original Message-
> > > From: Ben Pfaff [mailto:b...@ovn.org]
> > > Sent: Tuesday, April 3, 2018 6:54 PM
> > > To: O Mahony, Billy <billy.o.mah...@intel.com>
> > > Cc: d...@openvswitch.org; i.maxim...@samsung.com
> > > Subject: Re: [ovs-dev] [RFC v2 1/2] ingress scheduling: schema and
> > > docs
> > >
> > > On Tue, Apr 03, 2018 at 09:06:18AM +, O Mahony, Billy wrote:
> > > > Hi Ben,
> > > >
> > > > > -Original Message-
> > > > > From: Ben Pfaff [mailto:b...@ovn.org]
> > > > > Sent: Sunday, April 1, 2018 1:27 AM
> > > > > To: O Mahony, Billy <billy.o.mah...@intel.com>
> > > > > Cc: d...@openvswitch.org; i.maxim...@samsung.com
> > > > > Subject: Re: [ovs-dev] [RFC v2 1/2] ingress scheduling: schema
> > > > > and docs
> > > > >
> > > > > On Wed, Mar 28, 2018 at 11:11:57PM +0100, Billy O'Mahony wrote:
> > > > > > Signed-off-by: Billy O'Mahony <billy.o.mah...@intel.com>
> > > > > > ---
> > > > > >  Documentation/howto/dpdk.rst | 18 ++
> > > > > >  vswitchd/vswitch.ovsschema   |  9 +++--
> > > > > >  vswitchd/vswitch.xml | 40
> > > > > 
> > > > > >  3 files changed, 65 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/Documentation/howto/dpdk.rst
> > > > > > b/Documentation/howto/dpdk.rst index 79b626c..fca353a 100644
> > > > > > --- a/Documentation/howto/dpdk.rst
> > > > > > +++ b/Documentation/howto/dpdk.rst
> > > > > > @@ -237,6 +237,24 @@ respective parameter. To disable the flow
> > > > > > control at
> > > > > tx side, run::
> > > > > >
> > > > > >  $ ovs-vsctl set Interface dpdk-p0
> > > > > > options:tx-flow-ctrl=false
> > > > > >
> > > > > > +Ingress Scheduling
> > > > > > +--
> > > > > > +
> > > > > > +The ingress scheduling feature is described in general in
> > > > > > +``ovs-vswitchd.conf.db (5)``.
> > > > > > +
> > > > > > +Ingress scheduling currently only supports setting a priority
> > > > > > +for incoming packets for an entire interface. Priority levels
> > > > > > +0
> > > > > > +(lowest) to 3 (highest) are supported.  The default priority is 0.
> > > > > > +
> > > > > > +Interfaces of type ``dpdk`` and ``dpdkvhostuserclient``
> > > > > > +support ingress scheduling.
> > > > > > +
> > > > > > +To prioritize packets on a particular port:
> > > > > > +
> > > > > > +$ ovs-vsctl set Interface dpdk0 \
> > > > > > +ingress_sched=port_prio=3
> > > > >
> > > > > I'm happy to see experimentation in this area.  But, since it is
> > > > > specified to particular kinds of interfaces, and because it is
> > > > > likely to evolve in the future, I think I would prefer to see it
> > > > > defined in term of the interface-type-specific "options" field.
> > > > > Does that
> > > make sense?
> > > > >
> > > >
> > > > I did have as interface-type-specific configuration originally
> > > > mainly as it kept
> > > changes within areas I was familiar with. But it was pointed out by
> > > Ilya that there was nothing dpdk specific to this feature and that
> > > by making the configuration more general that we could "provide
> > > ingres scheduling for all the port types at once". In particular it
> > > would be useful for those implementing to implement netdev-netmap.
> > > >
> > > > So that was the motivation behind the generalized configuration at 
> > > > least.
> > >
> > > I think that generalized configurat

Re: [ovs-dev] [RFC v2 1/2] ingress scheduling: schema and docs

2018-04-04 Thread O Mahony, Billy


> -Original Message-
> From: Ben Pfaff [mailto:b...@ovn.org]
> Sent: Tuesday, April 3, 2018 6:54 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>
> Cc: d...@openvswitch.org; i.maxim...@samsung.com
> Subject: Re: [ovs-dev] [RFC v2 1/2] ingress scheduling: schema and docs
> 
> On Tue, Apr 03, 2018 at 09:06:18AM +, O Mahony, Billy wrote:
> > Hi Ben,
> >
> > > -Original Message-
> > > From: Ben Pfaff [mailto:b...@ovn.org]
> > > Sent: Sunday, April 1, 2018 1:27 AM
> > > To: O Mahony, Billy <billy.o.mah...@intel.com>
> > > Cc: d...@openvswitch.org; i.maxim...@samsung.com
> > > Subject: Re: [ovs-dev] [RFC v2 1/2] ingress scheduling: schema and
> > > docs
> > >
> > > On Wed, Mar 28, 2018 at 11:11:57PM +0100, Billy O'Mahony wrote:
> > > > Signed-off-by: Billy O'Mahony <billy.o.mah...@intel.com>
> > > > ---
> > > >  Documentation/howto/dpdk.rst | 18 ++
> > > >  vswitchd/vswitch.ovsschema   |  9 +++--
> > > >  vswitchd/vswitch.xml | 40
> > > 
> > > >  3 files changed, 65 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/Documentation/howto/dpdk.rst
> > > > b/Documentation/howto/dpdk.rst index 79b626c..fca353a 100644
> > > > --- a/Documentation/howto/dpdk.rst
> > > > +++ b/Documentation/howto/dpdk.rst
> > > > @@ -237,6 +237,24 @@ respective parameter. To disable the flow
> > > > control at
> > > tx side, run::
> > > >
> > > >  $ ovs-vsctl set Interface dpdk-p0 options:tx-flow-ctrl=false
> > > >
> > > > +Ingress Scheduling
> > > > +--
> > > > +
> > > > +The ingress scheduling feature is described in general in
> > > > +``ovs-vswitchd.conf.db (5)``.
> > > > +
> > > > +Ingress scheduling currently only supports setting a priority for
> > > > +incoming packets for an entire interface. Priority levels 0
> > > > +(lowest) to 3 (highest) are supported.  The default priority is 0.
> > > > +
> > > > +Interfaces of type ``dpdk`` and ``dpdkvhostuserclient`` support
> > > > +ingress scheduling.
> > > > +
> > > > +To prioritize packets on a particular port:
> > > > +
> > > > +$ ovs-vsctl set Interface dpdk0 \
> > > > +ingress_sched=port_prio=3
> > >
> > > I'm happy to see experimentation in this area.  But, since it is
> > > specified to particular kinds of interfaces, and because it is
> > > likely to evolve in the future, I think I would prefer to see it
> > > defined in term of the interface-type-specific "options" field.  Does that
> make sense?
> > >
> >
> > I did have as interface-type-specific configuration originally mainly as it 
> > kept
> changes within areas I was familiar with. But it was pointed out by Ilya that
> there was nothing dpdk specific to this feature and that by making the
> configuration more general that we could "provide ingres scheduling for all 
> the
> port types at once". In particular it would be useful for those implementing 
> to
> implement netdev-netmap.
> >
> > So that was the motivation behind the generalized configuration at least.
> 
> I think that generalized configuration can make sense, but I wonder what the
> overall universe of ingress scheduler configuration looks like.  Are there 
> multi-
> vendor examples of how different software or hardware do ingress QoS?
A few vendors have started supporting DPDK rte_flow API albeit just for 
classification offload. But I it could be a future standard offload API. But 
for this RFC which is just per-port granularity the configuration only affects 
the datapath.
> 
> I find myself wondering if we should just add an "ingress_qos" column that
> points to a QoS row.  Probably, there would initially only be one ingress QoS
> type, and it might not have any queues to define, but this would at least be 
> quite
> extensible and fairly familiar to users.  (It could also obsolete the
> ingress_policing_rate and ingress_policing_burst
> columns.)
I think that would be a good solution too. Another reason for moving the config 
from Interface options was that the configurations were potentially quite large 
and a separate column would be cleaner. So a separate table would also work 
well. Although for just prioritizing on a per-port basis the options field 
would be perfectly fine.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [RFC v2 1/2] ingress scheduling: schema and docs

2018-04-03 Thread O Mahony, Billy
Hi Ben,

> -Original Message-
> From: Ben Pfaff [mailto:b...@ovn.org]
> Sent: Sunday, April 1, 2018 1:27 AM
> To: O Mahony, Billy <billy.o.mah...@intel.com>
> Cc: d...@openvswitch.org; i.maxim...@samsung.com
> Subject: Re: [ovs-dev] [RFC v2 1/2] ingress scheduling: schema and docs
> 
> On Wed, Mar 28, 2018 at 11:11:57PM +0100, Billy O'Mahony wrote:
> > Signed-off-by: Billy O'Mahony <billy.o.mah...@intel.com>
> > ---
> >  Documentation/howto/dpdk.rst | 18 ++
> >  vswitchd/vswitch.ovsschema   |  9 +++--
> >  vswitchd/vswitch.xml | 40
> 
> >  3 files changed, 65 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/howto/dpdk.rst
> > b/Documentation/howto/dpdk.rst index 79b626c..fca353a 100644
> > --- a/Documentation/howto/dpdk.rst
> > +++ b/Documentation/howto/dpdk.rst
> > @@ -237,6 +237,24 @@ respective parameter. To disable the flow control at
> tx side, run::
> >
> >  $ ovs-vsctl set Interface dpdk-p0 options:tx-flow-ctrl=false
> >
> > +Ingress Scheduling
> > +--
> > +
> > +The ingress scheduling feature is described in general in
> > +``ovs-vswitchd.conf.db (5)``.
> > +
> > +Ingress scheduling currently only supports setting a priority for
> > +incoming packets for an entire interface. Priority levels 0 (lowest)
> > +to 3 (highest) are supported.  The default priority is 0.
> > +
> > +Interfaces of type ``dpdk`` and ``dpdkvhostuserclient`` support
> > +ingress scheduling.
> > +
> > +To prioritize packets on a particular port:
> > +
> > +$ ovs-vsctl set Interface dpdk0 \
> > +ingress_sched=port_prio=3
> 
> I'm happy to see experimentation in this area.  But, since it is specified to
> particular kinds of interfaces, and because it is likely to evolve in the 
> future, I
> think I would prefer to see it defined in term of the interface-type-specific
> "options" field.  Does that make sense?
> 

I did have as interface-type-specific configuration originally mainly as it 
kept changes within areas I was familiar with. But it was pointed out by Ilya 
that there was nothing dpdk specific to this feature and that by making the 
configuration more general that we could "provide ingres scheduling for all the 
port types at once". In particular it would be useful for those implementing to 
implement netdev-netmap.

So that was the motivation behind the generalized configuration at least.

I'm still a little way from getting the time to move this to a full v1 - I 
intend to  characterize the performance more first. So more input from the 
community is welcome in the interim. 

Thanks,
Billy.


> Thanks,
> 
> Ben.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] netdev-dpdk: Remove 'error' from non error log.

2018-03-23 Thread O Mahony, Billy
Sounds like a job for VLOG_WARN ? 

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of Stokes, Ian
> Sent: Thursday, March 22, 2018 2:52 PM
> To: Kevin Traynor ; d...@openvswitch.org
> Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: Remove 'error' from non error log.
> 
> > Presently, if OVS tries to setup more queues than are allowed by a
> > specific NIC, OVS will handle this case by retrying with a lower
> > amount of queues.
> >
> > Rather than reporting initial failed queue setups in the logs as
> > ERROR, they are reported as INFO but contain the word 'error'. Unless
> > a user has detailed knowledge of OVS-DPDK workings, this is confusing.
> >
> > Let's remove 'error' and the DPDK error code from the INFO log.
> >
> > Signed-off-by: Kevin Traynor 
> > ---
> >  lib/netdev-dpdk.c | 8 
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index
> > af9843a..2032712
> > 100644
> > --- a/lib/netdev-dpdk.c
> > +++ b/lib/netdev-dpdk.c
> > @@ -729,6 +729,6 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev,
> > int n_rxq, int n_txq)
> >dev->socket_id, NULL);
> >  if (diag) {
> > -VLOG_INFO("Interface %s txq(%d) setup error: %s",
> > -  dev->up.name, i, rte_strerror(-diag));
> > +VLOG_INFO("Interface %s unable to setup txq(%d)",
> > +  dev->up.name, i);
> 
> I agree with removing error from the info message but is it worth retaining 
> the
> DPDK error code for debugging somehwere? Maybe is a separate debug log?
> 
> I'm just thinking are there other cases where the error code will help 
> decipher
> why the operation fails (device busy, operation not supported) for tx and rx
> queue setup?
> 
> Ian
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v10 3/3] dpif-netdev: Detection and logging of suspicious PMD iterations

2018-03-18 Thread O Mahony, Billy
Acked-by: Billy O'Mahony <billy.o.mah...@intel.com>

> -Original Message-
> From: Jan Scheurich [mailto:jan.scheur...@ericsson.com]
> Sent: Sunday, March 18, 2018 5:55 PM
> To: d...@openvswitch.org
> Cc: ktray...@redhat.com; Stokes, Ian <ian.sto...@intel.com>;
> i.maxim...@samsung.com; O Mahony, Billy <billy.o.mah...@intel.com>; Jan
> Scheurich <jan.scheur...@ericsson.com>
> Subject: [PATCH v10 3/3] dpif-netdev: Detection and logging of suspicious PMD
> iterations
> 
> This patch enhances dpif-netdev-perf to detect iterations with suspicious
> statistics according to the following criteria:
> 
> - iteration lasts longer than US_THR microseconds (default 250).
>   This can be used to capture events where a PMD is blocked or
>   interrupted for such a period of time that there is a risk for
>   dropped packets on any of its Rx queues.
> 
> - max vhost qlen exceeds a threshold Q_THR (default 128). This can
>   be used to infer virtio queue overruns and dropped packets inside
>   a VM, which are not visible in OVS otherwise.
> 
> Such suspicious iterations can be logged together with their iteration 
> statistics
> to be able to correlate them to packet drop or other events outside OVS.
> 
> A new command is introduced to enable/disable logging at run-time and to
> adjust the above thresholds for suspicious iterations:
> 
> ovs-appctl dpif-netdev/pmd-perf-log-set on | off
> [-b before] [-a after] [-e|-ne] [-us usec] [-q qlen]
> 
> Turn logging on or off at run-time (on|off).
> 
> -b before:  The number of iterations before the suspicious iteration to
> be logged (default 5).
> -a after:   The number of iterations after the suspicious iteration to
> be logged (default 5).
> -e: Extend logging interval if another suspicious iteration is
> detected before logging occurs.
> -ne:Do not extend logging interval (default).
> -q qlen:Suspicious vhost queue fill level threshold. Increase this
> to 512 if the Qemu supports 1024 virtio queue length.
> (default 128).
> -us usec:   change the duration threshold for a suspicious iteration
> (default 250 us).
> 
> Note: Logging of suspicious iterations itself consumes a considerable amount 
> of
> processing cycles of a PMD which may be visible in the iteration history. In 
> the
> worst case this can lead OVS to detect another suspicious iteration caused by
> logging.
> 
> If more than 100 iterations around a suspicious iteration have been logged 
> once,
> OVS falls back to the safe default values (-b 5/-a 5/-ne) to avoid that 
> logging
> itself causes continuos further logging.
> 
> Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
> Acked-by: Billy O'Mahony <billy.o.mah...@intel.com>
> ---
>  NEWS|   2 +
>  lib/dpif-netdev-perf.c  | 201
> 
>  lib/dpif-netdev-perf.h  |  42 +
>  lib/dpif-netdev-unixctl.man |  59 +
>  lib/dpif-netdev.c   |   5 ++
>  5 files changed, 309 insertions(+)
> 
> diff --git a/NEWS b/NEWS
> index 8f66fd3..61148b1 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -76,6 +76,8 @@ v2.9.0 - 19 Feb 2018
>   * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single
> PMD
>   * Detailed PMD performance metrics available with new command
>   ovs-appctl dpif-netdev/pmd-perf-show
> + * Supervision of PMD performance metrics and logging of suspicious
> +   iterations
> - vswitchd:
>   * Datapath IDs may now be specified as 0x1 (etc.) instead of 16 digits.
>   * Configuring a controller, or unconfiguring all controllers, now 
> deletes diff --
> git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c index 2b36410..410a209
> 100644
> --- a/lib/dpif-netdev-perf.c
> +++ b/lib/dpif-netdev-perf.c
> @@ -25,6 +25,24 @@
> 
>  VLOG_DEFINE_THIS_MODULE(pmd_perf);
> 
> +#define ITER_US_THRESHOLD 250   /* Warning threshold for iteration duration
> +   in microseconds. */
> +#define VHOST_QUEUE_FULL 128/* Size of the virtio TX queue. */
> +#define LOG_IT_BEFORE 5 /* Number of iterations to log before
> +   suspicious iteration. */
> +#define LOG_IT_AFTER 5  /* Number of iterations to log after
> +   suspicious iteration. */
> +
> +bool log_enabled = false;
> +bool log_extend = false;
> +static uint32_t log_it_before = LOG_IT_BEFORE; static uint32_t
> +log_it_after = LOG_IT_AFTER; static uint32_t log_us_thr =
> +ITER_US_THRESHOLD; uint32_t log_q_thr = VHOST_QUEUE

Re: [ovs-dev] [PATCH v10 2/3] dpif-netdev: Detailed performance stats for PMDs

2018-03-18 Thread O Mahony, Billy
Acked-by: Billy O'Mahony <billy.o.mah...@intel.com>

> -Original Message-
> From: Jan Scheurich [mailto:jan.scheur...@ericsson.com]
> Sent: Sunday, March 18, 2018 5:55 PM
> To: d...@openvswitch.org
> Cc: ktray...@redhat.com; Stokes, Ian <ian.sto...@intel.com>;
> i.maxim...@samsung.com; O Mahony, Billy <billy.o.mah...@intel.com>; Jan
> Scheurich <jan.scheur...@ericsson.com>
> Subject: [PATCH v10 2/3] dpif-netdev: Detailed performance stats for PMDs
> 
> This patch instruments the dpif-netdev datapath to record detailed
> statistics of what is happening in every iteration of a PMD thread.
> 
> The collection of detailed statistics can be controlled by a new
> Open_vSwitch configuration parameter "other_config:pmd-perf-metrics".
> By default it is disabled. The run-time overhead, when enabled, is
> in the order of 1%.
> 
> The covered metrics per iteration are:
>   - cycles
>   - packets
>   - (rx) batches
>   - packets/batch
>   - max. vhostuser qlen
>   - upcalls
>   - cycles spent in upcalls
> 
> This raw recorded data is used threefold:
> 
> 1. In histograms for each of the following metrics:
>- cycles/iteration (log.)
>- packets/iteration (log.)
>- cycles/packet
>- packets/batch
>- max. vhostuser qlen (log.)
>- upcalls
>- cycles/upcall (log)
>The histograms bins are divided linear or logarithmic.
> 
> 2. A cyclic history of the above statistics for 999 iterations
> 
> 3. A cyclic history of the cummulative/average values per millisecond
>wall clock for the last 1000 milliseconds:
>- number of iterations
>- avg. cycles/iteration
>- packets (Kpps)
>- avg. packets/batch
>- avg. max vhost qlen
>- upcalls
>- avg. cycles/upcall
> 
> The gathered performance metrics can be printed at any time with the
> new CLI command
> 
> ovs-appctl dpif-netdev/pmd-perf-show [-nh] [-it iter_len] [-ms ms_len]
> [-pmd core] [dp]
> 
> The options are
> 
> -nh:Suppress the histograms
> -it iter_len:   Display the last iter_len iteration stats
> -ms ms_len: Display the last ms_len millisecond stats
> -pmd core:  Display only the specified PMD
> 
> The performance statistics are reset with the existing
> dpif-netdev/pmd-stats-clear command.
> 
> The output always contains the following global PMD statistics,
> similar to the pmd-stats-show command:
> 
> Time: 15:24:55.270
> Measurement duration: 1.008 s
> 
> pmd thread numa_id 0 core_id 1:
> 
>   Cycles:2419034712  (2.40 GHz)
>   Iterations:572817  (1.76 us/it)
>   - idle:486808  (15.9 % cycles)
>   - busy: 86009  (84.1 % cycles)
>   Rx packets:   2399607  (2381 Kpps, 848 cycles/pkt)
>   Datapath passes:  3599415  (1.50 passes/pkt)
>   - EMC hits:336472  ( 9.3 %)
>   - Megaflow hits:  3262943  (90.7 %, 1.00 subtbl lookups/hit)
>   - Upcalls:  0  ( 0.0 %, 0.0 us/upcall)
>   - Lost upcalls: 0  ( 0.0 %)
>   Tx packets:   2399607  (2381 Kpps)
>   Tx batches:171400  (14.00 pkts/batch)
> 
> Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
> Acked-by: Billy O'Mahony <billy.o.mah...@intel.com>
> ---
>  NEWS|   3 +
>  lib/automake.mk |   1 +
>  lib/dpif-netdev-perf.c  | 350
> +++-
>  lib/dpif-netdev-perf.h  | 258 ++--
>  lib/dpif-netdev-unixctl.man | 157 
>  lib/dpif-netdev.c   | 183 +--
>  manpages.mk |   2 +
>  vswitchd/ovs-vswitchd.8.in  |  27 +---
>  vswitchd/vswitch.xml|  12 ++
>  9 files changed, 940 insertions(+), 53 deletions(-)
>  create mode 100644 lib/dpif-netdev-unixctl.man
> 
> diff --git a/NEWS b/NEWS
> index 8d0b502..8f66fd3 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -73,6 +73,9 @@ v2.9.0 - 19 Feb 2018
>   * Add support for vHost dequeue zero copy (experimental)
> - Userspace datapath:
>   * Output packet batching support.
> + * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single
> PMD
> + * Detailed PMD performance metrics available with new command
> + ovs-appctl dpif-netdev/pmd-perf-show
> - vswitchd:
>   * Datapath IDs may now be specified as 0x1 (etc.) instead of 16 digits.
>   * Configuring a controller, or unconfiguring all controllers, now 
> deletes
> diff --git a/lib/automake.mk b/lib/automake.mk
> index 5c26e0f..7a5632d 100644
> --- a/lib/automake.mk
> +++ b/lib/automake.mk
> @@ -484,6

Re: [ovs-dev] [PATCH v10 1/3] netdev: Add optional qfill output parameter to rxq_recv()

2018-03-18 Thread O Mahony, Billy
Acked-by: Billy O'Mahony <billy.o.mah...@intel.com>

> -Original Message-
> From: Jan Scheurich [mailto:jan.scheur...@ericsson.com]
> Sent: Sunday, March 18, 2018 5:55 PM
> To: d...@openvswitch.org
> Cc: ktray...@redhat.com; Stokes, Ian <ian.sto...@intel.com>;
> i.maxim...@samsung.com; O Mahony, Billy <billy.o.mah...@intel.com>; Jan
> Scheurich <jan.scheur...@ericsson.com>
> Subject: [PATCH v10 1/3] netdev: Add optional qfill output parameter to
> rxq_recv()
> 
> If the caller provides a non-NULL qfill pointer and the netdev 
> implemementation
> supports reading the rx queue fill level, the rxq_recv() function returns the
> remaining number of packets in the rx queue after reception of the packet 
> burst
> to the caller. If the implementation does not support this, it returns 
> -ENOTSUP
> instead. Reading the remaining queue fill level should not substantilly slow 
> down
> the recv() operation.
> 
> A first implementation is provided for ethernet and vhostuser DPDK ports in
> netdev-dpdk.c.
> 
> This output parameter will be used in the upcoming commit for PMD
> performance metrics to supervise the rx queue fill level for DPDK vhostuser
> ports.
> 
> Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
> Acked-by: Billy O'Mahony <billy.o.mah...@intel.com>
> ---
>  lib/dpif-netdev.c |  2 +-
>  lib/netdev-bsd.c  |  8 +++-
>  lib/netdev-dpdk.c | 25 +++--
>  lib/netdev-dummy.c|  8 +++-
>  lib/netdev-linux.c|  7 ++-
>  lib/netdev-provider.h |  7 ++-
>  lib/netdev.c  |  5 +++--
>  lib/netdev.h  |  3 ++-
>  8 files changed, 55 insertions(+), 10 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index b07fc6b..86d8739 
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -3276,7 +3276,7 @@ dp_netdev_process_rxq_port(struct
> dp_netdev_pmd_thread *pmd,
>  pmd->ctx.last_rxq = rxq;
>  dp_packet_batch_init();
> 
> -error = netdev_rxq_recv(rxq->rx, );
> +error = netdev_rxq_recv(rxq->rx, , NULL);
>  if (!error) {
>  /* At least one packet received. */
>  *recirc_depth_get() = 0;
> diff --git a/lib/netdev-bsd.c b/lib/netdev-bsd.c index 05974c1..b70f327 100644
> --- a/lib/netdev-bsd.c
> +++ b/lib/netdev-bsd.c
> @@ -618,7 +618,8 @@ netdev_rxq_bsd_recv_tap(struct netdev_rxq_bsd *rxq,
> struct dp_packet *buffer)  }
> 
>  static int
> -netdev_bsd_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch)
> +netdev_bsd_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
> +int *qfill)
>  {
>  struct netdev_rxq_bsd *rxq = netdev_rxq_bsd_cast(rxq_);
>  struct netdev *netdev = rxq->up.netdev; @@ -643,6 +644,11 @@
> netdev_bsd_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch)
>  batch->packets[0] = packet;
>  batch->count = 1;
>  }
> +
> +if (qfill) {
> +*qfill = -ENOTSUP;
> +}
> +
>  return retval;
>  }
> 
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index af9843a..66f2439
> 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -1808,7 +1808,7 @@ netdev_dpdk_vhost_update_rx_counters(struct
> netdev_stats *stats,
>   */
>  static int
>  netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
> -   struct dp_packet_batch *batch)
> +   struct dp_packet_batch *batch, int *qfill)
>  {
>  struct netdev_dpdk *dev = netdev_dpdk_cast(rxq->netdev);
>  struct ingress_policer *policer = netdev_dpdk_get_ingress_policer(dev);
> @@ -1846,11 +1846,24 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq
> *rxq,
>  batch->count = nb_rx;
>  dp_packet_batch_init_packet_fields(batch);
> 
> +if (qfill) {
> +if (nb_rx == NETDEV_MAX_BURST) {
> +/* The DPDK API returns a uint32_t which often has invalid bits 
> in
> + * the upper 16-bits. Need to restrict the value to uint16_t. */
> +*qfill = rte_vhost_rx_queue_count(netdev_dpdk_get_vid(dev),
> +  qid * VIRTIO_QNUM + VIRTIO_TXQ)
> +& UINT16_MAX;
> +} else {
> +*qfill = 0;
> +}
> +}
> +
>  return 0;
>  }
> 
>  static int
> -netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct dp_packet_batch *batch)
> +netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct dp_packet_batch *batch,
> + int *qfill)
>  {
>  struct netdev_rxq_dpdk *rx = netdev_rxq_dpdk_cast(rxq);
>   

Re: [ovs-dev] [PATCH v8 2/3] dpif-netdev: Detailed performance stats for PMDs

2018-03-12 Thread O Mahony, Billy
Acked-by: Billy O'Mahony <billy.o.mah...@intel.com>

Contingent on += fixed on patch 1/3 of series.

> -Original Message-
> From: Jan Scheurich [mailto:jan.scheur...@ericsson.com]
> Sent: Friday, January 26, 2018 12:20 PM
> To: d...@openvswitch.org
> Cc: ktray...@redhat.com; Stokes, Ian <ian.sto...@intel.com>;
> i.maxim...@samsung.com; O Mahony, Billy <billy.o.mah...@intel.com>; Jan
> Scheurich <jan.scheur...@ericsson.com>
> Subject: [PATCH v8 2/3] dpif-netdev: Detailed performance stats for PMDs
> 
> This patch instruments the dpif-netdev datapath to record detailed
> statistics of what is happening in every iteration of a PMD thread.
> 
> The collection of detailed statistics can be controlled by a new
> Open_vSwitch configuration parameter "other_config:pmd-perf-metrics".
> By default it is disabled. The run-time overhead, when enabled, is
> in the order of 1%.
> 
> The covered metrics per iteration are:
>   - cycles
>   - packets
>   - (rx) batches
>   - packets/batch
>   - max. vhostuser qlen
>   - upcalls
>   - cycles spent in upcalls
> 
> This raw recorded data is used threefold:
> 
> 1. In histograms for each of the following metrics:
>- cycles/iteration (log.)
>- packets/iteration (log.)
>- cycles/packet
>- packets/batch
>- max. vhostuser qlen (log.)
>- upcalls
>- cycles/upcall (log)
>The histograms bins are divided linear or logarithmic.
> 
> 2. A cyclic history of the above statistics for 999 iterations
> 
> 3. A cyclic history of the cummulative/average values per millisecond
>wall clock for the last 1000 milliseconds:
>- number of iterations
>- avg. cycles/iteration
>- packets (Kpps)
>- avg. packets/batch
>- avg. max vhost qlen
>- upcalls
>- avg. cycles/upcall
> 
> The gathered performance metrics can be printed at any time with the
> new CLI command
> 
> ovs-appctl dpif-netdev/pmd-perf-show [-nh] [-it iter_len] [-ms ms_len]
> [-pmd core] [dp]
> 
> The options are
> 
> -nh:Suppress the histograms
> -it iter_len:   Display the last iter_len iteration stats
> -ms ms_len: Display the last ms_len millisecond stats
> -pmd core:  Display only the specified PMD
> 
> The performance statistics are reset with the existing
> dpif-netdev/pmd-stats-clear command.
> 
> The output always contains the following global PMD statistics,
> similar to the pmd-stats-show command:
> 
> Time: 15:24:55.270
> Measurement duration: 1.008 s
> 
> pmd thread numa_id 0 core_id 1:
> 
>   Cycles:2419034712  (2.40 GHz)
>   Iterations:572817  (1.76 us/it)
>   - idle:486808  (15.9 % cycles)
>   - busy: 86009  (84.1 % cycles)
>   Rx packets:   2399607  (2381 Kpps, 848 cycles/pkt)
>   Datapath passes:  3599415  (1.50 passes/pkt)
>   - EMC hits:336472  ( 9.3 %)
>   - Megaflow hits:  3262943  (90.7 %, 1.00 subtbl lookups/hit)
>   - Upcalls:  0  ( 0.0 %, 0.0 us/upcall)
>   - Lost upcalls: 0  ( 0.0 %)
>   Tx packets:   2399607  (2381 Kpps)
>   Tx batches:171400  (14.00 pkts/batch)
> 
> Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
> ---
>  NEWS|   3 +
>  lib/automake.mk |   1 +
>  lib/dpif-netdev-perf.c  | 350
> +++-
>  lib/dpif-netdev-perf.h  | 258 ++--
>  lib/dpif-netdev-unixctl.man | 157 
>  lib/dpif-netdev.c   | 182 +--
>  manpages.mk |   2 +
>  vswitchd/ovs-vswitchd.8.in  |  27 +---
>  vswitchd/vswitch.xml|  12 ++
>  9 files changed, 939 insertions(+), 53 deletions(-)
>  create mode 100644 lib/dpif-netdev-unixctl.man
> 
> diff --git a/NEWS b/NEWS
> index d7d585b..587b036 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -56,6 +56,9 @@ v2.9.0 - xx xxx 
>   * Add rxq utilization of pmd to appctl 'dpif-netdev/pmd-rxq-show'.
> - Userspace datapath:
>   * Output packet batching support.
> + * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single
> PMD
> + * Detailed PMD performance metrics available with new command
> + ovs-appctl dpif-netdev/pmd-perf-show
> - vswitchd:
>   * Datapath IDs may now be specified as 0x1 (etc.) instead of 16 digits.
>   * Configuring a controller, or unconfiguring all controllers, now 
> deletes
> diff --git a/lib/automake.mk b/lib/automake.mk
> index 159319f..54375ea 100644
> --- a/lib/automake.mk
> +++ b/lib/automake.mk
> @@ -468,6 +468,7 @

Re: [ovs-dev] [PATCH v8 3/3] dpif-netdev: Detection and logging of suspicious PMD iterations

2018-03-12 Thread O Mahony, Billy
Acked-by: Billy O'Mahony <billy.o.mah...@intel.com>

Contingent on += fixed on patch 1/3 of series.

> -Original Message-
> From: Jan Scheurich [mailto:jan.scheur...@ericsson.com]
> Sent: Friday, January 26, 2018 12:20 PM
> To: d...@openvswitch.org
> Cc: ktray...@redhat.com; Stokes, Ian <ian.sto...@intel.com>;
> i.maxim...@samsung.com; O Mahony, Billy <billy.o.mah...@intel.com>; Jan
> Scheurich <jan.scheur...@ericsson.com>
> Subject: [PATCH v8 3/3] dpif-netdev: Detection and logging of suspicious PMD
> iterations
> 
> This patch enhances dpif-netdev-perf to detect iterations with suspicious
> statistics according to the following criteria:
> 
> - iteration lasts longer than US_THR microseconds (default 250).
>   This can be used to capture events where a PMD is blocked or
>   interrupted for such a period of time that there is a risk for
>   dropped packets on any of its Rx queues.
> 
> - max vhost qlen exceeds a threshold Q_THR (default 128). This can
>   be used to infer virtio queue overruns and dropped packets inside
>   a VM, which are not visible in OVS otherwise.
> 
> Such suspicious iterations can be logged together with their iteration 
> statistics
> to be able to correlate them to packet drop or other events outside OVS.
> 
> A new command is introduced to enable/disable logging at run-time and to
> adjust the above thresholds for suspicious iterations:
> 
> ovs-appctl dpif-netdev/pmd-perf-log-set on | off
> [-b before] [-a after] [-e|-ne] [-us usec] [-q qlen]
> 
> Turn logging on or off at run-time (on|off).
> 
> -b before:  The number of iterations before the suspicious iteration to
> be logged (default 5).
> -a after:   The number of iterations after the suspicious iteration to
> be logged (default 5).
> -e: Extend logging interval if another suspicious iteration is
> detected before logging occurs.
> -ne:Do not extend logging interval (default).
> -q qlen:Suspicious vhost queue fill level threshold. Increase this
> to 512 if the Qemu supports 1024 virtio queue length.
> (default 128).
> -us usec:   change the duration threshold for a suspicious iteration
> (default 250 us).
> 
> Note: Logging of suspicious iterations itself consumes a considerable amount 
> of
> processing cycles of a PMD which may be visible in the iteration history. In 
> the
> worst case this can lead OVS to detect another suspicious iteration caused by
> logging.
> 
> If more than 100 iterations around a suspicious iteration have been logged 
> once,
> OVS falls back to the safe default values (-b 5/-a 5/-ne) to avoid that 
> logging
> itself causes continuos further logging.
> 
> Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
> ---
>  NEWS|   2 +
>  lib/dpif-netdev-perf.c  | 201
> 
>  lib/dpif-netdev-perf.h  |  42 +
>  lib/dpif-netdev-unixctl.man |  59 +
>  lib/dpif-netdev.c   |   5 ++
>  5 files changed, 309 insertions(+)
> 
> diff --git a/NEWS b/NEWS
> index 587b036..615a630 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -59,6 +59,8 @@ v2.9.0 - xx xxx 
>   * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single
> PMD
>   * Detailed PMD performance metrics available with new command
>   ovs-appctl dpif-netdev/pmd-perf-show
> + * Supervision of PMD performance metrics and logging of suspicious
> +   iterations
> - vswitchd:
>   * Datapath IDs may now be specified as 0x1 (etc.) instead of 16 digits.
>   * Configuring a controller, or unconfiguring all controllers, now 
> deletes diff --
> git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c index 43f537e..e6afd07
> 100644
> --- a/lib/dpif-netdev-perf.c
> +++ b/lib/dpif-netdev-perf.c
> @@ -25,6 +25,24 @@
> 
>  VLOG_DEFINE_THIS_MODULE(pmd_perf);
> 
> +#define ITER_US_THRESHOLD 250   /* Warning threshold for iteration duration
> +   in microseconds. */
> +#define VHOST_QUEUE_FULL 128/* Size of the virtio TX queue. */
> +#define LOG_IT_BEFORE 5 /* Number of iterations to log before
> +   suspicious iteration. */
> +#define LOG_IT_AFTER 5  /* Number of iterations to log after
> +   suspicious iteration. */
> +
> +bool log_enabled = false;
> +bool log_extend = false;
> +static uint32_t log_it_before = LOG_IT_BEFORE; static uint32_t
> +log_it_after = LOG_IT_AFTER; static uint32_t log_us_thr =
> +ITER_US_THRESHOLD; uint32_t log_q_thr = VHOST_QUEUE_FULL; ui

Re: [ovs-dev] [PATCH 0/3] dpif-netdev: Combine CD and DFC patch for datapath refactor

2018-03-08 Thread O Mahony, Billy
Hi All,

I have run some tests using a more realistic distribution of flows - see below
- to what we normally test with.

This is a phy to phy test with just port forwarding but I think that is the 
best way to test the EMC as it avoids noise from other mechanisms e.g. vhost
and dpcls lookup. It uses 64B packets and 1 PMD.

I also ran the tests with the EMC disabled.

 Baseline HoM 951cbaf 6/3/18
||   emc-insert-prob 1/100  | emc disabled
offered   flows ||   rxdemccycles/  |  rxd  emc   cycles
kpps||  kpps   hits   pkt   | kpps hits  pkt
++--+---
14,0008 || 10830   100%   212   | 7360   0%  311
14,000  100 ||  9730   100%   236   | 7370   0%  311
14,0001 ||  654569%   345   | 7370   0%  311
14,000  100 ||  6202 3%   370   | 7370   0%  311

 Combine CD and DFC patch
||  emc-insert-prob 1/100   | emc disabled
offered   flows ||  rxdemc  dfc  cycles | rxd   emc   dfc cycles
kpps|| kpps   hits hits/pkt |  hits  hits   /pkt
++--+---
14,0008 || 10930  100%   0% 210 | 8570   0%  100%268
14,000  100 || 10220  100%   0% 224 | 7800   0%  100%294
14,0001 || 800084%  16% 287 | 6770   0%  100%339
14,000  100 || 5921 7%  65% 387 | 6060   0%   72%378

In these scenarios the patch gives an advantage at lower numbers of flows but
this advantaged is reversed at very high flow numbers - presumably as the DFC 
itself
approaches capacity.

Another interesting scenario to test would be the case of many shortlived flows.
1M flows and 200k new flows/sec for example. This real use case was presented at
the last OvS Con 
https://www.slideshare.net/LF_OpenvSwitch/lfovs17ovsdpdk-for-nfv-go-live-feedback.
Which I'd hope to implement in due course.

Below is some details on the how and why flow distribution was made for these 
tests.

Regards,
Billy


All caches are designed on the assumption that in the real world that access
requests are not uniformly distributed.  By design they are used to improve
performance only in situations where some items are accessed more frequently
than others.  If this assumption does not hold then the use a cache actually
degrades performance.

Currently we test the EMC with one of two scenarios both of which break the
above assumption:

1) Round Robin 'Distribution':

The TG sends a packet from flow#1 then flow#2 up to flow#N then back to
flow#1 again.

Testing with this case gives results that under-state the benefit of the
EMC to the maximum extent possible.  By sending packet from every other 
flow in
between every two packets from any given flow the chances of the flow having
been evicted in the interim between the two packets are maximized.  If a 
tester
were to intentionally design a test to understate the benefits of the EMC it
would be a round-robin flow distribution.

2) Uniform Distribution:

The TG randomly selects the next packet to send from all the configured 
flows.

Testing with this case gives results that under-state the benefit of the
EMC.  Testing with this case gives results that under-state the benefit
of the EMC.  As each flow is equally likely to occur then unless the number 
of 
flows is less than the number of EMC entries, there are many more flows than
is likely no benefit from using the EMC.

By testing only with these flow distributions that break the fundamental
assumptions indicating the use of an EMC, we are consistently under-stating the
benefits of flow-caching.

A More Realistic Distribution:

A more realistic distribution is almost certainly some kind of power-law or
Pareto distribution.  In this kind of distribution the majority of packets
belong to a small number of flows.  Think of the classic '80/20' rule.  Given a
power-law distribution if we rank all the flows by their packets-per-second we
see that flow with the Nth most packets-per-second is has a rate that is some
fraction of the N-1th flow for all N.  This kind of distribution is what is
seen in the natural world regarding the distribution of ranking of word
occurrence in a language,  the population ranks of cities in various countries,
corporation sizes, income rankings, ranks of number of people watching the same
TV channel, and so on. [https://en.wikipedia.org/wiki/Zipf%27s_law].

For example using a Zipfian distribution with k=1 as the model for pps for
flows the flow distribution would look like this (y-axis is not linear):

1000| *
 500| *  *
 333| *  *  *
PPS  250| *  *  *  *
   ..
  10| *  *  *  *   *
+...-
  1  2  3  4 ...  100
  

Re: [ovs-dev] [RFC 2/2] ingress scheduling: Provide per interface ingress priority

2018-02-20 Thread O Mahony, Billy


> -Original Message-
> From: Ilya Maximets [mailto:i.maxim...@samsung.com]
> Sent: Tuesday, February 20, 2018 3:10 PM
> To: ovs-dev@openvswitch.org; O Mahony, Billy <billy.o.mah...@intel.com>
> Subject: Re: [ovs-dev] [RFC 2/2] ingress scheduling: Provide per interface 
> ingress
> priority
> 
> Not a full review.
> Two general comments inline.
> 
> > Allow configuration to specify an ingress priority for interfaces.
> > Modify ovs-netdev datapath to act on this configuration so that
> > packets on interfaces with a higher priority will tend be processed
> > ahead of packets on lower priority interfaces.  This protects traffic
> > on higher priority interfaces from loss and latency as PMDs get overloaded.
> >
> > Signed-off-by: Billy O'Mahony 
> > ---
> >  include/openvswitch/ofp-parse.h |  3 ++
> >  lib/dpif-netdev.c   | 47 +-
> >  lib/netdev-bsd.c|  1 +
> >  lib/netdev-dpdk.c   | 64
> +++--
> >  lib/netdev-dummy.c  |  1 +
> >  lib/netdev-linux.c  |  1 +
> >  lib/netdev-provider.h   | 11 ++-
> >  lib/netdev-vport.c  |  1 +
> >  lib/netdev.c| 23 +++
> >  lib/netdev.h|  2 ++
> >  vswitchd/bridge.c   |  2 ++
> >  11 files changed, 140 insertions(+), 16 deletions(-)
> >
> > diff --git a/include/openvswitch/ofp-parse.h
> > b/include/openvswitch/ofp-parse.h index 3fdd468..d77ab8f 100644
> > --- a/include/openvswitch/ofp-parse.h
> > +++ b/include/openvswitch/ofp-parse.h
> > @@ -33,6 +33,9 @@ extern "C" {
> >  struct match;
> >  struct mf_field;
> >  struct ofputil_port_map;
> > +struct tun_table;
> > +struct flow_wildcards;
> > +struct ofputil_port_map;
> >
> >  struct ofp_protocol {
> >  const char *name;
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> > d49c986..89d8229 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -42,6 +42,7 @@
> >  #include "dpif.h"
> >  #include "dpif-netdev-perf.h"
> >  #include "dpif-provider.h"
> > +#include "netdev-provider.h"
> >  #include "dummy.h"
> >  #include "fat-rwlock.h"
> >  #include "flow.h"
> > @@ -487,6 +488,7 @@ static void dp_netdev_actions_free(struct
> > dp_netdev_actions *);  struct polled_queue {
> >  struct dp_netdev_rxq *rxq;
> >  odp_port_t port_no;
> > +uint8_t priority;
> >  };
> >
> >  /* Contained by struct dp_netdev_pmd_thread's 'poll_list' member. */
> > @@ -626,6 +628,12 @@ struct dpif_netdev {
> >  uint64_t last_port_seq;
> >  };
> >
> > +static void
> > +dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
> > +   struct dp_netdev_rxq *rxq,
> > +   odp_port_t port_no,
> > +   unsigned int *rxd_cnt,
> > +   unsigned int *txd_cnt);
> >  static int get_port_by_number(struct dp_netdev *dp, odp_port_t port_no,
> >struct dp_netdev_port **portp)
> >  OVS_REQUIRES(dp->port_mutex);
> > @@ -3259,15 +3267,16 @@ dp_netdev_pmd_flush_output_packets(struct
> dp_netdev_pmd_thread *pmd,
> >  return output_cnt;
> >  }
> >
> > -static int
> > +static void
> >  dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
> > struct dp_netdev_rxq *rxq,
> > -   odp_port_t port_no)
> > +   odp_port_t port_no,
> > +   unsigned int *rxd_cnt,
> > +   unsigned int *txd_cnt)
> >  {
> >  struct dp_packet_batch batch;
> >  struct cycle_timer timer;
> >  int error;
> > -int batch_cnt = 0, output_cnt = 0;
> >  uint64_t cycles;
> >
> >  /* Measure duration for polling and processing rx burst. */ @@
> > -3279,17 +3288,17 @@ dp_netdev_process_rxq_port(struct
> dp_netdev_pmd_thread *pmd,
> >  error = netdev_rxq_recv(rxq->rx, );
> >  if (!error) {
> >  /* At least one packet received. */
> > +*rxd_cnt = batch.count;
> >  *recirc_depth_get() = 0;
> >  pmd_thread_ctx_time_update(pmd);
> >
> > -batch_cnt = batch.count;
> >  dp_netdev_input(pmd, , po

Re: [ovs-dev] [RFC 0/2] Ingress Scheduling

2018-02-20 Thread O Mahony, Billy


> -Original Message-
> From: Ilya Maximets [mailto:i.maxim...@samsung.com]
> Sent: Tuesday, February 20, 2018 2:55 PM
> To: ovs-dev@openvswitch.org; O Mahony, Billy <billy.o.mah...@intel.com>
> Subject: Re: Re: [ovs-dev] [RFC 0/2] Ingress Scheduling
> 
> > This patch set implements the 'preferential read' part of the feature
> > of ingress scheduling described at OvS 2017 Fall Conference
> > https://www.slideshare.net/LF_OpenvSwitch/lfovs17ingress-scheduling-
> 82280320.
> >
> > It allows configuration to specify an ingress priority for and entire
> > interface. This protects traffic on higher priority interfaces from
> > loss and latency as PMDs get overloaded.
> >
> > Results so far a are very promising; For a uniform traffic
> > distribution as total offered load increases loss starts on the lowest
> > priority port first and the highest priority port last.
> >
> > When using four physical ports with each port forwarded to one of the
> > other ports. The following packets loss is seen. The EMC was bypassed
> > in this case and a small delay loop was added to each packet to
> > simulate more realistic per packet processing cost of 1000cycles approx.
> >
> > Port dpdk_0  dpdk_1  dpdk_2  dpdk_3
> > Traffic
> > Dist.   25% 25% 25% 25%
> > Priority  0   1   2   3
> > n_rxq 8   8   8   8
> >
> > Total
> > Load Kpps   Loss Rate Per Port (Kpps)
> > 2110  0   0   0   0
> > 2120  5   0   0   0
> > 2920676   0   0   0
> > 2930677   5   0   0
> > 3510854 411   0   0
> > 3520860 415   3   0
> > 4390   1083 716 348   0
> > 4400   1088 720 354   1
> >
> >
> > Even in the case where most traffic is on the priority port this
> > remains the
> > case:
> >
> > Port dpdk_0  dpdk_1  dpdk_2  dpdk_3
> > Traffic
> > Dist.   10% 20% 30% 40%
> > Priority  0   1   2   3
> > n_rxq 8   8   8   8
> >
> > Total
> > Load Kpps   Loss Rate Per Port (Kpps)
> >  2400 0   0   0   0
> >  2410 5   0   0   0
> >  2720   225   5   0   0
> >  2880   252 121   9   0
> >  3030   269 176  82   3
> >  4000   369 377 384 392
> >  5000   471 580 691 801
> >
> > The latency characteristics of the traffic on the higher priority
> > ports is also protected.
> >
> > Port dpdk_0  dpdk_1  dpdk_2  dpdk_3
> > Traffic
> > Dist.   10% 20% 30% 40%
> > Priority  0   1   2   3
> > n_rxq 8   8   8   8
> >
> > Totaldpdk0dpdk1dpdk2dpdk3
> > Load Kpps
> >  2400  113  122  120  125
> >  241036117  571  577  560
> >  2720   32324214424 3265 3235
> >  2880   3914043350810075 4600
> >  3030   4125973545017061 7965
> >  4000   414729360701774011106
> >  5000   416801364451823311653
> >
> > Some General setup notes:
> > Fortville. (X710 DA4. firmware-version: 6.01 0x800034af 1.1747.0)
> > Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz One pmd Port fwding port
> > 0<->1, 2 <-> 3 Frame 64B, UDP 221 streams per port.
> > OvS base - 4c80644 http://github.com/istokes/ovs dpdk_merge_2_9. Added
> 600cycles approx pkt processing in order to bring per packet cost to ~1000
> cycles.
> > DPDK v17.11.1
> >
> 
> Hi Billy, thanks for your work on this feature.
> I have one question here. Have you tasted heterogeneous configurations?

[[BO'M]] I have performed some tests for virtual devices which shows a similar 
effect. VMs with prioritized vhost ports achieve higher tcp transfer at the 
cost of b/w between non-prioritized. But I have not tested a mixture of either 
physical & virtual or different types of NIC. I do plan such tests as part of a 
closer characterization of how TCP transfers are affected.

> I mean situations where PMD thread polls different types of ports (NICs).
> It's known that RX operations has different cost for different port types 
> (like
> virtual and physical) and also for different hardware NICs because of 
> different
> implementations of DPDK PMD drivers.

[[BO'M]] That is an interesting point. Any scheme that protects (ie rea

Re: [ovs-dev] [PATCH v8 0/3] dpif-netdev: Detailed PMD performance metrics and supervision

2018-02-15 Thread O Mahony, Billy
Hi Jan,

Everyone is probably reviewed-out :)

I'm happy to ack once the +='s are fixed.

Regards,
Billy. 



> -Original Message-
> From: Jan Scheurich [mailto:jan.scheur...@ericsson.com]
> Sent: Tuesday, February 13, 2018 4:04 PM
> To: d...@openvswitch.org
> Cc: ktray...@redhat.com; Stokes, Ian <ian.sto...@intel.com>;
> i.maxim...@samsung.com; O Mahony, Billy <billy.o.mah...@intel.com>
> Subject: RE: [PATCH v8 0/3] dpif-netdev: Detailed PMD performance metrics and
> supervision
> 
> Gentle reminder to review this series which unfortunately missed the 2.9
> deadline.
> 
> I checked and the patches still apply on today's master.
> So far I have received one comment from Billy
> https://mail.openvswitch.org/pipermail/ovs-dev/2018-January/343808.html
> 
> Thanks, Jan
> 
> > -Original Message-
> > From: Jan Scheurich
> > Sent: Friday, 26 January, 2018 13:20
> > To: d...@openvswitch.org
> > Cc: ktray...@redhat.com; ian.sto...@intel.com; i.maxim...@samsung.com;
> > billy.o.mah...@intel.com; Jan Scheurich <jan.scheur...@ericsson.com>
> > Subject: [PATCH v8 0/3] dpif-netdev: Detailed PMD performance metrics
> > and supervision
> >
> > The run-time performance of PMDs is often difficult to understand and
> > trouble-shoot. The existing PMD statistics counters only provide a
> > coarse grained average picture. At packet rates of several Mpps
> > sporadic drops of packet bursts happen at sub-millisecond time scales
> > and are impossible to capture and analyze with existing tools.
> >
> > This patch collects a large number of important PMD performance
> > metrics per PMD iteration, maintaining histograms and circular
> > histories for iteration metrics and millisecond averages. To capture
> > sporadic drop events, the patch set can be configured to monitor
> > iterations for suspicious metrics and to log the neighborhood of such
> iterations for off-line analysis.
> >
> > The extra cost for the performance metric collection and the
> > supervision has been measured to be in the order of 1% compared to the
> > base commit in a PVP setup with L3 pipeline over VXLAN tunnels. For
> > that reason the metrics collection is disabled by default and can be
> > enabled at run-time through configuration.
> >
> > v7 -> v8:
> > * Rebased on to master (commit 4e99b70df)
> > * Implemented comments from Ilya Maximets and Billy O'Mahony.
> > * Replaced netdev_rxq_length() introduced in v7 by optional out
> >   parameter for the remaining rx queue len in netdev_rxq_recv().
> > * Fixed thread synchronization issues in clearing PMD stats:
> >   - Use mutex to control whether to clear from main thread directly
> > or in PMD at start of next iteration.
> >   - Use mutex to prevent concurrent clearing and printing of metrics.
> > * Added tx packet and batch stats to pmd-perf-show output.
> > * Delay warning for suspicious iteration to the iteration in which
> >   we also log the neighborhood to not pollute the logged iteration
> >   stats with logging costs.
> > * Corrected the exact number of iterations logged before and after a
> >   supicious iteration.
> > * Introduced options -e and -ne in pmd-perf-log-set to control whether
> >   to *extend* the range of logged iterations when additional supicious
> >   iterations are detected before the scheduled end of logging interval
> >   is reached.
> > * Exclude logging cycles from the iteration stats to avoid confusing
> >   ghost peaks.
> > * Performance impact compared to master less than 1% even with
> >   supervision enabled.
> >
> > v5 -> v7:
> > * Rebased on to dpdk_merge (commit e68)
> >   - New base contains earlier refactoring parts of series.
> > * Implemented comments from Ilya Maximets and Billy O'Mahony.
> > * Replaced piggybacking qlen on dp_packet_batch with a new netdev API
> >   netdev_rxq_length().
> > * Thread-safe clearing of pmd counters in pmd_perf_start_iteration().
> > * Fixed bug in reporting datapath stats.
> > * Work-around a bug in DPDK rte_vhost_rx_queue_count() which sometimes
> >   returns bogus in the upper 16 bits of the uint32_t return value.
> >
> > v4 -> v5:
> > * Rebased to master (commit e9de6c0)
> > * Implemented comments from Aaron Conole and Darrel Ball
> >
> > v3 -> v4:
> > * Rebased to master (commit 4d0a31b)
> >   - Reverting changes to struct dp_netdev_pmd_thread.
> > * Make metrics collection configurable.
> > * Several bugfixes.
> >
> > v2 -> v3:
> > * Rebased to OVS master (commit 3728b3b).
>

Re: [ovs-dev] [PATCH v8 1/3] netdev: Add optional qfill output parameter to rxq_recv()

2018-01-26 Thread O Mahony, Billy
LGTM but one thing I don't understand down below...

> -Original Message-
> From: Jan Scheurich [mailto:jan.scheur...@ericsson.com]
> Sent: Friday, January 26, 2018 12:20 PM
> To: d...@openvswitch.org
> Cc: ktray...@redhat.com; Stokes, Ian <ian.sto...@intel.com>;
> i.maxim...@samsung.com; O Mahony, Billy <billy.o.mah...@intel.com>;
> Jan Scheurich <jan.scheur...@ericsson.com>
> Subject: [PATCH v8 1/3] netdev: Add optional qfill output parameter to
> rxq_recv()
> 
> If the caller provides a non-NULL qfill pointer and the netdev
> implemementation supports reading the rx queue fill level, the rxq_recv()
> function returns the remaining number of packets in the rx queue after
> reception of the packet burst to the caller. If the implementation does not
> support this, it returns -ENOTSUP instead. Reading the remaining queue fill
> level should not substantilly slow down the recv() operation.
> 
> A first implementation is provided for ethernet and vhostuser DPDK ports in
> netdev-dpdk.c.
> 
> This output parameter will be used in the upcoming commit for PMD
> performance metrics to supervise the rx queue fill level for DPDK vhostuser
> ports.
> 
> Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
> ---
>  lib/dpif-netdev.c |  2 +-
>  lib/netdev-bsd.c  |  8 +++-
>  lib/netdev-dpdk.c | 25 +++--
>  lib/netdev-dummy.c|  8 +++-
>  lib/netdev-linux.c|  7 ++-
>  lib/netdev-provider.h |  7 ++-
>  lib/netdev.c  |  5 +++--
>  lib/netdev.h  |  3 ++-
>  8 files changed, 55 insertions(+), 10 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index ba62128..4a0fcbd
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -3276,7 +3276,7 @@ dp_netdev_process_rxq_port(struct
> dp_netdev_pmd_thread *pmd,
>  pmd->ctx.last_rxq = rxq;
>  dp_packet_batch_init();
> 
> -error = netdev_rxq_recv(rxq->rx, );
> +error = netdev_rxq_recv(rxq->rx, , NULL);
>  if (!error) {
>  /* At least one packet received. */
>  *recirc_depth_get() = 0;
> diff --git a/lib/netdev-bsd.c b/lib/netdev-bsd.c index 05974c1..b70f327
> 100644
> --- a/lib/netdev-bsd.c
> +++ b/lib/netdev-bsd.c
> @@ -618,7 +618,8 @@ netdev_rxq_bsd_recv_tap(struct netdev_rxq_bsd
> *rxq, struct dp_packet *buffer)  }
> 
>  static int
> -netdev_bsd_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch
> *batch)
> +netdev_bsd_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch
> *batch,
> +int *qfill)
>  {
>  struct netdev_rxq_bsd *rxq = netdev_rxq_bsd_cast(rxq_);
>  struct netdev *netdev = rxq->up.netdev; @@ -643,6 +644,11 @@
> netdev_bsd_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch
> *batch)
>  batch->packets[0] = packet;
>  batch->count = 1;
>  }
> +
> +if (qfill) {
> +*qfill = -ENOTSUP;
> +}
> +
>  return retval;
>  }
> 
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index ac2e38e..bb7dece
> 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -1757,7 +1757,7 @@ netdev_dpdk_vhost_update_rx_counters(struct
> netdev_stats *stats,
>   */
>  static int
>  netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
> -   struct dp_packet_batch *batch)
> +   struct dp_packet_batch *batch, int *qfill)
>  {
>  struct netdev_dpdk *dev = netdev_dpdk_cast(rxq->netdev);
>  struct ingress_policer *policer = netdev_dpdk_get_ingress_policer(dev);
> @@ -1795,11 +1795,24 @@ netdev_dpdk_vhost_rxq_recv(struct
> netdev_rxq *rxq,
>  batch->count = nb_rx;
>  dp_packet_batch_init_packet_fields(batch);
> 
> +if (qfill) {
> +if (nb_rx == NETDEV_MAX_BURST) {
> +/* The DPDK API returns a uint32_t which often has invalid bits 
> in
> + * the upper 16-bits. Need to restrict the value to uint16_t. */
> +*qfill += rte_vhost_rx_queue_count(netdev_dpdk_get_vid(dev),
> +   qid * VIRTIO_QNUM + 
> VIRTIO_TXQ)
> +& UINT16_MAX;
[[BO'M]] Why the += operator?
> +} else {
> +*qfill = 0;
> +}
> +}
> +
>  return 0;
>  }
> 
>  static int
> -netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct dp_packet_batch
> *batch)
> +netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct dp_packet_batch
> *batch,
> + int *qfill)
>  {
>  struct netdev_rxq_dpdk *rx = netdev_rxq_dpdk_cast(rxq);
>  struct netdev_dpdk *dev = netdev_d

Re: [ovs-dev] [PATCH v7 2/3] dpif-netdev: Detailed performance stats for PMDs

2018-01-19 Thread O Mahony, Billy
A few things I didn't come across until reading 3/3 but are not related to the 
other atomic/volatile/mutex discussion..

All these are suggestions so use only if felt they offer improvements in 
clarity.

> -Original Message-
> From: Jan Scheurich [mailto:jan.scheur...@ericsson.com]
> Sent: Tuesday, January 16, 2018 1:51 AM
> To: d...@openvswitch.org
> Cc: ktray...@redhat.com; Stokes, Ian <ian.sto...@intel.com>;
> i.maxim...@samsung.com; O Mahony, Billy <billy.o.mah...@intel.com>;
> Jan Scheurich <jan.scheur...@ericsson.com>
> Subject: [PATCH v7 2/3] dpif-netdev: Detailed performance stats for PMDs
> 
> This patch instruments the dpif-netdev datapath to record detailed
> statistics of what is happening in every iteration of a PMD thread.
> 
> The collection of detailed statistics can be controlled by a new
> Open_vSwitch configuration parameter "other_config:pmd-perf-metrics".
> By default it is disabled. The run-time overhead, when enabled, is
> in the order of 1%.
> 
> The covered metrics per iteration are:
>   - cycles
>   - packets
>   - (rx) batches
>   - packets/batch
>   - max. vhostuser qlen
>   - upcalls
>   - cycles spent in upcalls
> 
> This raw recorded data is used threefold:
> 
> 1. In histograms for each of the following metrics:
>- cycles/iteration (log.)
>- packets/iteration (log.)
>- cycles/packet
>- packets/batch
>- max. vhostuser qlen (log.)
>- upcalls
>- cycles/upcall (log)
>The histograms bins are divided linear or logarithmic.
> 
> 2. A cyclic history of the above statistics for 999 iterations
> 
> 3. A cyclic history of the cummulative/average values per millisecond
>wall clock for the last 1000 milliseconds:
>- number of iterations
>- avg. cycles/iteration
>- packets (Kpps)
>- avg. packets/batch
>- avg. max vhost qlen
>- upcalls
>- avg. cycles/upcall
> 
> The gathered performance metrics can be printed at any time with the
> new CLI command
> 
> ovs-appctl dpif-netdev/pmd-perf-show [-nh] [-it iter_len] [-ms ms_len]
> [-pmd core | dp]
> 
> The options are
> 
> -nh:Suppress the histograms
> -it iter_len:   Display the last iter_len iteration stats
> -ms ms_len: Display the last ms_len millisecond stats
> -pmd core:  Display only
> 
> The performance statistics are reset with the existing
> dpif-netdev/pmd-stats-clear command.
> 
> The output always contains the following global PMD statistics,
> similar to the pmd-stats-show command:
> 
> Time: 15:24:55.270
> Measurement duration: 1.008 s
> 
> pmd thread numa_id 0 core_id 1:
> 
>   Cycles:2419034712  (2.40 GHz)
>   Iterations:572817  (1.76 us/it)
>   - idle:486808  (15.9 % cycles)
>   - busy: 86009  (84.1 % cycles)
>   Packets:  2399607  (2381 Kpps, 848 cycles/pkt)
>   Datapath passes:  3599415  (1.50 passes/pkt)
>   - EMC hits:336472  ( 9.3 %)
>   - Megaflow hits:  3262943  (90.7 %, 1.00 subtbl lookups/hit)
>   - Upcalls:  0  ( 0.0 %, 0.0 us/upcall)
>   - Lost upcalls: 0  ( 0.0 %)
> 
> Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
> ---
>  NEWS|   3 +
>  lib/automake.mk |   1 +
>  lib/dp-packet.h |   1 +
>  lib/dpif-netdev-perf.c  | 333
> +++-
>  lib/dpif-netdev-perf.h  | 239 +--
>  lib/dpif-netdev.c   | 177 +--
>  lib/netdev-dpdk.c   |  13 +-
>  lib/netdev-dpdk.h   |  14 ++
>  lib/netdev-dpif-unixctl.man | 113 +++
>  manpages.mk |   2 +
>  vswitchd/ovs-vswitchd.8.in  |  27 +---
>  vswitchd/vswitch.xml|  12 ++
>  12 files changed, 881 insertions(+), 54 deletions(-)
>  create mode 100644 lib/netdev-dpif-unixctl.man
> 
> diff --git a/NEWS b/NEWS
> index 2c28456..743528e 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -44,6 +44,9 @@ Post-v2.8.0
>if available (for OpenFlow 1.4+).
> - Userspace datapath:
>   * Output packet batching support.
> + * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a
> single PMD
> + * Detailed PMD performance metrics available with new command
> + ovs-appctl dpif-netdev/pmd-perf-show
> - vswitchd:
>   * Datapath IDs may now be specified as 0x1 (etc.) instead of 16 digits.
>   * Configuring a controller, or unconfiguring all controllers, now 
> deletes
> diff --git a/lib/automake.mk b/lib/automake.mk
> index 159319f..d07cbe9 100644
> --- a/lib/a

Re: [ovs-dev] [PATCH v7 3/3] dpif-netdev: Detection and logging of suspicious PMD iterations

2018-01-19 Thread O Mahony, Billy
Hi All,

I'm going to actually try out the code next. But for now a few comments on the 
code. 

> -Original Message-
> From: Jan Scheurich [mailto:jan.scheur...@ericsson.com]
> Sent: Tuesday, January 16, 2018 1:51 AM
> To: d...@openvswitch.org
> Cc: ktray...@redhat.com; Stokes, Ian <ian.sto...@intel.com>;
> i.maxim...@samsung.com; O Mahony, Billy <billy.o.mah...@intel.com>;
> Jan Scheurich <jan.scheur...@ericsson.com>
> Subject: [PATCH v7 3/3] dpif-netdev: Detection and logging of suspicious
> PMD iterations
> 
> This patch enhances dpif-netdev-perf to detect iterations with suspicious
> statistics according to the following criteria:
> 
> - iteration lasts longer than US_THR microseconds (default 250).
>   This can be used to capture events where a PMD is blocked or
>   interrupted for such a period of time that there is a risk for
>   dropped packets on any of its Rx queues.
> 
> - max vhost qlen exceeds a threshold Q_THR (default 128). This can
>   be used to infer virtio queue overruns and dropped packets inside
>   a VM, which are not visible in OVS otherwise.
> 
> Such suspicious iterations can be logged together with their iteration
> statistics to be able to correlate them to packet drop or other events outside
> OVS.
> 
> A new command is introduced to enable/disable logging at run-time and to
> adjust the above thresholds for suspicious iterations:
> 
> ovs-appctl dpif-netdev/pmd-perf-log-set on | off
> [-b before] [-a after] [-us usec] [-q qlen]
> 
> Turn logging on or off at run-time (on|off).
> 
> -b before:  The number of iterations before the suspicious iteration to
> be logged (default 5).
> -a after:   The number of iterations after the suspicious iteration to
> be logged (default 5).
> -q qlen:Suspicious vhost queue fill level threshold. Increase this
> to 512 if the Qemu supports 1024 virtio queue length.
> (default 128).
> -us usec:   change the duration threshold for a suspicious iteration
> (default 250 us).
> 
> If more than 100 iterations before or after a suspicious iteration have been
> looged once, OVS falls back to the safe default values (5/5) to avoid that
> logging itself causes continuos further logging.
> 
> Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
> ---
>  NEWS|   2 +
>  lib/dpif-netdev-perf.c  | 142
> 
>  lib/dpif-netdev-perf.h  |  40 -
>  lib/dpif-netdev.c   |   7 ++-
>  lib/netdev-dpif-unixctl.man |  47 ++-
>  5 files changed, 233 insertions(+), 5 deletions(-)
> 
> diff --git a/NEWS b/NEWS
> index 743528e..7d40374 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -47,6 +47,8 @@ Post-v2.8.0
>   * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a
> single PMD
>   * Detailed PMD performance metrics available with new command
>   ovs-appctl dpif-netdev/pmd-perf-show
> + * Supervision of PMD performance metrics and logging of suspicious
> +   iterations
> - vswitchd:
>   * Datapath IDs may now be specified as 0x1 (etc.) instead of 16 digits.
>   * Configuring a controller, or unconfiguring all controllers, now 
> deletes diff
> --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c index
> e0ef15d..259a6c8 100644
> --- a/lib/dpif-netdev-perf.c
> +++ b/lib/dpif-netdev-perf.c
> @@ -24,6 +24,23 @@
> 
>  VLOG_DEFINE_THIS_MODULE(pmd_perf);
> 
> +#define ITER_US_THRESHOLD 250   /* Warning threshold for iteration
> duration
> +   in microseconds. */
> +#define VHOST_QUEUE_FULL 128/* Size of the virtio TX queue. */
> +#define LOG_IT_BEFORE 5 /* Number of iteration to log before
> +   suspicious iteration. */
> +#define LOG_IT_AFTER 5  /* Number of iteration to log after
> +   suspicious iteration. */
[[BO'M]] typo 'Number of iterations...'
> +
> +bool log_on = false;
> +static uint32_t log_it_before = LOG_IT_BEFORE; static uint32_t
> +log_it_after = LOG_IT_AFTER; static uint32_t log_us_thr =
> +ITER_US_THRESHOLD; uint32_t log_q_thr = VHOST_QUEUE_FULL; uint64_t
> +iter_cycle_threshold;
> +
> +static struct vlog_rate_limit latency_rl = VLOG_RATE_LIMIT_INIT(600,
> +600);
> +
>  #ifdef DPDK_NETDEV
>  static uint64_t
>  get_tsc_hz(void)
> @@ -124,6 +141,8 @@ pmd_perf_stats_init(struct pmd_perf_stats *s)
>  histogram_walls_set_log(>cycles_per_upcall, 1000, 100);
>  histogram_walls_set_log(>max_vhost_qfill, 0, 512);
>  s->start_ms = time_msec();
> +s->log_begin_it = UINT

Re: [ovs-dev] [PATCH v7 1/3] netdev: Add rxq callback function rxq_length()

2018-01-19 Thread O Mahony, Billy


> -Original Message-
> From: Jan Scheurich [mailto:jan.scheur...@ericsson.com]
> Sent: Thursday, January 18, 2018 4:59 PM
> To: Ilya Maximets <i.maxim...@samsung.com>; O Mahony, Billy
> <billy.o.mah...@intel.com>; d...@openvswitch.org
> Cc: ktray...@redhat.com; Stokes, Ian <ian.sto...@intel.com>
> Subject: RE: [PATCH v7 1/3] netdev: Add rxq callback function rxq_length()
> 
> > >>>
> > >>> OK. Not necessary for our use case, as it will only be called by
> > >>> the PMD
> > >> after having received a full batch of 32 packets, but in general I
> > >> agree those checks are needed.
> > >>
> > >> It's necessary because vhost device could be disconnected between
> > >> rxq_recv() and rxq_length(). In this case we will call
> > >> rte_vhost_rx_queue_count() with vid == -1. This will produce access
> > >> to the random memory inside dpdk and likely a segmentation fault.
> > >>
> > >> See commit daf22bf7a826 ("netdev-dpdk: Fix calling vhost API with
> > >> negative
> > >> vid.") for a example of a similar issue. And I'm taking this
> > >> opportunity to recall that you should retrieve the vid only once.
> > >
> > >  [[BO'M]] Is there not also the possibility that the vhost device gets
> disconnected between the call to get_vid() and rxq_recv()?
> >
> > You mean disconnect between netdev_dpdk_get_vid(dev) and
> rte_vhost_dequeue_burst(vid) ?
> > There is no issue in this case, because 'destroy_device()' will wait
> > for other threads to quiesce. This means that device structure inside
> > dpdk will not be freed while we're inside netdev_rxq_recv(). We can
> > safely call any rte_vhost API for the old vid until device not freed inside
> dpdk.
> >
> > >
> > > Also, given these required calls to get_vid (which afaik requires
> > > some slow memory fencing) wouldn't that argue for the original
> > approach where the rxq len is returned from rxq_recv(). As the call to
> > rxq_length()  would be made once per batch once the queue is not being
> drained rxq_recv() the overhead could be significant.
> >
> > I'm not sure (I hope that Jan tested the performance of this version),
> > but I feel that 'rte_vhost_rx_queue_count()' is more heavy operation.
> 
> I have not done any performance tests yet with the new
> netdev_rxq_length() call after adding the vid and other checks. The actual
> 'rte_vhost_rx_queue_count()' is very lightweight. So if the memory fencing
> for get_vid() is expensive it might mean a hit. (Note: The
> rte_eth_queue_count() functions for physical ixgbe queues was much more
> costly).
> 
> Thinking about it, I would tend to agree with Billy that it seems simpler as 
> well
> as more accurate and efficient to let the caller provide an optional output
> parameter "uint32_t *rxq_len" in rqx_recv() if they are interested in the
> queue fill level and retrieve both in one atomic operation, so that we avoid
> the duplicate vid checks.
[[BO'M]] I like the idea of supplying the pointer which makes it optional and 
so gets around other issues we discussed - such as a client may only be 
interested in rxq len for certain netdev types or may know a priori that it's 
an expensive operation for other netdev types. 
> 
> Can we agree on that?
> 
> BR, Jan

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v7 1/3] netdev: Add rxq callback function rxq_length()

2018-01-18 Thread O Mahony, Billy


> -Original Message-
> From: Ilya Maximets [mailto:i.maxim...@samsung.com]
> Sent: Thursday, January 18, 2018 6:18 AM
> To: Jan Scheurich <jan.scheur...@ericsson.com>; d...@openvswitch.org
> Cc: ktray...@redhat.com; Stokes, Ian <ian.sto...@intel.com>; O Mahony,
> Billy <billy.o.mah...@intel.com>
> Subject: Re: [PATCH v7 1/3] netdev: Add rxq callback function rxq_length()
> 
> On 18.01.2018 02:21, Jan Scheurich wrote:
> > Thanks for the review. Answers inline.
> > Regards, Jan
> >
> >
> >> From: Ilya Maximets [mailto:i.maxim...@samsung.com]
> >> Sent: Wednesday, 17 January, 2018 11:47
> >> Subject: Re: [PATCH v7 1/3] netdev: Add rxq callback function
> >> rxq_length()
> >>
> >> On 16.01.2018 04:51, Jan Scheurich wrote:
> >>> If implememented, this function returns the number of packets in an
> >>> rx queue of the netdev. If not implemented, it returns -1.
> >>
> >> To be conform with other netdev functions it should return meaningful
> >> error codes. As 'rte_eth_rx_queue_count' could return different
> >> errors like -EINVAL or -ENOTSUP, 'netdev_rxq_length' itself should
> >> return -EOPNOTSUPP if not implemented.
> >
> > OK.
> >
> >>
> >>>
> >>> This function will be used in the upcoming commit for PMD
> >>> performance metrics to supervise the rx queue fill level for DPDK
> vhostuser ports.
> >>>
> >>> Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
> >>> ---
> >>>  lib/netdev-bsd.c  |  1 +
> >>>  lib/netdev-dpdk.c | 36 +++-
> >>>  lib/netdev-dummy.c|  1 +
> >>>  lib/netdev-linux.c|  1 +
> >>>  lib/netdev-provider.h |  3 +++
> >>>  lib/netdev-vport.c|  1 +
> >>>  lib/netdev.c  |  9 +
> >>>  lib/netdev.h  |  1 +
> >>>  8 files changed, 48 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/lib/netdev-bsd.c b/lib/netdev-bsd.c index
> >>> 05974c1..8d1771e 100644
> >>> --- a/lib/netdev-bsd.c
> >>> +++ b/lib/netdev-bsd.c
> >>> @@ -1546,6 +1546,7 @@ netdev_bsd_update_flags(struct netdev
> *netdev_, enum netdev_flags off,
> >>>  netdev_bsd_rxq_recv, \
> >>>  netdev_bsd_rxq_wait, \
> >>>  netdev_bsd_rxq_drain,\
> >>> +NULL, /* rxq_length */   \
> >>>   \
> >>>  NO_OFFLOAD_API   \
> >>>  }
> >>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index
> >>> ccda3fc..4200556 100644
> >>> --- a/lib/netdev-dpdk.c
> >>> +++ b/lib/netdev-dpdk.c
> >>> @@ -1839,6 +1839,27 @@ netdev_dpdk_rxq_recv(struct netdev_rxq
> *rxq, struct dp_packet_batch *batch)
> >>>  return 0;
> >>>  }
> >>>
> >>> +static int
> >>> +netdev_dpdk_vhost_rxq_length(struct netdev_rxq *rxq) {
> >>> +struct netdev_dpdk *dev = netdev_dpdk_cast(rxq->netdev);
> >>> +int qid = rxq->queue_id;
> >>> +
> >>
> >> We must make all the checks as in rxq_recv() function before calling
> >> 'rte_vhost_rx_queue_count'. Otherwise we may crash here if device
> >> will be occasionally disconnected:
> >>
> >> int vid = netdev_dpdk_get_vid(dev);
> >>
> >> if (OVS_UNLIKELY(vid < 0 || !dev->vhost_reconfigured
> >>  || !(dev->flags & NETDEV_UP))) {
> >> return -EAGAIN;
> >> }
> >>
> >> Not sure about -EAGAIN, but we need to return some negative errno.
> >
> > OK. Not necessary for our use case, as it will only be called by the PMD
> after having received a full batch of 32 packets, but in general I agree those
> checks are needed.
> 
> It's necessary because vhost device could be disconnected between
> rxq_recv() and rxq_length(). In this case we will call
> rte_vhost_rx_queue_count() with vid == -1. This will produce access to the
> random memory inside dpdk and likely a segmentation fault.
> 
> See commit daf22bf7a826 ("netdev-dpdk: Fix calling vhost API with negative
> vid.") for a example of a similar issue. And I'm taking this opportunity to 
> recall
> that you should retrieve the vid only once.

 [

Re: [ovs-dev] [PATCH v9 1/2] dpif-netdev: Refactor PMD performance into dpif-netdev-perf

2018-01-13 Thread O Mahony, Billy
Acked-by: Billy O'Mahony 

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of Jan Scheurich
> Sent: Friday, January 12, 2018 12:39 AM
> To: d...@openvswitch.org
> Cc: i.maxim...@samsung.com
> Subject: [ovs-dev] [PATCH v9 1/2] dpif-netdev: Refactor PMD performance into
> dpif-netdev-perf
> 
> Add module dpif-netdev-perf to host all PMD performance-related data
> structures and functions in dpif-netdev. Refactor the PMD stats handling in 
> dpif-
> netdev and delegate whatever possible into the new module, using clean
> interfaces to shield dpif-netdev from the implementation details. Accordingly,
> the all PMD statistics members are moved from the main struct
> dp_netdev_pmd_thread into a dedicated member of type struct pmd_perf_stats.
> 
> Include Darrel's prior refactoring of PMD stats contained in [PATCH v5,2/3] 
> dpif-
> netdev: Refactor some pmd stats:
> 
> 1. The cycles per packet counts are now based on packets received rather than
> packet passes through the datapath.
> 
> 2. Packet counters are now kept for packets received and packets recirculated.
> These are kept as separate counters for maintainability reasons. The cost of
> incrementing these counters is negligible.  These new counters are also
> displayed to the user.
> 
> 3. A display statistic is added for the average number of datapath passes per
> packet. This should be useful for user debugging and understanding of packet
> processing.
> 
> 4. The user visible 'miss' counter is used for successful upcalls, rather 
> than the
> sum of sucessful and unsuccessful upcalls. Hence, this becomes what user
> historically understands by OVS 'miss upcall'.
> The user display is annotated to make this clear as well.
> 
> 5. The user visible 'lost' counter remains as failed upcalls, but is 
> annotated to
> make it clear what the meaning is.
> 
> 6. The enum pmd_stat_type is annotated to make the usage of the stats
> counters clear.
> 
> 7. The subtable lookup stats is renamed to make it clear that it relates to 
> masked
> lookups.
> 
> 8. The PMD stats test is updated to handle the new user stats of packets
> received, packets recirculated and average number of datapath passes per
> packet.
> 
> On top of that introduce a "-pmd " option to the PMD info commands to
> filter the output for a single PMD.
> 
> Made the pmd-stats-show output a bit more readable by adding a blank
> between colon and value.
> 
> Signed-off-by: Jan Scheurich 
> Co-authored-by: Darrell Ball 
> Signed-off-by: Darrell Ball 
> ---
>  lib/automake.mk|   2 +
>  lib/dpif-netdev-perf.c |  60 +
>  lib/dpif-netdev-perf.h | 140 +++
>  lib/dpif-netdev.c  | 358 
> -
>  tests/pmd.at   |  30 +++--
>  5 files changed, 369 insertions(+), 221 deletions(-)  create mode 100644 
> lib/dpif-
> netdev-perf.c  create mode 100644 lib/dpif-netdev-perf.h
> 
> diff --git a/lib/automake.mk b/lib/automake.mk index 4b38a11..159319f 100644
> --- a/lib/automake.mk
> +++ b/lib/automake.mk
> @@ -80,6 +80,8 @@ lib_libopenvswitch_la_SOURCES = \
>   lib/dpdk.h \
>   lib/dpif-netdev.c \
>   lib/dpif-netdev.h \
> + lib/dpif-netdev-perf.c \
> + lib/dpif-netdev-perf.h \
>   lib/dpif-provider.h \
>   lib/dpif.c \
>   lib/dpif.h \
> diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c new file mode 
> 100644
> index 000..f06991a
> --- /dev/null
> +++ b/lib/dpif-netdev-perf.c
> @@ -0,0 +1,60 @@
> +/*
> + * Copyright (c) 2017 Ericsson AB.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#include 
> +
> +#include "openvswitch/dynamic-string.h"
> +#include "openvswitch/vlog.h"
> +#include "dpif-netdev-perf.h"
> +#include "timeval.h"
> +
> +VLOG_DEFINE_THIS_MODULE(pmd_perf);
> +
> +void
> +pmd_perf_stats_init(struct pmd_perf_stats *s) {
> +memset(s, 0 , sizeof(*s));
> +}
> +
> +void
> +pmd_perf_read_counters(struct pmd_perf_stats *s,
> +   uint64_t stats[PMD_N_STATS]) {
> +uint64_t val;
> +
> +/* These loops subtracts reference values (.zero[*]) from the counters.
> + * Since loads and stores are relaxed, it might be possible for a 
> .zero[*]
> + * value to be more recent than the current value we're reading from the
> + * 

Re: [ovs-dev] [PATCH v9 2/2] dpif-netdev: Refactor cycle counting

2018-01-13 Thread O Mahony, Billy
Acked-by: Billy O'Mahony 

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of Jan Scheurich
> Sent: Friday, January 12, 2018 12:39 AM
> To: d...@openvswitch.org
> Cc: i.maxim...@samsung.com
> Subject: [ovs-dev] [PATCH v9 2/2] dpif-netdev: Refactor cycle counting
> 
> Simplify the historically grown TSC cycle counting in PMD threads.
> Cycles are currently counted for the following purposes:
> 
> 1. Measure PMD ustilization
> 
> PMD utilization is defined as ratio of cycles spent in busy iterations (at 
> least one
> packet received or sent) over the total number of cycles.
> 
> This is already done in pmd_perf_start_iteration() and
> pmd_perf_end_iteration() based on a TSC timestamp saved in current iteration
> at start_iteration() and the actual TSC at end_iteration().
> No dependency on intermediate cycle accounting.
> 
> 2. Measure the processing load per RX queue
> 
> This comprises cycles spend on polling and processing packets received from 
> the
> rx queue and the cycles spent on delayed sending of these packets to tx queues
> (with time-based batching).
> 
> The previous scheme using cycles_count_start(), cycles_count_intermediate()
> and cycles-count_end() originally introduced to simplify cycle counting and
> saving calls to rte_get_tsc_cycles() was rather obscuring things.
> 
> Replace by a nestable cycle_timer with with start and stop functions to 
> embrace
> a code segment to be timed. The timed code may contain arbitrary nested
> cycle_timers. The duration of nested timers is excluded from the outer timer.
> 
> The caller must ensure that each call to cycle_timer_start() is followed by a 
> call
> to cycle_timer_end(). Failure to do so will lead to assertion failure or a 
> memory
> leak.
> 
> The new cycle_timer is used to measure the processing cycles per rx queue.
> This is not yet strictly necessary but will be made use of in a subsequent 
> commit.
> 
> All cycle count functions and data are relocated to module dpif-netdev-perf.
> 
> Signed-off-by: Jan Scheurich 
> ---
>  lib/dpif-netdev-perf.h | 110 ---
> -
>  lib/dpif-netdev.c  | 122 
> ++---
>  2 files changed, 135 insertions(+), 97 deletions(-)
> 
> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h index
> 53d60d3..5993c25 100644
> --- a/lib/dpif-netdev-perf.h
> +++ b/lib/dpif-netdev-perf.h
> @@ -23,6 +23,11 @@
>  #include 
>  #include 
> 
> +#ifdef DPDK_NETDEV
> +#include 
> +#include 
> +#endif
> +
>  #include "openvswitch/vlog.h"
>  #include "ovs-atomic.h"
>  #include "timeval.h"
> @@ -59,10 +64,6 @@ enum pmd_stat_type {
>   * recirculation. */
>  PMD_STAT_SENT_PKTS, /* Packets that have been sent. */
>  PMD_STAT_SENT_BATCHES,  /* Number of batches sent. */
> -PMD_CYCLES_POLL_IDLE,   /* Cycles spent unsuccessful polling. */
> -PMD_CYCLES_POLL_BUSY,   /* Cycles spent successfully polling and
> - * processing polled packets. */
> -PMD_CYCLES_OVERHEAD,/* Cycles spent for other tasks. */
>  PMD_CYCLES_ITER_IDLE,   /* Cycles spent in idle iterations. */
>  PMD_CYCLES_ITER_BUSY,   /* Cycles spent in busy iterations. */
>  PMD_N_STATS
> @@ -85,11 +86,95 @@ struct pmd_counters {
> 
>  struct pmd_perf_stats {
>  /* Start of the current PMD iteration in TSC cycles.*/
> +uint64_t start_it_tsc;
> +/* Latest TSC time stamp taken in PMD. */
>  uint64_t last_tsc;
> +/* If non-NULL, outermost cycle timer currently running in PMD. */
> +struct cycle_timer *cur_timer;
>  /* Set of PMD counters with their zero offsets. */
>  struct pmd_counters counters;
>  };
> 
> +/* Support for accurate timing of PMD execution on TSC clock cycle level.
> + * These functions are intended to be invoked in the context of pmd
> +threads. */
> +
> +/* Read the TSC cycle register and cache it. Any function not requiring
> +clock
> + * cycle accuracy should read the cached value using
> +cycles_counter_get() to
> + * avoid the overhead of reading the TSC register. */
> +
> +static inline uint64_t
> +cycles_counter_update(struct pmd_perf_stats *s) { #ifdef DPDK_NETDEV
> +return s->last_tsc = rte_get_tsc_cycles(); #else
> +return s->last_tsc = 0;
> +#endif
> +}
> +
> +static inline uint64_t
> +cycles_counter_get(struct pmd_perf_stats *s) {
> +return s->last_tsc;
> +}
> +
> +/* A nestable timer for measuring execution time in TSC cycles.
> + *
> + * Usage:
> + * struct cycle_timer timer;
> + *
> + * cycle_timer_start(pmd, );
> + * 
> + * uint64_t cycles = cycle_timer_stop(pmd, );
> + *
> + * The caller must guarantee that a call to cycle_timer_start() is
> +always
> + * paired with a call to cycle_stimer_stop().
> + *
> + * Is is possible to have nested cycles timers within the timed code.
> +The
> + * 

Re: [ovs-dev] [PATCH v5 3/3] dpif-netdev: Detection and logging of suspicious PMD iterations

2018-01-09 Thread O Mahony, Billy
Hi Jan,

Thanks these patches are really very useful.

I haven't finished trying them out but thought you'd prefer to get initial 
comments earlier. I'll continue to try to them out and revert with any further 
other comments later.

Regards,
Billy.

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of Jan Scheurich
> Sent: Thursday, January 4, 2018 12:08 PM
> To: d...@openvswitch.org
> Subject: [ovs-dev] [PATCH v5 3/3] dpif-netdev: Detection and logging of
> suspicious PMD iterations
> 
> This patch enhances dpif-netdev-perf to detect iterations with suspicious
> statistics according to the following criteria:
> 
> - iteration lasts longer than US_THR microseconds (default 250).
>   This can be used to capture events where a PMD is blocked or
>   interrupted for such a period of time that there is a risk for
>   dropped packets on any of its Rx queues.
> 
> - max vhost qlen exceeds a threshold Q_THR (default 128). This can
>   be used to infer virtio queue overruns and dropped packets inside
>   a VM, which are not visible in OVS otherwise.
> 
> Such suspicious iterations can be logged together with their iteration
> statistics to be able to correlate them to packet drop or other events outside
> OVS.
> 
> A new command is introduced to enable/disable logging at run-time and to
> adjust the above thresholds for suspicious iterations:
> 
> ovs-appctl dpif-netdev/pmd-perf-log-set [on|off]
> [-b before] [-a after] [-us usec] [-q qlen]
> 
> Turn logging on or off at run-time (on|off).
> 
> -b before:  The number of iterations before the suspicious iteration to
> be logged (default 5).
> -a after:   The number of iterations after the suspicious iteration to
> be logged (default 5).
> -q qlen:Suspicious vhost queue fill level threshold. Increase this
> to 512 if the Qemu supports 1024 virtio queue length.
> (default 128).
> -us usec:   change the duration threshold for a suspicious iteration
> (default 250 us).
> 
> If more than 100 iterations before or after a suspicious iteration have been
> looged once, OVS falls back to the safe default values (5/5) to avoid that
[[BO'M]] typo
> logging itself causes continuos further logging.
[[BO'M]] typo
> 
> Signed-off-by: Jan Scheurich 
> ---
>  lib/dpif-netdev-perf.c | 142
> +
>  lib/dpif-netdev-perf.h |  32 +++
>  lib/dpif-netdev.c  |   5 ++
>  3 files changed, 179 insertions(+)
> 
> diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c index
> a66a48c..3fd19b0 100644
> --- a/lib/dpif-netdev-perf.c
> +++ b/lib/dpif-netdev-perf.c
> @@ -28,6 +28,23 @@
> 
>  VLOG_DEFINE_THIS_MODULE(pmd_perf);
> 
> +#define ITER_US_THRESHOLD 250   /* Warning threshold for iteration
> duration
> +   in microseconds. */
> +#define VHOST_QUEUE_FULL 128/* Size of the virtio TX queue. */
> +#define LOG_IT_BEFORE 5 /* Number of iteration to log before
> +   suspicious iteration. */
> +#define LOG_IT_AFTER 5  /* Number of iteration to log after
> +   suspicious iteration. */
> +
> +bool log_on = false;
> +static uint32_t log_it_before = LOG_IT_BEFORE; static uint32_t
> +log_it_after = LOG_IT_AFTER; static uint32_t log_us_thr =
> +ITER_US_THRESHOLD; uint32_t log_q_thr = VHOST_QUEUE_FULL; uint64_t
> +iter_cycle_threshold;
> +
> +static struct vlog_rate_limit latency_rl = VLOG_RATE_LIMIT_INIT(600,
> +600);
> +
>  #ifdef DPDK_NETDEV
>  static uint64_t
>  get_tsc_hz(void)
> @@ -133,6 +150,8 @@ pmd_perf_stats_init(struct pmd_perf_stats *s) {
>  histogram_walls_set_log(>cycles_per_upcall, 1000, 100);
>  histogram_walls_set_log(>max_vhost_qfill, 0, 512);
>  s->start_ms = time_msec();
> +s->log_begin_it = UINT64_MAX;
> +s->log_end_it = UINT64_MAX;
>  }
> 
>  void
> @@ -368,6 +387,129 @@ pmd_perf_stats_clear(struct pmd_perf_stats *s)
>  histogram_clear(>max_vhost_qfill);
>  history_init(>iterations);
>  history_init(>milliseconds);
> +s->log_begin_it = UINT64_MAX;
> +s->log_end_it = UINT64_MAX;
>  s->start_ms = time_msec(); /* Clearing finished. */
>  s->milliseconds.sample[0].timestamp = s->start_ms;  }
> +
> +void
> +pmd_perf_log_suspicious_iteration(struct pmd_perf_stats *s,
> + uint64_t cycles,
> + char *reason) {
> +VLOG_WARN_RL(_rl,
> + "Suspicious iteration (%s): tsc=%"PRIu64
> + " duration=%"PRIu64" us\n",
> + reason, s->current.timestamp,
> + (100L * cycles) / get_tsc_hz());
[[BO'M]] get_tsc_hz calls sleep(1) so won't this block the datapath when 
logging is turned on? I haven't tested trying to inject occasional long 
processing interval.


> +if 

Re: [ovs-dev] [PATCH v5 2/3] dpif-netdev: Detailed performance stats for PMDs

2018-01-09 Thread O Mahony, Billy
Hi Jan,

Thanks these patches are really very useful.

I haven't finished trying them out but thought you'd prefer to get initial 
comments earlier. I'll continue to try to them out and revert with any further 
other comments later.

Regards,
Billy.

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of Jan Scheurich
> Sent: Thursday, January 4, 2018 12:08 PM
> To: d...@openvswitch.org
> Subject: [ovs-dev] [PATCH v5 2/3] dpif-netdev: Detailed performance stats
> for PMDs
> 
> This patch instruments the dpif-netdev datapath to record detailed statistics
> of what is happening in every iteration of a PMD thread.
> 
> The collection of detailed statistics can be controlled by a new configuration
> parameter "other_config:pmd-perf-metrics". By default it is disabled. The
[[BO'M]] Specify table Open_vSwitch
> run-time overhead, when enabled, is in the order of 1%.
[[BO'M]] When I enable metrics I get a reduction in max lossless throughput of 
about 0.6%. That is with simple port forwarding (~200 cycles per-packet) so is 
probably a worst case scenario. So really very little overhead for such 
detailed stats.
> 
> The covered metrics per iteration are:
>   - cycles
>   - packets
>   - (rx) batches
>   - packets/batch
>   - max. vhostuser qlen
>   - upcalls
>   - cycles spent in upcalls
> 
> This raw recorded data is used threefold:
> 
> 1. In histograms for each of the following metrics:
>- cycles/iteration (log.)
>- packets/iteration (log.)
>- cycles/packet
>- packets/batch
>- max. vhostuser qlen (log.)
>- upcalls
>- cycles/upcall (log)
>The histograms bins are divided linear or logarithmic.
> 
> 2. A cyclic history of the above statistics for 1024 iterations
[[BO'M]] I only get a max of 999 iterations printed.
> 
> 3. A cyclic history of the cummulative/average values per millisecond
[[BO'M]] typo
>wall clock for the last 1024 milliseconds:
[[BO'M]] I only get a max of 1000 previous ms printed.
>- number of iterations
>- avg. cycles/iteration
>- packets (Kpps)
>- avg. packets/batch
>- avg. max vhost qlen
>- upcalls
>- avg. cycles/upcall
> 
> The gathered performance statists can be printed at any time with the new
[[BO'M]] typo
> CLI command
> 
> ovs-appctl dpif-netdev/pmd-perf-show [-nh] [-it iter_len] [-ms ms_len]
>   [-pmd core] [dp]
> 
> The options are
> 
> -nh:Suppress the histograms
> -it iter_len:   Display the last iter_len iteration stats
> -ms ms_len: Display the last ms_len millisecond stats
> -pmd core:  Display only
> 
> The performance statistics are reset with the existing dpif-netdev/pmd-
> stats-clear command.
> 
> The output always contains the following global PMD statistics, similar to the
> pmd-stats-show command:
> 
> Time: 15:24:55.270
> Measurement duration: 1.008 s
> 
> pmd thread numa_id 0 core_id 1:
> 
>   Cycles:2419034712  (2.40 GHz)
>   Iterations:572817  (1.76 us/it)
>   - idle:486808  (15.9 % cycles)
>   - busy: 86009  (84.1 % cycles)
>   Packets:  2399607  (2381 Kpps, 848 cycles/pkt)
>   Datapath passes:  3599415  (1.50 passes/pkt)
>   - EMC hits:336472  ( 9.3 %)
>   - Megaflow hits:  3262943  (90.7 %, 1.00 subtbl lookups/hit)
>   - Upcalls:  0  ( 0.0 %, 0.0 us/upcall)
>   - Lost upcalls: 0  ( 0.0 %)
> 
> Signed-off-by: Jan Scheurich 
> ---
>  lib/dp-packet.h|   2 +
>  lib/dpif-netdev-perf.c | 309
> -
>  lib/dpif-netdev-perf.h | 173 ++-
>  lib/dpif-netdev.c  | 158 +++--
>  lib/netdev-dpdk.c  |  23 +++-
>  lib/netdev-dpdk.h  |  14 +++
>  ofproto/ofproto-dpif.c |   3 +-
>  7 files changed, 664 insertions(+), 18 deletions(-)
> 
> diff --git a/lib/dp-packet.h b/lib/dp-packet.h index b4b721c..7950247 100644
> --- a/lib/dp-packet.h
> +++ b/lib/dp-packet.h
> @@ -695,8 +695,10 @@ enum { NETDEV_MAX_BURST = 32 }; /* Maximum
> number packets in a batch. */
> 
>  struct dp_packet_batch {
>  size_t count;
> +size_t qfill; /* Number of packets in queue when reading rx burst.
> + */
>  bool trunc; /* true if the batch needs truncate. */
>  struct dp_packet *packets[NETDEV_MAX_BURST];
> +
>  };
> 
>  static inline void
> diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c index
> 7d8b7b2..a66a48c 100644
> --- a/lib/dpif-netdev-perf.c
> +++ b/lib/dpif-netdev-perf.c
> @@ -15,6 +15,7 @@
>   */
> 
>  #include 
> +#include 
> 
>  #ifdef DPDK_NETDEV
>  #include 
> @@ -27,13 +28,307 @@
> 
>  VLOG_DEFINE_THIS_MODULE(pmd_perf);
> 
> +#ifdef DPDK_NETDEV
> +static uint64_t
> +get_tsc_hz(void)
> +{
> +return rte_get_tsc_hz();[[BO'M]] maG
> +}
> +#else
> +static uint64_t
> +read_tsc(void)
> +{
> +register uint64_t tsc asm("eax");
> +asm volatile (".byte 15, 

Re: [ovs-dev] [PATCH v5 1/3] dpif-netdev: Refactor PMD performance into dpif-netdev-perf

2018-01-09 Thread O Mahony, Billy
Hi Jan,

Thanks these patches are really very useful.

I haven't finished trying them out but thought you'd prefer to get initial 
comments earlier. I'll continue to try to them out and revert with any further 
other comments later.

Regards,
Billy. 



> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of Jan Scheurich
> Sent: Thursday, January 4, 2018 12:08 PM
> To: d...@openvswitch.org
> Subject: [ovs-dev] [PATCH v5 1/3] dpif-netdev: Refactor PMD performance
> into dpif-netdev-perf
> 
> Add module dpif-netdev-perf to host all PMD performance-related data
> structures and functions in dpif-netdev. Refactor the PMD stats handling in
> dpif-netdev and delegate whatever possible into the new module, using
> clean interfaces to shield dpif-netdev from the implementation details.
> Accordingly, the all PMD statistics members are moved from the main struct
> dp_netdev_pmd_thread into a dedicated member of type struct
> pmd_perf_stats.
> 
> Include Darrel's prior refactoring of PMD stats contained in [PATCH v5,2/3]
> dpif-netdev: Refactor some pmd stats:
> 
> 1. The cycles per packet counts are now based on packets received rather
> than packet passes through the datapath.
> 
> 2. Packet counters are now kept for packets received and packets
> recirculated. These are kept as separate counters for maintainability reasons.
> The cost of incrementing these counters is negligible.  These new counters
> are also displayed to the user.
> 
> 3. A display statistic is added for the average number of datapath passes per
> packet. This should be useful for user debugging and understanding of
> packet processing.
> 
> 4. The user visible 'miss' counter is used for successful upcalls, rather 
> than the
> sum of sucessful and unsuccessful upcalls. Hence, this becomes what user
> historically understands by OVS 'miss upcall'.
> The user display is annotated to make this clear as well.
> 
> 5. The user visible 'lost' counter remains as failed upcalls, but is 
> annotated to
> make it clear what the meaning is.
> 
> 6. The enum pmd_stat_type is annotated to make the usage of the stats
> counters clear.
> 
> 7. The subtable lookup stats is renamed to make it clear that it relates to
> masked lookups.
> 
> 8. The PMD stats test is updated to handle the new user stats of packets
> received, packets recirculated and average number of datapath passes per
> packet.
> 
> On top of that introduce a "-pmd " option to the PMD info commands
> to filter the output for a single PMD.
> 
> Signed-off-by: Jan Scheurich 
> Co-authored-by: Darrell Ball 
> Signed-off-by: Darrell Ball 
> ---
>  lib/automake.mk|   2 +
>  lib/dpif-netdev-perf.c |  66 +
>  lib/dpif-netdev-perf.h | 129 ++
>  lib/dpif-netdev.c  | 353 
> -
>  tests/pmd.at   |  22 +--
>  5 files changed, 357 insertions(+), 215 deletions(-)  create mode 100644
> lib/dpif-netdev-perf.c  create mode 100644 lib/dpif-netdev-perf.h
> 
> diff --git a/lib/automake.mk b/lib/automake.mk index 4b38a11..159319f
> 100644
> --- a/lib/automake.mk
> +++ b/lib/automake.mk
> @@ -80,6 +80,8 @@ lib_libopenvswitch_la_SOURCES = \
>   lib/dpdk.h \
>   lib/dpif-netdev.c \
>   lib/dpif-netdev.h \
> + lib/dpif-netdev-perf.c \
> + lib/dpif-netdev-perf.h \
>   lib/dpif-provider.h \
>   lib/dpif.c \
>   lib/dpif.h \
> diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c new file mode
> 100644 index 000..7d8b7b2
> --- /dev/null
> +++ b/lib/dpif-netdev-perf.c
> @@ -0,0 +1,66 @@
> +/*
> + * Copyright (c) 2017 Ericsson AB.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#include 
> +
> +#ifdef DPDK_NETDEV
> +#include 
> +#endif
> +
> +#include "openvswitch/dynamic-string.h"
> +#include "openvswitch/vlog.h"
> +#include "dpif-netdev-perf.h"
> +#include "timeval.h"
> +
> +VLOG_DEFINE_THIS_MODULE(pmd_perf);
> +
> +void
> +pmd_perf_stats_init(struct pmd_perf_stats *s) {
> +memset(s, 0 , sizeof(*s));
> +s->start_ms = time_msec();
> +}
> +
> +void
> +pmd_perf_read_counters(struct pmd_perf_stats *s,
> +   uint64_t stats[PMD_N_STATS]) {
> +uint64_t val;
> +
> +/* These loops subtracts reference values ('.zero[*]') from the counters.
> + * Since loads and stores are 

Re: [ovs-dev] [ovs-dev, v5, 1/3] dpif-netdev: Refactor PMD performance into dpif-netdev-perf

2018-01-09 Thread O Mahony, Billy
Hi All,

I know a v6 is under preparation so I just wanted to give some initial thoughts 
early before I review further.

/Billy.

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of Ilya Maximets
> Sent: Tuesday, January 9, 2018 8:35 AM
> To: Jan Scheurich ; d...@openvswitch.org
> Subject: Re: [ovs-dev] [ovs-dev, v5, 1/3] dpif-netdev: Refactor PMD
> performance into dpif-netdev-perf
> 
> I'm a bit lost with all the data structures in lib/dpif-netdev-perf.h.
> It's too complex, especially with other patches applied. And no comments
> there.
> Can we simplify? I didn't review this part.
> 
> Comments inline.
> 
> Best regards, Ilya Maximets.
> 
> On 04.01.2018 15:07, Jan Scheurich wrote:
> > Add module dpif-netdev-perf to host all PMD performance-related data
> > structures and functions in dpif-netdev. Refactor the PMD stats
> > handling in dpif-netdev and delegate whatever possible into the new
> > module, using clean interfaces to shield dpif-netdev from the
> > implementation details. Accordingly, the all PMD statistics members
[[BO'M]] typo "the all"
> > are moved from the main struct dp_netdev_pmd_thread into a dedicated
> > member of type struct pmd_perf_stats.
> >
> > Include Darrel's prior refactoring of PMD stats contained in [PATCH
> > v5,2/3] dpif-netdev: Refactor some pmd stats:
> >
> > 1. The cycles per packet counts are now based on packets received
> > rather than packet passes through the datapath.
[[BO'M]] ie. a packet that is recirculated was previously counted as two 
packets but now just as one? Maybe confirm that in commit msg.
> >
> > 2. Packet counters are now kept for packets received and packets
> > recirculated. These are kept as separate counters for maintainability
> > reasons. The cost of incrementing these counters is negligible.  These
> > new counters are also displayed to the user.
> >
> > 3. A display statistic is added for the average number of datapath
> > passes per packet. This should be useful for user debugging and
> > understanding of packet processing.
> >
> > 4. The user visible 'miss' counter is used for successful upcalls,
> > rather than the sum of sucessful and unsuccessful upcalls. Hence, this
[[BO'M]] typo
> > becomes what user historically understands by OVS 'miss upcall'.
> > The user display is annotated to make this clear as well.
> >

> 
> > +int error = handle_packet_upcall(pmd, packet, [i],
> > +   , _actions);
> > +
> > +if (OVS_UNLIKELY(error)) {
> > +upcall_fail_cnt++;
> > +} else {
> > +upcall_ok_cnt++;
> > +}
> 
> Also, 'error' is not used. How about just:
> 
>  if (!handle_packet_upcall(pmd, packet, [i],
>, _actions)) {
>  upcall_ok_cnt++;
>  } else {
>  upcall_fail_cnt++;
>  }
> 
> 
> >  }
[[BO'M]] I find the original much more natural to read. Unless it's shown that 
the alternative compiles to be more efficient I think most readers of the code 
would prefer the original long hand version.
> >
> >  ofpbuf_uninit();
> > @@ -5212,8 +5148,7 @@ fast_path_processing(struct
> dp_netdev_pmd_thread *pmd,
> >  DP_PACKET_BATCH_FOR_EACH (packet, packets_) {
> >  if (OVS_UNLIKELY(!rules[i])) {
> >  dp_packet_delete(packet);
> > -lost_cnt++;
> > -miss_cnt++;
> > +upcall_fail_cnt++;
> >  }
> >  }
> >  }
> > @@ -5231,10 +5166,14 @@ fast_path_processing(struct
> dp_netdev_pmd_thread *pmd,
> >  dp_netdev_queue_batches(packet, flow, [i].mf, batches,
> n_batches);
> >  }
> >
> > -dp_netdev_count_packet(pmd, DP_STAT_MASKED_HIT, cnt -
> miss_cnt);
> > -dp_netdev_count_packet(pmd, DP_STAT_LOOKUP_HIT, lookup_cnt);
> > -dp_netdev_count_packet(pmd, DP_STAT_MISS, miss_cnt);
> > -dp_netdev_count_packet(pmd, DP_STAT_LOST, lost_cnt);
> > +pmd_perf_update_counter(>perf_stats,
> PMD_STAT_MASKED_HIT,
> > +cnt - upcall_ok_cnt - upcall_fail_cnt);
> > +pmd_perf_update_counter(>perf_stats,
> PMD_STAT_MASKED_LOOKUP,
> > +lookup_cnt);
> > +pmd_perf_update_counter(>perf_stats, PMD_STAT_MISS,
> > +upcall_ok_cnt);
> > +pmd_perf_update_counter(>perf_stats, PMD_STAT_LOST,
> > +upcall_fail_cnt);
> >  }
> >
> >  /* Packets enter the datapath from a port (or from recirculation) here.
> > diff --git a/tests/pmd.at b/tests/pmd.at index e39a23a..0356f87 100644
> > --- a/tests/pmd.at
> > +++ b/tests/pmd.at
> > @@ -170,13 +170,16 @@ dummy@ovs-dummy: hit:0 missed:0
> > p0 7/1: (dummy-pmd: configured_rx_queues=4,
> > configured_tx_queues=, requested_rx_queues=4,
> > requested_tx_queues=)
> >  ])
> >

Re: [ovs-dev] [PATCH v3 1/3] dpif-netdev: Refactor PMD performance into dpif-netdev-perf

2017-12-08 Thread O Mahony, Billy
Hi Jan,

I had problems applying later patches in this series so just reviewing this one 
for now. I tried several revisions to apply them.

The second patch ([ovs-dev,v3,2/3] dpif-netdev: Detailed performance stats for 
PMDs ) fails with 
fatal: patch fragment without header at line 708: @@ -1073,6 +1155,12 @@ 
dpif_netdev_init(void)

Overall not only is user-visible output is clearer but the code is also more 
consistent and easier to understand. 

I tested this patch by applying it to: 3728b3b Ben Pfaff 2017-11-20 Merge 
branch 'dpdk_merge' of https://github.com/istokes/ovs into HEAD

These are the issues I did find:
1. make check #1159 "ofproto-dpif patch ports" consistently fails for me with 
this patch applied

2. ./utilities/checkpatch.py reports some line length issues:
== Checking 
"dpif-netdev-perf/ovs-dev-v3-1-3-dpif-netdev-Refactor-PMD-performance-into-dpif-netdev-perf.patch"
 ==
ERROR: Too many signoffs; are you missing Co-authored-by lines?
WARNING: Line length is >79-characters long
#346 FILE: lib/dpif-netdev.c:544:
struct ovs_refcount ref_cnt;/* Every reference must be refcount'ed. 
*/

WARNING: Line has non-spaces leading whitespace
WARNING: Line has trailing whitespace
#347 FILE: lib/dpif-netdev.c:545:


WARNING: Line length is >79-characters long
#349 FILE: lib/dpif-netdev.c:547:
 * XPS disabled for this netdev. All static_tx_qid's are unique and less

WARNING: Line has non-spaces leading whitespace
WARNING: Line has trailing whitespace
#352 FILE: lib/dpif-netdev.c:550:


WARNING: Line length is >79-characters long
WARNING: Line has trailing whitespace
#413 FILE: lib/dpif-netdev.c:610:
OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct pmd_perf_stats perf_stats;

Lines checked: 976, Warnings: 8, Errors: 1

3. Does the new 'pmd' arg to pmd-stats-show interfere with the existing [dp] 
arg? 

sudo ./utilities/ovs-appctl dpif-netdev/pmd-stats-show -pmd 1 
netdev@dpif-netdev
"dpif-netdev/pmd-stats-show" command takes at most 2 arguments
 ovs-appctl: ovs-vswitchd: server returned an error

Otherwise it looks like a really useful patch. And the remainder of the series 
more so.

Thanks,
Billy. 

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of Jan Scheurich
> Sent: Tuesday, November 21, 2017 12:38 AM
> To: 'ovs-dev@openvswitch.org' 
> Subject: [ovs-dev] [PATCH v3 1/3] dpif-netdev: Refactor PMD performance into
> dpif-netdev-perf
> 
> Add module dpif-netdev-perf to host all PMD performance-related data
> structures and functions in dpif-netdev. Refactor the PMD stats handling in 
> dpif-
> netdev and delegate whatever possible into the new module, using clean
> interfaces to shield dpif-netdev from the implementation details. Accordingly,
> the all PMD statistics members are moved from the main struct
> dp_netdev_pmd_thread into a dedicated member of type struct pmd_perf_stats.
> 
> Include Darrel's prior refactoring of PMD stats contained in [PATCH v5,2/3] 
> dpif-
> netdev: Refactor some pmd stats:
> 
> 1. The cycles per packet counts are now based on packets received rather than
> packet passes through the datapath.
> 
> 2. Packet counters are now kept for packets received and packets recirculated.
> These are kept as separate counters for maintainability reasons. The cost of
> incrementing these counters is negligible.  These new counters are also
> displayed to the user.
> 
> 3. A display statistic is added for the average number of datapath passes per
> packet. This should be useful for user debugging and understanding of packet
> processing.
> 
> 4. The user visible 'miss' counter is used for successful upcalls, rather 
> than the
> sum of sucessful and unsuccessful upcalls. Hence, this becomes what user
> historically understands by OVS 'miss upcall'.
> The user display is annotated to make this clear as well.
> 
> 5. The user visible 'lost' counter remains as failed upcalls, but is 
> annotated to
> make it clear what the meaning is.
> 
> 6. The enum pmd_stat_type is annotated to make the usage of the stats
> counters clear.
> 
> 7. The subtable lookup stats is renamed to make it clear that it relates to 
> masked
> lookups.
> 
> 8. The PMD stats test is updated to handle the new user stats of packets
> received, packets recirculated and average number of datapath passes per
> packet.
> 
> On top of that introduce a "-pmd " option to the PMD info commands to
> filter the output for a single PMD.
> 
> Signed-off-by: Jan Scheurich 
> Signed-off-by: Darrell Ball 
> 
> ---
>  lib/automake.mk|   2 +
>  lib/dpif-netdev-perf.c |  67 +
>  lib/dpif-netdev-perf.h | 123 
>  lib/dpif-netdev.c  | 371 
> -
>  tests/pmd.at   |  22 +--
>  5 files changed, 358 insertions(+), 227 deletions(-)  create mode 100644 
> lib/dpif-
> netdev-perf.c  

Re: [ovs-dev] [PATCH 2/2] Adding configuration option to whitelist DPDK physical ports.

2017-12-08 Thread O Mahony, Billy
I can confirm that using other_config:dpdk-extra  is indeed already effective 
to change the hugepage file prefix (admittedly without inserting a ref to the 
pid) and specify a pci whitelist. 

Regards,
Billy. 

> -Original Message-
> From: Mooney, Sean K
> Sent: Thursday, December 7, 2017 5:53 PM
> To: Chandran, Sugesh <sugesh.chand...@intel.com>; O Mahony, Billy
> <billy.o.mah...@intel.com>; d...@openvswitch.org; b...@ovn.org
> Cc: Mooney, Sean K <sean.k.moo...@intel.com>
> Subject: RE: [ovs-dev] [PATCH 2/2] Adding configuration option to whitelist
> DPDK physical ports.
> 
> 
> 
> > -Original Message-
> > From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> > boun...@openvswitch.org] On Behalf Of Chandran, Sugesh
> > Sent: Thursday, December 7, 2017 5:07 PM
> > To: O Mahony, Billy <billy.o.mah...@intel.com>; d...@openvswitch.org;
> > b...@ovn.org
> > Subject: Re: [ovs-dev] [PATCH 2/2] Adding configuration option to
> > whitelist DPDK physical ports.
> >
> >
> >
> > Regards
> > _Sugesh
> >
> > > -Original Message-
> > > From: O Mahony, Billy
> > > Sent: Thursday, December 7, 2017 11:47 AM
> > > To: Chandran, Sugesh <sugesh.chand...@intel.com>;
> > d...@openvswitch.org;
> > > b...@ovn.org
> > > Subject: RE: [ovs-dev] [PATCH 2/2] Adding configuration option to
> > > whitelist DPDK physical ports.
> > >
> > > Hi Sugesh,
> > >
> > > > -Original Message-
> > > > From: Chandran, Sugesh
> > > > Sent: Wednesday, December 6, 2017 6:23 PM
> > > > To: O Mahony, Billy <billy.o.mah...@intel.com>;
> > d...@openvswitch.org;
> > > > b...@ovn.org
> > > > Subject: RE: [ovs-dev] [PATCH 2/2] Adding configuration option to
> > > > whitelist DPDK physical ports.
> > > >
> > > > Thank you Billy for the review.
> > > > Please find below my reply.
> > > >
> > > > Regards
> > > > _Sugesh
> > > >
> > > >
> > > > > -Original Message-
> > > > > From: O Mahony, Billy
> > > > > Sent: Wednesday, December 6, 2017 5:31 PM
> > > > > To: Chandran, Sugesh <sugesh.chand...@intel.com>;
> > > > > d...@openvswitch.org; b...@ovn.org
> > > > > Subject: RE: [ovs-dev] [PATCH 2/2] Adding configuration option
> > > > > to whitelist DPDK physical ports.
> > > > >
> > > > > Hi Sugesh,
> > > > >
> > > > > This is definitely a very useful feature. I'm looking forward to
> > > > > running trex on the same DUT as my ovs-dpdk.
> [Mooney, Sean K]  you can all ready to this you just need to set the 
> whitelist In
> other_config:dpdk-extra just repeat "-w $address" for each device.
> To have two dpdk primary processes on the same system you will also need to
> change The hugepage prfix used be dpdk which you can also do via the dpdk-
> extra option.
> 
> After this patch we will still be able to specify the whitelist using
> other_config:dpdk-extra correct? If not this may break ovs-dpdk support in
> openstack installers. I ported our whitelist code in networking-ovs-dpdk to 
> use
> dpdk-extra when when we moved the dpdk params to the db and I also added it
> to kolla.
> im pretty sure tripple0 and fule also do the same.
> 
> > > > >
> > > > > However I'd suggest adding an sscanf or some such to verify that
> > > > > the domain is also specified for each whitelist member. And
> > either
> > > > > add the default of '' or complain loudly if the domain is
> > absent.
> > > > [Sugesh] Will throw an error in that case then .
> > > >
> > > > >
> > > > > Currently (without this patch) you must specify the domain when
> > > > > adding
> > > ports:
> > > > >Vsctl add-port ... options:dpdk-devargs=:05:00.0 Or else
> > an
> > > > > error such as 'Cannot find unplugged device (05:00.0)'  is
> > reported.
> > > > >
> > > > > And with the patch if you include the domain in the other_config
> > (e.g.
> > > > > other_config:dpdk-whitelist-pci-ids=":05:00.0") everything
> > > > > works just as before.
> > > > >
> > > > > However with the patch if you add the whitelist *without* a
> > domain e.g.
> > > > >   ovs-vsctl --no-wait set O

Re: [ovs-dev] [PATCH 2/2] Adding configuration option to whitelist DPDK physical ports.

2017-12-07 Thread O Mahony, Billy
Hi Sugesh,

> -Original Message-
> From: Chandran, Sugesh
> Sent: Wednesday, December 6, 2017 6:23 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; d...@openvswitch.org;
> b...@ovn.org
> Subject: RE: [ovs-dev] [PATCH 2/2] Adding configuration option to whitelist
> DPDK physical ports.
> 
> Thank you Billy for the review.
> Please find below my reply.
> 
> Regards
> _Sugesh
> 
> 
> > -Original Message-
> > From: O Mahony, Billy
> > Sent: Wednesday, December 6, 2017 5:31 PM
> > To: Chandran, Sugesh <sugesh.chand...@intel.com>; d...@openvswitch.org;
> > b...@ovn.org
> > Subject: RE: [ovs-dev] [PATCH 2/2] Adding configuration option to
> > whitelist DPDK physical ports.
> >
> > Hi Sugesh,
> >
> > This is definitely a very useful feature. I'm looking forward to
> > running trex on the same DUT as my ovs-dpdk.
> >
> > However I'd suggest adding an sscanf or some such to verify that the
> > domain is also specified for each whitelist member. And either add the
> > default of '' or complain loudly if the domain is absent.
> [Sugesh] Will throw an error in that case then .
> 
> >
> > Currently (without this patch) you must specify the domain when adding 
> > ports:
> >Vsctl add-port ... options:dpdk-devargs=:05:00.0 Or else an
> > error such as 'Cannot find unplugged device (05:00.0)'  is reported.
> >
> > And with the patch if you include the domain in the other_config (e.g.
> > other_config:dpdk-whitelist-pci-ids=":05:00.0") everything works
> > just as before.
> >
> > However with the patch if you add the whitelist *without* a domain e.g.
> > ovs-vsctl --no-wait set Open_vSwitch .
> > other_config:dpdk-whitelist-pci- ids="05:00.0"
> >
> > There is no immediate error. However later when doing add-port if you
> > include the domain (current required practice) you will get an error.
> > If you omit the domain all is well.
> [Sugesh] It looks to me, the dpdk-devargs need the PCI id with the ''.
> But to bind and PCI scan its not necessary.
> So to keep it consistent, I would add check for PCI-ID in whitelist config 
> too, and
> throw error incase pci-id are mentioned wrong(means without ''.
> Does it looks OK to you?

[[BO'M]] I think the error is the right thing to do. It would be tempting to 
insert the default '' if the domain is omitted but then you would have a 
confusing inconsistency in that it would be ok to omit the domain in one place 
(whitelist) but not in the other (add-port).

> >
> > It's a little bit strange as regardless of domain or no domain in the
> > other_config the PCI probe always reports the NIC as expected:
> > 2017-12-06T16:55:27Z|00013|dpdk|INFO|EAL: PCI device :05:00.0
> > on NUMA socket -1
> > 2017-12-06T16:55:27Z|00014|dpdk|WARN|EAL:   Invalid NUMA socket,
> > default to 0
> > 2017-12-06T16:55:27Z|00015|dpdk|INFO|EAL:   probe driver: 8086:1572
> > net_i40e
> >
> > I'll be using the other patch in this series "isolate rte-mempool
> > allocation" over the next few days so I'll review that in due course.
> >
> > Thanks,
> > Billy.
> >
> > > -Original Message-
> > > From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> > > boun...@openvswitch.org] On Behalf Of Sugesh Chandran
> > > Sent: Friday, November 10, 2017 1:29 AM
> > > To: d...@openvswitch.org; b...@ovn.org
> > > Subject: [ovs-dev] [PATCH 2/2] Adding configuration option to
> > > whitelist DPDK physical ports.
> > >
> > > Adding a OVS configuration option to whitelist DPDK physical ports.
> > > By default running multiple instances of DPDK on a single platform
> > > cannot use physical ports at the same time even though they are distinct.
> > >
> > > The eal init scans all the ports that are bound to DPDK and
> > > initialize the drivers accordingly. This happens for every DPDK process 
> > > init.
> > > On a multi instance deployment usecase, it causes issues for using
> > > physical NIC ports.
> > > Consider a two DPDK process that are running on a single platform,
> > > the second DPDK primary process will try to initialize the drivers
> > > for all the physical ports even though it may be used in first DPDK 
> > > process.
> > >
> > > To avoid this situation user can whitelist the ports for each DPDK 
> > > application.
> > > Whitelisting of ports/PCI-ID in a DPDK process will limit the
> > > eal-init only on th

Re: [ovs-dev] [PATCH 2/2] Adding configuration option to whitelist DPDK physical ports.

2017-12-06 Thread O Mahony, Billy
Hi Sugesh,

This is definitely a very useful feature. I'm looking forward to running trex 
on the same DUT as my ovs-dpdk.

However I'd suggest adding an sscanf or some such to verify that the domain is 
also specified for each whitelist member. And either add the default of '' 
or complain loudly if the domain is absent.

Currently (without this patch) you must specify the domain when adding ports:
   Vsctl add-port ... options:dpdk-devargs=:05:00.0
Or else an error such as 'Cannot find unplugged device (05:00.0)'  is reported.

And with the patch if you include the domain in the other_config (e.g. 
other_config:dpdk-whitelist-pci-ids=":05:00.0") everything works just as 
before.

However with the patch if you add the whitelist *without* a domain e.g.
ovs-vsctl --no-wait set Open_vSwitch . 
other_config:dpdk-whitelist-pci-ids="05:00.0"

There is no immediate error. However later when doing add-port if you include 
the domain (current required practice) you will get an error. If you omit the 
domain all is well.

It's a little bit strange as regardless of domain or no domain in the 
other_config the PCI probe always reports the NIC as expected:
2017-12-06T16:55:27Z|00013|dpdk|INFO|EAL: PCI device :05:00.0 on NUMA 
socket -1
2017-12-06T16:55:27Z|00014|dpdk|WARN|EAL:   Invalid NUMA socket, default to 0
2017-12-06T16:55:27Z|00015|dpdk|INFO|EAL:   probe driver: 8086:1572 net_i40e

I'll be using the other patch in this series "isolate rte-mempool allocation" 
over the next few days so I'll review that in due course.

Thanks,
Billy.

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of Sugesh Chandran
> Sent: Friday, November 10, 2017 1:29 AM
> To: d...@openvswitch.org; b...@ovn.org
> Subject: [ovs-dev] [PATCH 2/2] Adding configuration option to whitelist DPDK
> physical ports.
> 
> Adding a OVS configuration option to whitelist DPDK physical ports. By default
> running multiple instances of DPDK on a single platform cannot use physical
> ports at the same time even though they are distinct.
> 
> The eal init scans all the ports that are bound to DPDK and initialize the 
> drivers
> accordingly. This happens for every DPDK process init.
> On a multi instance deployment usecase, it causes issues for using physical 
> NIC
> ports.
> Consider a two DPDK process that are running on a single platform, the second
> DPDK primary process will try to initialize the drivers for all the physical 
> ports
> even though it may be used in first DPDK process.
> 
> To avoid this situation user can whitelist the ports for each DPDK 
> application.
> Whitelisting of ports/PCI-ID in a DPDK process will limit the eal-init only 
> on those
> ports.
> 
> To whitelist two physical ports ":06:00.0" and ":06:00.1", the
> configuration option in OVS would be
>   ovs-vsctl  set Open_vSwitch . other_config:dpdk-whitelist-pci-
> ids=":06:00.0,:06:00.1"
> 
> To update the whitelist ports, OVS daemon has to be restarted.
> 
> Signed-off-by: Sugesh Chandran 
> ---
>  lib/dpdk.c   | 29 +
>  vswitchd/vswitch.xml | 21 +
>  2 files changed, 50 insertions(+)
> 
> diff --git a/lib/dpdk.c b/lib/dpdk.c
> index 9d187c7..0f11977 100644
> --- a/lib/dpdk.c
> +++ b/lib/dpdk.c
> @@ -323,6 +323,34 @@ dpdk_isolate_rte_mem_config(const struct smap
> *ovs_other_config,  }
> 
>  static void
> +dpdk_whitelist_pci_ids(const struct smap *ovs_other_config, char ***argv,
> +   int *argc)
> +{
> +const char *pci_ids;
> +char *pci_dev;
> +int len;
> +int i;
> +pci_ids = smap_get(ovs_other_config, "dpdk-whitelist-pci-ids");
> +if (!pci_ids) {
> +return;
> +}
> +len = strlen(pci_ids);
> +do {
> +i = strcspn(pci_ids, ",");
> +pci_dev = xmemdup0(pci_ids, i);
> +if (!strlen(pci_dev)) {
> + break;
> +}
> +*argv = grow_argv(argv, *argc, 2);
> +(*argv)[(*argc)++] = xstrdup("-w");
> +(*argv)[(*argc)++] = pci_dev;
> +i++;
> +pci_ids += i;
> +len -= i;
> +} while (pci_ids && len > 0);
> +}
> +
> +static void
>  dpdk_init__(const struct smap *ovs_other_config)  {
>  char **argv = NULL, **argv_to_release = NULL; @@ -409,6 +437,7 @@
> dpdk_init__(const struct smap *ovs_other_config)
>  }
> 
>  dpdk_isolate_rte_mem_config(ovs_other_config, , );
> +dpdk_whitelist_pci_ids(ovs_other_config, , );
>  argv = grow_argv(, argc, 1);
>  argv[argc] = NULL;
> 
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index
> 7462b30..0b64b25 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -442,6 +442,27 @@
>  
>
> 
> +  
> +
> +  Specifies list of pci-ids separated by , for whitelisting available
> +  physical NIC ports in OVS. The option valid 

Re: [ovs-dev] [PATCH v4 2/3] dpif-netdev: Rename rxq_cycle_sort to compare_rxq_cycles.

2017-11-24 Thread O Mahony, Billy
Acked-by: Billy O'Mahony

> -Original Message-
> From: Kevin Traynor [mailto:ktray...@redhat.com]
> Sent: Thursday, November 23, 2017 7:42 PM
> To: d...@openvswitch.org; aserd...@ovn.org; i.maxim...@samsung.com; O
> Mahony, Billy <billy.o.mah...@intel.com>; Stokes, Ian <ian.sto...@intel.com>
> Cc: Kevin Traynor <ktray...@redhat.com>
> Subject: [PATCH v4 2/3] dpif-netdev: Rename rxq_cycle_sort to
> compare_rxq_cycles.
> 
> This function is used for comparison between queues as part of the sort. It 
> does
> not do the sort itself.
> As such, give it a more appropriate name.
> 
> Suggested-by: Billy O'Mahony <billy.o.mah...@intel.com>
> Signed-off-by: Kevin Traynor <ktray...@redhat.com>
> ---
> 
> V4: Added patch into series after suggestion by Billy
> 
>  lib/dpif-netdev.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index f5cdd92..657df71 
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -3446,5 +3446,5 @@ rr_numa_list_destroy(struct rr_numa_list *rr)
>  /* Sort Rx Queues by the processing cycles they are consuming. */  static 
> int -
> rxq_cycle_sort(const void *a, const void *b)
> +compare_rxq_cycles(const void *a, const void *b)
>  {
>  struct dp_netdev_rxq *qa;
> @@ -3535,5 +3535,5 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned)
> OVS_REQUIRES(dp->port_mutex)
>  /* Sort the queues in order of the processing cycles
>   * they consumed during their last pmd interval. */
> -qsort(rxqs, n_rxqs, sizeof *rxqs, rxq_cycle_sort);
> +qsort(rxqs, n_rxqs, sizeof *rxqs, compare_rxq_cycles);
>  }
> 
> --
> 1.8.3.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [RFC PATCH v3 8/8] netdev-dpdk: support multi-segment jumbo frames

2017-11-23 Thread O Mahony, Billy
Hi Mark,

Just one comment below.

/Billy

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of Mark Kavanagh
> Sent: Tuesday, November 21, 2017 6:29 PM
> To: d...@openvswitch.org; qiud...@chinac.com
> Subject: [ovs-dev] [RFC PATCH v3 8/8] netdev-dpdk: support multi-segment
> jumbo frames
> 
> Currently, jumbo frame support for OvS-DPDK is implemented by increasing the
> size of mbufs within a mempool, such that each mbuf within the pool is large
> enough to contain an entire jumbo frame of a user-defined size. Typically, for
> each user-defined MTU, 'requested_mtu', a new mempool is created, containing
> mbufs of size ~requested_mtu.
> 
> With the multi-segment approach, a port uses a single mempool, (containing
> standard/default-sized mbufs of ~2k bytes), irrespective of the user-requested
> MTU value. To accommodate jumbo frames, mbufs are chained together, where
> each mbuf in the chain stores a portion of the jumbo frame. Each mbuf in the
> chain is termed a segment, hence the name.
> 
> == Enabling multi-segment mbufs ==
> Multi-segment and single-segment mbufs are mutually exclusive, and the user
> must decide on which approach to adopt on init. The introduction of a new
> OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. This is a global 
> boolean
> value, which determines how jumbo frames are represented across all DPDK
> ports. In the absence of a user-supplied value, 'dpdk-multi-seg-mbufs' 
> defaults
> to false, i.e. multi-segment mbufs must be explicitly enabled / single-segment
> mbufs remain the default.
> 
[[BO'M]] Would it be more useful if they multi-segment was enabled by default?  
Does enabling multi-segment mbufs result in much of a performance decrease when 
not-using jumbo frames? Either because jumbo frames are not coming in on the 
ingress port or because the mtu is set not to accept jumbo frames.

Obviously not a blocker to this patch-set. Maybe something to be looked at in 
the future. 

> Setting the field is identical to setting existing DPDK-specific OVSDB
> fields:
> 
> ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
> ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
> ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
> ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
> 
> Signed-off-by: Mark Kavanagh 
> ---
>  NEWS |  1 +
>  lib/dpdk.c   |  7 +++
>  lib/netdev-dpdk.c| 43 ---
>  lib/netdev-dpdk.h|  1 +
>  vswitchd/vswitch.xml | 20 
>  5 files changed, 69 insertions(+), 3 deletions(-)
> 
> diff --git a/NEWS b/NEWS
> index c15dc24..657b598 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -15,6 +15,7 @@ Post-v2.8.0
> - DPDK:
>   * Add support for DPDK v17.11
>   * Add support for vHost IOMMU feature
> + * Add support for multi-segment mbufs
> 
>  v2.8.0 - 31 Aug 2017
>  
> diff --git a/lib/dpdk.c b/lib/dpdk.c
> index 8da6c32..4c28bd0 100644
> --- a/lib/dpdk.c
> +++ b/lib/dpdk.c
> @@ -450,6 +450,13 @@ dpdk_init__(const struct smap *ovs_other_config)
> 
>  /* Finally, register the dpdk classes */
>  netdev_dpdk_register();
> +
> +bool multi_seg_mbufs_enable = smap_get_bool(ovs_other_config,
> +"dpdk-multi-seg-mbufs", false);
> +if (multi_seg_mbufs_enable) {
> +VLOG_INFO("DPDK multi-segment mbufs enabled\n");
> +netdev_dpdk_multi_segment_mbufs_enable();
> +}
>  }
> 
>  void
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index 36275bd..293edad
> 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -65,6 +65,7 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
> 
>  VLOG_DEFINE_THIS_MODULE(netdev_dpdk);
>  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
> +static bool dpdk_multi_segment_mbufs = false;
> 
>  #define DPDK_PORT_WATCHDOG_INTERVAL 5
> 
> @@ -500,6 +501,7 @@ dpdk_mp_create(struct netdev_dpdk *dev, uint16_t
> frame_len)
>+ dev->requested_n_txq * dev->requested_txq_size
>+ MIN(RTE_MAX_LCORE, dev->requested_n_rxq) *
> NETDEV_MAX_BURST
>+ MIN_NB_MBUF;
> +/* XXX (RFC) - should n_mbufs be increased if multi-seg mbufs are
> + used? */
> 
>  ovs_mutex_lock(_mp_mutex);
>  do {
> @@ -568,7 +570,13 @@ dpdk_mp_free(struct rte_mempool *mp)
> 
>  /* Tries to allocate a new mempool - or re-use an existing one where
>   * appropriate - on requested_socket_id with a size determined by
> - * requested_mtu and requested Rx/Tx queues.
> + * requested_mtu and requested Rx/Tx queues. Some properties of the
> + mempool's
> + * elements are dependent on the value of 'dpdk_multi_segment_mbufs':
> + * - if 'true', then the mempool contains standard-sized mbufs that are 
> chained
> + *   together to accommodate packets of size 'requested_mtu'.
> + * - if 'false', 

Re: [ovs-dev] [RFC PATCH v2] dpif-netdev: Add port/queue tiebreaker to rxq_cycle_sort.

2017-11-23 Thread O Mahony, Billy
Hi Kevin,

My 2c below..

/Billy

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of Kevin Traynor
> Sent: Wednesday, November 22, 2017 7:11 PM
> To: d...@openvswitch.org; aserd...@ovn.org
> Subject: [ovs-dev] [RFC PATCH v2] dpif-netdev: Add port/queue tiebreaker to
> rxq_cycle_sort.
> 
> rxq_cycle_sort is used to sort the rx queues by their measured number of 
> cycles.
> In the event that they are equal 0 could be returned.
> However, it is observed that returning 0 results in a different sort order on
> Windows/Linux. This is ok in practice but it causes a unit test failure for
> "1007: PMD - pmd-cpu-mask/distribution of rx queues" on Windows.
> 
> In order to have a consistent sort result, introduce a tiebreaker of 
> port/queue.
> 
> Fixes: 655856ef39b9 ("dpif-netdev: Change rxq_scheduling to use rxq processing
> cycles.")
> Reported-by: Alin Gabriel Serdean 
> Signed-off-by: Kevin Traynor 
> ---
> 
> v2: Inadvertently reversed the order for non-tiebreak cases in v1. Fix that.
> 
>  lib/dpif-netdev.c | 15 ---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 0a62630..57451e9 
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -3452,4 +3452,5 @@ rxq_cycle_sort(const void *a, const void *b)
>  uint64_t total_qa, total_qb;
>  unsigned i;
> +int winner = 1;
> 
>  qa = *(struct dp_netdev_rxq **) a;
> @@ -3464,8 +3465,16 @@ rxq_cycle_sort(const void *a, const void *b)
>  dp_netdev_rxq_set_cycles(qb, RXQ_CYCLES_PROC_HIST, total_qb);
> 
[[BO'M]]
I think it's worth adding a comment to state this compare function never 
return's 0 and that the reason for this (the sort order difference on different 
OSs).

Also the function should probably be called rxq_cycle_compare as its really 
comparing entries rather than sorting a list of entries And it's used as the 
compar arg to qsort.

> -if (total_qa >= total_qb) {
> -return -1;
> +if (total_qa > total_qb) {
> +winner = -1;
> +} else if (total_qa == total_qb) {
> +/* Cycles are the same.  Tiebreak on port/queue id. */
> +if (qb->port->port_no > qa->port->port_no) {
> +winner = -1;
> +} else if (qa->port->port_no == qb->port->port_no) {
> +winner = netdev_rxq_get_queue_id(qa->rx)
> +- netdev_rxq_get_queue_id(qb->rx);
> +}
>  }
> -return 1;
> +return winner;
>  }
> 
> --
> 1.8.3.1
> 
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] dpif-netdev: Refactor datapath flow cache

2017-11-21 Thread O Mahony, Billy
Hi Jan,

Thanks, that's a really interesting patch.

Currently does not apply to head of master - what rev can I apply it to?

Some more below - including one way down in the code.

Thanks,
/Billy.

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of Jan Scheurich
> Sent: Monday, November 20, 2017 5:33 PM
> To: d...@openvswitch.org
> Subject: [ovs-dev] [PATCH] dpif-netdev: Refactor datapath flow cache
> 
> So far the netdev datapath uses an 8K EMC to speed up the lookup of frequently
> used flows by comparing the parsed packet headers against the miniflow of a
> cached flow, using 13 bits of the packet RSS hash as index. The EMC is too 
> small
> for many applications with 100K or more parallel packet flows so that EMC
> threshing actually degrades performance.
> Furthermore, the size of struct miniflow and the flow copying cost prevents us
> from making it much larger.
> 
> At the same time the lookup cost of the megaflow classifier (DPCLS) is 
> increasing
> as the number of frequently hit subtables grows with the complexity of 
> pipeline
> and the number of recirculations.
> 
> To close the performance gap for many parallel flows, this patch introduces 
> the
> datapath flow cache (DFC) with 1M entries as lookup stage between EMC and
> DPCLS. It directly maps 20 bits of the RSS hash to a pointer to the last hit
> megaflow entry and performs a masked comparison of the packet flow with the
> megaflow key to confirm the hit. This avoids the costly DPCLS lookup even for
> very large number of parallel flows with a small memory overhead.
> 
> Due the large size of the DFC and the low risk of DFC thrashing, any DPCLS hit
> immediately inserts an entry in the DFC so that subsequent packets get speeded
> up. The DFC, thus, accelerate also short-lived flows.
> 
> To further accelerate the lookup of few elephant flows, every DFC hit 
> triggers a
> probabilistic EMC insertion of the flow. As the DFC entry is already in place 
> the
> default EMC insertion probability can be reduced to
> 1/1000 to minimize EMC thrashing should there still be many fat flows.
> The inverse EMC insertion probability remains configurable.
> 
> The EMC implementation is simplified by removing the possibility to store a 
> flow
> in two slots, as there is no particular reason why two flows should 
> systematically
> collide (the RSS hash is not symmetric).
> The maximum size of the EMC flow key is limited to 256 bytes to reduce the
> memory footprint. This should be sufficient to hold most real life packet flow
> keys. Larger flows are not installed in the EMC.
[[BO'M]] Does miniflow_extract work ok with the reduced miniflow size?
Miniflow comment says: 
  Caller is responsible for initializing 'dst' with enough storage for 
FLOW_U64S * 8 bytes.

> 
> The pmd-stats-show command is enhanced to show both EMC and DFC hits
> separately.
> 
> The sweep speed for cleaning up obsolete EMC and DFC flow entries and freeing
> dead megaflow entries is increased. With a typical PMD cycle duration of 100us
> under load and checking one DFC entry per cycle, the DFC sweep should
> normally complete within in 100s.
> 
> In PVP performance tests with an L3 pipeline over VXLAN we determined the
> optimal EMC size to be 16K entries to obtain a uniform speedup compared to
> the master branch over the full range of parallel flows. The measurement below
> is for 64 byte packets and the average number of subtable lookups per DPCLS 
> hit
> in this pipeline is 1.0, i.e. the acceleration already starts for a single 
> busy mask.
> Tests with many visited subtables should show a strong increase of the gain
> through DFC.
> 
> Flows   master  DFC+EMC  Gain
> [Mpps]  [Mpps]
> --
> 8   4.454.62 3.8%
> 100 4.174.47 7.2%
> 10003.884.3412.0%
> 20003.544.1717.8%
> 50003.013.8227.0%
> 1   2.753.6331.9%
> 2   2.643.5032.8%
> 5   2.603.3328.1%
> 10  2.593.2324.7%
> 50  2.593.1621.9%
[[BO'M]]
What is the flow distribution here? 

Are there other flow distributions that we want to ensure do not suffer a 
possible regression. I'm not sure what they are exactly - I have some ideas but 
admittedly I've only every tested with either a uniform flow distribution or 
else a round-robin distribution. 

> 
> 
> Signed-off-by: Jan Scheurich 
> ---
>  lib/dpif-netdev.c | 349 ---
> ---
>  1 file changed, 235 insertions(+), 114 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index db78318..efcf2e9 
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -127,19 +127,19 @@ struct netdev_flow_key {
>  uint64_t buf[FLOW_MAX_PACKET_U64S];  };
> 
> -/* Exact match cache for frequently used flows
> +/* Datapath flow cache (DFC) for frequently used flows
>  

Re: [ovs-dev] [RFC PATCH v2 08/10] vswitch.xml: Detail vxlanipsec user interface.

2017-10-18 Thread O Mahony, Billy
Hi Ian,

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of Ian Stokes
> Sent: Friday, August 25, 2017 5:41 PM
> To: d...@openvswitch.org
> Subject: [ovs-dev] [RFC PATCH v2 08/10] vswitch.xml: Detail vxlanipsec user
> interface.
> 
> This commit adds details to the vswitch xml regarding the use of the 
> vxlanipsec
> interface type. This patch is not intended for upstreaming and simply seeks to
> solicit feedback on the user interface design of the vxlanipsec port type as
> described in the vswitch.xml.
> 
> This modifies the vswitch.xml with a proposed vxlanipsec interface.
> It also provides details for the proposed interface options such as SPD 
> creation,
> SA creation and modification, Policy entries for the SPD as well as traffic
> selector options for the policy.
> 
> Signed-off-by: Ian Stokes 
> ---
>  vswitchd/vswitch.xml |  225
> ++
>  1 files changed, 225 insertions(+), 0 deletions(-)
> 
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index
> 074535b..27c3c54 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -2227,6 +2227,13 @@
>  A pair of virtual devices that act as a patch cable.
>
> 
> +  vxlanipsec
> +  
> +An interface type to provide IPsec RFC4301 functionality for
> +traffic at the IP layer with vxlan, initially for IPv4
> +environments.
> +  
> +
>null
>An ignored interface. Deprecated and slated for removal in
>February 2013.
> @@ -2644,6 +2651,224 @@
>
>  
> 
> +
> +  
> +Only vxlanipsec interfaces support these options.
> +  
> +
> +  
> +  
> +Must be an identifier for the SPD that is to be used by this IPsec
> +interface. If no SPD exists with this ID then it will be created.
> +  
> +  
[[BO'M]] Do we need different security policy databases for different 
Interfaces? One per vSwitch would be enough to start with.
> +
> +  
> +  
> +An identifier representing the ID of a Security Association.
> +If no SA with this ID exists it will be created.
> +  
> +  
> +
> +  
> +  
> +A 32 bit number representing the security policy index for
> +the SA.
[[BO'M]] should this be 'security *parameters* index'? If so the specs (rfc4301 
sec 4.4.2.1) say "a 32-bit value selected by the
  receiving end of an SA to uniquely identify the SA"  and " An arbitrary 
32-bit value that is used by a receiver to identify
  the SA to which an incoming packet should be bound." so should not need 
to be configured? I think the receiver just assigns an arbitrary 32bit value.
> +  
> +  
> +
[[BO'M]] The remaining options ipsec_mode, sa_protocol,  
ts_remote_port_range really define an SPD (security policy database) to look at 
the packet and decide if it needs to be DISCARD, BYPASS, PROTECTed. Would it be 
feasible to configure the selectors just as regular OF rules that send traffic 
to the vxlanipsec interface and then associate the keys, algorithms, mode 
(tunnel/transport) and protocol (AH/ESP) with the vxlanipsec interface?

> +  
> +  
> +The IPsec mode that applies to the SA, one of:
> +  
> +
> +  
> +transport: Provide protection primarily for next
> +layer protocols.
> +  
> +  
> +tunnel: Provide protection to IP layer also (applied
> +to tunneled IP packets).
> +  
> +
> +  
> +Initially only support for transport mode will be implemented.
> +  
> +  
> +
> +  
> +  
> +The security protocol used for IPsec, one of the following:
> +  
> +
> +  
> +ESP: Encapsulating Security Payload.
> +  
> +  
> +AH: Authentication header
> +  
> +
> +  
> +Initially only ESP is supported, users can implement authentication
> +communication only by setting the encryption algorithm to NULL for 
> ESP
> +but specifying the integrity algorithm. In this way there is no need
> +to support AH. If this is acceptable to the OVS community then this
> +option could be removed as it will always be ESP.
> +  
> +  
> +
> +  
> +  
> +The encryption algorithm used for IPsec, one of the following:
> +  
> +
> +  
> +NULL: No encryption. Note NULL is required for the
> +use of ESP with authentication only which is preferred over AH
> +due to NAT traversal.
> +  
> +  
> +AES_CBC: AES_CBC is a non-AEAD algorithm. Note users
> +MUST specify an authentication algorithm to check integrity.
> +  
> +  
> 

Re: [ovs-dev] [RFC PATCH v2 09/10] Docs: Add userspace-ipsec how to guide.

2017-10-18 Thread O Mahony, Billy
Hi Ian,

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of Ian Stokes
> Sent: Friday, August 25, 2017 5:41 PM
> To: d...@openvswitch.org
> Subject: [ovs-dev] [RFC PATCH v2 09/10] Docs: Add userspace-ipsec how to
> guide.
> 
> This commit adds a how to guide for using the proposed vxlanipsec userspace
> interface. It is not intended to be upstreamed but simply seeks to solicit 
> feed
> back by providing an example of the proposed vxlanipsec interface design setup
> and usage.
> 
> The how to usecase deals with securing vxlan traffic between 2 VMs as
> described in the userspace-vxlan how to guide. It provides an example of how
> the proposed ipsec interface is configured with an SPD, creation of SAs 
> including
> IPsec protocol, mode, crypto/authentication algorithms/keys, assigning SPD
> entires to SAs for inbound/outbound traffic with traffic selectors and finally
> updating the SA keys.
> 
> Signed-off-by: Ian Stokes 
> ---
>  Documentation/automake.mk   |1 +
>  Documentation/howto/index.rst   |1 +
>  Documentation/howto/userspace-ipsec.rst |  187
> +++
>  3 files changed, 189 insertions(+), 0 deletions(-)  create mode 100644
> Documentation/howto/userspace-ipsec.rst
> 
> diff --git a/Documentation/automake.mk b/Documentation/automake.mk index
> 24fe63d..a8f2a01 100644
> --- a/Documentation/automake.mk
> +++ b/Documentation/automake.mk
> @@ -59,6 +59,7 @@ DOC_SOURCE = \
>   Documentation/howto/tunneling.png \
>   Documentation/howto/tunneling.rst \
>   Documentation/howto/userspace-tunneling.rst \
> + Documentation/howto/userspace-ipsec.rst \
>   Documentation/howto/vlan.png \
>   Documentation/howto/vlan.rst \
>   Documentation/howto/vtep.rst \
> diff --git a/Documentation/howto/index.rst b/Documentation/howto/index.rst
> index 5859a33..97d690a 100644
> --- a/Documentation/howto/index.rst
> +++ b/Documentation/howto/index.rst
> @@ -43,6 +43,7 @@ OVS
> lisp
> tunneling
> userspace-tunneling
> +   userspace-ipsec
> vlan
> qos
> vtep
> diff --git a/Documentation/howto/userspace-ipsec.rst
> b/Documentation/howto/userspace-ipsec.rst
> new file mode 100644
> index 000..2ae2bd8
> --- /dev/null
> +++ b/Documentation/howto/userspace-ipsec.rst
> @@ -0,0 +1,187 @@
> +..
> +  Licensed under the Apache License, Version 2.0 (the "License"); you may
> +  not use this file except in compliance with the License. You may obtain
> +  a copy of the License at
> +
> +  http://www.apache.org/licenses/LICENSE-2.0
> +
> +  Unless required by applicable law or agreed to in writing, software
> +  distributed under the License is distributed on an "AS IS" BASIS, 
> WITHOUT
> +  WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See
> the
> +  License for the specific language governing permissions and limitations
> +  under the License.
> +
> +  Convention for heading levels in Open vSwitch documentation:
> +
> +  ===  Heading 0 (reserved for the title in a document)
> +  ---  Heading 1
> +  ~~~  Heading 2
> +  +++  Heading 3
> +  '''  Heading 4
> +
> +  Avoid deeper levels because they do not render well.
> +
> +==
> +Securing VXLAN traffic between VMs Using IPsec (Userspace)
> +==
> +
> +This document describes how to use IPsec in Open vSwitch to secure
> +traffic between VMs on two different hosts communicating over VXLAN
> +tunnels. This solution works entirely in userspace.
> +
> +.. note::
> +
> +   This guide covers the steps required to configure an IPsec interface to
> +   secure VXLAN tunneling traffic. It does not cover the steps to configure
> +   the vxlan tunnels in userspace. For these configuration steps please refer
> +   to :doc:`userspace-tunneling`.
> +
> +.. TODO(stephenfin): Convert this to a (prettier) PNG with same styling as 
> the
> +   rest of the document
> +
> +::
> +
> ++--+  +--+
> +| vm0  | 192.168.1.1/24192.168.1.2/24 | vm1  |
> ++--+  +--+
> +   (vm_port0)(vm_port1)
> +   | |
> +   | |
> +   | |
> ++--+  +--+
> +|br-int|  |br-int|
> ++--+  +--+
> +| vxlanipsec0  | 172.168.1.1/24172.168.1.2/24 | vxlanipsec0  |
> ++--+ 

Re: [ovs-dev] [PATCH 0/4] prioritizing latency sensitive traffic

2017-10-17 Thread O Mahony, Billy
Hi All,

As a suggestion for dealing with indicating success or otherwise of ingress 
scheduling configuration and also advertising an Interfaces ingress scheduling 
capability I'm suggesting both these can be written back to the Interface 
tables other_config column.

The schema change (change with respect to the current patch-set) would be like 
this. 

   
   
 
  The format of the ingress_sched field is specified in ovs-fields(7) in
  the ``Matching'' and ``FIELD REFERENCE'' sections.
 
   
+  
+
+A comma separated list of ovs-fields(7) that the interface supports for
+ingress scheduling. If ingress scheduling is not supported this column
+is cleared.
+
+  
+  
+
+If the specified ingress scheduling could not be applied, Open vSwitch
+sets this column to an error description in human readable form.
+Otherwise, Open vSwitch clears this column.
+
+  
 


It would be nice to have input on the feasibility of writing back to the 
Interface table - there is already a few columns that are written to in 
Interface table - e.g stats column and  ofport column. But this would make the 
other_config column both read and write which hopefully doesn't confuse the 
mechanism that notifies Interface table changes from ovsdb into vswitchd. 

Regards,
Billy. 


> -Original Message-
> From: Jan Scheurich [mailto:jan.scheur...@ericsson.com]
> Sent: Friday, September 22, 2017 12:37 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; Kevin Traynor
> <ktray...@redhat.com>; d...@openvswitch.org
> Cc: Mechthild Buescher <mechthild.buesc...@ericsson.com>; Venkatesan
> Pradeep <venkatesan.prad...@ericsson.com>
> Subject: RE: [ovs-dev] [PATCH 0/4] prioritizing latency sensitive traffic
> 
> Hi Billy,
> 
> > -Original Message-
> > From: O Mahony, Billy [mailto:billy.o.mah...@intel.com]
> > Sent: Friday, 22 September, 2017 10:52
> 
> > > The next question is how to classify the ingress traffic on the NIC
> > > and insert it into rx queues with different priority. Any scheme
> > > implemented should preferably work with as many NICs as possible.
> > > Use of the new rte_flow API in DPDK seems the right direction to go here.
> >
> > [[BO'M]] This may be getting ahead of where we are but is it important to
> know if a NIC does not support a prioritization scheme?
> > Someone, Darrell I believe mentioned a capability discovery mechanism
> > at one point. I was thinking it was not necessary as functionally
> > nothing changes if prioritization is or is not supported. But maybe in 
> > terms of
> an orchestrator it does make sense - as the it may want to want to make other
> arrangements to protect control traffic in the absence of a working
> prioritization mechanism.
> 
> [Jan] In our use case the configuration of filters for prioritization would 
> happen
> "manually" at OVS deployment time with full knowledge of the NIC type and
> capabilities. A run-time capability discovery mechanism is not really needed 
> for
> that. But it would anyway be good to get a feedback if the configured filter 
> is
> supported by the present NIC or if the prioritization will not work.
> 
> > >
> > > We are very interested in starting the dialogue how to configure the
> > > {queue, priority, filter} mapping in OVS and which filters are most
> > > meaningful to start with and supported by most NICs. Candidates
> > > could include VLAN tags and p- bits, Ethertype and IP DSCP.
> 
> Any feedback as to the viability of filtering on those fields with i40e and 
> ixgbe?
> 
> > >
> > > One thing that we consider important and that we would not want to
> > > lose with prioritization is the possibility to share load over a
> > > number of PMDs with RSS. So preferably the prioritization and RSS
> > > spread over a number of rx queues were orthogonal.
> >
> > [[BO'M]] We have a proposed solution for this now. Which is simply to
> > change the RETA table to avoid RSS'd packets 'polluting' the priority
> > queue. It hasn't been implemented but it should work. That's in the context 
> > of
> DPDK/FlowDirector/XL710 but rte_flow api should allow this too.
> 
> [Jan] Does this mean there is work needed to enhance the NIC firmware, the
> i40e DPDK PMD, or the rte_flow API (or any combination of those)? What about
> the ixgbe PMD in this context? Will the Niantic  support similar 
> classification?
> 
> Do you have a pointer to Fortville documentation that would help us to
> understand how i40e implements the rte_flow API.
> 
> Thanks, Jan

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] netdev-dpdk: Fix calling vhost API with negative vid.

2017-10-13 Thread O Mahony, Billy
I couldn't recreate the issue on x86 but after testing with vhostuser and 
vhostuserclient for a few scenarios such as client, server reconnect and 
multi-queue I didn't find any problems with this patch. 

Tested-by: Billy O'Mahony <billy.o.mah...@intel.com> 
Acked-by: Billy O'Mahony <billy.o.mah...@intel.com>

> -Original Message-
> From: Ilya Maximets [mailto:i.maxim...@samsung.com]
> Sent: Friday, October 13, 2017 1:06 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; ovs-dev@openvswitch.org
> Cc: Maxime Coquelin <maxime.coque...@redhat.com>; Heetae Ahn
> <heetae82@samsung.com>
> Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: Fix calling vhost API with 
> negative
> vid.
> 
> On 13.10.2017 14:38, O Mahony, Billy wrote:
> > Hi Ilya,
> >
> >
> >> Issue can be reproduced by stopping DPDK application (testpmd) inside
> >> guest while heavy traffic flows to this VM.
> >>
> >
> > I tried both quitting testpmd without stopping the forwarding task and> 
> > simply
> killing testpmd without crashing vswitch in the host.
> >
> > What versions of dpdk are you using in the guest and host?
> 
> Versions below, but I don't think that it's so important.
> 
> Host: 17.05.2
> Guest: 16.07-rc1
> 
> >
> > Are you using dpdkvhostuser or dpdkvhostuserclient type ports?
> 
> dpdkvhostuserclient.
> 
> The complete test scenario where I saw this behaviour was:
> 
> 2 VMs with 4 queues per vhostuserclient port.
> VM1 - OVS - VM2
> 
> VM1 runs testpmd with --rxq=4 --txq=4 --nb-cores=4 --eth-peer=0,MAC2 --
> forward-mode=mac
> VM2 runs testpmd with --rxq=4 --txq=4 --nb-cores=4 --eth-peer=0,MAC1 --
> forward-mode=txonly
> 
> OVS with 8 pmd threads (1 core per queue).
> action=NORMAL
> 
> Steps:
> 
> 1. Starting testpmd in both VMs (non-interactive mode)
> 2. Waiting a while
> 3. Pushing  in VM1 console.
>--> OVS crashes while testpmd termination.
> 
> The most important thing, I guess, is that I'm using ARMv8 machine for that.
> It could be not so easy to reproduce on x86 system (I didn't try).
> 
> Best regards, Ilya Maximets.
> 
> >
> > Thanks,
> > Billy.
> >
> >> -Original Message-
> >> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> >> boun...@openvswitch.org] On Behalf Of Ilya Maximets
> >> Sent: Friday, October 6, 2017 11:50 AM
> >> To: ovs-dev@openvswitch.org
> >> Cc: Ilya Maximets <i.maxim...@samsung.com>; Maxime Coquelin
> >> <maxime.coque...@redhat.com>; Heetae Ahn
> <heetae82@samsung.com>
> >> Subject: [ovs-dev] [PATCH] netdev-dpdk: Fix calling vhost API with negative
> vid.
> >>
> >> Currently, rx and tx functions for vhost interfaces always obtain
> >> 'vid' twice. First time inside 'is_vhost_running' for checking the
> >> value and the second time in enqueue/dequeue function calls to
> >> send/receive packets. But second time we're not checking the returned
> >> value. If vhost device will be destroyed between checking and
> >> enqueue/dequeue, DPDK API will be called with '-1' instead of valid 'vid'.
> DPDK API does not validate the 'vid'.
> >> This leads to getting random memory value as a pointer to internal
> >> device structure inside DPDK. Access by this pointer leads to
> >> segmentation fault. For
> >> example:
> >>
> >>   |00503|dpdk|INFO|VHOST_CONFIG: read message
> >> VHOST_USER_GET_VRING_BASE
> >>   [New Thread 0x7fb6754910 (LWP 21246)]
> >>
> >>   Program received signal SIGSEGV, Segmentation fault.
> >>   rte_vhost_enqueue_burst at lib/librte_vhost/virtio_net.c:630
> >>   630 if (dev->features & (1 << VIRTIO_NET_F_MRG_RXBUF))
> >>   (gdb) bt full
> >>   #0  rte_vhost_enqueue_burst at lib/librte_vhost/virtio_net.c:630
> >>   dev = 0x
> >>   #1  __netdev_dpdk_vhost_send at lib/netdev-dpdk.c:1803
> >>   tx_pkts = 
> >>   cur_pkts = 0x7f340084f0
> >>   total_pkts = 32
> >>   dropped = 0
> >>   i = 
> >>   retries = 0
> >>   ...
> >>   (gdb) p *((struct netdev_dpdk *) netdev)
> >>   $8 = { ... ,
> >> flags = (NETDEV_UP | NETDEV_PROMISC), ... ,
> >> vid = {v = -1},
> >> vhost_reconfigured = false, ... }
> >>
> >> Issue can be reproduced by stopping DPDK application (testpmd) inside
> >> guest while heavy traffic flows to this VM.

Re: [ovs-dev] [PATCH] netdev-dpdk: Fix calling vhost API with negative vid.

2017-10-13 Thread O Mahony, Billy
Ok, I'll try with something closer to that configuration... But still x86 :)

> -Original Message-
> From: Ilya Maximets [mailto:i.maxim...@samsung.com]
> Sent: Friday, October 13, 2017 1:06 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; ovs-dev@openvswitch.org
> Cc: Maxime Coquelin <maxime.coque...@redhat.com>; Heetae Ahn
> <heetae82@samsung.com>
> Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: Fix calling vhost API with 
> negative
> vid.
> 
> On 13.10.2017 14:38, O Mahony, Billy wrote:
> > Hi Ilya,
> >
> >
> >> Issue can be reproduced by stopping DPDK application (testpmd) inside
> >> guest while heavy traffic flows to this VM.
> >>
> >
> > I tried both quitting testpmd without stopping the forwarding task and> 
> > simply
> killing testpmd without crashing vswitch in the host.
> >
> > What versions of dpdk are you using in the guest and host?
> 
> Versions below, but I don't think that it's so important.
> 
> Host: 17.05.2
> Guest: 16.07-rc1
> 
> >
> > Are you using dpdkvhostuser or dpdkvhostuserclient type ports?
> 
> dpdkvhostuserclient.
> 
> The complete test scenario where I saw this behaviour was:
> 
> 2 VMs with 4 queues per vhostuserclient port.
> VM1 - OVS - VM2
> 
> VM1 runs testpmd with --rxq=4 --txq=4 --nb-cores=4 --eth-peer=0,MAC2 --
> forward-mode=mac
> VM2 runs testpmd with --rxq=4 --txq=4 --nb-cores=4 --eth-peer=0,MAC1 --
> forward-mode=txonly
> 
> OVS with 8 pmd threads (1 core per queue).
> action=NORMAL
> 
> Steps:
> 
> 1. Starting testpmd in both VMs (non-interactive mode)
> 2. Waiting a while
> 3. Pushing  in VM1 console.
>--> OVS crashes while testpmd termination.
> 
> The most important thing, I guess, is that I'm using ARMv8 machine for that.
> It could be not so easy to reproduce on x86 system (I didn't try).
> 
> Best regards, Ilya Maximets.
> 
> >
> > Thanks,
> > Billy.
> >
> >> -Original Message-
> >> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> >> boun...@openvswitch.org] On Behalf Of Ilya Maximets
> >> Sent: Friday, October 6, 2017 11:50 AM
> >> To: ovs-dev@openvswitch.org
> >> Cc: Ilya Maximets <i.maxim...@samsung.com>; Maxime Coquelin
> >> <maxime.coque...@redhat.com>; Heetae Ahn
> <heetae82@samsung.com>
> >> Subject: [ovs-dev] [PATCH] netdev-dpdk: Fix calling vhost API with negative
> vid.
> >>
> >> Currently, rx and tx functions for vhost interfaces always obtain
> >> 'vid' twice. First time inside 'is_vhost_running' for checking the
> >> value and the second time in enqueue/dequeue function calls to
> >> send/receive packets. But second time we're not checking the returned
> >> value. If vhost device will be destroyed between checking and
> >> enqueue/dequeue, DPDK API will be called with '-1' instead of valid 'vid'.
> DPDK API does not validate the 'vid'.
> >> This leads to getting random memory value as a pointer to internal
> >> device structure inside DPDK. Access by this pointer leads to
> >> segmentation fault. For
> >> example:
> >>
> >>   |00503|dpdk|INFO|VHOST_CONFIG: read message
> >> VHOST_USER_GET_VRING_BASE
> >>   [New Thread 0x7fb6754910 (LWP 21246)]
> >>
> >>   Program received signal SIGSEGV, Segmentation fault.
> >>   rte_vhost_enqueue_burst at lib/librte_vhost/virtio_net.c:630
> >>   630 if (dev->features & (1 << VIRTIO_NET_F_MRG_RXBUF))
> >>   (gdb) bt full
> >>   #0  rte_vhost_enqueue_burst at lib/librte_vhost/virtio_net.c:630
> >>   dev = 0x
> >>   #1  __netdev_dpdk_vhost_send at lib/netdev-dpdk.c:1803
> >>   tx_pkts = 
> >>   cur_pkts = 0x7f340084f0
> >>   total_pkts = 32
> >>   dropped = 0
> >>   i = 
> >>   retries = 0
> >>   ...
> >>   (gdb) p *((struct netdev_dpdk *) netdev)
> >>   $8 = { ... ,
> >> flags = (NETDEV_UP | NETDEV_PROMISC), ... ,
> >> vid = {v = -1},
> >> vhost_reconfigured = false, ... }
> >>
> >> Issue can be reproduced by stopping DPDK application (testpmd) inside
> >> guest while heavy traffic flows to this VM.
> >>
> >> Fix that by obtaining and checking the 'vid' only once.
> >>
> >> CC: Ciara Loftus <ciara.lof...@intel.com>
> >> Fixes: 0a0f39df1d5a ("netdev-dpdk: Add support for DPDK 1

Re: [ovs-dev] [PATCH] netdev-dpdk: Fix calling vhost API with negative vid.

2017-10-13 Thread O Mahony, Billy
Hi Ilya,


> Issue can be reproduced by stopping DPDK application (testpmd) inside guest
> while heavy traffic flows to this VM.
>

I tried both quitting testpmd without stopping the forwarding task and simply 
killing testpmd without crashing vswitch in the host.

What versions of dpdk are you using in the guest and host?

Are you using dpdkvhostuser or dpdkvhostuserclient type ports?

Thanks,
Billy. 

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of Ilya Maximets
> Sent: Friday, October 6, 2017 11:50 AM
> To: ovs-dev@openvswitch.org
> Cc: Ilya Maximets ; Maxime Coquelin
> ; Heetae Ahn 
> Subject: [ovs-dev] [PATCH] netdev-dpdk: Fix calling vhost API with negative 
> vid.
> 
> Currently, rx and tx functions for vhost interfaces always obtain 'vid' 
> twice. First
> time inside 'is_vhost_running' for checking the value and the second time in
> enqueue/dequeue function calls to send/receive packets. But second time we're
> not checking the returned value. If vhost device will be destroyed between
> checking and enqueue/dequeue, DPDK API will be called with '-1' instead of 
> valid
> 'vid'. DPDK API does not validate the 'vid'.
> This leads to getting random memory value as a pointer to internal device
> structure inside DPDK. Access by this pointer leads to segmentation fault. For
> example:
> 
>   |00503|dpdk|INFO|VHOST_CONFIG: read message
> VHOST_USER_GET_VRING_BASE
>   [New Thread 0x7fb6754910 (LWP 21246)]
> 
>   Program received signal SIGSEGV, Segmentation fault.
>   rte_vhost_enqueue_burst at lib/librte_vhost/virtio_net.c:630
>   630 if (dev->features & (1 << VIRTIO_NET_F_MRG_RXBUF))
>   (gdb) bt full
>   #0  rte_vhost_enqueue_burst at lib/librte_vhost/virtio_net.c:630
>   dev = 0x
>   #1  __netdev_dpdk_vhost_send at lib/netdev-dpdk.c:1803
>   tx_pkts = 
>   cur_pkts = 0x7f340084f0
>   total_pkts = 32
>   dropped = 0
>   i = 
>   retries = 0
>   ...
>   (gdb) p *((struct netdev_dpdk *) netdev)
>   $8 = { ... ,
> flags = (NETDEV_UP | NETDEV_PROMISC), ... ,
> vid = {v = -1},
> vhost_reconfigured = false, ... }
> 
> Issue can be reproduced by stopping DPDK application (testpmd) inside guest
> while heavy traffic flows to this VM.
> 
> Fix that by obtaining and checking the 'vid' only once.
> 
> CC: Ciara Loftus 
> Fixes: 0a0f39df1d5a ("netdev-dpdk: Add support for DPDK 16.07")
> Signed-off-by: Ilya Maximets 
> ---
>  lib/netdev-dpdk.c | 14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index c60f46f..bf30bb0 
> 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -1637,18 +1637,18 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq
> *rxq,
> struct dp_packet_batch *batch)  {
>  struct netdev_dpdk *dev = netdev_dpdk_cast(rxq->netdev);
> -int qid = rxq->queue_id;
>  struct ingress_policer *policer = netdev_dpdk_get_ingress_policer(dev);
>  uint16_t nb_rx = 0;
>  uint16_t dropped = 0;
> +int qid = rxq->queue_id;
> +int vid = netdev_dpdk_get_vid(dev);
> 
> -if (OVS_UNLIKELY(!is_vhost_running(dev)
> +if (OVS_UNLIKELY(vid < 0 || !dev->vhost_reconfigured
>   || !(dev->flags & NETDEV_UP))) {
>  return EAGAIN;
>  }
> 
> -nb_rx = rte_vhost_dequeue_burst(netdev_dpdk_get_vid(dev),
> -qid * VIRTIO_QNUM + VIRTIO_TXQ,
> +nb_rx = rte_vhost_dequeue_burst(vid, qid * VIRTIO_QNUM +
> + VIRTIO_TXQ,
>  dev->dpdk_mp->mp,
>  (struct rte_mbuf **) batch->packets,
>  NETDEV_MAX_BURST); @@ -1783,10 +1783,11 
> @@
> __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
>  unsigned int total_pkts = cnt;
>  unsigned int dropped = 0;
>  int i, retries = 0;
> +int vid = netdev_dpdk_get_vid(dev);
> 
>  qid = dev->tx_q[qid % netdev->n_txq].map;
> 
> -if (OVS_UNLIKELY(!is_vhost_running(dev) || qid < 0
> +if (OVS_UNLIKELY(vid < 0 || !dev->vhost_reconfigured || qid < 0
>   || !(dev->flags & NETDEV_UP))) {
>  rte_spinlock_lock(>stats_lock);
>  dev->stats.tx_dropped+= cnt;
> @@ -1805,8 +1806,7 @@ __netdev_dpdk_vhost_send(struct netdev *netdev,
> int qid,
>  int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ;
>  unsigned int tx_pkts;
> 
> -tx_pkts = rte_vhost_enqueue_burst(netdev_dpdk_get_vid(dev),
> -  vhost_qid, cur_pkts, cnt);
> +tx_pkts = rte_vhost_enqueue_burst(vid, vhost_qid, cur_pkts,
> + cnt);
>  if (OVS_LIKELY(tx_pkts)) {
>  /* Packets have been sent.*/
>

Re: [ovs-dev] [RFC 0/2] EMC load-shedding

2017-09-25 Thread O Mahony, Billy
Hi Darrell,

Some more information below. I'll hold off on a v2 for now to give others time 
to comment.

Thanks,
Billy. 

 
> -Original Message-
> From: Darrell Ball [mailto:db...@vmware.com]
> Sent: Friday, September 22, 2017 7:20 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; d...@openvswitch.org
> Cc: i.maxim...@samsung.com; jan.scheur...@ericsson.com
> Subject: Re: [RFC 0/2] EMC load-shedding
> 
> Thanks for working on this Billy
> One comment inline.
> 
> On 9/22/17, 6:47 AM, "Billy O'Mahony" <billy.o.mah...@intel.com> wrote:
> 
> Hi All,
> 
> Please find attached RFC patch for EMC load-shedding [1] as promised [2].
> 
> This applies clean on 5ff834 "Increment ct packet counters..." It also 
> uses
> Ilya's patch "Fix per packet cycles statistics." [3] so I've included 
> that in
> the patch set as it wasn't merged when I started the RFC.
> 
> The main goal for this RFC is only to demonstrate the outline of the
> mechanism
> and get feedback & advice for further work.
> 
> However I did some initial testing with promising results. For 8K to 64K
> flows
> the cycles per packet drop from ~1200 to ~1100. For small numbers of flows
> (~16) the cycles per packet remain at ~900 which I beleive means no
> increase
> but I didn't baseline that situation.
> 
> There are some TODOs commented in the patch with XXX.
> 
> For one I think the mechanism should take into account the expected
> cycle-cost
> of EMC lookup and EMC miss (dpcls lookup) when deciding how much load
> to shed.
> Rather than the heuristic in this patch which is to keep the emc hit rate 
> (for
> flow which have not been diverted from the EMC) between certain
> bounds.
> 
> 
> [Darrell]
> Could you expand on the description of the algorithm and the rational?
> I know the algorithm was discussed along with other proposed patches, but I
> think it be would be beneficial if the patch (boils down to a single patch)
> described it.
[[BO'M]] 

I'll add that description and some comments to the v2 of the patch.  In the 
meantime reviewers should find this helpful:

As the number of flows increased there will eventually be too many flows 
contending for a place in the EMC cache. The EMC cache becomes a liability when 
emc_lookup_cost + (emc_miss_rate * dpcls_lookup_cost) grows to be greater
than a straightforward dpcls_lookup_cost. When this occurs if some proportion 
of flows could be made to skip the EMC  (ie neither be inserted into nor looked 
up in the EMC) it would result in lower lookup costs overall. 

This requires and efficient and flexible way to categorize flows into - 'skip 
EMC' and 'use EMC' categories. The RSS hash can fulfil this role by setting a 
threshold whereby RSS hashes under a certain value are skipped from the EMC.

The algorithm in this RFC is based on setting this shed threshold so that the 
hit rate on the EMC remains between 50 and 70% which from observation gives an 
efficient use of the EMC (based on cycles per packet). Periodically (after each 
3 million packets) the EMC hit rate is checked and if it is over 70% then the 
shed threshold is increased (more flows are shed from the EMC) and if it is 
below 50% the shed threshold in decreased (fewer flows are shed from the EMC). 
The shed threshold as 16 different values (0x_ to 0xF000_) which 
allows for no-shedding, 1/16th, 2/16ths, ... 15/16ths of flows to skipped from 
the EMC.

Each time the shed_threshold is adjusted it is moved by just one step.

Later revisions will look at the actual lookup cost for flows in the EMC and 
dpcls rather than using hard-coded hit rates to define efficient use of the 
EMC. They may also adjust the shed rate in a proportional manner and adjust on 
a timed interval instead of every N packets.


> 
> Probably the code could benefit from some expanded comments as well?
> 
> I see one comment in the code
> +/* As hit rate goes down shed thresh goes up (more is shed from EMC)
> */
> +/* XXX consider increment more if further out of bounds *
> 
[[BO'M]] 
> 
> Also we should decide on at least one flow distribution that would be
> useful
> (i.e. realistic) for EMC testing. The tests above have either been 
> carried out
> with a random (uniform) flow distribution which doesn't play well with 
> flow
> caching or else a round-robin flow distribution which is actually 
> adverserial
> to flow caching. If I have an agreed flow distribution I can then figure 
> out
> how to produce it for testing :).
> 
> [1] https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-
> 2DAugust_336509.html=DwIBAg=uilaK90D4

Re: [ovs-dev] [PATCH 0/4] prioritizing latency sensitive traffic

2017-09-22 Thread O Mahony, Billy
Hi Jan,

> -Original Message-
> From: Jan Scheurich [mailto:jan.scheur...@ericsson.com]
> Sent: Friday, September 22, 2017 12:37 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; Kevin Traynor
> <ktray...@redhat.com>; d...@openvswitch.org
> Cc: Mechthild Buescher <mechthild.buesc...@ericsson.com>; Venkatesan
> Pradeep <venkatesan.prad...@ericsson.com>
> Subject: RE: [ovs-dev] [PATCH 0/4] prioritizing latency sensitive traffic
> 
> Hi Billy,
> 
> > -Original Message-
> > From: O Mahony, Billy [mailto:billy.o.mah...@intel.com]
> > Sent: Friday, 22 September, 2017 10:52
> 
> > > The next question is how to classify the ingress traffic on the NIC
> > > and insert it into rx queues with different priority. Any scheme
> > > implemented should preferably work with as many NICs as possible.
> > > Use of the new rte_flow API in DPDK seems the right direction to go
> here.
> >
> > [[BO'M]] This may be getting ahead of where we are but is it important to
> know if a NIC does not support a prioritization scheme?
> > Someone, Darrell I believe mentioned a capability discovery mechanism
> > at one point. I was thinking it was not necessary as functionally
> > nothing changes if prioritization is or is not supported. But maybe in terms
> of an orchestrator it does make sense - as the it may want to want to make
> other arrangements to protect control traffic in the absence of a working
> prioritization mechanism.
> 
> [Jan] In our use case the configuration of filters for prioritization would
> happen "manually" at OVS deployment time with full knowledge of the NIC
> type and capabilities. A run-time capability discovery mechanism is not really
> needed for that. But it would anyway be good to get a feedback if the
> configured filter is supported by the present NIC or if the prioritization 
> will
> not work.
> 
[[BO'M]] There is a log warning message but if something more software-friendly 
is required maybe the ovsdb entry for the other_config could be cleared by 
vswitchd if the interface can't perform?
> > >
> > > We are very interested in starting the dialogue how to configure the
> > > {queue, priority, filter} mapping in OVS and which filters are most
> > > meaningful to start with and supported by most NICs. Candidates
> > > could include VLAN tags and p- bits, Ethertype and IP DSCP.
> 
> Any feedback as to the viability of filtering on those fields with i40e and
> ixgbe?
[[BO'M]] There is a flex filter feature which should make this possible for 
XL710. I will verify.
> 
> > >
> > > One thing that we consider important and that we would not want to
> > > lose with prioritization is the possibility to share load over a
> > > number of PMDs with RSS. So preferably the prioritization and RSS
> > > spread over a number of rx queues were orthogonal.
> >
> > [[BO'M]] We have a proposed solution for this now. Which is simply to
> > change the RETA table to avoid RSS'd packets 'polluting' the priority
> > queue. It hasn't been implemented but it should work. That's in the
> context of DPDK/FlowDirector/XL710 but rte_flow api should allow this too.
> 
> [Jan] Does this mean there is work needed to enhance the NIC firmware, the
> i40e DPDK PMD, or the rte_flow API (or any combination of those)? What
> about the ixgbe PMD in this context? Will the Niantic  support similar
> classification?
[[BO'M]] I'd imagine that all NICs implementing RSS have a RETA and I'm sure 
it's accessible by both fdir and rte_flow currently. In terms of Niantic 
supporting queue assignment based on VLAN tags etc I'm not so sure. I'll take 
an AR to dig into this. 
> 
> Do you have a pointer to Fortville documentation that would help us to
> understand how i40e implements the rte_flow API.

 [[BO'M]] AFAIK the flow API is pretty expressive. The issue would be more with 
NIC support. 
There is the XL710 datasheet 
https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xl710-10-40-controller-datasheet.pdf
 which tbh I find hard to figure out how the various filter mechanism interact. 

> 
> Thanks, Jan

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 0/4] prioritizing latency sensitive traffic

2017-09-22 Thread O Mahony, Billy
Hi Jan, Kevin,

> -Original Message-
> From: Jan Scheurich [mailto:jan.scheur...@ericsson.com]
> Sent: Thursday, September 21, 2017 4:12 PM
> To: Kevin Traynor <ktray...@redhat.com>; O Mahony, Billy
> <billy.o.mah...@intel.com>; d...@openvswitch.org
> Cc: Mechthild Buescher <mechthild.buesc...@ericsson.com>; Venkatesan
> Pradeep <venkatesan.prad...@ericsson.com>
> Subject: RE: [ovs-dev] [PATCH 0/4] prioritizing latency sensitive traffic
> 
> Hi all,
> 
> We seriously want to pursue this kind of ingress traffic prioritization from
> physical ports in OVS-DPDK for the use case I mentioned earlier: 
> prioritization
> of in-band control plane traffic running on the same physical network as the
> tenant data traffic.
> 
> We have first focused on testing the effectiveness of the SW queue
> prioritization in Billy's patch. To this end we added two DPDK ports to a PMD:
> dpdk0 with normal priority and dpdk1 with hard-coded high priority (e.g. not
> using the config interface in the patch). We cross-connected dpdk0 to a
> vhostuser port in a VM and dpdk1 to the LOCAL port on the host.
> 
> We overloaded the PMD with 64 byte packets on dpdk0 (~25% rx packet
> drop on dpdk0) and in parallel sent iperf3 UDP traffic (256 byte datagrams) in
> on dpdk1, destined to an iperf3 server running on the host.
> 
> With the dpdk1 queue prioritized, we achieve ~1Gbit/s (460 Kpps) iperf3
> throughput with zero packet drop no matter if the parallel overload traffic on
> dpdk0 is running or not. (The throughput is limited by the UDP/IP stack on
> the client side.) In the same test with non-prioritized dpdk1 queue iperf3
> reports about 28% packet drop, same as experienced by the dpdk0 traffic.
> 
> With that we can conclude that the PMD priority queue polling scheme
> implemented in Billy's patch effectively solves our problem. We haven't
> tested if the inner priority polling loop has any performance impact on the
> normal PMD processing. Not likely, though.

[[BO'M]] That great to know!

> 
> The next question is how to classify the ingress traffic on the NIC and 
> insert it
> into rx queues with different priority. Any scheme implemented should
> preferably work with as many NICs as possible. Use of the new rte_flow API
> in DPDK seems the right direction to go here.

[[BO'M]] This may be getting ahead of where we are but is it important to know 
if a NIC does not support a prioritization scheme? Someone, Darrell I believe 
mentioned a capability discovery mechanism at one point. I was thinking it was 
not necessary as functionally nothing changes if prioritization is or is not 
supported. But maybe in terms of an orchestrator it does make sense - as the it 
may want to want to make other arrangements to protect control traffic in the 
absence of a working prioritization mechanism.

> 
> We are very interested in starting the dialogue how to configure the {queue,
> priority, filter} mapping in OVS and which filters are most meaningful to 
> start
> with and supported by most NICs. Candidates could include VLAN tags and p-
> bits, Ethertype and IP DSCP.
> 
> One thing that we consider important and that we would not want to lose
> with prioritization is the possibility to share load over a number of PMDs 
> with
> RSS. So preferably the prioritization and RSS spread over a number of rx
> queues were orthogonal.

[[BO'M]] We have a proposed solution for this now. Which is simply to change 
the RETA table to avoid RSS'd packets 'polluting' the priority queue. It hasn't 
been implemented but it should work. That's in the context of 
DPDK/FlowDirector/XL710 but rte_flow api should allow this too.

> 
> BR, Jan
> 
> 
> Note: There seems to be a significant overlap with the discussion around
> classification HW offload for datapath flow entries currently going on, with
> the exception that the QoS filters here are static and not in any way tied to
> dynamic megaflows.
> 
> 
> > -Original Message-
> > From: Kevin Traynor [mailto:ktray...@redhat.com]
> > Sent: Friday, 18 August, 2017 20:40
> > To: Jan Scheurich <jan.scheur...@ericsson.com>; O Mahony, Billy
> > <billy.o.mah...@intel.com>; d...@openvswitch.org
> > Subject: Re: [ovs-dev] [PATCH 0/4] prioritizing latency sensitive
> > traffic
> >
> > On 08/17/2017 05:21 PM, Jan Scheurich wrote:
> > > Good discussion. Some thoughts:
> > >
> > > 1. Prioritizing queues by assigning them to dedicated PMDs is a
> > > simple and effective but very crude method, considering that you
> > have to reserve an entire (logical) core for that. So I am all for a more
> economic and perhaps slightly less deterministic option!
> > >
> >
> > S

Re: [ovs-dev] ovs_dpdk: dpdk-socket-mem usage question

2017-09-19 Thread O Mahony, Billy
Hi Wang,

Typically I reserve between 512M and 1G on each Numa.

There is no formula I am aware of for how much memory is actually required.

Fundamentally this will be determined by the maximum number and size of packets 
in-flight at any given time. Which is determined by the ingress packet rate, 
processing time in ovs and the rate and frequency at which egress queues are 
drained.

The maximum memory requiremnt is determined by the number of rx and tx queues 
and how many descriptors each has.  Also longer queues (more descriptors) will 
protect against packet loss up to a point. So QoS/throughput also comes in to 
play. 

On that point dpdkvhostuser ports, as far as I know, current versions of qemu 
have a virtio queue length fixed at compile time so these queue lengths cannot 
be modified by OVS at all.

In short I don't think there is any way other than testing and tuning of the 
dpdk application (in this case OVS) and the particular use case while 
monitoring internal queue usage. This should give you an idea of an acceptable 
maximum length for the various queues and a good first guess as to the total 
amount of memory required.

Regards,
Billy.



> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of ???
> Sent: Wednesday, September 13, 2017 6:35 AM
> To: ovs-dev@openvswitch.org; ovs-disc...@openvswitch.org
> Subject: [ovs-dev] ovs_dpdk: dpdk-socket-mem usage question
> 
> Hi All,
> 
> I read below doc, and have one question:
> 
> http://docs.openvswitch.org/en/latest/intro/install/dpdk/
> dpdk-socket-mem
> Comma separated list of memory to pre-allocate from hugepages on specific
> sockets.
> 
> Question:
>OVS+DPDK can let user to specify the needed memory using dpdk-socket-
> mem. But the question is that how to know how much memory is needed. Is
> there some algorithm on how to calculate the memory?Thanks.
> 
> Br,
> Wang Zhike
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v5] dpif-netdev: Avoid reading RSS hash when EMC is disabled

2017-09-19 Thread O Mahony, Billy
Acked-by: Billy O'Mahony <billy.o.mah...@intel.com>

> -Original Message-
> From: Fischetti, Antonio
> Sent: Tuesday, September 12, 2017 5:34 PM
> To: d...@openvswitch.org
> Cc: Darrell Ball <db...@vmware.com>; O Mahony, Billy
> <billy.o.mah...@intel.com>; Fischetti, Antonio
> <antonio.fische...@intel.com>
> Subject: [PATCH v5] dpif-netdev: Avoid reading RSS hash when EMC is
> disabled
> 
> When EMC is disabled the reading of RSS hash is skipped.
> Also, for packets that are not recirculated it retrieves the hash value 
> without
> considering the recirc id.
> 
> CC: Darrell Ball <db...@vmware.com>
> CC: Billy O Mahony <billy.o.mah...@intel.com>
> Signed-off-by: Antonio Fischetti <antonio.fische...@intel.com>
> ---
>  V5
>   - Removed OVS_LIKELY when checking cur_min.
> 
>   - I see a performance improvement for P2P and PVP tescases
> when EMC is disabled, measurements are below.
> 
>   - I also tried different solutions, eg passing md_is_valid
> to dpif_netdev_packet_get_rss_hash - or even the recirc_id
> computed inside dp_execute_cb - as suggested by Billy but
> I didn't see any performance benefit.
> 
>   - Rebased on Commit id
>b9fedfa61f000f49500973d2a51e99a80d9cf9b8
> 
>   - Each measurement was repeated 5 times and an average was
> computed.
> 
> 
> P2P testcase
> 
> Flow setup:
> table=0, in_port=dpdk0 actions=output:dpdk1 table=0, in_port=dpdk1
> actions=output:dpdk0
> 
> Mono-directional, 64B UDP packets. Traffic sent at line-rate.
> PMD threads: 2
> Built with "-O2 -march=native -g"
> 
> Measurements and average are in Mpps.
> 
>Orig   |   5 Measurments  |  Avg
>  -+--+-
>  With EMC | 11.39   11.37   11.46   11.35   11.39| 11.39
>  no EMC   |  8.238.228.268.208.22|  8.23
> 
>   + patch |   5 Measurments  |  Avg
>  -+--+-
>  With EMC |  11.46   11.39   11.40   11.45   11.38   | 11.42
>  no EMC   |   8.428.418.378.438.37   |  8.40
> 
> 
> PVP testcase
> 
> Flow setup:
> table=0, in_port=dpdk0 actions=output:dpdkvhostuser0 table=0,
> in_port=dpdkvhostuser0 actions=output:dpdk0 table=0,
> in_port=dpdkvhostuser1 actions=output:dpdk1 table=0, in_port=dpdk1
> actions=output:dpdkvhostuser1
> 
> Bi-directional, 64B UDP packets. Traffic sent at line-rate.
> PMD threads:  2
> Built with "-O2 -march=native -g"
> 
> Measurements and average are in Mpps.
> 
>Orig   |   5 Measurments  |  Avg
>  -+--+-
>  With EMC |   4.59   4.60   4.46   4.59   4.59   |  4.57
>  no EMC   |   3.72   3.72   3.64   3.72   3.72   |  3.70
> 
>   + patch |   5 Measurments  |  Avg
>  -+--+-
>  With EMC |  4.50   4.62   4.60   4.60   4.58|  4.58
>  no EMC   |  3.78   3.86   3.84   3.84   3.83|  3.83
> 
> 
> Recirculation testcase
> --
> In a test setup with a firewall I didn't see any performance difference
> between the original and the patch.
> 
> 
>  V4
>   - reworked to remove dependencies from other patches in
> patchset "Skip EMC for recirc pkts and other optimizations."
> https://mail.openvswitch.org/pipermail/ovs-dev/2017-
> August/337320.html
> 
>   - measurements were repeated with the latest head of master.
> ---
>  lib/dpif-netdev.c | 32 
>  1 file changed, 28 insertions(+), 4 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 0ceef9d..baf65e8
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -4765,6 +4765,22 @@ dp_netdev_upcall(struct dp_netdev_pmd_thread
> *pmd, struct dp_packet *packet_,  }
> 
>  static inline uint32_t
> +dpif_netdev_packet_get_rss_hash_orig_pkt(struct dp_packet *packet,
> +const struct miniflow *mf) {
> +uint32_t hash;
> +
> +if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
> +hash = dp_packet_get_rss_hash(packet);
> +} else {
> +hash = miniflow_hash_5tuple(mf, 0);
> +dp_packet_set_rss_hash(packet, hash);
> +}
> +
> +return hash;
> +}
> +
> +static inline uint32_t
>  dpif_netdev_packet_get_rss_hash(struct dp_packet *packet,
>  const struct miniflow *mf)  { @@ -4899,10 
> +

Re: [ovs-dev] [PATCH v5] dpif-netdev: Avoid reading RSS hash when EMC is disabled

2017-09-13 Thread O Mahony, Billy
Hi All,

It's a pity the performance gain couldn't be got without introducing the new 
function. But it is a clear performance gain none the less.

Regards,
/Billy.

> -Original Message-
> From: Darrell Ball [mailto:db...@vmware.com]
> Sent: Wednesday, September 13, 2017 6:32 AM
> To: Fischetti, Antonio <antonio.fische...@intel.com>; d...@openvswitch.org
> Cc: O Mahony, Billy <billy.o.mah...@intel.com>
> Subject: Re: [PATCH v5] dpif-netdev: Avoid reading RSS hash when EMC is
> disabled
> 
> These results look more clear: 2-3 % improvement for no EMC cases.
> Maybe others have comments?
> 
> 
> On 9/12/17, 9:34 AM, "antonio.fische...@intel.com"
> <antonio.fische...@intel.com> wrote:
> 
> When EMC is disabled the reading of RSS hash is skipped.
> Also, for packets that are not recirculated it retrieves
> the hash value without considering the recirc id.
> 
> CC: Darrell Ball <db...@vmware.com>
> CC: Billy O Mahony <billy.o.mah...@intel.com>
> Signed-off-by: Antonio Fischetti <antonio.fische...@intel.com>
> ---
>  V5
>   - Removed OVS_LIKELY when checking cur_min.
> 
>   - I see a performance improvement for P2P and PVP tescases
> when EMC is disabled, measurements are below.
> 
>   - I also tried different solutions, eg passing md_is_valid
> to dpif_netdev_packet_get_rss_hash - or even the recirc_id
> computed inside dp_execute_cb - as suggested by Billy but
> I didn't see any performance benefit.
> 
>   - Rebased on Commit id
>b9fedfa61f000f49500973d2a51e99a80d9cf9b8
> 
>   - Each measurement was repeated 5 times and an average was
> computed.
> 
> 
> P2P testcase
> 
> Flow setup:
> table=0, in_port=dpdk0 actions=output:dpdk1
> table=0, in_port=dpdk1 actions=output:dpdk0
> 
> Mono-directional, 64B UDP packets. Traffic sent at line-rate.
> PMD threads: 2
> Built with "-O2 -march=native -g"
> 
> Measurements and average are in Mpps.
> 
>Orig   |   5 Measurments  |  Avg
>  -+--+-
>  With EMC | 11.39   11.37   11.46   11.35   11.39| 11.39
>  no EMC   |  8.238.228.268.208.22|  8.23
> 
>   + patch |   5 Measurments  |  Avg
>  -+--+-
>  With EMC |  11.46   11.39   11.40   11.45   11.38   | 11.42
>  no EMC   |   8.428.418.378.438.37   |  8.40
> 
> 
> PVP testcase
> 
> Flow setup:
> table=0, in_port=dpdk0 actions=output:dpdkvhostuser0
> table=0, in_port=dpdkvhostuser0 actions=output:dpdk0
> table=0, in_port=dpdkvhostuser1 actions=output:dpdk1
> table=0, in_port=dpdk1 actions=output:dpdkvhostuser1
> 
> Bi-directional, 64B UDP packets. Traffic sent at line-rate.
> PMD threads:  2
> Built with "-O2 -march=native -g"
> 
> Measurements and average are in Mpps.
> 
>Orig   |   5 Measurments  |  Avg
>  -+--+-
>  With EMC |   4.59   4.60   4.46   4.59   4.59   |  4.57
>  no EMC   |   3.72   3.72   3.64   3.72   3.72   |  3.70
> 
>   + patch |   5 Measurments  |  Avg
>  -+--+-
>  With EMC |  4.50   4.62   4.60   4.60   4.58|  4.58
>  no EMC   |  3.78   3.86   3.84   3.84   3.83|  3.83
> 
> 
> Recirculation testcase
> --
> In a test setup with a firewall I didn't see any performance
> difference between the original and the patch.
> 
> 
>  V4
>   - reworked to remove dependencies from other patches in
> patchset "Skip EMC for recirc pkts and other optimizations."
> https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-
> 2DAugust_337320.html=DwIBAg=uilaK90D4TOVoH58JNXRgQ=BVhFA
> 09CGX7JQ5Ih-
> uZnsw=Mlu2yd3PFPpnA0HXsnY9Mq9JpsZYoS9_cSNEv6nMjsI=frcbor3lc
> JkzFUS3Dl5Mmioaz43QeZOA6AwDWjO0Iac=
> 
>   - measurements were repeated with the latest head of master.
> ---
>  lib/dpif-netdev.c | 32 
>  1 file changed, 28 insertions(+), 4 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 0ceef9d..baf65e8 100644
> --- a/lib/dpif-netdev.c
&

Re: [ovs-dev] [PATCH] dpif-netdev: Simplify emc replacement policy.

2017-09-12 Thread O Mahony, Billy
Hi Darrell,

> The effect here is highly data dependent and in fact dominated by the packet
> distribution which will not be random or rather pseudo-random. I had done
> my own testing with pseudo random flows, fwiw.
> I did not see any thrashing with even at 4000 flows and saw one alive/alive
> choice at 8000.

An agreed standard packet distribution would be useful (essential really) for 
EMC performance characterization. What is the packet distribution you are using?

Thanks,
Billy.


> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of Darrell Ball
> Sent: Friday, August 4, 2017 8:38 PM
> To: Ilya Maximets ; Wang, Yipeng1
> ; ovs-dev@openvswitch.org
> Cc: Heetae Ahn 
> Subject: Re: [ovs-dev] [PATCH] dpif-netdev: Simplify emc replacement
> policy.
> 
> 
> 
> -Original Message-
> From: Ilya Maximets 
> Date: Monday, July 31, 2017 at 7:25 AM
> To: Darrell Ball , "Wang, Yipeng1"
> , "ovs-dev@openvswitch.org"  d...@openvswitch.org>
> Cc: Heetae Ahn 
> Subject: Re: [ovs-dev] [PATCH] dpif-netdev: Simplify emc replacement
> policy.
> 
> On 31.07.2017 04:41, Darrell Ball wrote:
> >
> >
> > -Original Message-
> > From:  on behalf of "Wang,
> Yipeng1" 
> > Date: Friday, July 28, 2017 at 11:04 AM
> > To: Ilya Maximets , "ovs-
> d...@openvswitch.org" 
> > Cc: Heetae Ahn 
> > Subject: Re: [ovs-dev] [PATCH] dpif-netdev: Simplify emc replacement
> policy.
> >
> > Good catch. But I think the hash comparison is to "randomly" choose
> one of the two entries to replace when both entries are live.
> > Your change would always replace the first one in such case. It might
> cause some thrashing issue for certain traffic. Meanwhile, to my experience,
> the original "hash comparison" is also not a good way to choose random
> entry, I encountered some thrashing issue before.
> >
> > I think we want some condition like below, but a way to fast choose 
> a
> random entry.
> >
> > if (!to_be_replaced || (emc_entry_alive(to_be_replaced) &&
> !emc_entry_alive(current_entry) )
> > to_be_replaced = current_entry;
> > else if((emc_entry_alive(to_be_replaced) &&
> (emc_entry_alive(current_entry))
> > to_be_replaced = random_entry;
> 
> I agree that we need to have something like random choosing of active
> entry to replace.
> I though about this a little and came up with idea to reuse the random
> value generated
> for insertion probability. This should give a good distribution for
> replacement.
> I'll send v2 soon with that approach.
> 
> The effect here is highly data dependent and in fact dominated by the packet
> distribution which will not be random or rather pseudo-random. I had done
> my own testing with pseudo random flows, fwiw.
> I did not see any thrashing with even at 4000 flows and saw one alive/alive
> choice at 8000.

[[BO'M]] What is the packet distribution you are using? 

> 
> We can also see the data dependency with this patch in this first version.
> This patch removed all randomness when choosing an entry to replace when
> both candidates are alive and instead always choose the first entry.
> 
> However, you observed that this fixed your problem of thrashing with your
> dataset – if so, the dataset used in your testing may not be very random.
> This change would have been worse in the general case, but seemed perfect
> for your dataset.
> 
> 
> > //
> >
> > I agree – we are trying to randomly select one of two live entries with 
> the
> last condition.
> > Something like this maybe makes it more clear what we are trying to do ?
> 
> Your code solves the issue with replacement of alive entries while dead
> ones exists,
> but you're still uses hashes as random values which is not right. Hashes 
> are
> not random
> and there is no any difference in choosing the first entry or the entry 
> with a
> bit
> set in a particular place. There always will be some bad case where you 
> will
> replace
> same entries all the time and performance of EMC will be low.
> 
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> > index 47a9fa0..75cc039 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -2051,12 +2051,15 @@ emc_insert(struct emc_cache *cache, const
> struct netdev_flow_key *key,
> >  }
> >
> >  /* Replacement policy: put the flow in an empty (not alive) 
> entry, or
> > - * in the first entry where it can be */
> > -if (!to_be_replaced
> > -   

Re: [ovs-dev] OVS DPDK NUMA pmd assignment question for physical port

2017-09-11 Thread O Mahony, Billy
Hi Wang,

I believe that the PMD stats processing cycles includes EMC processing time. 

This is just in the context of your results being surprising. It could be a 
factor if you are using code where the bug exists. The patch carries a fixes: 
tag (I think) that should help you figure out if your results were potentially 
affected by this issue.

Regards,
/Billy. 

> -Original Message-
> From: 王志克 [mailto:wangzh...@jd.com]
> Sent: Monday, September 11, 2017 3:00 AM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; ovs-
> d...@openvswitch.org; Jan Scheurich <jan.scheur...@ericsson.com>; Darrell
> Ball <db...@vmware.com>; ovs-disc...@openvswitch.org; Kevin Traynor
> <ktray...@redhat.com>
> Subject: RE: [ovs-dev] OVS DPDK NUMA pmd assignment question for
> physical port
> 
> Hi Billy,
> 
> In my test, almost all traffic went trough via EMC. So the fix does not impact
> the result, especially we want to know the difference (not the exact num).
> 
> Can you test to get some data? Thanks.
> 
> Br,
> Wang Zhike
> 
> -Original Message-
> From: O Mahony, Billy [mailto:billy.o.mah...@intel.com]
> Sent: Friday, September 08, 2017 11:18 PM
> To: 王志克; ovs-dev@openvswitch.org; Jan Scheurich; Darrell Ball; ovs-
> disc...@openvswitch.org; Kevin Traynor
> Subject: RE: [ovs-dev] OVS DPDK NUMA pmd assignment question for
> physical port
> 
> Hi Wang,
> 
> https://mail.openvswitch.org/pipermail/ovs-dev/2017-August/337309.html
> 
> I see it's been acked and is due to be pushed to master with other changes
> on the dpdk merge branch so you'll have to apply it manually for now.
> 
> /Billy.
> 
> > -Original Message-
> > From: 王志克 [mailto:wangzh...@jd.com]
> > Sent: Friday, September 8, 2017 11:48 AM
> > To: ovs-dev@openvswitch.org; Jan Scheurich
> > <jan.scheur...@ericsson.com>; O Mahony, Billy
> > <billy.o.mah...@intel.com>; Darrell Ball <db...@vmware.com>; ovs-
> > disc...@openvswitch.org; Kevin Traynor <ktray...@redhat.com>
> > Subject: Re: [ovs-dev] OVS DPDK NUMA pmd assignment question for
> > physical port
> >
> > Hi Billy,
> >
> > I used ovs2.7.0. I searched the git log, and not sure which commit it
> > is. Do you happen to know?
> >
> > Yes, I cleared the stats after traffic run.
> >
> > Br,
> > Wang Zhike
> >
> >
> > From: "O Mahony, Billy" <billy.o.mah...@intel.com>
> > To: "wangzh...@jd.com" <wangzh...@jd.com>, Jan Scheurich
> > <jan.scheur...@ericsson.com>, Darrell Ball <db...@vmware.com>,
> > "ovs-disc...@openvswitch.org" <ovs-disc...@openvswitch.org>,
> > "ovs-dev@openvswitch.org" <ovs-dev@openvswitch.org>, Kevin
> Traynor
> > <ktray...@redhat.com>
> > Subject: Re: [ovs-dev] OVS DPDK NUMA pmd assignment question for
> > physical port
> > Message-ID:
> > <03135aea779d444e90975c2703f148dc58c19...@irsmsx107.ger.c
> > orp.intel.com>
> >
> > Content-Type: text/plain; charset="utf-8"
> >
> > Hi Wang,
> >
> > Thanks for the figures. Unexpected results as you say. Two things come
> > to
> > mind:
> >
> > I?m not sure what code you are using but the cycles per packet
> > statistic was broken for a while recently. Ilya posted a patch to fix
> > it so make sure you have that patch included.
> >
> > Also remember to reset the pmd stats after you start your traffic and
> > then measure after a short duration.
> >
> > Regards,
> > Billy.
> >
> >
> >
> > From: ??? [mailto:wangzh...@jd.com]
> > Sent: Friday, September 8, 2017 8:01 AM
> > To: Jan Scheurich <jan.scheur...@ericsson.com>; O Mahony, Billy
> > <billy.o.mah...@intel.com>; Darrell Ball <db...@vmware.com>; ovs-
> > disc...@openvswitch.org; ovs-dev@openvswitch.org; Kevin Traynor
> > <ktray...@redhat.com>
> > Subject: RE: [ovs-dev] OVS DPDK NUMA pmd assignment question for
> > physical port
> >
> >
> > Hi All,
> >
> >
> >
> > I tested below cases, and get some performance data. The data shows
> > there is little impact for cross NUMA communication, which is
> > different from my expectation. (Previously I mentioned that cross NUMA
> > would add 60% cycles, but I can NOT reproduce it any more).
> >
> >
> >
> > @Jan,
> >
> > You mentioned cross NUMA communication would cost lots more cycles.
> > Can you share your data? I am not sure whether I made some mistake or
> not.
> >
> &

Re: [ovs-dev] OVS DPDK NUMA pmd assignment question for physical port

2017-09-08 Thread O Mahony, Billy
Hi Wang,

https://mail.openvswitch.org/pipermail/ovs-dev/2017-August/337309.html

I see it's been acked and is due to be pushed to master with other changes on 
the dpdk merge branch so you'll have to apply it manually for now.

/Billy. 

> -Original Message-
> From: 王志克 [mailto:wangzh...@jd.com]
> Sent: Friday, September 8, 2017 11:48 AM
> To: ovs-dev@openvswitch.org; Jan Scheurich
> <jan.scheur...@ericsson.com>; O Mahony, Billy
> <billy.o.mah...@intel.com>; Darrell Ball <db...@vmware.com>; ovs-
> disc...@openvswitch.org; Kevin Traynor <ktray...@redhat.com>
> Subject: Re: [ovs-dev] OVS DPDK NUMA pmd assignment question for
> physical port
> 
> Hi Billy,
> 
> I used ovs2.7.0. I searched the git log, and not sure which commit it is. Do 
> you
> happen to know?
> 
> Yes, I cleared the stats after traffic run.
> 
> Br,
> Wang Zhike
> 
> 
> From: "O Mahony, Billy" <billy.o.mah...@intel.com>
> To: "wangzh...@jd.com" <wangzh...@jd.com>, Jan Scheurich
>   <jan.scheur...@ericsson.com>, Darrell Ball <db...@vmware.com>,
>   "ovs-disc...@openvswitch.org" <ovs-disc...@openvswitch.org>,
>   "ovs-dev@openvswitch.org" <ovs-dev@openvswitch.org>, Kevin
> Traynor
>   <ktray...@redhat.com>
> Subject: Re: [ovs-dev] OVS DPDK NUMA pmd assignment question for
>   physical port
> Message-ID:
>   <03135aea779d444e90975c2703f148dc58c19...@irsmsx107.ger.c
> orp.intel.com>
> 
> Content-Type: text/plain; charset="utf-8"
> 
> Hi Wang,
> 
> Thanks for the figures. Unexpected results as you say. Two things come to
> mind:
> 
> I?m not sure what code you are using but the cycles per packet statistic was
> broken for a while recently. Ilya posted a patch to fix it so make sure you
> have that patch included.
> 
> Also remember to reset the pmd stats after you start your traffic and then
> measure after a short duration.
> 
> Regards,
> Billy.
> 
> 
> 
> From: ??? [mailto:wangzh...@jd.com]
> Sent: Friday, September 8, 2017 8:01 AM
> To: Jan Scheurich <jan.scheur...@ericsson.com>; O Mahony, Billy
> <billy.o.mah...@intel.com>; Darrell Ball <db...@vmware.com>; ovs-
> disc...@openvswitch.org; ovs-dev@openvswitch.org; Kevin Traynor
> <ktray...@redhat.com>
> Subject: RE: [ovs-dev] OVS DPDK NUMA pmd assignment question for
> physical port
> 
> 
> Hi All,
> 
> 
> 
> I tested below cases, and get some performance data. The data shows there
> is little impact for cross NUMA communication, which is different from my
> expectation. (Previously I mentioned that cross NUMA would add 60%
> cycles, but I can NOT reproduce it any more).
> 
> 
> 
> @Jan,
> 
> You mentioned cross NUMA communication would cost lots more cycles. Can
> you share your data? I am not sure whether I made some mistake or not.
> 
> 
> 
> @All,
> 
> Welcome your data if you have data for similar cases. Thanks.
> 
> 
> 
> Case1: VM0->PMD0->NIC0
> 
> Case2:VM1->PMD1->NIC0
> 
> Case3:VM1->PMD0->NIC0
> 
> Case4:NIC0->PMD0->VM0
> 
> Case5:NIC0->PMD1->VM1
> 
> Case6:NIC0->PMD0->VM1
> 
> 
> 
> ? VM Tx Mpps  Host Tx Mpps  avg cycles per packet   avg processing
> cycles per packet
> 
> Case1 1.4   1.4 512 
> 415
> 
> Case2 1.3   1.3 537 
> 436
> 
> Case3 1.351.35   514 390
> 
> 
> 
> ?  VM Rx Mpps    Host Rx Mpps  avg cycles per packet   avg processing 
> cycles
> per packet
> 
> Case4 1.3   1.3 549 
> 533
> 
> Case5 1.3   1.3 559 
> 540
> 
> Case6 1.28 1.28  568 551
> 
> 
> 
> Br,
> 
> Wang Zhike
> 
> 
> 
> -Original Message-
> From: Jan Scheurich [mailto:jan.scheur...@ericsson.com]
> Sent: Wednesday, September 06, 2017 9:33 PM
> To: O Mahony, Billy; ???; Darrell Ball; ovs-
> disc...@openvswitch.org<mailto:ovs-disc...@openvswitch.org>; ovs-
> d...@openvswitch.org<mailto:ovs-dev@openvswitch.org>; Kevin Traynor
> Subject: RE: [ovs-dev] OVS DPDK NUMA pmd assignment question for
> physical port
> 
> 
> 
> Hi Billy,
> 
> 
> 
> > You are going to have to take the hit crossing the NUMA boundary at some
> point if your NIC and VM are on different NUMAs.
> 
> >
>

Re: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert for recirc packets

2017-09-06 Thread O Mahony, Billy
Hi All,

On the "“RSS hash threshold method” for EMC load shedding I hope to have time 
to do an RFC to illustrate in the next week or so give a better idea of what I 
mean.

Thanks,
Billy.

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of Darrell Ball
> Sent: Thursday, August 17, 2017 7:46 PM
> To: Jan Scheurich ; Darrell Ball
> 
> Cc: d...@openvswitch.org
> Subject: Re: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert
> for recirc packets
> 
> 
> 
> On 8/17/17, 5:22 AM, "Jan Scheurich"  wrote:
> 
> The RSS hash threshold method looks like the only pseudo-random
> criterion that we can use that produces consistent result for every packet of
> a flow and does require more information. Of course elephant flows with an
> unlucky hash value might never get to use the EMC, but that risk we have
> with any stateless selection scheme.
> 
> 
> 
> [Darrell] It is probably something I know by another name, but JTBC, can
> you define the “RSS hash threshold method” ?
> 
> 
> 
> I am referring to Billy's proposal
> (https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-
> 2DAugust_336509.html=DwIGaQ=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-
> YihVMNtXt-
> uEs=dGZmbKhBG9tJHY4odedsGA=_i_IqWudJqU7R3_ZaFm7HHhOQM
> Hwm_U6G-
> EIyOGjkxI=IDMHVf9n5CjmHMI67mzMd0HZegNJ_LntZLfcdpRUvJI= )
> 
> 
> In essence the is suggests to only select packets for EMC lookup whose RSS
> hash is above a certain threshold. The lookup probability is determined by
> the threshold (e.g. threshold of 0.75 * UINT32_MAX corresponds to 25%). It
> is pseudo-random as, assuming that the hash result is uniformly distributed,
> flows will profit from EMC lookup with the same probability.
> 
> 
> [Darrell] ahh, there is no actual patch yet, just an e-mail
> I see, you have a coined the term “RSS hash threshold method” 
> for
> the approach; the nomenclature makes sense now.
> I’ll have separate comments, of course, on the proposal 
> itself.
> 
> 
> 
> The new thing required will be the dynamic adjustment of lookup
> probability to the EMC fill level and/or hit ratio.
> 
> 
> 
> [Darrell] Did you mean insertion probability rather than lookup 
> probability ?
> 
> 
> 
> No, I actually meant dynamic adaptation of lookup probability. We don't
> want to reduce the EMC lookup probability when the EMC is not yet
> overloaded, but only when the EMC hit rate degrades due to collisions.
> When we devise an algorithm to adapt lookup probability, we can study if it
> could make sense to also dynamically adjust the currently fixed
> (configurable) EMC insertion probability based on EMC fill level and/or hit
> rate.
> 
> 
> [Darrell] Now that I know what you are referring to above, it is a lot easier 
> to
> make the linkage.
> 
> 
> 
> BR, Jan
> 
> 
> 
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] OVS DPDK NUMA pmd assignment question for physical port

2017-09-06 Thread O Mahony, Billy


> -Original Message-
> From: Kevin Traynor [mailto:ktray...@redhat.com]
> Sent: Wednesday, September 6, 2017 3:02 PM
> To: Jan Scheurich <jan.scheur...@ericsson.com>; O Mahony, Billy
> <billy.o.mah...@intel.com>; wangzh...@jd.com; Darrell Ball
> <db...@vmware.com>; ovs-disc...@openvswitch.org; ovs-
> d...@openvswitch.org
> Subject: Re: [ovs-dev] OVS DPDK NUMA pmd assignment question for
> physical port
> 
> On 09/06/2017 02:43 PM, Jan Scheurich wrote:
> >>
> >> I think the mention of pinning was confusing me a little. Let me see
> >> if I fully understand your use case:  You don't 'want' to pin
> >> anything but you are using it as a way to force the distribution of rxq 
> >> from
> a single nic across to PMDs on different NUMAs. As without pinning all rxqs
> are assigned to the NUMA-local pmd leaving the other PMD totally unused.
> >>
> >> But then when you used pinning you the PMDs became isolated so the
> >> vhostuser ports rxqs would not be assigned to the PMDs unless they too
> were pinned. Which worked but was not manageable as VM (and vhost
> ports) came and went.
> >>
> >> Yes?
> >
> > Yes!!!
[[BO'M]] Hurrah!
> >
> >>
> >> In that case what we probably want is the ability to pin an rxq to a
> >> pmd but without also isolating the pmd. So the PMD could be assigned
> some rxqs manually and still have others automatically assigned.
> >
> > Wonderful. That is exactly what I have wanted to propose for a while:
> Separate PMD isolation from pinning of Rx queues.
> >
> > Tying these two together makes it impossible to use pinning of Rx queues
> in OpenStack context (without the addition of dedicated PMDs/cores). And
> even during manual testing it is a nightmare to have to manually pin all 48
> vhostuser queues just because we want to pin the two heavy-loaded Rx
> queues to different PMDs.
> >
> 
> That sounds like it would be useful. Do you know in advance of running which
> rxq's they will be? i.e. you know it's particular port and there is only one
> queue. Or you don't know but analyze at runtime and then reconfigure?
> 
> > The idea would be to introduce a separate configuration option for PMDs
> to isolate them, and no longer automatically set that when pinning an rx
> queue to the PMD.
> >
> 
> Please don't break backward compatibility. I think it would be better to keep
> the existing command as is and add a new softer version that allows other
> rxq's to be scheduled on that pmd also.
[[BO'M]] Although is implicit isolation feature of pmd-rxq-affinity actuall 
used in the wild?  But still it's sensible to introduce the new 'softer 
version' as you say. 
> 
> Kevin.
> 
> > BR, Jan
> >

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] OVS DPDK NUMA pmd assignment question for physical port

2017-09-06 Thread O Mahony, Billy


> -Original Message-
> From: Kevin Traynor [mailto:ktray...@redhat.com]
> Sent: Wednesday, September 6, 2017 2:50 PM
> To: Jan Scheurich <jan.scheur...@ericsson.com>; O Mahony, Billy
> <billy.o.mah...@intel.com>; wangzh...@jd.com; Darrell Ball
> <db...@vmware.com>; ovs-disc...@openvswitch.org; ovs-
> d...@openvswitch.org
> Subject: Re: [ovs-dev] OVS DPDK NUMA pmd assignment question for
> physical port
> 
> On 09/06/2017 02:33 PM, Jan Scheurich wrote:
> > Hi Billy,
> >
> >> You are going to have to take the hit crossing the NUMA boundary at
> some point if your NIC and VM are on different NUMAs.
> >>
> >> So are you saying that it is more expensive to cross the NUMA
> >> boundary from the pmd to the VM that to cross it from the NIC to the
> PMD?
> >
> > Indeed, that is the case: If the NIC crosses the QPI bus when storing
> packets in the remote NUMA there is no cost involved for the PMD. (The QPI
> bandwidth is typically not a bottleneck.) The PMD only performs local
> memory access.
> >
> > On the other hand, if the PMD crosses the QPI when copying packets into a
> remote VM, there is a huge latency penalty involved, consuming lots of PMD
> cycles that cannot be spent on processing packets. We at Ericsson have
> observed exactly this behavior.
> >
> > This latency penalty becomes even worse when the LLC cache hit rate is
> degraded due to LLC cache contention with real VNFs and/or unfavorable
> packet buffer re-use patterns as exhibited by real VNFs compared to typical
> synthetic benchmark apps like DPDK testpmd.
> >
> >>
> >> If so then in that case you'd like to have two (for example) PMDs
> >> polling 2 queues on the same NIC. With the PMDs on each of the NUMA
> nodes forwarding to the VMs local to that NUMA?
> >>
> >> Of course your NIC would then also need to be able know which VM (or
> >> at least which NUMA the VM is on) in order to send the frame to the
> correct rxq.
> >
> > That would indeed be optimal but hard to realize in the general case (e.g.
> with VXLAN encapsulation) as the actual destination is only known after
> tunnel pop. Here perhaps some probabilistic steering of RSS hash values
> based on measured distribution of final destinations might help in the future.
> >
> > But even without that in place, we need PMDs on both NUMAs anyhow
> (for NUMA-aware polling of vhostuser ports), so why not use them to also
> poll remote eth ports. We can achieve better average performance with
> fewer PMDs than with the current limitation to NUMA-local polling.
> >
> 
> If the user has some knowledge of the numa locality of ports and can place
> VM's accordingly, default cross-numa assignment can be harm performance.
> Also, it would make for very unpredictable performance from test to test and
> even for flow to flow on a datapath.
[[BO'M]] Wang's original request would constitute default cross numa assignment 
but I don't think this modified proposal would as it still requires explicit 
config to assign to the remote NUMA.
> 
> Kevin.
> 
> > BR, Jan
> >

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] OVS DPDK NUMA pmd assignment question for physical port

2017-09-06 Thread O Mahony, Billy
Hi Wang,

I think the mention of pinning was confusing me a little. Let me see if I fully 
understand your use case:  You don't 'want' to pin anything but you are using 
it as a way to force the distribution of rxq from a single nic across to PMDs 
on different NUMAs. As without pinning all rxqs are assigned to the NUMA-local 
pmd leaving the other PMD totally unused.

But then when you used pinning you the PMDs became isolated so the vhostuser 
ports rxqs would not be assigned to the PMDs unless they too were pinned. Which 
worked but was not manageable as VM (and vhost ports) came and went.

Yes? 

In that case what we probably want is the ability to pin an rxq to a pmd but 
without also isolating the pmd. So the PMD could be assigned some rxqs manually 
and still have others automatically assigned. 

But what I still don't understand is why you don't put both PMDs on the same 
NUMA node. Given that you cannot program the NIC to know which VM a frame is 
for then you would have to RSS the frames across rxqs (ie across NUMA nodes). 
Of those going to the NICs local-numa node 50% would have to go across the NUMA 
boundary when their destination VM was decided - which is okay - they have to 
cross the boundary at some point. But for or frames going to non-local NUMA, 
50% of these will actually be destined for what was originally the local NUMA 
node. Now these packets (25% of all traffic would ) will cross NUMA *twice* 
whereas if all PMDs were on the NICs NUMA node those frames would never have 
had to pass between NUMA nodes.

In short I think it's more efficient to have both PMDs on the same NUMA node as 
the NIC.

There is one more comments below..

> -Original Message-
> From: 王志克 [mailto:wangzh...@jd.com]
> Sent: Wednesday, September 6, 2017 12:50 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; Darrell Ball
> <db...@vmware.com>; ovs-disc...@openvswitch.org; ovs-
> d...@openvswitch.org; Kevin Traynor <ktray...@redhat.com>
> Subject: RE: [ovs-dev] OVS DPDK NUMA pmd assignment question for
> physical port
> 
> Hi Billy,
> 
> See my reply in line.
> 
> Br,
> Wang Zhike
> 
> -Original Message-
> From: O Mahony, Billy [mailto:billy.o.mah...@intel.com]
> Sent: Wednesday, September 06, 2017 7:26 PM
> To: 王志克; Darrell Ball; ovs-disc...@openvswitch.org; ovs-
> d...@openvswitch.org; Kevin Traynor
> Subject: RE: [ovs-dev] OVS DPDK NUMA pmd assignment question for
> physical port
> 
> Hi Wang,
> 
> You are going to have to take the hit crossing the NUMA boundary at some
> point if your NIC and VM are on different NUMAs.
> 
> So are you saying that it is more expensive to cross the NUMA boundary
> from the pmd to the VM that to cross it from the NIC to the PMD?
> 
> [Wang Zhike] I do not have such data. I hope we can try the new behavior
> and get the test result, and then know whether and how much performance
> can be improved.

[[BO'M]] You don't need to a code change to compare performance of these two 
scenarios. You can simulate it by pinning queues to VMs. I'd imagine crossing 
the NUMA boundary during the PCI DMA would be cheaper that crossing it over 
vhost. But I don't know what the result would be and this would a pretty 
interesting figure to have by the way.


> 
> If so then in that case you'd like to have two (for example) PMDs polling 2
> queues on the same NIC. With the PMDs on each of the NUMA nodes
> forwarding to the VMs local to that NUMA?
> 
> Of course your NIC would then also need to be able know which VM (or at
> least which NUMA the VM is on) in order to send the frame to the correct
> rxq.
> 
> [Wang Zhike] Currently I do not know how to achieve it. From my view, NIC
> do not know which NUMA should be the destination of the packet. Only
> after OVS handling (eg lookup the fowarding rule in OVS), then it can know
> the destination. If NIC does not know the destination NUMA socket, it does
> not matter which PMD to poll it.
> 
> 
> /Billy.
> 
> > -Original Message-
> > From: 王志克 [mailto:wangzh...@jd.com]
> > Sent: Wednesday, September 6, 2017 11:41 AM
> > To: O Mahony, Billy <billy.o.mah...@intel.com>; Darrell Ball
> > <db...@vmware.com>; ovs-disc...@openvswitch.org; ovs-
> > d...@openvswitch.org; Kevin Traynor <ktray...@redhat.com>
> > Subject: RE: [ovs-dev] OVS DPDK NUMA pmd assignment question for
> > physical port
> >
> > Hi Billy,
> >
> > It depends on the destination of the traffic.
> >
> > I observed that if the traffic destination is across NUMA socket, the
> > "avg processing cycles per packet" would increase 60% than the traffic
> > to same NUMA socket.
> >
> > Br,
> > Wang Zhike
> >
> > -Original Message-
> > F

Re: [ovs-dev] OVS DPDK NUMA pmd assignment question for physical port

2017-09-06 Thread O Mahony, Billy
Hi Wang,

You are going to have to take the hit crossing the NUMA boundary at some point 
if your NIC and VM are on different NUMAs.

So are you saying that it is more expensive to cross the NUMA boundary from the 
pmd to the VM that to cross it from the NIC to the PMD?

If so then in that case you'd like to have two (for example) PMDs polling 2 
queues on the same NIC. With the PMDs on each of the NUMA nodes forwarding to 
the VMs local to that NUMA?

Of course your NIC would then also need to be able know which VM (or at least 
which NUMA the VM is on) in order to send the frame to the correct rxq. 

/Billy. 

> -Original Message-
> From: 王志克 [mailto:wangzh...@jd.com]
> Sent: Wednesday, September 6, 2017 11:41 AM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; Darrell Ball
> <db...@vmware.com>; ovs-disc...@openvswitch.org; ovs-
> d...@openvswitch.org; Kevin Traynor <ktray...@redhat.com>
> Subject: RE: [ovs-dev] OVS DPDK NUMA pmd assignment question for
> physical port
> 
> Hi Billy,
> 
> It depends on the destination of the traffic.
> 
> I observed that if the traffic destination is across NUMA socket, the "avg
> processing cycles per packet" would increase 60% than the traffic to same
> NUMA socket.
> 
> Br,
> Wang Zhike
> 
> -Original Message-
> From: O Mahony, Billy [mailto:billy.o.mah...@intel.com]
> Sent: Wednesday, September 06, 2017 6:35 PM
> To: 王志克; Darrell Ball; ovs-disc...@openvswitch.org; ovs-
> d...@openvswitch.org; Kevin Traynor
> Subject: RE: [ovs-dev] OVS DPDK NUMA pmd assignment question for
> physical port
> 
> Hi Wang,
> 
> If you create several PMDs on the NUMA of the physical port does that have
> the same performance characteristic?
> 
> /Billy
> 
> 
> 
> > -Original Message-
> > From: 王志克 [mailto:wangzh...@jd.com]
> > Sent: Wednesday, September 6, 2017 10:20 AM
> > To: O Mahony, Billy <billy.o.mah...@intel.com>; Darrell Ball
> > <db...@vmware.com>; ovs-disc...@openvswitch.org; ovs-
> > d...@openvswitch.org; Kevin Traynor <ktray...@redhat.com>
> > Subject: RE: [ovs-dev] OVS DPDK NUMA pmd assignment question for
> > physical port
> >
> > Hi Billy,
> >
> > Yes, I want to achieve better performance.
> >
> > The commit "dpif-netdev: Assign ports to pmds on non-local numa node"
> > can NOT meet my needs.
> >
> > I do have pmd on socket 0 to poll the physical NIC which is also on socket 
> > 0.
> > However, this is not enough since I also have other pmd on socket 1. I
> > hope such pmds on socket 1 can together poll physical NIC. In this
> > way, we have more CPU (in my case, double CPU) to poll the NIC, which
> > results in performance improvement.
> >
> > BR,
> > Wang Zhike
> >
> > -Original Message-
> > From: O Mahony, Billy [mailto:billy.o.mah...@intel.com]
> > Sent: Wednesday, September 06, 2017 5:14 PM
> > To: Darrell Ball; 王志克; ovs-disc...@openvswitch.org; ovs-
> > d...@openvswitch.org; Kevin Traynor
> > Subject: RE: [ovs-dev] OVS DPDK NUMA pmd assignment question for
> > physical port
> >
> > Hi Wang,
> >
> > A change was committed to head of master 2017-08-02 "dpif-netdev:
> > Assign ports to pmds on non-local numa node" which if I understand
> > your request correctly will do what you require.
> >
> > However it is not clear to me why you are pinning rxqs to PMDs in the
> > first instance. Currently if you configure at least on pmd on each
> > numa there should always be a PMD available. Is the pinning for
> performance reasons?
> >
> > Regards,
> > Billy
> >
> >
> >
> > > -Original Message-
> > > From: Darrell Ball [mailto:db...@vmware.com]
> > > Sent: Wednesday, September 6, 2017 8:25 AM
> > > To: 王志克 <wangzh...@jd.com>; ovs-disc...@openvswitch.org; ovs-
> > > d...@openvswitch.org; O Mahony, Billy <billy.o.mah...@intel.com>;
> > Kevin
> > > Traynor <ktray...@redhat.com>
> > > Subject: Re: [ovs-dev] OVS DPDK NUMA pmd assignment question for
> > > physical port
> > >
> > > Adding Billy and Kevin
> > >
> > >
> > > On 9/6/17, 12:22 AM, "Darrell Ball" <db...@vmware.com> wrote:
> > >
> > >
> > >
> > > On 9/6/17, 12:03 AM, "王志克" <wangzh...@jd.com> wrote:
> > >
> > > Hi Darrell,
> > >
> > > pmd-rxq-affinity has below limitation: (so isolated pmd can
> > > not be used for others, which is not my expect

Re: [ovs-dev] OVS DPDK NUMA pmd assignment question for physical port

2017-09-06 Thread O Mahony, Billy
Hi Wang,

If you create several PMDs on the NUMA of the physical port does that have the 
same performance characteristic? 

/Billy



> -Original Message-
> From: 王志克 [mailto:wangzh...@jd.com]
> Sent: Wednesday, September 6, 2017 10:20 AM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; Darrell Ball
> <db...@vmware.com>; ovs-disc...@openvswitch.org; ovs-
> d...@openvswitch.org; Kevin Traynor <ktray...@redhat.com>
> Subject: RE: [ovs-dev] OVS DPDK NUMA pmd assignment question for
> physical port
> 
> Hi Billy,
> 
> Yes, I want to achieve better performance.
> 
> The commit "dpif-netdev: Assign ports to pmds on non-local numa node" can
> NOT meet my needs.
> 
> I do have pmd on socket 0 to poll the physical NIC which is also on socket 0.
> However, this is not enough since I also have other pmd on socket 1. I hope
> such pmds on socket 1 can together poll physical NIC. In this way, we have
> more CPU (in my case, double CPU) to poll the NIC, which results in
> performance improvement.
> 
> BR,
> Wang Zhike
> 
> -Original Message-
> From: O Mahony, Billy [mailto:billy.o.mah...@intel.com]
> Sent: Wednesday, September 06, 2017 5:14 PM
> To: Darrell Ball; 王志克; ovs-disc...@openvswitch.org; ovs-
> d...@openvswitch.org; Kevin Traynor
> Subject: RE: [ovs-dev] OVS DPDK NUMA pmd assignment question for
> physical port
> 
> Hi Wang,
> 
> A change was committed to head of master 2017-08-02 "dpif-netdev: Assign
> ports to pmds on non-local numa node" which if I understand your request
> correctly will do what you require.
> 
> However it is not clear to me why you are pinning rxqs to PMDs in the first
> instance. Currently if you configure at least on pmd on each numa there
> should always be a PMD available. Is the pinning for performance reasons?
> 
> Regards,
> Billy
> 
> 
> 
> > -Original Message-
> > From: Darrell Ball [mailto:db...@vmware.com]
> > Sent: Wednesday, September 6, 2017 8:25 AM
> > To: 王志克 <wangzh...@jd.com>; ovs-disc...@openvswitch.org; ovs-
> > d...@openvswitch.org; O Mahony, Billy <billy.o.mah...@intel.com>;
> Kevin
> > Traynor <ktray...@redhat.com>
> > Subject: Re: [ovs-dev] OVS DPDK NUMA pmd assignment question for
> > physical port
> >
> > Adding Billy and Kevin
> >
> >
> > On 9/6/17, 12:22 AM, "Darrell Ball" <db...@vmware.com> wrote:
> >
> >
> >
> > On 9/6/17, 12:03 AM, "王志克" <wangzh...@jd.com> wrote:
> >
> > Hi Darrell,
> >
> > pmd-rxq-affinity has below limitation: (so isolated pmd can
> > not be used for others, which is not my expectation. Lots of VMs come
> > and go on the fly, and manully assignment is not feasible.)
> >   >>After that PMD threads on cores where RX queues
> > was pinned will become isolated. This means that this thread will poll
> > only pinned RX queues
> >
> > My problem is that I have several CPUs spreading on different
> > NUMA nodes. I hope all these CPU can have chance to serve the rxq.
> > However, because the phy NIC only locates on one certain socket node,
> > non-same numa pmd/CPU would be excluded. So I am wondering whether
> we
> > can have different behavior for phy port rxq:
> >   round-robin to all PMDs even the pmd on different NUMA socket.
> >
> > I guess this is a common case, and I believe it would improve
> > rx performance.
> >
> >
> > [Darrell] I agree it would be a common problem and some
> > distribution would seem to make sense, maybe factoring in some
> > favoring of local numa PMDs ?
> > Maybe an optional config to enable ?
> >
> >
> > Br,
> > Wang Zhike
> >
> >

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v1] netdev-dpdk: Implement TCP/UDP TX cksum in ovs-dpdk side

2017-08-24 Thread O Mahony, Billy
Hi Gao,

Thanks for working on this. Lack of checksum offload is big difference between 
ovs and ovs-dpdk when using linux stack in the guest.
 
The thing that struck me was that rather than immediately calculating the L4 
checksum in the host on vhost rx that the calculation should be delayed until 
it's known to be absolutely required to be done on the host. If the packet is 
for another VM a checksum is not required as the bits are not going over a 
physical medium. And if the packets is destined for a NIC then the checksum can 
be offloaded if the NIC supports it.

I'm not sure why doing the L4 sum in the guest should give a performance gain. 
The processing still has to be done. Maybe the guest code was compiled for an 
older architecture and is not using as efficient a set of instructions?

In any case the best advantage of having dpdk virtio device  support offload is 
if it can further offload to a NIC or avoid cksum entirely if the packet is 
destined for a local VM.

Thanks,
Billy. 


> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of Gao Zhenyu
> Sent: Wednesday, August 23, 2017 4:12 PM
> To: Loftus, Ciara 
> Cc: d...@openvswitch.org; us...@dpdk.org
> Subject: Re: [ovs-dev] [PATCH v1] netdev-dpdk: Implement TCP/UDP TX
> cksum in ovs-dpdk side
> 
> Yes, maintaining only one implementation is resonable.
> However making ovs-dpdk to support vhost tx-cksum first is doable as well.
> We can have it in ovs, and replace it with new DPDK API once ovs update its
> dpdk version which contains the tx-cksum implementation.
> 
> 
> Thanks
> Zhenyu Gao
> 
> 2017-08-23 21:59 GMT+08:00 Loftus, Ciara :
> 
> > >
> > > Hi Ciara
> > >
> > > You had a general concern below; can we conclude on that before
> > > going further ?
> > >
> > > Thanks Darrell
> > >
> > > “
> > > > On another note I have a general concern. I understand similar
> > functionality
> > > > is present in the DPDK vhost sample app. I wonder if it would be
> > feasible
> > > for
> > > > this to be implemented in the DPDK vhost library and leveraged
> > > > here,
> > > rather
> > > > than having two implementations in two separate code bases.
> >
> > This is something I'd like to see, although I wouldn't block on this
> > patch waiting for it.
> > Maybe we can have the initial implementation as it is (if it proves
> > beneficial), then move to a common DPDK API if/when it becomes
> available.
> >
> > I've cc'ed DPDK users list hoping for some input. To summarise:
> > From my understanding, the DPDK vhost sample application calculates TX
> > checksum for packets received from vHost ports with invalid/0 checksums:
> > http://dpdk.org/browse/dpdk/tree/examples/vhost/main.c#n910
> > The patch being discussed in this thread (also here:
> > https://patchwork.ozlabs.org/patch/802070/) it seems does something
> > very similar.
> > Wondering on the feasibility of putting this functionality in a
> > rte_vhost library call such that we don't have two separate
> implementations?
> >
> > Thanks,
> > Ciara
> >
> > > >
> > > > I have some other comments inline.
> > > >
> > > > Thanks,
> > > > Ciara
> > > “
> > >
> > >
> > >
> > > From: Gao Zhenyu 
> > > Date: Wednesday, August 16, 2017 at 6:38 AM
> > > To: "Loftus, Ciara" 
> > > Cc: "b...@ovn.org" , "Chandran, Sugesh"
> > > , "ktray...@redhat.com"
> > > , Darrell Ball ,
> > > "d...@openvswitch.org" 
> > > Subject: Re: [ovs-dev] [PATCH v1] netdev-dpdk: Implement TCP/UDP TX
> > > cksum in ovs-dpdk side
> > >
> > > Hi Loftus,
> > >I had submitted a new version, please see
> > > https://patchwork.ozlabs.org/patch/802070/
> > >It move the cksum to vhost receive side.
> > > Thanks
> > > Zhenyu Gao
> > >
> > > 2017-08-10 12:35 GMT+08:00 Gao Zhenyu :
> > > I see, for flows in phy-phy setup, they should not be calculate cksum.
> > > I will revise my patch to do the cksum for vhost port only. I will
> > > send
> > a new
> > > patch next week.
> > >
> > > Thanks
> > > Zhenyu Gao
> > >
> > > 2017-08-08 17:53 GMT+08:00 Loftus, Ciara :
> > > >
> > > > Hi Loftus,
> > > >
> > > > Thanks for testing and the comments!
> > > > Can you show more details about your phy-vm-phy,phy-phy setup and
> > > > testing steps? Then I can reproduce it to see if I can solve this
> > > > pps
> > problem.
> > >
> > > You're welcome. I forgot to mention my tests were with 64B packets.
> > >
> > > For phy-phy the setup is a single host with 2 dpdk physical ports
> > > and 1
> > flow
> > > rule port1 -> port2.
> > > See figure 3 here:
> > > https://tools.ietf.org/html/draft-ietf-bmwg-vswitch-
> > > opnfv-04#section-4
> > >
> > > For the phy-vm-phy the setup is a single host with 2 dpdk physical
> > > ports
> > and 2
> > > 

Re: [ovs-dev] [PATCH 0/4] prioritizing latency sensitive traffic

2017-08-17 Thread O Mahony, Billy
Hi All,

> -Original Message-
> From: Jan Scheurich [mailto:jan.scheur...@ericsson.com]
> Sent: Thursday, August 17, 2017 5:22 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; Kevin Traynor
> <ktray...@redhat.com>; d...@openvswitch.org
> Subject: RE: [ovs-dev] [PATCH 0/4] prioritizing latency sensitive traffic
> 
> Good discussion. Some thoughts:
> 
> 1. Prioritizing queues by assigning them to dedicated PMDs is a simple and
> effective but very crude method, considering that you have to reserve an
> entire (logical) core for that. So I am all for a more economic and perhaps
> slightly less deterministic option!
[[BO'M]] I would agree. I was just drawing attention to the two-part nature of 
the patch. Ie. that it's not dpdk specific as such but comes with a dpdk 
implementation.
> 
> 2. Offering the option to prioritize certain queues in OVS-DPDK is a highly
> desirable feature. We have at least one important use case in OpenStack
> (prioritizing "in-band" infrastructure control plane traffic over tenant 
> data, in
> case both are carried on the same physical network). In our case the traffic
> separation would be done per VLAN. Can we add this to the list of supported
> filters?
[[BO'M]] Good to know about use-cases. I'll dig a bit on that wrt to dpdk 
drivers and hardware.
> 
> 3. It would be nice to be able to combine priority queues with filters with a
> number of RSS queues without filter. Is this a XL710 HW limitation or only a
> limitation of the drivers and DPDK APIs?
[[BO'M]] Again I'll have to dig on this. Our go to guy for this is on vacation 
at the moment. Remind me if I don't get back with a response. 
> 
> BR, Jan
> 
> 
> > From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> > boun...@openvswitch.org] On Behalf Of O Mahony, Billy
> > Sent: Thursday, 17 August, 2017 18:07
> > To: Kevin Traynor <ktray...@redhat.com>; d...@openvswitch.org
> > Subject: Re: [ovs-dev] [PATCH 0/4] prioritizing latency sensitive
> > traffic
> >
> > Hi Kevin,
> >
> > Thanks for the comments - more inline.
> >
> > Billy.
> >
> > > -Original Message-
> > > From: Kevin Traynor [mailto:ktray...@redhat.com]
> > > Sent: Thursday, August 17, 2017 3:37 PM
> > > To: O Mahony, Billy <billy.o.mah...@intel.com>; d...@openvswitch.org
> > > Subject: Re: [ovs-dev] [PATCH 0/4] prioritizing latency sensitive
> > > traffic
> > >
> > > Hi Billy,
> > >
> > > I just happened to be about to send a reply to the previous
> > > patchset, so adding comments here instead.
> > >
> > > On 08/17/2017 03:24 PM, Billy O'Mahony wrote:
> > > > Hi All,
> > > >
> > > > v2: Addresses various review comments; Applies cleanly on 0bedb3d6.
> > > >
> > > > This patch set provides a method to request ingress scheduling on
> > > interfaces.
> > > > It also provides an implemtation of same for DPDK physical ports.
> > > >
> > > > This allows specific packet types to be:
> > > > * forwarded to their destination port ahead of other packets.
> > > > and/or
> > > > * be less likely to be dropped in an overloaded situation.
> > > >
> > > > It was previously discussed
> > > > https://mail.openvswitch.org/pipermail/ovs-discuss/2017-
> > > May/044395.htm
> > > > l
> > > > and RFC'd
> > > > https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/335237.ht
> > > > ml
> > > >
> > > > Limitations of this patch:
> > > > * The patch uses the Flow Director filter API in DPDK and has only
> > > > been tested  on Fortville (XL710) NIC.
> > > > * Prioritization is limited to:
> > > > ** eth_type
> > > > ** Fully specified 5-tuple src & dst ip and port numbers for UDP &
> > > > TCP packets
> > > > * ovs-appctl dpif-netdev/pmd-*-show o/p should indicate rxq
> > > prioritization.
> > > > * any requirements for a more granular prioritization mechanism
> > > >
> > >
> > > In general I like the idea of splitting priority traffic to a
> > > specific queue but I have concerns about the implementation. I
> > > shared most of these when we met already but adding here too. Not a
> detailed review.
> > [[BO'M]] No worries. If we get the high-level sorted out first the
> > details will fall into place :)
> > >
> > > - It is using deprecated DPDK filter API.
> > > http://dpdk.org/doc/guides/rel_notes/

Re: [ovs-dev] [PATCH 0/4] prioritizing latency sensitive traffic

2017-08-17 Thread O Mahony, Billy
Hi Kevin,

Thanks for the comments - more inline.

Billy.

> -Original Message-
> From: Kevin Traynor [mailto:ktray...@redhat.com]
> Sent: Thursday, August 17, 2017 3:37 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; d...@openvswitch.org
> Subject: Re: [ovs-dev] [PATCH 0/4] prioritizing latency sensitive traffic
> 
> Hi Billy,
> 
> I just happened to be about to send a reply to the previous patchset, so
> adding comments here instead.
> 
> On 08/17/2017 03:24 PM, Billy O'Mahony wrote:
> > Hi All,
> >
> > v2: Addresses various review comments; Applies cleanly on 0bedb3d6.
> >
> > This patch set provides a method to request ingress scheduling on
> interfaces.
> > It also provides an implemtation of same for DPDK physical ports.
> >
> > This allows specific packet types to be:
> > * forwarded to their destination port ahead of other packets.
> > and/or
> > * be less likely to be dropped in an overloaded situation.
> >
> > It was previously discussed
> > https://mail.openvswitch.org/pipermail/ovs-discuss/2017-
> May/044395.htm
> > l
> > and RFC'd
> > https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/335237.html
> >
> > Limitations of this patch:
> > * The patch uses the Flow Director filter API in DPDK and has only
> > been tested  on Fortville (XL710) NIC.
> > * Prioritization is limited to:
> > ** eth_type
> > ** Fully specified 5-tuple src & dst ip and port numbers for UDP & TCP
> > packets
> > * ovs-appctl dpif-netdev/pmd-*-show o/p should indicate rxq
> prioritization.
> > * any requirements for a more granular prioritization mechanism
> >
> 
> In general I like the idea of splitting priority traffic to a specific queue 
> but I
> have concerns about the implementation. I shared most of these when we
> met already but adding here too. Not a detailed review.
[[BO'M]] No worries. If we get the high-level sorted out first the details will 
fall into place :) 
> 
> - It is using deprecated DPDK filter API.
> http://dpdk.org/doc/guides/rel_notes/deprecation.html
[[BO'M]] Yes it looks like a move to the shiny new Flow API is in order.
> 
> - It is an invasive change that seems to be for only one Intel NIC in the DPDK
> datapath. Even then it is very limited as it only works when that Intel NIC is
> using exactly 2 rx queues.
[[BO'M]] That's the current case but is really a limitation of 
FlowDirectorAPI/DPDK/XL710 combination. Maybe Flow API will allow to RSS over 
many queues and place the prioritized traffic on another queue.
> 
> - It's a hardcoded opaque QoS which will have a negative impact on
> whichever queues happen to land on the same pmd so it's unpredictable
> which queues will be affected. It could effect other latency sensitive traffic
> that cannot by prioritized because of the limitations above.
> 
> - I guess multiple priority queues could land on the same pmd and starve
> each other?
[[BO'M]] Interaction with pmd assignment is definitely an issue that needs to 
be addressed. I know there is work in-flight in that regard so it will be 
easier to address that when the in-flight work lands.
> 
> I think a more general, less restricted scheme using DPDK rte_flow API with
> controls on the effects to other traffic is needed. Perhaps if a user is very
> concerned with latency on traffic from a port, they would be ok with
> dedicating a pmd to it.
[[BO'M]] You are proposing to prioritize queues by allocating a single pmd to 
them rather than by changing the pmds read algorithm to favor prioritized 
queues? For sure that could be another implementation of the solution.

If we look at the patch set as containing two distinct things as per the cover 
letter "the patch set provides a method to request ingress scheduling on
interfaces. It also provides an implementation of same for DPDK physical 
ports." Then this would change the second part put the first would be still 
valid. Each port type in any case would have to come up with it's own 
implementation - it's just for non-physical ports than cannot offload the 
prioritization decision it not worth the effort - as was noted in an earlier 
RFC.

> 
> thanks,
> Kevin.
> 
> > Initial results:
> > * even when userspace OVS is very much overloaded and
> >   dropping significant numbers of packets the drop rate for prioritized 
> > traffic
> >   is running at 1/1000th of the drop rate for non-prioritized traffic.
> >
> > * the latency profile of prioritized traffic through userspace OVS is also
> much
> >   improved
> >
> > 1e0 |*
> > |*
>

Re: [ovs-dev] [PATCH 2/4] netdev-dpdk: Apply ingress_sched config to dpdk phy ports

2017-08-17 Thread O Mahony, Billy
Hi Mark,

Thanks for the very useful review comments.

I'll send a rev'd patch set shortly.

/Billy

> -Original Message-
> From: Kavanagh, Mark B
> Sent: Friday, August 4, 2017 4:14 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; d...@openvswitch.org
> Subject: RE: [ovs-dev] [PATCH 2/4] netdev-dpdk: Apply ingress_sched config
> to dpdk phy ports
> 
> >From: ovs-dev-boun...@openvswitch.org
> >[mailto:ovs-dev-boun...@openvswitch.org]
> >On Behalf Of Billy O'Mahony
> >Sent: Thursday, July 20, 2017 5:11 PM
> >To: d...@openvswitch.org
> >Subject: [ovs-dev] [PATCH 2/4] netdev-dpdk: Apply ingress_sched config
> >to dpdk phy ports
> >
> >Ingress scheduling configuration is given effect by way of Flow
> >Director filters. A small subset of the ingress scheduling possible is
> >implemented in this patch.
> >
> >Signed-off-by: Billy O'Mahony <billy.o.mah...@intel.com>
> 
> Hi Billy,
> 
> As a general comment, this patch doesn't apply to HEAD of master, so please
> rebase as part of rework.
> 
> Review comments inline.
> 
> Thanks,
> Mark
> 
> 
> >---
> > include/openvswitch/ofp-parse.h |   3 +
> > lib/dpif-netdev.c   |   1 +
> > lib/netdev-dpdk.c   | 167
> ++-
> >-
> > vswitchd/bridge.c   |   2 +
> > 4 files changed, 166 insertions(+), 7 deletions(-)
> >
> >diff --git a/include/openvswitch/ofp-parse.h
> >b/include/openvswitch/ofp-parse.h index fc5784e..08d6086 100644
> >--- a/include/openvswitch/ofp-parse.h
> >+++ b/include/openvswitch/ofp-parse.h
> >@@ -37,6 +37,9 @@ struct ofputil_table_mod;  struct ofputil_bundle_msg;
> >struct ofputil_tlv_table_mod;  struct simap;
> >+struct tun_table;
> >+struct flow_wildcards;
> >+struct ofputil_port_map;
> > enum ofputil_protocol;
> >
> > char *parse_ofp_str(struct ofputil_flow_mod *, int command, const char
> >*str_, diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> >47a9fa0..d35566f 100644
> >--- a/lib/dpif-netdev.c
> >+++ b/lib/dpif-netdev.c
> >@@ -44,6 +44,7 @@
> > #include "dp-packet.h"
> > #include "dpif.h"
> > #include "dpif-provider.h"
> >+#include "netdev-provider.h"
> 
> If a setter function for modifying netdev->ingress_sched_str is
> implemented, there is no need to include netdev-provider.h
> 
> > #include "dummy.h"
> > #include "fat-rwlock.h"
> > #include "flow.h"
> >diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index
> >e74c50f..e393abf 100644
> >--- a/lib/netdev-dpdk.c
> >+++ b/lib/netdev-dpdk.c
> >@@ -33,6 +33,8 @@
> > #include 
> > #include 
> >
> >+#include 
> >+#include 
> 
> Move these include below, with the other openvswitch include file.
> 
> > #include "dirs.h"
> > #include "dp-packet.h"
> > #include "dpdk.h"
> >@@ -169,6 +171,10 @@ static const struct rte_eth_conf port_conf = {
> > .txmode = {
> > .mq_mode = ETH_MQ_TX_NONE,
> > },
> >+.fdir_conf = {
> >+.mode = RTE_FDIR_MODE_PERFECT,
> 
> As you mentioned in your cover letter, you've only tested on a Fortville NIC.
> How widely supported are the Flow Director features across DPDK-
> supported NICs?
[[BO'M]] I'm not sure. Probably many NICs have the capabilities required - but 
I don't know if the dpdk drivers expose it.
> 
> >+},
> >+
> > };
> >
> > enum { DPDK_RING_SIZE = 256 };
> >@@ -330,6 +336,11 @@ enum dpdk_hw_ol_features {
> > NETDEV_RX_CHECKSUM_OFFLOAD = 1 << 0,  };
> >
> >+union ingress_filter {
> >+struct rte_eth_ethertype_filter ethertype;
> >+struct rte_eth_fdir_filter fdir;
> >+};
> >+
> > struct netdev_dpdk {
> > struct netdev up;
> > dpdk_port_t port_id;
> >@@ -369,8 +380,11 @@ struct netdev_dpdk {
> > /* If true, device was attached by rte_eth_dev_attach(). */
> > bool attached;
> >
> >-/* Ingress Scheduling config */
> >+/* Ingress Scheduling config & state. */
> > char *ingress_sched_str;
> >+bool ingress_sched_changed;
> >+enum rte_filter_type ingress_filter_type;
> >+union ingress_filter ingress_filter;
> >
> > /* In dpdk_list. */
> > struct ovs_list list_node OVS_GUARDED_BY(dpdk_mutex); @@ -653,6
> >+667,15 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int
> n_rxq,
> >int n_txq)
> > int i;
&

Re: [ovs-dev] [PATCH 1/4] netdev: Add set_ingress_sched to netdev api

2017-08-16 Thread O Mahony, Billy
Hi Mark,

> -Original Message-
> From: O Mahony, Billy
> Sent: Wednesday, August 16, 2017 4:53 PM
> To: Kavanagh, Mark B <mark.b.kavan...@intel.com>; d...@openvswitch.org
> Subject: RE: [ovs-dev] [PATCH 1/4] netdev: Add set_ingress_sched to netdev
> api
> 
> Hi Mark,
> 
> I'm continuing with rework/rebase. Some comments below.
> 
> > -Original Message-
> > From: Kavanagh, Mark B
> > Sent: Friday, August 4, 2017 3:49 PM
> > To: O Mahony, Billy <billy.o.mah...@intel.com>; d...@openvswitch.org
> > Subject: RE: [ovs-dev] [PATCH 1/4] netdev: Add set_ingress_sched to
> > netdev api
> >
> > >From: ovs-dev-boun...@openvswitch.org
> > >[mailto:ovs-dev-boun...@openvswitch.org]
> > >On Behalf Of Billy O'Mahony
> > >Sent: Thursday, July 20, 2017 5:11 PM
> > >To: d...@openvswitch.org
> > >Subject: [ovs-dev] [PATCH 1/4] netdev: Add set_ingress_sched to
> > >netdev api
> > >
> > >Passes ingress_sched config item from other_config column of
> > >Interface table to the netdev.
> >
> >
> > Hi Billy,
> >
> > Thanks for the patch - some review comments inline.
> >
> > Cheers,
> > Mark
> >
> > >
> > >Signed-off-by: Billy O'Mahony <billy.o.mah...@intel.com>
> > >---
> > > lib/netdev-bsd.c  |  1 +
> > > lib/netdev-dpdk.c | 19 +++
> > > lib/netdev-dummy.c|  1 +
> > > lib/netdev-linux.c|  1 +
> > > lib/netdev-provider.h | 10 ++
> > > lib/netdev-vport.c|  1 +
> > > lib/netdev.c  | 22 ++
> > > lib/netdev.h  |  1 +
> > > vswitchd/bridge.c |  2 ++
> > > 9 files changed, 58 insertions(+)
> > >
> > >diff --git a/lib/netdev-bsd.c b/lib/netdev-bsd.c index
> > >6cc83d3..eadf7bf
> > >100644
> > >--- a/lib/netdev-bsd.c
> > >+++ b/lib/netdev-bsd.c
> > >@@ -1509,6 +1509,7 @@ netdev_bsd_update_flags(struct netdev
> > *netdev_,
> > >enum netdev_flags off,
> > > netdev_bsd_get_etheraddr,\
> > > netdev_bsd_get_mtu,  \
> > > NULL, /* set_mtu */  \
> > >+NULL, /* set_ingress_sched */\
> > > netdev_bsd_get_ifindex,  \
> > > netdev_bsd_get_carrier,  \
> > > NULL, /* get_carrier_resets */   \
> > >diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index
> > >ea17b97..e74c50f 100644
> > >--- a/lib/netdev-dpdk.c
> > >+++ b/lib/netdev-dpdk.c
> > >@@ -369,6 +369,9 @@ struct netdev_dpdk {
> > > /* If true, device was attached by rte_eth_dev_attach(). */
> > > bool attached;
> > >
> > >+/* Ingress Scheduling config */
> > >+char *ingress_sched_str;
> >
> > Would ingress_sched_cfg be more apt?
> [[BO'M]] I find it useful to have the hint that this is a (human readable) 
> string
> as opposed to a struct.
> >
> > >+
> > > /* In dpdk_list. */
> > > struct ovs_list list_node OVS_GUARDED_BY(dpdk_mutex);
> > >
> > >@@ -1018,6 +1021,7 @@ netdev_dpdk_destruct(struct netdev *netdev)
> > > }
> > >
> > > free(dev->devargs);
> > >+free(dev->ingress_sched_str);
> >
> > There is a bug here.
> >
> > In the case that a user doesn't set an ingress scheduling policy,
> > netdev_dpdk's ingress_sched_str will not have been set. However, since
> > it is not initialized/set to the NULL string anywhere in the code, it
> > could potentially point to a random area of memory. Upon destruction
> > of the port, the call to free(dev->ingress_sched_str) will free said
> > memory, causing undesired behavior for any application/process using it.
> >
> [[BO'M]] I'm happy to put in a checks in here -  just generally in OVS I see
> checks for things that ought never happen are generally not made. But
> maybe that is just in cycle-critical packet handling code paths. The
> ingress_sched_str ptr is set to NULL in common_construct() (that may be in
> one of the other patches) so it will either be NULL or set to a malloc'd 
> location
> and should not be an issue. But TBH I'm happier with a check in front of code
> like this too.
> >
> > > common_destruct(dev);
> > >
> > > ovs_mutex_unlock(_mutex);
> > >@@ -1941

Re: [ovs-dev] [PATCH 1/4] netdev: Add set_ingress_sched to netdev api

2017-08-16 Thread O Mahony, Billy
Hi Mark,

I'm continuing with rework/rebase. Some comments below.

> -Original Message-
> From: Kavanagh, Mark B
> Sent: Friday, August 4, 2017 3:49 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; d...@openvswitch.org
> Subject: RE: [ovs-dev] [PATCH 1/4] netdev: Add set_ingress_sched to netdev
> api
> 
> >From: ovs-dev-boun...@openvswitch.org
> >[mailto:ovs-dev-boun...@openvswitch.org]
> >On Behalf Of Billy O'Mahony
> >Sent: Thursday, July 20, 2017 5:11 PM
> >To: d...@openvswitch.org
> >Subject: [ovs-dev] [PATCH 1/4] netdev: Add set_ingress_sched to netdev
> >api
> >
> >Passes ingress_sched config item from other_config column of Interface
> >table to the netdev.
> 
> 
> Hi Billy,
> 
> Thanks for the patch - some review comments inline.
> 
> Cheers,
> Mark
> 
> >
> >Signed-off-by: Billy O'Mahony <billy.o.mah...@intel.com>
> >---
> > lib/netdev-bsd.c  |  1 +
> > lib/netdev-dpdk.c | 19 +++
> > lib/netdev-dummy.c|  1 +
> > lib/netdev-linux.c|  1 +
> > lib/netdev-provider.h | 10 ++
> > lib/netdev-vport.c|  1 +
> > lib/netdev.c  | 22 ++
> > lib/netdev.h  |  1 +
> > vswitchd/bridge.c |  2 ++
> > 9 files changed, 58 insertions(+)
> >
> >diff --git a/lib/netdev-bsd.c b/lib/netdev-bsd.c index 6cc83d3..eadf7bf
> >100644
> >--- a/lib/netdev-bsd.c
> >+++ b/lib/netdev-bsd.c
> >@@ -1509,6 +1509,7 @@ netdev_bsd_update_flags(struct netdev
> *netdev_,
> >enum netdev_flags off,
> > netdev_bsd_get_etheraddr,\
> > netdev_bsd_get_mtu,  \
> > NULL, /* set_mtu */  \
> >+NULL, /* set_ingress_sched */\
> > netdev_bsd_get_ifindex,  \
> > netdev_bsd_get_carrier,  \
> > NULL, /* get_carrier_resets */   \
> >diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index
> >ea17b97..e74c50f 100644
> >--- a/lib/netdev-dpdk.c
> >+++ b/lib/netdev-dpdk.c
> >@@ -369,6 +369,9 @@ struct netdev_dpdk {
> > /* If true, device was attached by rte_eth_dev_attach(). */
> > bool attached;
> >
> >+/* Ingress Scheduling config */
> >+char *ingress_sched_str;
> 
> Would ingress_sched_cfg be more apt?
[[BO'M]] I find it useful to have the hint that this is a (human readable) 
string as opposed to a struct.
> 
> >+
> > /* In dpdk_list. */
> > struct ovs_list list_node OVS_GUARDED_BY(dpdk_mutex);
> >
> >@@ -1018,6 +1021,7 @@ netdev_dpdk_destruct(struct netdev *netdev)
> > }
> >
> > free(dev->devargs);
> >+free(dev->ingress_sched_str);
> 
> There is a bug here.
> 
> In the case that a user doesn't set an ingress scheduling policy,
> netdev_dpdk's ingress_sched_str will not have been set. However, since it is
> not initialized/set to the NULL string anywhere in the code, it could
> potentially point to a random area of memory. Upon destruction of the port,
> the call to free(dev->ingress_sched_str) will free said memory, causing
> undesired behavior for any application/process using it.
> 
[[BO'M]] I'm happy to put in a checks in here -  just generally in OVS I see 
checks for things that ought never happen are generally not made. But maybe 
that is just in cycle-critical packet handling code paths. The 
ingress_sched_str ptr is set to NULL in common_construct() (that may be in one 
of the other patches) so it will either be NULL or set to a malloc'd location 
and should not be an issue. But TBH I'm happier with a check in front of code 
like this too.
> 
> > common_destruct(dev);
> >
> > ovs_mutex_unlock(_mutex);
> >@@ -1941,6 +1945,20 @@ netdev_dpdk_set_mtu(struct netdev *netdev,
> int
> >mtu)  }
> >
> > static int
> >+netdev_dpdk_set_ingress_sched(struct netdev *netdev,
> >+  const char *ingress_sched_str) {
> >+struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> >+
> >+free(dev->ingress_sched_str);
> 
> As above.
> 
> >+if (ingress_sched_str) {
> >+dev->ingress_sched_str = xstrdup(ingress_sched_str);
> >+}
> >+
> >+return 0;
> >+}
> >+
> >+static int
> > netdev_dpdk_get_carrier(const struct netdev *netdev, bool *carrier);
> >
> > static int
> >@@ -3246,6 +3264,7 @@ unlock:
> > netdev_dpdk_get_etheraddr,\
> > 

Re: [ovs-dev] [PATCH v2 5/5] dp-packet: Use memcpy on dp_packet elements.

2017-08-01 Thread O Mahony, Billy
Hi Antonio,

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of antonio.fische...@intel.com
> Sent: Wednesday, July 19, 2017 5:05 PM
> To: d...@openvswitch.org
> Subject: [ovs-dev] [PATCH v2 5/5] dp-packet: Use memcpy on dp_packet
> elements.
> 
> memcpy replaces the several single copies inside
> dp_packet_clone_with_headroom().
> 
> Signed-off-by: Antonio Fischetti 
> ---
>  lib/dp-packet.c | 18 +-
>  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/lib/dp-packet.c b/lib/dp-packet.c index 67aa406..f4dbcb7 100644
> --- a/lib/dp-packet.c
> +++ b/lib/dp-packet.c
> @@ -157,8 +157,9 @@ dp_packet_clone(const struct dp_packet *buffer)
>  return dp_packet_clone_with_headroom(buffer, 0);  }
> 
> -/* Creates and returns a new dp_packet whose data are copied from
> 'buffer'.   The
> - * returned dp_packet will additionally have 'headroom' bytes of headroom.
> */
> +/* Creates and returns a new dp_packet whose data are copied from
> 'buffer'.
> + * The returned dp_packet will additionally have 'headroom' bytes of
> + * headroom. */
>  struct dp_packet *
>  dp_packet_clone_with_headroom(const struct dp_packet *buffer, size_t
> headroom)  { @@ -167,13 +168,12 @@
> dp_packet_clone_with_headroom(const struct dp_packet *buffer, size_t
> headroom)
>  new_buffer =
> dp_packet_clone_data_with_headroom(dp_packet_data(buffer),
>   dp_packet_size(buffer),
>   headroom);
> -new_buffer->l2_pad_size = buffer->l2_pad_size;
> -new_buffer->l2_5_ofs = buffer->l2_5_ofs;
> -new_buffer->l3_ofs = buffer->l3_ofs;
> -new_buffer->l4_ofs = buffer->l4_ofs;
> -new_buffer->md = buffer->md;
> -new_buffer->cutlen = buffer->cutlen;
> -new_buffer->packet_type = buffer->packet_type;
> +/* Copy the following fields into the returned buffer: l2_pad_size,
> + * l2_5_ofs, l3_ofs, l4_ofs, cutlen, packet_type and md. */
> +memcpy(_buffer->l2_pad_size, >l2_pad_size,
> +sizeof(struct dp_packet) -
> +offsetof(struct dp_packet, l2_pad_size));
> +
[[BO'M]] 
Does this change in itself have a perf improvement?

A reference/warning in the dp_packet declaration would be a good idea -  anyone 
changing fields from l2_pad_size to the end of the structure needs to be aware 
of this implementation of dp_packet_clone_data_with_headroom().

>  #ifdef DPDK_NETDEV
>  new_buffer->mbuf.ol_flags = buffer->mbuf.ol_flags;  #else
> --
> 2.4.11
> 
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] Proposal: EMC load-shedding

2017-08-01 Thread O Mahony, Billy
Hi All,

This proposal is an attempt to make a more general solution to the same issue
of EMC thrashing addressed by
https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/335940.html. That patch
proposes that when the EMC is overloaded that recirculated packets are neither
inserted into the EMC nor are they looked up in the EMC.

However, while it is a cheap decision, picking 'recirculated' as the category
of packet to drop from the EMC is not very granular - dropping them from the
EMC may result in an EMC that is now under-utilized or still overloaded - it 
all 
depends on what proportion of packets were in the recirculated category.

Once there are too many entries contending for any cache it's going to become
a liability as the lookup_cost + (miss_rate * cost_of_miss) grows to be greater
than cost_of_miss. And in that overloaded case any scheme where a very cheap
decision can be made to not insert a certain category of packets and to also
not check the EMC for that same category will reduce the cache miss rate.

Looking at a packet's RSS hash can also provide a "very cheap decision that can
be made to not insert a certain category of packets and to also not check the
EMC for that same category"

I don't want to propose an actual patch right now but if the general principle
is agreed then I would be happy to write at least a first iteration of the
patch or maybe the authors of the patch above would like to do so?

So I suggest that when the EMC becomes "overloaded" load is shed based on RSS
/5-tuple hash. If the hash is above a certain threshold it becomes eligible to
be inserted into the EMC. Packets with a hash below the threshold are non
inserted nor are they looked up in the EMC. The threshold can be changed in an
ongoing and granular way to adapt to current traffic conditions and maintain an
optimum EMC utilization.

So in simple, non-optimized (and non-compiling) terms:

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 47a9fa0..df5b9db 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -4599,11 +4599,12 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
 miniflow_extract(packet, >mf);
 key->len = 0; /* Not computed yet. */
 key->hash = dpif_netdev_packet_get_rss_hash(packet, >mf);

 /* If EMC is disabled skip emc_lookup */
-flow = (cur_min == 0) ? NULL: emc_lookup(flow_cache, key);
+if (key->hash > flow_cache.shed_threshold) {
+flow = (cur_min == 0) ? NULL: emc_lookup(flow_cache, key);
+}
 if (OVS_LIKELY(flow)) {
 dp_netdev_queue_batches(packet, flow, >mf, batches,
 n_batches);
 } else {
 /* Exact match cache missed. Group missed packets together at
@@ -4777,11 +4778,12 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
 continue;
 }

 flow = dp_netdev_flow_cast(rules[i]);

-emc_probabilistic_insert(pmd, [i], flow);
+if (keys[i].hash > pmd->flow_cache->shed_threshold) {
+emc_probabilistic_insert(pmd, [i], flow);
+}
 dp_netdev_queue_batches(packet, flow, [i].mf, batches, n_batches);
 }

 dp_netdev_count_packet(pmd, DP_STAT_MASKED_HIT, cnt - miss_cnt);
 dp_netdev_count_packet(pmd, DP_STAT_LOOKUP_HIT, lookup_cnt);

The other question is setting the shed_threshold value. Generally:

 sched_threshold = some_function_of(emc_load)

Some suggestions for calculating emc_load:
* the num_alive_entries/num_entries
* the num_evictions/num_insertions (where num_evictions is the number of
  insertions that overwrote and existing alive entry).
* something more adaptable:
  emc_load = some_function_of (cost_of_emc_miss,
   cost_of_emc_lookup,
   probability_of_emc_miss)


I'd be interested to know what people think of doing something like this and
any more details that can be fleshed out, corner cases and so on.

Regards,
Billy

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 3/5] dpif-netdev: Skip EMC lookup/insert for recirc packets.

2017-08-01 Thread O Mahony, Billy
Hi Antonio,

Unfortunately I think the performance deltas of this here probably need to be 
re-worked given the bug discovered & fixed in EMC Insertion algorithm here 
which according to the patch notes will significantly reduce EMC contention for 
a given number of flows.

https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/336452.html

However, before you commit more effort I would like to post a proposal to the 
list on a more generalized EMC load-shedding mechanism which I think could be 
more effective as it would be more granular than shedding just re-circulated 
traffic. I hope to post that today. 

Regards,
/Billy

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of antonio.fische...@intel.com
> Sent: Wednesday, July 19, 2017 5:05 PM
> To: d...@openvswitch.org
> Subject: [ovs-dev] [PATCH v2 3/5] dpif-netdev: Skip EMC lookup/insert for
> recirc packets.
> 
> When OVS is configured as a firewall, with thousands of active concurrent
> connections, the EMC gets quicly saturated and may come under heavy
> thrashing for the reason that original and recirculated packets keep overwrite
> existing active EMC entries due to its limited size (8k).
> 
> This thrashing causes the EMC to be less efficient than the dcpls in terms of
> lookups and insertions.
> 
> This patch allows to use the EMC efficiently by allowing only the 'original'
> packets to hit EMC. All recirculated packets are sent to the classifier 
> directly.
> An empirical threshold (EMC_RECIRCT_NO_INSERT_THRESHOLD - of 50%) for
> EMC occupancy is set to trigger this logic. By doing so when EMC utilization
> exceeds
> EMC_RECIRCT_NO_INSERT_THRESHOLD:
>  - EMC Insertions are allowed just for original packets. EMC insertion
>and look up is skipped for recirculated packets.
>  - Recirculated packets are sent to the classifier.
> 
> This patch is based on patch
> "dpif-netdev: add EMC entry count and %full figure to pmd-stats-show" at:
> https://mail.openvswitch.org/pipermail/ovs-dev/2017-January/327570.html
> Also, this patch depends on the previous one in this series.
> 
> Signed-off-by: Antonio Fischetti 
> Signed-off-by: Bhanuprakash Bodireddy
> 
> Co-authored-by: Bhanuprakash Bodireddy
> 
> ---
> In our Connection Tracker testbench set up with
> 
>  table=0, priority=1 actions=drop
>  table=0, priority=10,arp actions=NORMAL  table=0, priority=100,ct_state=-
> trk,ip actions=ct(table=1)  table=1, ct_state=+new+trk,ip,in_port=1
> actions=ct(commit),output:2  table=1, ct_state=+est+trk,ip,in_port=1
> actions=output:2  table=1, ct_state=+new+trk,ip,in_port=2 actions=drop
> table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1
> 
> we saw the following performance improvement.
> 
> We measured packet Rx rate (regardless of packet loss). Bidirectional test
> with 64B UDP packets.
> Each row is a test with a different number of traffic streams. The traffic
> generator is set so that each stream establishes one UDP connection.
> Mpps columns reports the Rx rates on the 2 sides.
> 
>   +--+---+
>   |  Original OvS-DPDK   |Previous case  |
>   |  + patches #1,2  |+ this patch   |
>  -++-++--+
>   Traffic | Rx |   EMC   | Rx |   EMC|
>   Streams |   [Mpps]   | entries |   [Mpps]   | entries  |
>  -++-++--+
>   10  | 2.60, 2.67 |20   | 2.60, 2.64 |20|
>  100  | 2.53, 2.58 |   200   | 2.59, 2.61 |   201|
>1,000  | 2.02, 2.03 |  1929   | 2.15, 2.15 |  1997|
>2,000  | 1.94, 1.96 |  3661   | 1.97, 1.98 |  3668|
>3,000  | 1.87, 1.90 |  5086   | 1.96, 1.98 |  4736|
>4,000  | 1.82, 1.82 |  6173   | 1.95, 1.94 |  5280|
>   10,000  | 1.68, 1.69 |  7826   | 1.84, 1.84 |  7102|
>   30,000  | 1.57, 1.58 |  8192   | 1.68, 1.70 |  8192|
>  -++-++--+
> 
> This test setup implies 1 recirculation on each received packet.
> We didn't check this patch in a test scenario where more than 1 recirculation
> is occurring per packet.
> 
>  lib/dpif-netdev.c | 63
> ++-
>  1 file changed, 58 insertions(+), 5 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 9562827..79efce6
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -4573,6 +4573,9 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
>  packet_batch_per_flow_update(batch, pkt, mf);  }
> 
> +/* Threshold to skip EMC for recirculated packets. */ #define
> +EMC_RECIRCT_NO_INSERT_THRESHOLD 0xF000
> +
>  /* Try to process all ('cnt') the 'packets' using only the exact match cache
>   * 'pmd->flow_cache'. If a flow is not found for a packet 

Re: [ovs-dev] [PATCH v2 2/5] dpif-netdev: Avoid reading RSS hash when EMC is disabled.

2017-07-31 Thread O Mahony, Billy
Hi Antonio,

This is patch is definitely simpler than the original. 

However on the original patch I suggested: 

"If so it would be less disturbing to the existing code to just add a bool arg 
to dpif_netdev_packet_get_rss_hash() called do_not_check_recirc_depth and use 
that to return early (before the if (recirc_depth) check). Also in that case 
the patch would require none of the  conditional logic changes (neither the 
original or that suggested in this email) and should be able to just set the 
proposed do_not_check_recirc_depth based on md_is_valid."

I know you checked this and reported the performance gain was lower than with 
the v1 patch. We surmised that it was related to introducing a branch in the 
dpif_netdev_packet_get_rss_hash(). However there are many branches in this 
patch also.

Can you give details of how you are testing? 
* What is the traffic
* the flows/rules and 
* how are you measuring the performance difference  (ie. cycles per packet or 
packet throughput or some other measure).

Apologies for going on about this but if we can't get the same effect with a 
two or three line change than a 20line change I think it'll be worth it.

One other comment below

Thanks,
Billy.


> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of antonio.fische...@intel.com
> Sent: Wednesday, July 19, 2017 5:05 PM
> To: d...@openvswitch.org
> Subject: [ovs-dev] [PATCH v2 2/5] dpif-netdev: Avoid reading RSS hash when
> EMC is disabled.
> 
> When EMC is disabled the reading of RSS hash is skipped.

[[BO'M]] I think this is already the case with the existing code?  Just 
addition of OVS_UNLIKELY on the check. 

> For packets that are not recirculated it retrieves the hash value without
> considering the recirc id.
> 
> This is mostly a preliminary change for the next patch in this series.
> 
> Signed-off-by: Antonio Fischetti 
> ---
>  lib/dpif-netdev.c | 42 ++
>  1 file changed, 34 insertions(+), 8 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 123e04a..9562827
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -4472,6 +4472,22 @@ dp_netdev_upcall(struct dp_netdev_pmd_thread
> *pmd, struct dp_packet *packet_,  }
> 
>  static inline uint32_t
> +dpif_netdev_packet_get_rss_hash_orig_pkt(struct dp_packet *packet,
> +const struct miniflow *mf) {
> +uint32_t hash;
> +
> +if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
> +hash = dp_packet_get_rss_hash(packet);
> +} else {
> +hash = miniflow_hash_5tuple(mf, 0);
> +dp_packet_set_rss_hash(packet, hash);
> +}
> +
> +return hash;
> +}
> +
> +static inline uint32_t
>  dpif_netdev_packet_get_rss_hash(struct dp_packet *packet,
>  const struct miniflow *mf)  { @@ -4572,7 
> +4588,8 @@ static
> inline size_t  emc_processing(struct dp_netdev_pmd_thread *pmd,
> struct dp_packet_batch *packets_,
> struct netdev_flow_key *keys,
> -   struct packet_batch_per_flow batches[], size_t *n_batches)
> +   struct packet_batch_per_flow batches[], size_t *n_batches,
> +   bool md_is_valid)
>  {
>  struct emc_cache *flow_cache = >flow_cache;
>  struct netdev_flow_key *key = [0]; @@ -4602,10 +4619,19 @@
> emc_processing(struct dp_netdev_pmd_thread *pmd,
> 
>  miniflow_extract(packet, >mf);
>  key->len = 0; /* Not computed yet. */
> -key->hash = dpif_netdev_packet_get_rss_hash(packet, >mf);
> 
> -/* If EMC is disabled skip emc_lookup */
> -flow = (cur_min == 0) ? NULL: emc_lookup(flow_cache, key);
> +/* If EMC is disabled skip hash computation and emc_lookup */
> +if (OVS_LIKELY(cur_min)) {
> +if (!md_is_valid) {
> +key->hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet,
> +>mf);
> +} else {
> +key->hash = dpif_netdev_packet_get_rss_hash(packet, 
> >mf);
> +}
> +flow = emc_lookup(flow_cache, key);
> +} else {
> +flow = NULL;
> +}
>  if (OVS_LIKELY(flow)) {
>  dp_netdev_queue_batches(packet, flow, >mf, batches,
>  n_batches); @@ -4801,7 +4827,7 @@
> fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>   * valid, 'md_is_valid' must be true and 'port_no' will be ignored. */  
> static
> void  dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
> -  struct dp_packet_batch *packets)
> +  struct dp_packet_batch *packets, bool md_is_valid)
>  {
>  int cnt = packets->count;
>  #if !defined(__CHECKER__) && !defined(_WIN32) @@ -4818,7 +4844,7 @@
> dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
>  odp_port_t in_port;
> 
>  n_batches = 0;
> 

Re: [ovs-dev] [PATCH v2 1/5] dpif-netdev: move pkt metadata init out of emc_processing.

2017-07-31 Thread O Mahony, Billy
There is also a reference to md_is_valid is the comments of emc_processing that 
needs to be removed.

> -Original Message-
> From: O Mahony, Billy
> Sent: Monday, July 31, 2017 4:04 PM
> To: 'antonio.fische...@intel.com' <antonio.fische...@intel.com>;
> d...@openvswitch.org
> Subject: RE: [ovs-dev] [PATCH v2 1/5] dpif-netdev: move pkt metadata init
> out of emc_processing.
> 
> Hi Antionio,
> 
> This looks like a reasonable change to me.
> 
> Can you add some performance statistics for when dealing with re-circulated
> packets?
> 
> /Billy.
> 
> > -Original Message-
> > From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> > boun...@openvswitch.org] On Behalf Of antonio.fische...@intel.com
> > Sent: Wednesday, July 19, 2017 5:05 PM
> > To: d...@openvswitch.org
> > Subject: [ovs-dev] [PATCH v2 1/5] dpif-netdev: move pkt metadata init
> > out of emc_processing.
> >
> > Packet metadata initialization is moved into dp_netdev_input to
> > improve performance.
> >
> > Signed-off-by: Antonio Fischetti <antonio.fische...@intel.com>
> > ---
> > In my testbench with the following port to port flow setup:
> > in_port=1,action=output:2
> > in_port=2,action=output:1
> >
> > I measured packet Rx rate (regardless of packet loss) in a
> > Bidirectional test with  64B UDP packets.
> >
> > I saw the following performance improvement
> >
> > Orig:  11.30, 11.54 Mpps
> > Orig + patch:  11.70, 11.76 Mpps
> >
> >  lib/dpif-netdev.c | 21 ++---
> >  1 file changed, 10 insertions(+), 11 deletions(-)
> >
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> > 98e7765..123e04a
> > 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -4572,8 +4572,7 @@ static inline size_t  emc_processing(struct
> > dp_netdev_pmd_thread *pmd,
> > struct dp_packet_batch *packets_,
> > struct netdev_flow_key *keys,
> > -   struct packet_batch_per_flow batches[], size_t *n_batches,
> > -   bool md_is_valid, odp_port_t port_no)
> > +   struct packet_batch_per_flow batches[], size_t
> > + *n_batches)
> >  {
> >  struct emc_cache *flow_cache = >flow_cache;
> >  struct netdev_flow_key *key = [0]; @@ -4601,9 +4600,6 @@
> > emc_processing(struct dp_netdev_pmd_thread *pmd,
> >  pkt_metadata_prefetch_init([i+1]->md);
> >  }
> >
> > -if (!md_is_valid) {
> > -pkt_metadata_init(>md, port_no);
> > -}
> >  miniflow_extract(packet, >mf);
> >  key->len = 0; /* Not computed yet. */
> >  key->hash = dpif_netdev_packet_get_rss_hash(packet,
> > >mf); @@ -4805,8 +4801,7 @@ fast_path_processing(struct
> > dp_netdev_pmd_thread *pmd,
> >   * valid, 'md_is_valid' must be true and 'port_no' will be ignored.
> > */  static void  dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
> > -  struct dp_packet_batch *packets,
> > -  bool md_is_valid, odp_port_t port_no)
> > +  struct dp_packet_batch *packets)
> >  {
> >  int cnt = packets->count;
> >  #if !defined(__CHECKER__) && !defined(_WIN32) @@ -4823,8 +4818,7
> @@
> > dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
> >  odp_port_t in_port;
> >
> >  n_batches = 0;
> > -emc_processing(pmd, packets, keys, batches, _batches,
> > -md_is_valid, port_no);
> > +emc_processing(pmd, packets, keys, batches, _batches);
> >  if (!dp_packet_batch_is_empty(packets)) {
> >  /* Get ingress port from first packet's metadata. */
> >  in_port = packets->packets[0]->md.in_port.odp_port;
> > @@ -4856,14 +4850,19 @@ dp_netdev_input(struct
> dp_netdev_pmd_thread
> > *pmd,
> >  struct dp_packet_batch *packets,
> >  odp_port_t port_no)
> >  {
> > -dp_netdev_input__(pmd, packets, false, port_no);
> > +struct dp_packet *packet;
> > +DP_PACKET_BATCH_FOR_EACH (packet, packets) {
> > +pkt_metadata_init(>md, port_no);
> > +}
> > +
> > +dp_netdev_input__(pmd, packets);
> >  }
> >
> >  static void
> >  dp_netdev_recirculate(struct dp_netdev_pmd_thread *pmd,
> >struct dp_packet_batch *packets)  {
> > -dp_netdev_input__(pmd, packets, true, 0);
> > +dp_netdev_input__(pmd, packets);
> >  }
> >
> >  struct dp_netdev_execute_aux {
> > --
> > 2.4.11
> >
> > ___
> > dev mailing list
> > d...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 1/5] dpif-netdev: move pkt metadata init out of emc_processing.

2017-07-31 Thread O Mahony, Billy
Hi Antionio,

This looks like a reasonable change to me.

Can you add some performance statistics for when dealing with re-circulated 
packets?

/Billy.

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of antonio.fische...@intel.com
> Sent: Wednesday, July 19, 2017 5:05 PM
> To: d...@openvswitch.org
> Subject: [ovs-dev] [PATCH v2 1/5] dpif-netdev: move pkt metadata init out
> of emc_processing.
> 
> Packet metadata initialization is moved into dp_netdev_input to improve
> performance.
> 
> Signed-off-by: Antonio Fischetti 
> ---
> In my testbench with the following port to port flow setup:
> in_port=1,action=output:2
> in_port=2,action=output:1
> 
> I measured packet Rx rate (regardless of packet loss) in a Bidirectional test
> with  64B UDP packets.
> 
> I saw the following performance improvement
> 
> Orig:  11.30, 11.54 Mpps
> Orig + patch:  11.70, 11.76 Mpps
> 
>  lib/dpif-netdev.c | 21 ++---
>  1 file changed, 10 insertions(+), 11 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 98e7765..123e04a
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -4572,8 +4572,7 @@ static inline size_t  emc_processing(struct
> dp_netdev_pmd_thread *pmd,
> struct dp_packet_batch *packets_,
> struct netdev_flow_key *keys,
> -   struct packet_batch_per_flow batches[], size_t *n_batches,
> -   bool md_is_valid, odp_port_t port_no)
> +   struct packet_batch_per_flow batches[], size_t
> + *n_batches)
>  {
>  struct emc_cache *flow_cache = >flow_cache;
>  struct netdev_flow_key *key = [0]; @@ -4601,9 +4600,6 @@
> emc_processing(struct dp_netdev_pmd_thread *pmd,
>  pkt_metadata_prefetch_init([i+1]->md);
>  }
> 
> -if (!md_is_valid) {
> -pkt_metadata_init(>md, port_no);
> -}
>  miniflow_extract(packet, >mf);
>  key->len = 0; /* Not computed yet. */
>  key->hash = dpif_netdev_packet_get_rss_hash(packet, >mf);
> @@ -4805,8 +4801,7 @@ fast_path_processing(struct
> dp_netdev_pmd_thread *pmd,
>   * valid, 'md_is_valid' must be true and 'port_no' will be ignored. */  
> static
> void  dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
> -  struct dp_packet_batch *packets,
> -  bool md_is_valid, odp_port_t port_no)
> +  struct dp_packet_batch *packets)
>  {
>  int cnt = packets->count;
>  #if !defined(__CHECKER__) && !defined(_WIN32) @@ -4823,8 +4818,7 @@
> dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
>  odp_port_t in_port;
> 
>  n_batches = 0;
> -emc_processing(pmd, packets, keys, batches, _batches,
> -md_is_valid, port_no);
> +emc_processing(pmd, packets, keys, batches, _batches);
>  if (!dp_packet_batch_is_empty(packets)) {
>  /* Get ingress port from first packet's metadata. */
>  in_port = packets->packets[0]->md.in_port.odp_port;
> @@ -4856,14 +4850,19 @@ dp_netdev_input(struct
> dp_netdev_pmd_thread *pmd,
>  struct dp_packet_batch *packets,
>  odp_port_t port_no)
>  {
> -dp_netdev_input__(pmd, packets, false, port_no);
> +struct dp_packet *packet;
> +DP_PACKET_BATCH_FOR_EACH (packet, packets) {
> +pkt_metadata_init(>md, port_no);
> +}
> +
> +dp_netdev_input__(pmd, packets);
>  }
> 
>  static void
>  dp_netdev_recirculate(struct dp_netdev_pmd_thread *pmd,
>struct dp_packet_batch *packets)  {
> -dp_netdev_input__(pmd, packets, true, 0);
> +dp_netdev_input__(pmd, packets);
>  }
> 
>  struct dp_netdev_execute_aux {
> --
> 2.4.11
> 
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 1/4] dpif-netdev: Avoid reading RSS hash when EMC is disabled.

2017-07-26 Thread O Mahony, Billy
Hi Antonio,

Ok, I guess you can't argue with performance! I look forward to the next rev.

No further comments below.

/Billy

> -Original Message-
> From: Fischetti, Antonio
> Sent: Wednesday, July 19, 2017 4:59 PM
> To: Fischetti, Antonio <antonio.fische...@intel.com>; O Mahony, Billy
> <billy.o.mah...@intel.com>; d...@openvswitch.org
> Subject: RE: [ovs-dev] [PATCH 1/4] dpif-netdev: Avoid reading RSS hash
> when EMC is disabled.
> 
> Hi Billy, your suggestion really simplify the code a lot and improve 
> readability
> but unfortunately there's no gain in performance.
> Anyway in the next version I'm adding some further change and I will try to
> take into account your suggestions.
> 
> /Antonio
> 
> > -Original Message-
> > From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> > boun...@openvswitch.org] On Behalf Of Fischetti, Antonio
> > Sent: Friday, June 23, 2017 10:53 PM
> > To: O Mahony, Billy <billy.o.mah...@intel.com>; d...@openvswitch.org
> > Subject: Re: [ovs-dev] [PATCH 1/4] dpif-netdev: Avoid reading RSS hash
> > when EMC is disabled.
> >
> > Hi Billy, thanks for your suggestion, it makes the code more clean and
> > readable.
> > Once I get back from vacation I'll give it a try and check if this
> > still gives a performance benefit.
> >
> > /Antonio
> >
> > > -Original Message-
> > > From: O Mahony, Billy
> > > Sent: Friday, June 23, 2017 5:23 PM
> > > To: Fischetti, Antonio <antonio.fische...@intel.com>;
> > d...@openvswitch.org
> > > Subject: RE: [ovs-dev] [PATCH 1/4] dpif-netdev: Avoid reading RSS
> > > hash when EMC is disabled.
> > >
> > > Hi Antonio,
> > >
> > > > -Original Message-
> > > > From: Fischetti, Antonio
> > > > Sent: Friday, June 23, 2017 3:10 PM
> > > > To: O Mahony, Billy <billy.o.mah...@intel.com>;
> > > > d...@openvswitch.org
> > > > Subject: RE: [ovs-dev] [PATCH 1/4] dpif-netdev: Avoid reading RSS
> > > > hash when EMC is disabled.
> > > >
> > > > Hi Billy,
> > > > thanks a lot for you suggestions. Those would really help
> > > > re-factoring
> > > the
> > > > code by avoiding duplications.
> > > > The thing is that this patch 1/4 is mainly a preparation for the
> > > > next
> > > patch 2/4.
> > > > So I did these changes with the next patch 2/4 in mind.
> > > >
> > > > The final result I meant to achieve in patch 2/4 is the following.
> > > > EMC lookup is skipped - not only when EMC is disabled - but also
> > > > when (we're processing recirculated packets) && (the EMC is 'enough'
> full).
> > > > The purpose is to avoid EMC thrashing.
> > > >
> > > > Below is how the code looks like after applying patches 1/4 and 2/4.
> > > > Please let me know if you can find some similar optimizations to
> > > > avoid
> > > code
> > > > duplications, that would be great.
> > > > 
> > > > /*
> > > >  * EMC lookup is skipped when one or both of the following
> > > >  * two cases occurs:
> > > >  *
> > > >  *   - EMC is disabled.  This is detected from cur_min.
> > > >  *
> > > >  *   - The EMC occupancy exceeds EMC_FULL_THRESHOLD and the
> > > >  * packet to be classified is being recirculated.  When
> > this
> > > >  * happens also EMC insertions are skipped for
> > recirculated
> > > >  * packets.  So that EMC is used just to store entries
> > which
> > > >  * are hit from the 'original' packets.  This way the EMC
> > > >  * thrashing is mitigated with a benefit on performance.
> > > >  */
> > > > if (!md_is_valid) {
> > > > pkt_metadata_init(>md, port_no);
> > > > miniflow_extract(packet, >mf);  <== this fn must
> > > > be
> > > called after
> > > > pkt_metadta_init
> > > > /* This is not a recirculated packet. */
> > > > if (OVS_LIKELY(cur_min)) {
> > > > /* EMC is enabled.  We can retrieve the 5-tuple hash
> > > >  * without considering the recirc id. */
> > > > if (OVS_LIK

Re: [ovs-dev] [PATCH RFC 2/4] dpif-netdev: Skip EMC lookup/insert for recirculated packets.

2017-07-26 Thread O Mahony, Billy
Hi Antonio,

> -Original Message-
> From: Fischetti, Antonio
> Sent: Friday, June 23, 2017 10:49 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; d...@openvswitch.org
> Subject: RE: [ovs-dev] [PATCH RFC 2/4] dpif-netdev: Skip EMC lookup/insert
> for recirculated packets.
> 
> Thanks a lot Billy, really appreciate your feedback.
> My replies inline.
> 
> /Antonio
> 
> > -Original Message-
> > From: O Mahony, Billy
> > Sent: Friday, June 23, 2017 6:39 PM
> > To: Fischetti, Antonio <antonio.fische...@intel.com>;
> > d...@openvswitch.org
> > Subject: RE: [ovs-dev] [PATCH RFC 2/4] dpif-netdev: Skip EMC
> > lookup/insert for recirculated packets.
> >
> > Hi Antonio,
> >
> > This is a really interesting patch. Comments inline below.
> >
> > Thanks,
> > /Billy.
> >
> > > -Original Message-
> > > From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> > > boun...@openvswitch.org] On Behalf Of antonio.fische...@intel.com
> > > Sent: Monday, June 19, 2017 11:12 AM
> > > To: d...@openvswitch.org
> > > Subject: [ovs-dev] [PATCH RFC 2/4] dpif-netdev: Skip EMC
> > > lookup/insert
> > for
> > > recirculated packets.
> > >
> > > From: Antonio Fischetti <antonio.fische...@intel.com>
> > >
> > > When OVS is configured as a firewall, with thousands of active
> > concurrent
> > > connections, the EMC gets quicly saturated and may come under heavy
> > > thrashing for the reason that original and recirculated packets keep
> > overwrite
> > > existing active EMC entries due to its limited size(8k).
> > >
> > > This thrashing causes the EMC to be less efficient than the dcpls in
> > terms of
> > > lookups and insertions.
> > >
> > > This patch allows to use the EMC efficiently by allowing only the
> > 'original'
> > > packets to hit EMC. All recirculated packets are sent to classifier
> > directly.
> > > An empirical threshold (EMC_FULL_THRESHOLD - of 50%) for EMC
> > > occupancy is set to trigger this logic. By doing so when EMC
> > > utilization exceeds EMC_FULL_THRESHOLD.
> > >  - EMC Insertions are allowed just for original packets. EMC insertion
> > >and look up is skipped for recirculated packets.
> > >  - Recirculated packets are sent to classifier.
> > >
> > > This patch depends on the previous one in this series. It's based on
> > patch
> > > "dpif-netdev: add EMC entry count and %full figure to pmd-stats-show"
> > at:
> > > https://mail.openvswitch.org/pipermail/ovs-dev/2017-January/327570.h
> > > tml
> > >
> > > Signed-off-by: Antonio Fischetti <antonio.fische...@intel.com>
> > > Signed-off-by: Bhanuprakash Bodireddy
> > > <bhanuprakash.bodire...@intel.com>
> > > Co-authored-by: Bhanuprakash Bodireddy
> > > <bhanuprakash.bodire...@intel.com>
> > > ---
> > > In our Connection Tracker testbench set up with
> > >
> > >  table=0, priority=1 actions=drop
> > >  table=0, priority=10,arp actions=NORMAL  table=0,
> > priority=100,ct_state=-
> > > trk,ip actions=ct(table=1)  table=1, ct_state=+new+trk,ip,in_port=1
> > > actions=ct(commit),output:2  table=1, ct_state=+est+trk,ip,in_port=1
> > > actions=output:2  table=1, ct_state=+new+trk,ip,in_port=2
> > > actions=drop table=1, ct_state=+est+trk,ip,in_port=2
> > > actions=output:1
> > >
> > > we saw the following performance improvement.
> > >
> > > Measured packet Rx rate (regardless of packet loss). Bidirectional
> > > test
> > with
> > > 64B UDP packets.
> > > Each row is a test with a different number of traffic streams. The
> > traffic
> > > generator is set so that each stream establishes one UDP connection.
> > > Mpps columns reports the Rx rates on the 2 sides.
> > >
> > >  Traffic |Orig| Orig  |  +changes  |   +changes
> > >  Streams |   [Mpps]   | [EMC entries] |   [Mpps]   | [EMC entries]
> > > -++---++---
> > >  10  |  3.4, 3.4  |  20   |  3.4, 3.4  |  20
> > > 100  |  2.6, 2.7  | 200   |  2.6, 2.7  | 201
> > >   1,000  |  2.4, 2.4  |2009   |  2.4, 2.4  |1994
> > >   2,000  |  2.2, 2.2  |3903   |  2.2, 2.2  |3900
> > >   3,000  |  2.1, 2.1  |5473   |  2.2, 2.2  |4798
>

Re: [ovs-dev] [RFC v2 2/4] netdev-dpdk: Apply ingress_sched config to dpdk phy ports

2017-07-12 Thread O Mahony, Billy


> -Original Message-
> From: Darrell Ball [mailto:db...@vmware.com]
> Sent: Tuesday, July 11, 2017 6:49 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; d...@openvswitch.org
> Subject: Re: [ovs-dev] [RFC v2 2/4] netdev-dpdk: Apply ingress_sched config
> to dpdk phy ports
> 
> 
> 
> On 7/11/17, 9:58 AM, "ovs-dev-boun...@openvswitch.org on behalf of Billy
> O'Mahony" <ovs-dev-boun...@openvswitch.org on behalf of
> billy.o.mah...@intel.com> wrote:
> 
> Ingress scheduling configuration is given effect by way of Flow Director
> filters. A small subset of the possible ingress scheduling possible is
> implemented in this patch.
> 
> Signed-off-by: Billy O'Mahony <billy.o.mah...@intel.com>
> ---
>  include/openvswitch/ofp-parse.h |   3 ++
>  lib/dpif-netdev.c   |   1 +
>  lib/netdev-dpdk.c   | 117
> 
>  3 files changed, 121 insertions(+)
> 
> diff --git a/include/openvswitch/ofp-parse.h b/include/openvswitch/ofp-
> parse.h
> index fc5784e..08d6086 100644
> --- a/include/openvswitch/ofp-parse.h
> +++ b/include/openvswitch/ofp-parse.h
> @@ -37,6 +37,9 @@ struct ofputil_table_mod;
>  struct ofputil_bundle_msg;
>  struct ofputil_tlv_table_mod;
>  struct simap;
> +struct tun_table;
> +struct flow_wildcards;
> +struct ofputil_port_map;
>  enum ofputil_protocol;
> 
>  char *parse_ofp_str(struct ofputil_flow_mod *, int command, const char
> *str_,
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 2f224db..66712c7 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -44,6 +44,7 @@
>  #include "dp-packet.h"
>  #include "dpif.h"
>  #include "dpif-provider.h"
> +#include "netdev-provider.h"
>  #include "dummy.h"
>  #include "fat-rwlock.h"
>  #include "flow.h"
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index d14c381..93556e7 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -33,6 +33,8 @@
>  #include 
>  #include 
> 
> +#include 
> +#include 
>  #include "dirs.h"
>  #include "dp-packet.h"
>  #include "dpdk.h"
> @@ -168,6 +170,10 @@ static const struct rte_eth_conf port_conf = {
>  .txmode = {
>  .mq_mode = ETH_MQ_TX_NONE,
>  },
> +.fdir_conf = {
> +.mode = RTE_FDIR_MODE_PERFECT,
> +},
> +
>  };
> 
>  enum { DPDK_RING_SIZE = 256 };
> @@ -652,6 +658,15 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk
> *dev, int n_rxq, int n_txq)
>  int i;
>  struct rte_eth_conf conf = port_conf;
> 
> +/* Ingress scheduling requires ETH_MQ_RX_NONE so limit it to when
> exactly
> + * two rxqs are defined. Otherwise MQ will not work as expected. */
> +if (dev->ingress_sched_str && n_rxq == 2) {
> +conf.rxmode.mq_mode = ETH_MQ_RX_NONE;
> +}
> +else {
> +conf.rxmode.mq_mode = ETH_MQ_RX_RSS;
> +}
> +
>  if (dev->mtu > ETHER_MTU) {
>  conf.rxmode.jumbo_frame = 1;
>  conf.rxmode.max_rx_pkt_len = dev->max_packet_len;
> @@ -752,6 +767,106 @@ dpdk_eth_flow_ctrl_setup(struct netdev_dpdk
> *dev) OVS_REQUIRES(dev->mutex)
>  }
>  }
> 
> +static void
> +dpdk_apply_ingress_scheduling(struct netdev_dpdk *dev, int n_rxq)
> +{
> +if (!dev->ingress_sched_str) {
> +return;
> +}
> +
> +if (n_rxq != 2) {
> +VLOG_ERR("Interface %s: Ingress scheduling config ignored; " \
> + "Requires n_rxq==2.", dev->up.name);
> +}
> +
> +int priority_q_id = n_rxq-1;
> +char *key, *val, *str, *iter;
> +
> +ovs_be32 ip_src, ip_dst;
> +ip_src = ip_dst = 0;
> +
> +uint16_t eth_type, port_src, port_dst;
> +eth_type = port_src = port_dst = 0;
> +uint8_t ip_proto = 0;
> +
> +char *mallocd_str; /* str_to_x returns malloc'd str we'll need to 
> free */
> +/* Parse the configuration into local vars */
> +iter = str = xstrdup(dev->ingress_sched_str);
> +while (ofputil_parse_key_value (, , )) {
> +if (strcmp(key, "nw_src") == 0 || strcmp(key, "ip_src") == 0) {
> +mallocd_str = str_to_

Re: [ovs-dev] [RFC v2 4/4] docs: Document ingress scheduling feature

2017-07-12 Thread O Mahony, Billy
Hi Darrell,

Thanks for reviewing.

> -Original Message-
> From: Darrell Ball [mailto:db...@vmware.com]
> Sent: Tuesday, July 11, 2017 7:29 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; d...@openvswitch.org
> Subject: Re: [ovs-dev] [RFC v2 4/4] docs: Document ingress scheduling
> feature
> 
> 
> 
> On 7/11/17, 9:58 AM, "ovs-dev-boun...@openvswitch.org on behalf of Billy
> O'Mahony" <ovs-dev-boun...@openvswitch.org on behalf of
> billy.o.mah...@intel.com> wrote:
> 
> Signed-off-by: Billy O'Mahony <billy.o.mah...@intel.com>
> ---
>  Documentation/howto/dpdk.rst | 31
> +++
>  vswitchd/vswitch.xml | 31 +++
>  2 files changed, 62 insertions(+)
> 
> diff --git a/Documentation/howto/dpdk.rst
> b/Documentation/howto/dpdk.rst
> index 93248b4..07fb97d 100644
> --- a/Documentation/howto/dpdk.rst
> +++ b/Documentation/howto/dpdk.rst
> @@ -188,6 +188,37 @@ respective parameter. To disable the flow control
> at tx side, run::
> 
>  $ ovs-vsctl set Interface dpdk-p0 options:tx-flow-ctrl=false
> 
> +Ingress Scheduling
> +--
> +
> +The ingress scheduling feature is described in general in
> +``ovs-vswitchd.conf.db (5)``.
> +
> +Interfaces of type ``dpdk`` support ingress scheduling only for
> +either ether_type or else a fully specificed combination of src and
> +dst ip address and port numbers for TCP or UDP packets.
> +
> +To prioritize packets for Precision Time Protocol:
> +
> +$ ovs-vsctl set Interface dpdk-p0 \
> +other_config:ingress_sched=eth_type=0x88F7
> +
> +To prioritize UDP packets between specific IP source and destination:
> +
> +$ ovs-vsctl set Interface dpdk-p0 \
> +other_config:ingress_sched=udp,ip_src=1.1.1.1,ip_dst=2.2.2.2,\
> +udp_src=11,udp_dst=22
> +
> +If unsupported ingress scheduling configuration is specified or it cannot
> be
> +applied for any reason a warning message is logged and the Interface
> operates
> +as if no ingress scheduling was configured.
> +
> +Interfaces of type ``dpdkvhostuserclient``, ``dpdkr`` and 
> ``dpdkvhostuser``
> do
> +not support ingress scheduling.
> +
> +Currently only the match fields listed above are supported. No
> wildcarding of
> +fields is supported.
> +
> 
> I had a previous comment about this in Patch 2.
> No wildcarding ?; meaning we need to specify either the 5 tuple exact match
> and ethertype or just ethertype ?
> 

[[BO'M]] What I mean is that saying something like 
nw_src=10.1.0.0/255.255.0.0
is not  supported by dpdk_netdev devices (currently).

> 
> 
>  pdump
>  -
> 
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 0bf986d..299d725 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -2842,6 +2842,37 @@
>
>  
> 
> +
> +  
> +   Packets matching the ingress_sched value are prioritized. This 
> means
> +   some combination of:
> +  
> +  
> +
> + prioritized packets are forwarded to their destination port 
> before
> + non-prioritized
> +
> +
> + prioritized packets are less likely to be dropped in an 
> overloaded
> + situation than prioritized packets
> +
> +  
> +  
> +   Ingress scheduling is supported with the best effort of the 
> Interface.
> +   It may be dependant on the interface type and it's supporting
> +   implementation devices. Different interface types may have 
> different
> +   levels of support for the feature and the same interface type 
> attached
> +   to different devices (physical NICs or vhost ports, device driver,
> +   NIC model) may also offer different levels of support.
> +  
> +  
> +
> + The format of the ingress_sched field is specified in 
> ovs-fields(7) in
> + the ``Matching'' and ``FIELD REFERENCE'' sections.
> +
> +  
> +
> +
>  
>
>  BFD, defined in RFC 5880 and RFC 5881, allows point-to-point
> --
> 2.7.4
> 
> ___
> dev mailing list
> d...@openvswitch.org
> https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__mail.openvswitch.org_mailman_listinfo_ovs-
> 2Ddev=DwICAg=uilaK90D4TOVoH58JNXRgQ=BVhFA09CGX7JQ5Ih-
> uZnsw=cYvXqJfbNkGW_Hacwk5ZCIh4KMLFD_2r5ssimwwuVmA=5-
> _svanz4DDwQU7zBthfGajskziYym_GEuVpDJogHaA=
> 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [RFC PATCH 3/3] dpif-netdev: Add rxq prioritization

2017-07-11 Thread O Mahony, Billy
Hi Ben,

Thanks for the encouraging feedback.

I've omitted documentation and a few other things from the RFC but am working 
on those currently for a v1 patch, which should arrive next week.

Regards,
Billy. 

> -Original Message-
> From: Ben Pfaff [mailto:b...@ovn.org]
> Sent: Monday, July 10, 2017 7:21 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>
> Cc: d...@openvswitch.org
> Subject: Re: [ovs-dev] [RFC PATCH 3/3] dpif-netdev: Add rxq prioritization
> 
> On Fri, Jun 16, 2017 at 05:00:48PM +0100, Billy O'Mahony wrote:
> > If an rxq is marked as 'prioritized' then keep reading from this queue
> > until there are no packets available. Only then proceed to other queues.
> >
> > Signed-off-by: Billy O'Mahony <billy.o.mah...@intel.com>
> 
> This idea seems like it can actually work well, unlike the software-only
> solutions I've seen proposed before.  I haven't reviewed the details but the
> idea here certain seems sound.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v9] dpif-netdev: Assign ports to pmds on non-local numa node.

2017-07-10 Thread O Mahony, Billy
Thanks everybody. Will post v10 after lunch.

/Billy

> -Original Message-
> From: Stokes, Ian
> Sent: Monday, July 10, 2017 12:02 PM
> To: Ilya Maximets <i.maxim...@samsung.com>; O Mahony, Billy
> <billy.o.mah...@intel.com>; d...@openvswitch.org
> Cc: db...@vmare.com
> Subject: RE: [ovs-dev] [PATCH v9] dpif-netdev: Assign ports to pmds on non-
> local numa node.
> 
> > On 10.07.2017 13:42, O Mahony, Billy wrote:
> > >
> > >
> > >> -Original Message-
> > >> From: Stokes, Ian
> > >> Sent: Monday, July 10, 2017 10:41 AM
> > >> To: Ilya Maximets <i.maxim...@samsung.com>; O Mahony, Billy
> > >> <billy.o.mah...@intel.com>; d...@openvswitch.org
> > >> Cc: db...@vmare.com
> > >> Subject: RE: [ovs-dev] [PATCH v9] dpif-netdev: Assign ports to pmds
> > >> on non- local numa node.
> > >>
> > >>> On 08.07.2017 22:09, Stokes, Ian wrote:
> > >>>>> Previously if there is no available (non-isolated) pmd on the
> > >>>>> numa node for a port then the port is not polled at all. This
> > >>>>> can result in a non- operational system until such time as nics
> > >>>>> are physically repositioned. It is preferable to operate with a
> > >>>>> pmd on
> > the 'wrong'
> > >>>>> numa node albeit with lower performance. Local pmds are still
> > >>>>> chosen
> > >>> when available.
> > >>>>>
> > >>>>> Signed-off-by: Billy O'Mahony <billy.o.mah...@intel.com>
> > >>>>> Signed-off-by: Ilya Maximets <i.maxim...@samsung.com>
> > >>>>> Co-authored-by: Ilya Maximets <i.maxim...@samsung.com>
> > >>>>> ---
> > >>>>> v9: v8 missed some comments on v7
> > >>>>> v8: Some coding style issues; doc tweak
> > >>>>> v7: Incorporate review comments on docs and implementation
> > >>>>> v6: Change 'port' to 'queue' in a warning msg
> > >>>>> v5: Fix warning msg; Update same in docs
> > >>>>> v4: Fix a checkpatch error
> > >>>>> v3: Fix warning messages not appearing when using multiqueue
> > >>>>> v2: Add details of warning messages into docs
> > >>>>>
> > >>>>>  Documentation/intro/install/dpdk.rst | 21 +++---
> > >>>>>  lib/dpif-netdev.c| 41
> > >>>>> +---
> > >>>>>  2 files changed, 56 insertions(+), 6 deletions(-)
> > >>>>>
> > >>>>> diff --git a/Documentation/intro/install/dpdk.rst
> > >>>>> b/Documentation/intro/install/dpdk.rst
> > >>>>> index e83f852..89775d6 100644
> > >>>>> --- a/Documentation/intro/install/dpdk.rst
> > >>>>> +++ b/Documentation/intro/install/dpdk.rst
> > >>>>> @@ -449,7 +449,7 @@ affinitized accordingly.
> > >>>>>
> > >>>>>A poll mode driver (pmd) thread handles the I/O of all DPDK
> > >>> interfaces
> > >>>>>assigned to it. A pmd thread shall poll the ports for
> > >>>>> incoming packets,
> > >>>>> -  switch the packets and send to tx port.  pmd thread is CPU
> > >>>>> bound, and needs
> > >>>>> +  switch the packets and send to tx port.  A pmd thread is CPU
> > >>>>> + bound, and needs
> > >>>>>to be affinitized to isolated cores for optimum performance.
> > >>>>>
> > >>>>>By setting a bit in the mask, a pmd thread is created and
> > >>>>> pinned to the @@ -458,8 +458,23 @@ affinitized accordingly.
> > >>>>>$ ovs-vsctl set Open_vSwitch .
> > >>>>> other_config:pmd-cpu-mask=0x4
> > >>>>>
> > >>>>>.. note::
> > >>>>> -pmd thread on a NUMA node is only created if there is at least
> > one
> > >>>>> DPDK
> > >>>>> -interface from that NUMA node added to OVS.
> > >>>>> +A pmd thread on a NUMA node is only created if there is at
> > >>>>> + least one
> > >>>>> DPDK
> > >>>>> +interface from that NUMA node added to OVS. 

Re: [ovs-dev] [PATCH v9] dpif-netdev: Assign ports to pmds on non-local numa node.

2017-07-10 Thread O Mahony, Billy


> -Original Message-
> From: Stokes, Ian
> Sent: Monday, July 10, 2017 10:41 AM
> To: Ilya Maximets <i.maxim...@samsung.com>; O Mahony, Billy
> <billy.o.mah...@intel.com>; d...@openvswitch.org
> Cc: db...@vmare.com
> Subject: RE: [ovs-dev] [PATCH v9] dpif-netdev: Assign ports to pmds on non-
> local numa node.
> 
> > On 08.07.2017 22:09, Stokes, Ian wrote:
> > >> Previously if there is no available (non-isolated) pmd on the numa
> > >> node for a port then the port is not polled at all. This can result
> > >> in a non- operational system until such time as nics are physically
> > >> repositioned. It is preferable to operate with a pmd on the 'wrong'
> > >> numa node albeit with lower performance. Local pmds are still
> > >> chosen
> > when available.
> > >>
> > >> Signed-off-by: Billy O'Mahony <billy.o.mah...@intel.com>
> > >> Signed-off-by: Ilya Maximets <i.maxim...@samsung.com>
> > >> Co-authored-by: Ilya Maximets <i.maxim...@samsung.com>
> > >> ---
> > >> v9: v8 missed some comments on v7
> > >> v8: Some coding style issues; doc tweak
> > >> v7: Incorporate review comments on docs and implementation
> > >> v6: Change 'port' to 'queue' in a warning msg
> > >> v5: Fix warning msg; Update same in docs
> > >> v4: Fix a checkpatch error
> > >> v3: Fix warning messages not appearing when using multiqueue
> > >> v2: Add details of warning messages into docs
> > >>
> > >>  Documentation/intro/install/dpdk.rst | 21 +++---
> > >>  lib/dpif-netdev.c| 41
> > >> +---
> > >>  2 files changed, 56 insertions(+), 6 deletions(-)
> > >>
> > >> diff --git a/Documentation/intro/install/dpdk.rst
> > >> b/Documentation/intro/install/dpdk.rst
> > >> index e83f852..89775d6 100644
> > >> --- a/Documentation/intro/install/dpdk.rst
> > >> +++ b/Documentation/intro/install/dpdk.rst
> > >> @@ -449,7 +449,7 @@ affinitized accordingly.
> > >>
> > >>A poll mode driver (pmd) thread handles the I/O of all DPDK
> > interfaces
> > >>assigned to it. A pmd thread shall poll the ports for incoming
> > >> packets,
> > >> -  switch the packets and send to tx port.  pmd thread is CPU
> > >> bound, and needs
> > >> +  switch the packets and send to tx port.  A pmd thread is CPU
> > >> + bound, and needs
> > >>to be affinitized to isolated cores for optimum performance.
> > >>
> > >>By setting a bit in the mask, a pmd thread is created and pinned
> > >> to the @@ -458,8 +458,23 @@ affinitized accordingly.
> > >>$ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x4
> > >>
> > >>.. note::
> > >> -pmd thread on a NUMA node is only created if there is at least one
> > >> DPDK
> > >> -interface from that NUMA node added to OVS.
> > >> +A pmd thread on a NUMA node is only created if there is at
> > >> + least one
> > >> DPDK
> > >> +interface from that NUMA node added to OVS.  A pmd thread is
> > >> + created
> > >> by
> > >> +default on a core of a NUMA node or when a specified
> > >> + pmd-cpu-mask
> > has
> > >> +indicated so.  Even though a PMD thread may exist, the thread
> > >> + only
> > >> starts
> > >> +consuming CPU cycles if there is least one receive queue
> > >> + assigned
> > to
> > >> +the pmd.
> > >> +
> > >> +  .. note::
> > >> +On NUMA systems PCI devices are also local to a NUMA node.
> > >> + Unbound
> > >> rx
> > >> +queues for a PCI device will assigned to a pmd on it's local
> > >> + NUMA
> > >
> > > Minor point but should read 'will be assigned'

[[BO'M]] 
+1

> > >> node if a
> > >> +non-isolated PMD exists on that NUMA node.  If not, the queue
> > >> + will
> > be
> > >> +assigned to a non-isolated pmd on a remote NUMA node.  This
> > >> + will
> > >> result in
> > >> +reduced maximum throughput on that device and possibly on
> > >> + other
> > >> devices
> > >> +assigned to that pmd thread. In the case such,

Re: [ovs-dev] [PATCH 4/4] dp-packet: Use memcpy to copy dp_packet fields.

2017-06-28 Thread O Mahony, Billy
Hi Antonio,


> -Original Message-
> From: Fischetti, Antonio
> Sent: Friday, June 23, 2017 11:06 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; d...@openvswitch.org
> Subject: RE: [ovs-dev] [PATCH 4/4] dp-packet: Use memcpy to copy
> dp_packet fields.
> 
> Hi Billy, thanks for your review.
> Replies inline.
> 
> /Antonio
> 
> > -Original Message-
> > From: O Mahony, Billy
> > Sent: Friday, June 23, 2017 2:27 PM
> > To: Fischetti, Antonio <antonio.fische...@intel.com>;
> > d...@openvswitch.org
> > Subject: RE: [ovs-dev] [PATCH 4/4] dp-packet: Use memcpy to copy
> > dp_packet fields.
> >
> > Hi Antonio,
> >
> > I'm not sure that this approach will work. Mainly the memcpy will not
> > take account of any padding that the compiler may insert and the
> > struct.  Maybe if the struct was defined as packed it could fix this 
> > objection.
> 
> [AF] This patch is somewhat similar to the patch for initializing the
> pkt_metadata struct in pkt_metadata_init() at
> http://patchwork.ozlabs.org/patch/779696/
> 
> 
> >
> > Also anyone editing the structure in future would have to be aware that
> > these elements need to be kept contiguous in order for packet_clone to
> > keep working.
> 
> [AF] Agree, I should add a comment locally to the struct definition.
> 
> >
> > Or if the relevant fields here were all placed in the their own nested
> > struct then sizeof that nested struct could be used in the memcpy call as
> > the sizeof nested_struct would account for whatever padding the compiler
> > inserted.

 [[BO'M]] I think at a minimum this nesting of structures would have to be 
done. Just adding a comment won't address the issue with compiler padding. 
Which could change based on compiler, compiler version, compiler flags, target 
architecture, etc.

> >
> > Does this change give much of a performance increase?
> 
> I tested this while working on connection tracker. I was using this particular
> set
> of flows - see 4th line - with
> "action=ct(table=1),NORMAL";
> to trigger a call to dp_packet_clone_with_headroom():
> 
> ovs-ofctl del-flows br0;
> ovs-ofctl add-flow br0 table=0,priority=1,action=drop;
> ovs-ofctl add-flow br0 table=0,priority=10,arp,action=normal;
> ovs-ofctl add-flow br0 table=0,priority=100,ip,ct_state=-
> trk,"action=ct(table=1),NORMAL";
> ovs-ofctl add-flow br0
> table=1,in_port=1,ip,ct_state=+trk+new,"action=ct(commit),2";
> ovs-ofctl add-flow br0 table=1,in_port=1,ip,ct_state=+trk+est,"action=2";
> ovs-ofctl add-flow br0
> table=1,in_port=2,ip,ct_state=+trk+new,"action=drop";
> ovs-ofctl add-flow br0 table=1,in_port=2,ip,ct_state=+trk+est,"action=1";
> 
> After running a Hotspot analysis with VTune for 60 secs I had in the original
> that dp_packet_clone_with_headroom was ranked at the 2nd place:
> Function   CPU Time
> -+---
> __libc_malloc5.880s
> dp_packet_clone_with_headroom4.530s
> emc_lookup   4.050s
> free 3.500s
> pthread_mutex_unlock 2.890s
> ...
> 
> Instead after this change the same fn was consuming less cpu cycles:
> Function   CPU Time
> -+---
> __libc_malloc5.900s
> emc_lookup   4.070s
> free 4.010s
> dp_packet_clone_with_headroom3.920s
> pthread_mutex_unlock 3.060s
> 
> 

[[BO'M]] 
So we see a 0.5s saving (out of the 60s test) on the time spent in the changed 
function - almost 1%. However, there is also a change of 0.5s in the usage 
associated with the free() function which we'd expect to not change with this 
patch. So hopefully these changes are not just related to sampling error and 
remain stable across invocations or across longer invocations. Is there any 
change in the cycles per packet statistic with this change? That could be a 
more reliable metric that vtune which will have some sampling error involved.

Also is dp_packet_clone_with_headroom hit for all packets or is it just used 
when conntrk is enabled? 

> 
> 
> >
> > Also I commented on 1/4 of this patchset about a cover letter - but if the
> > patchset members are independent of each other then maybe they should
> just
> > be separate patches.
> 
> [AF] I grouped these patches together because they all would be some
> optimizations on performance with a focus mainly on conntracker usecase.
> Maybe a better choice was to split them in separate patches.
> 
> 

Re: [ovs-dev] [PATCH v7] dpif-netdev: Assign ports to pmds on non-local numa node.

2017-06-27 Thread O Mahony, Billy
I'll give Darrell a chance to comment before rev'ing. 

> -Original Message-
> From: Ilya Maximets [mailto:i.maxim...@samsung.com]
> Sent: Tuesday, June 27, 2017 5:11 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; d...@openvswitch.org
> Cc: dlu...@gmail.com
> Subject: Re: [PATCH v7] dpif-netdev: Assign ports to pmds on non-local numa
> node.
> 
> On 27.06.2017 18:46, Billy O'Mahony wrote:
> > Previously if there is no available (non-isolated) pmd on the numa
> > node for a port then the port is not polled at all. This can result in
> > a non-operational system until such time as nics are physically
> > repositioned. It is preferable to operate with a pmd on the 'wrong'
> > numa node albeit with lower performance. Local pmds are still chosen
> > when available.
> >
> > Signed-off-by: Billy O'Mahony <billy.o.mah...@intel.com>
> > Signed-off-by: Ilya Maximets <i.maxim...@samsung.com>
> > Co-authored-by: Ilya Maximets <i.maxim...@samsung.com>
> > ---
> > v7: Incorporate review comments on docs and implementation
> > v6: Change 'port' to 'queue' in a warning msg
> > v5: Fix warning msg; Update same in docs
> > v4: Fix a checkpatch error
> > v3: Fix warning messages not appearing when using multiqueue
> > v2: Add details of warning messages into docs
> >
> >  Documentation/intro/install/dpdk.rst | 18 +---
> >  lib/dpif-netdev.c| 42 
> > --
> --
> >  2 files changed, 53 insertions(+), 7 deletions(-)
> >
> > diff --git a/Documentation/intro/install/dpdk.rst
> > b/Documentation/intro/install/dpdk.rst
> > index e83f852..a760fb6 100644
> > --- a/Documentation/intro/install/dpdk.rst
> > +++ b/Documentation/intro/install/dpdk.rst
> > @@ -449,7 +449,7 @@ affinitized accordingly.
> >
> >A poll mode driver (pmd) thread handles the I/O of all DPDK interfaces
> >assigned to it. A pmd thread shall poll the ports for incoming
> > packets,
> > -  switch the packets and send to tx port.  pmd thread is CPU bound,
> > and needs
> > +  switch the packets and send to tx port.  A pmd thread is CPU bound,
> > + and needs
> >to be affinitized to isolated cores for optimum performance.
> >
> >By setting a bit in the mask, a pmd thread is created and pinned to
> > the @@ -458,8 +458,20 @@ affinitized accordingly.
> >$ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x4
> >
> >.. note::
> > -pmd thread on a NUMA node is only created if there is at least one
> DPDK
> > -interface from that NUMA node added to OVS.
> > +While pmd threads are created based on pmd-cpu-mask, the thread
> only starts
> > +consuming CPU cycles if there is least one receive queue assigned to
> the
> > +pmd.
> > +
> > +  .. note::
> > +
> > +On NUMA systems PCI devices are also local to a NUMA node.
> Unbound Rx
> > +queues for PCI device will assigned to a pmd on it's local NUMA node if
> > +pmd-cpu-mask has created a pmd thread on that NUMA node.  If not
> the queue
> > +will be assigned to a pmd on a remote NUMA node.  This will result in
> > +reduced maximum throughput on that device.
> 
> And possibly on other devices assigned to that pmd thread.
> 
> >   In case such a queue assignment
> > +is made a warning message will be logged: "There's no available (non-
> > +isolated) pmd thread on numa node N. Queue Q on port P will be
> assigned to
> > +the pmd on core C (numa node N'). Expect reduced performance."
> >
> >  - QEMU vCPU thread Affinity
> >
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> > 4e29085..38a0fd3 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -3195,6 +3195,23 @@ rr_numa_list_lookup(struct rr_numa_list *rr, int
> numa_id)
> >  return NULL;
> >  }
> >
> > +/* Returns next NUMA from rr list in round-robin fashion. Returns the
> > +first
> > + * NUMA node if 'NULL' or the last node passed, and 'NULL' if list is
> > +empty. */ static struct rr_numa * rr_numa_list_next(struct
> > +rr_numa_list *rr, const struct rr_numa *numa) {
> > +struct hmap_node *node = NULL;
> > +
> > +if (numa) {
> > +node = hmap_next(>numas, >node);
> > +}
> > +if (!node) {
> > +node = hmap_first(>numas);
> > +}
> > +
> > +return (node) ? CONTAINER_OF(node, struct rr_numa, node) : NULL;
> > +}
> 

Re: [ovs-dev] [PATCH v6] dpif-netdev: Assign ports to pmds on non-local numa node.

2017-06-26 Thread O Mahony, Billy
No problem, Darrell.

My network is acting up at the moment so I'll resubmit  the documentation 
section tomorrow. 

/Billy

> -Original Message-
> From: Darrell Ball [mailto:db...@vmware.com]
> Sent: Monday, June 26, 2017 5:08 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; d...@openvswitch.org
> Subject: Re: [ovs-dev] [PATCH v6] dpif-netdev: Assign ports to pmds on non-
> local numa node.
> 
> For the last sentence:
> 
> I think
> “In the case such, a queue assignment is made”
> was meant to be
> “In case such a queue assignment is made”
> 
> 
> On 6/26/17, 8:57 AM, "O Mahony, Billy" <billy.o.mah...@intel.com> wrote:
> 
> Sounds good. I'll incorporate those changes.
> 
> > -Original Message-
> > From: Darrell Ball [mailto:db...@vmware.com]
> > Sent: Monday, June 26, 2017 4:53 PM
> > To: O Mahony, Billy <billy.o.mah...@intel.com>; d...@openvswitch.org
> > Subject: Re: [ovs-dev] [PATCH v6] dpif-netdev: Assign ports to pmds on
> non-
> > local numa node.
> >
> >
> >
> > On 6/26/17, 6:52 AM, "O Mahony, Billy" <billy.o.mah...@intel.com>
> wrote:
> >
> > Hi Darrell,
> >
> >
> >
> > Thanks for reviewing.
> >
>     >
> >
> > > -Original Message-
> >
> > > From: Darrell Ball [mailto:db...@vmware.com]
> >
> > > Sent: Monday, June 26, 2017 8:04 AM
> >
> > > To: O Mahony, Billy <billy.o.mah...@intel.com>;
> d...@openvswitch.org
> >
> > > Subject: Re: [ovs-dev] [PATCH v6] dpif-netdev: Assign ports to 
> pmds
> on
> > non-
> >
> > > local numa node.
> >
> > >
> >
> > > I think this is helpful in some cases where lower performance is 
> an
> >
> > > acceptable tradeoff with more frugal and/or more flexible usage of
> cpu
> >
> > > resources.
> >
> > >
> >
> > > I did not test it since Ian has already done that, but I reviewed 
> the
> code
> >
> > > change and other related code.
> >
> > >
> >
> > > One comment inline regarding the added documentation.
> >
> > >
> >
> > > Otherwise, Acked-by: Darrell Ball <dlu...@gmail.com>
> >
> > >
> >
> > >
> >
> > >
> >
> > > On 5/10/17, 8:59 AM, "ovs-dev-boun...@openvswitch.org on behalf
> of
> > Billy
> >
> > > O'Mahony" <ovs-dev-boun...@openvswitch.org on behalf of
> >
> > > billy.o.mah...@intel.com> wrote:
> >
> > >
> >
> > > From: billyom <billy.o.mah...@intel.com>
> >
> > >
> >
> > > Previously if there is no available (non-isolated) pmd on the 
> numa
> node
> >
> > > for a port then the port is not polled at all. This can 
> result in a
> >
> > > non-operational system until such time as nics are physically
> >
> > > repositioned. It is preferable to operate with a pmd on the 
> 'wrong'
> > numa
> >
> > > node albeit with lower performance. Local pmds are still 
> chosen
> when
> >
> > > available.
> >
> > >
> >
> > > Signed-off-by: Billy O'Mahony <billy.o.mah...@intel.com>
> >
> > > ---
> >
> > > v6: Change 'port' to 'queue' in a warning msg
> >
> > > v5: Fix warning msg; Update same in docs
> >
> > > v4: Fix a checkpatch error
> >
> > > v3: Fix warning messages not appearing when using multiqueue
> >
> > > v2: Add details of warning messages into docs
> >
> > >
> >
> > >  Documentation/intro/install/dpdk.rst | 10 +
> >
> > >  lib/dpif-netdev.c| 43
> > +++---
> >
> > > --
> >
> > >  2 files changed, 48 insertions(+), 5 d

Re: [ovs-dev] [PATCH v6] dpif-netdev: Assign ports to pmds on non-local numa node.

2017-06-26 Thread O Mahony, Billy
Sounds good. I'll incorporate those changes. 

> -Original Message-
> From: Darrell Ball [mailto:db...@vmware.com]
> Sent: Monday, June 26, 2017 4:53 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; d...@openvswitch.org
> Subject: Re: [ovs-dev] [PATCH v6] dpif-netdev: Assign ports to pmds on non-
> local numa node.
> 
> 
> 
> On 6/26/17, 6:52 AM, "O Mahony, Billy" <billy.o.mah...@intel.com> wrote:
> 
> Hi Darrell,
> 
> 
> 
> Thanks for reviewing.
> 
> 
> 
> > -Original Message-
> 
> > From: Darrell Ball [mailto:db...@vmware.com]
> 
> > Sent: Monday, June 26, 2017 8:04 AM
> 
> > To: O Mahony, Billy <billy.o.mah...@intel.com>; d...@openvswitch.org
> 
> > Subject: Re: [ovs-dev] [PATCH v6] dpif-netdev: Assign ports to pmds on
> non-
> 
> > local numa node.
> 
> >
> 
> > I think this is helpful in some cases where lower performance is an
> 
> > acceptable tradeoff with more frugal and/or more flexible usage of cpu
> 
> > resources.
> 
> >
> 
> > I did not test it since Ian has already done that, but I reviewed the 
> code
> 
> > change and other related code.
> 
> >
> 
> > One comment inline regarding the added documentation.
> 
> >
> 
> > Otherwise, Acked-by: Darrell Ball <dlu...@gmail.com>
> 
> >
> 
> >
> 
> >
> 
> > On 5/10/17, 8:59 AM, "ovs-dev-boun...@openvswitch.org on behalf of
> Billy
> 
> > O'Mahony" <ovs-dev-boun...@openvswitch.org on behalf of
> 
> > billy.o.mah...@intel.com> wrote:
> 
> >
> 
> > From: billyom <billy.o.mah...@intel.com>
> 
> >
> 
> > Previously if there is no available (non-isolated) pmd on the numa 
> node
> 
> > for a port then the port is not polled at all. This can result in a
> 
> > non-operational system until such time as nics are physically
> 
> > repositioned. It is preferable to operate with a pmd on the 'wrong'
> numa
> 
> > node albeit with lower performance. Local pmds are still chosen when
> 
> > available.
> 
> >
> 
> > Signed-off-by: Billy O'Mahony <billy.o.mah...@intel.com>
> 
> > ---
> 
> > v6: Change 'port' to 'queue' in a warning msg
> 
> > v5: Fix warning msg; Update same in docs
> 
> > v4: Fix a checkpatch error
> 
> > v3: Fix warning messages not appearing when using multiqueue
> 
> > v2: Add details of warning messages into docs
> 
> >
> 
> >  Documentation/intro/install/dpdk.rst | 10 +
> 
> >  lib/dpif-netdev.c| 43
> +++---
> 
> > --
> 
> >  2 files changed, 48 insertions(+), 5 deletions(-)
> 
> >
> 
> > diff --git a/Documentation/intro/install/dpdk.rst
> 
> > b/Documentation/intro/install/dpdk.rst
> 
> > index d1c0e65..7a66bff 100644
> 
> > --- a/Documentation/intro/install/dpdk.rst
> 
> > +++ b/Documentation/intro/install/dpdk.rst
> 
> > @@ -460,6 +460,16 @@ affinitized accordingly.
> 
> >  pmd thread on a NUMA node is only created if there is at least 
> one
> DPDK
> 
> >  interface from that NUMA node added to OVS.
> 
> >
> 
> > +  .. note::
> 
> > +   On NUMA systems PCI devices are also local to a NUMA node.  Rx
> 
> > queues for
> 
> > +   PCI device will assigned to a pmd on it's local NUMA node if 
> pmd-
> cpu-
> 
> > mask
> 
> > +   has created a pmd thread on that NUMA node.  If not the queue 
> will
> be
> 
> > +   assigned to a pmd on a remote NUMA node.
> 
> >
> 
> >
> 
> >
> 
> > I think the below text is a bit more accurate
> 
> > +  .. note::
> 
> > +   On NUMA systems PCI devices are also local to a NUMA node.  Rx
> queues
> 
> > for
> 
> > +   a PCI device will be assigned to a pmd on it's local NUMA node.
> 
> 
> 
> [[BO'M]]
> 
> Re suggested sentence: "Rx queues for a PCI device will be assigned to a
> pmd on it's local NUMA node."
> 
> However, with this patch that is no longer always the case. Assignment to 
> a
> pmd on it's loc

Re: [ovs-dev] multiple PMD threads + multiple dpdk ports scenario

2017-06-26 Thread O Mahony, Billy
Hi,

It is just receive queue that are associated with a PMD. Each port can have one 
or more receive queues. So rxq1 and rxq2 on the same port could be handled by 
different PMDs. 

So when a packet is placed on a txq it is placed there by the PMD that 
originally received the packet. 

The number of txqs for each port is set so that there is at least one txq for 
each PMD plus one txq for the main vswitchd thread. This means that there is no 
locking required for two PMDs that are sending packet on the same port as they 
are using different transmit queues. 

For future reference I believe questions like this are preferred to be sent to 
ovs-discuss ML. 

Hope that helps,

Billy. 


> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of Joo Kim
> Sent: Wednesday, June 7, 2017 9:11 AM
> To: ovs-dev@openvswitch.org
> Subject: [ovs-dev] multiple PMD threads + multiple dpdk ports scenario
> 
> Hello,
> 
> In userspace dpdk OVS,  as I understand, each PMD thread can poll packets
> from multiple dpdk ports the thread owns.
> 
> Then, what about transmit?
> Suppose 2 PMD threads are running,  if  route lookup in 1st PMD thread
> results in a dpdk port (as outgoing port) which 2nd PMD thread owns, then
> how does the 1st PMD thread sends the packet over the port the 2nd PMD
> thread owns?
> From concurrency perspective,  can he 1st PMD thread safely call transmit
> API with the port owned by 2nd PMD thread owns?
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v6] dpif-netdev: Assign ports to pmds on non-local numa node.

2017-06-26 Thread O Mahony, Billy
Hi Ilya,

Thanks for further reviewing this patch, which we had previously discussed in 
February 
https://mail.openvswitch.org/pipermail/ovs-dev/2017-February/329182.html. 

You are suggesting adding an iterator function to identify the non-isolated 
cores as the assignment progresses and not building a full list up front. That 
would be much cleaner. I'll have look and use that idea if it's suitable (I 
think it will be).

Some more comments below.

Regards,
/Billy.

> -Original Message-
> From: Ilya Maximets [mailto:i.maxim...@samsung.com]
> Sent: Monday, June 26, 2017 2:49 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; d...@openvswitch.org
> Cc: Stokes, Ian <ian.sto...@intel.com>; 'Jan Scheurich'
> <jan.scheur...@ericsson.com>; Darrell Ball <db...@vmware.com>
> Subject: Re: [ovs-dev] [PATCH v6] dpif-netdev: Assign ports to pmds on non-
> local numa node.
> 
> I don't like the implementation. The bunch of these all_numa_ids* variables
> looks completely unreadable.
> 'all_numa_ids [64];' contains same numa_ids and will be overwritten from
> the start if we have more than 64 PMD threads. => possible broken logic and
> mapping of all the non-local ports to the same node.

 [[BO'M]] 
The wrap around of all_numa_ids is in order to assign to non-local numa nodes 
in a fair fashion. So if one non-local numa node has many PMDs and another has 
few then the queues with no local PMD are mainly assigned to the second rather 
than the latter. 

As thread counts increase over coming years or on pure packet switch devices 
(with many PMDs) and depending precisely how rr_numa_list is populated (i.e. if 
it is populated in numa id order) a limited (ie. wrapped around) list of 
alternative threads does mean there could possibly be situations where the 
allocation of queues to non-local PMDs could be unfair. Though nothing as 
drastic as broken logic. 

> 
> Also, the main concern is that  we already have all the required information
> about NUMA nodes in 'rr_numa_list rr'. All we need is to properly iterate
> over it.
> 
> 
> How about something like this (not fully tested):
> 
> >--<
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 4e29085..d17d7e4
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -3195,6 +3195,23 @@ rr_numa_list_lookup(struct rr_numa_list *rr, int
> numa_id)
>  return NULL;
>  }
> 
> +/* Returns next NUMA from rr list in round-robin fashion. Returns the
> +first
> + * NUMA node if 'NULL' or the last node passed, and 'NULL' if list is
> +empty. */ static struct rr_numa * rr_numa_list_next(struct rr_numa_list
> +*rr, const struct rr_numa *numa) {
> +struct hmap_node *node = NULL;
> +
> +if (numa) {
> +node = hmap_next(>numas, >node);
> +}
> +if (!node) {
> +node = hmap_first(>numas);
> +}
> +
> +return (node) ? CONTAINER_OF(node, struct rr_numa, node) : NULL; }
> +
>  static void
>  rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list *rr)  {
> @@ -3249,6 +3266,7 @@ rxq_scheduling(struct dp_netdev *dp, bool
> pinned) OVS_REQUIRES(dp->port_mutex)  {
>  struct dp_netdev_port *port;
>  struct rr_numa_list rr;
> +struct rr_numa *last_used_nonlocal_numa = NULL;
> 
>  rr_numa_list_populate(dp, );
> 
> @@ -3281,10 +3299,26 @@ rxq_scheduling(struct dp_netdev *dp, bool
> pinned) OVS_REQUIRES(dp->port_mutex)
>  }
>  } else if (!pinned && q->core_id == OVS_CORE_UNSPEC) {
>  if (!numa) {
> -VLOG_WARN("There's no available (non isolated) pmd 
> thread "
> +numa = rr_numa_list_next(, last_used_nonlocal_numa);
> +if (!numa) {
> +VLOG_ERR("There is no available (non-isolated) pmd "
> + "thread for port \'%s\' queue %d. This 
> queue "
> + "will not be polled. Is pmd-cpu-mask set to 
> "
> + "zero? Or are all PMDs isolated to other "
> + "queues?", netdev_get_name(port->netdev),
> + qid);
> +continue;
> +}
> +
> +q->pmd = rr_numa_get_pmd(numa);
> +VLOG_WARN("There's no available (non-isolated) pmd 
> thread "
>"on numa node %d. Queue %d on port \'%s\' will 
> "
> -  "not be polled.",
> -   

Re: [ovs-dev] [PATCH v6] dpif-netdev: Assign ports to pmds on non-local numa node.

2017-06-26 Thread O Mahony, Billy
Hi Darrell,

Thanks for reviewing.

> -Original Message-
> From: Darrell Ball [mailto:db...@vmware.com]
> Sent: Monday, June 26, 2017 8:04 AM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; d...@openvswitch.org
> Subject: Re: [ovs-dev] [PATCH v6] dpif-netdev: Assign ports to pmds on non-
> local numa node.
> 
> I think this is helpful in some cases where lower performance is an
> acceptable tradeoff with more frugal and/or more flexible usage of cpu
> resources.
> 
> I did not test it since Ian has already done that, but I reviewed the code
> change and other related code.
> 
> One comment inline regarding the added documentation.
> 
> Otherwise, Acked-by: Darrell Ball <dlu...@gmail.com>
> 
> 
> 
> On 5/10/17, 8:59 AM, "ovs-dev-boun...@openvswitch.org on behalf of Billy
> O'Mahony" <ovs-dev-boun...@openvswitch.org on behalf of
> billy.o.mah...@intel.com> wrote:
> 
> From: billyom <billy.o.mah...@intel.com>
> 
> Previously if there is no available (non-isolated) pmd on the numa node
> for a port then the port is not polled at all. This can result in a
> non-operational system until such time as nics are physically
> repositioned. It is preferable to operate with a pmd on the 'wrong' numa
> node albeit with lower performance. Local pmds are still chosen when
> available.
> 
> Signed-off-by: Billy O'Mahony <billy.o.mah...@intel.com>
> ---
> v6: Change 'port' to 'queue' in a warning msg
> v5: Fix warning msg; Update same in docs
> v4: Fix a checkpatch error
> v3: Fix warning messages not appearing when using multiqueue
> v2: Add details of warning messages into docs
> 
>  Documentation/intro/install/dpdk.rst | 10 +
>  lib/dpif-netdev.c| 43 
> +++---
> --
>  2 files changed, 48 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/intro/install/dpdk.rst
> b/Documentation/intro/install/dpdk.rst
> index d1c0e65..7a66bff 100644
> --- a/Documentation/intro/install/dpdk.rst
> +++ b/Documentation/intro/install/dpdk.rst
> @@ -460,6 +460,16 @@ affinitized accordingly.
>  pmd thread on a NUMA node is only created if there is at least one 
> DPDK
>  interface from that NUMA node added to OVS.
> 
> +  .. note::
> +   On NUMA systems PCI devices are also local to a NUMA node.  Rx
> queues for
> +   PCI device will assigned to a pmd on it's local NUMA node if pmd-cpu-
> mask
> +   has created a pmd thread on that NUMA node.  If not the queue will be
> +   assigned to a pmd on a remote NUMA node.
> 
> 
> 
> I think the below text is a bit more accurate
> +  .. note::
> +   On NUMA systems PCI devices are also local to a NUMA node.  Rx queues
> for
> +   a PCI device will be assigned to a pmd on it's local NUMA node. 

[[BO'M]] 
Re suggested sentence: "Rx queues for a PCI device will be assigned to a pmd on 
it's local NUMA node."
However, with this patch that is no longer always the case. Assignment to a pmd 
on it's local NUMA node will be preferred but it that is not possible then 
assignment will be done to a PMD on a non-local NUMA node. 

Is the suggested change in order to emphasize that a PMD is created if 
specified in pmd-cpu-mask but not actually scheduled by the OS unless at least 
one rxq is assigned to it? That case is covered in the preceeding paragraph 
(which is not modified by this patch) "A pmd thread on a NUMA node is only 
created if there is at least one DPDK interface from that NUMA node added to 
OVS."  Which, now I read it should be modified to read "A PMD thread only 
becomes runnable if there is at least one DPDK interface assigned to it." 

 A pmd is
> +   created if at least one dpdk interface is added to OVS on that NUMA node
> or
> +   if the pmd-cpu-mask has created a pmd thread on that NUMA node.

 [[BO'M]] 
Is this part of the suggested change in order to emphasize that a PMD is 
created if specified in pmd-cpu-mask but not actually scheduled by the OS 
unless at least one rxq is assigned to it? (introduced 2788a1b). That detail is 
covered somewhat in the preceding paragraph (which is not modified by this 
patch) "A pmd thread on a NUMA node is only created if there is at least one 
DPDK interface from that NUMA node added to OVS."  

Which, now I read it should be modified to actually reflect that. What do you 
think? 

> 
> + This will result in reduced
> +   maximum throughput on that device.  In the case such a queue
> assignment
> +   is made a warning message will be logged: "There's no available (non-
> +   isolated) pmd thread on numa node N. Queu

Re: [ovs-dev] [PATCH RFC 2/4] dpif-netdev: Skip EMC lookup/insert for recirculated packets.

2017-06-23 Thread O Mahony, Billy
Hi Antonio,

This is a really interesting patch. Comments inline below. 

Thanks,
/Billy.

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of antonio.fische...@intel.com
> Sent: Monday, June 19, 2017 11:12 AM
> To: d...@openvswitch.org
> Subject: [ovs-dev] [PATCH RFC 2/4] dpif-netdev: Skip EMC lookup/insert for
> recirculated packets.
> 
> From: Antonio Fischetti 
> 
> When OVS is configured as a firewall, with thousands of active concurrent
> connections, the EMC gets quicly saturated and may come under heavy
> thrashing for the reason that original and recirculated packets keep overwrite
> existing active EMC entries due to its limited size(8k).
> 
> This thrashing causes the EMC to be less efficient than the dcpls in terms of
> lookups and insertions.
> 
> This patch allows to use the EMC efficiently by allowing only the 'original'
> packets to hit EMC. All recirculated packets are sent to classifier directly.
> An empirical threshold (EMC_FULL_THRESHOLD - of 50%) for EMC occupancy
> is set to trigger this logic. By doing so when EMC utilization exceeds
> EMC_FULL_THRESHOLD.
>  - EMC Insertions are allowed just for original packets. EMC insertion
>and look up is skipped for recirculated packets.
>  - Recirculated packets are sent to classifier.
> 
> This patch depends on the previous one in this series. It's based on patch
> "dpif-netdev: add EMC entry count and %full figure to pmd-stats-show" at:
> https://mail.openvswitch.org/pipermail/ovs-dev/2017-January/327570.html
> 
> Signed-off-by: Antonio Fischetti 
> Signed-off-by: Bhanuprakash Bodireddy
> 
> Co-authored-by: Bhanuprakash Bodireddy
> 
> ---
> In our Connection Tracker testbench set up with
> 
>  table=0, priority=1 actions=drop
>  table=0, priority=10,arp actions=NORMAL  table=0, priority=100,ct_state=-
> trk,ip actions=ct(table=1)  table=1, ct_state=+new+trk,ip,in_port=1
> actions=ct(commit),output:2  table=1, ct_state=+est+trk,ip,in_port=1
> actions=output:2  table=1, ct_state=+new+trk,ip,in_port=2 actions=drop
> table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1
> 
> we saw the following performance improvement.
> 
> Measured packet Rx rate (regardless of packet loss). Bidirectional test with
> 64B UDP packets.
> Each row is a test with a different number of traffic streams. The traffic
> generator is set so that each stream establishes one UDP connection.
> Mpps columns reports the Rx rates on the 2 sides.
> 
>  Traffic |Orig| Orig  |  +changes  |   +changes
>  Streams |   [Mpps]   | [EMC entries] |   [Mpps]   | [EMC entries]
> -++---++---
>  10  |  3.4, 3.4  |  20   |  3.4, 3.4  |  20
> 100  |  2.6, 2.7  | 200   |  2.6, 2.7  | 201
>   1,000  |  2.4, 2.4  |2009   |  2.4, 2.4  |1994
>   2,000  |  2.2, 2.2  |3903   |  2.2, 2.2  |3900
>   3,000  |  2.1, 2.1  |5473   |  2.2, 2.2  |4798
>   4,000  |  2.0, 2.0  |6478   |  2.2, 2.2  |5663
>  10,000  |  1.8, 1.9  |8070   |  2.0, 2.0  |7347
> 100,000  |  1.7, 1.7  |8192   |  1.8, 1.8  |8192
> 

[[BO'M]] 
A few questions on the test:
Are all the pkts rxd being recirculated?
Are there any flows present where the pkts do not require recirculation? 
Was the rxd rss hash calculation offloaded to the NIC?
For the cases with larger numbers of flows (10K , 100K) did you investigate the 
results when the EMC is simply switched off? 

Say we have 3000 flows (the lowest figure at which we see a positive effect) 
that means 6000 flows are contending for places in the emc.  
Is the effect we see here to do with disabling recirculated packets in 
particular or just reducing contention on the emc in general. I know that the 
recirculated pkt hashes require software hashing albeit a small amount so they 
do make a good category of packet to drop from the emc when contention is 
severe.

Once there are too many entries contending  for any cache it's going to become 
a liability as the lookup_cost + (miss_rate * cost_of_miss) grows to be greater 
that the cost_of_miss. And in that case any scheme where a very cheap decision 
can be made to not insert a certain category of packets and to also not check 
the emc for that same category will reduce the cache miss rate.

But I'm not sure that is_recirculated is the best categorization on which to 
make that decision. Mainly because it is not controllable. In this test case 
50% pkts were recirculated so by ruling these packets out of eligibility for 
the EMC you get a really large reduction in EMC contention. However you can 
never know ahead of time if any packets will be recirc'd. You may have a 
situation where the EMC is totally swamped with 200K flows as above but none of 
them are 

Re: [ovs-dev] [PATCH 1/4] dpif-netdev: Avoid reading RSS hash when EMC is disabled.

2017-06-23 Thread O Mahony, Billy
Hi Antonio,

> -Original Message-
> From: Fischetti, Antonio
> Sent: Friday, June 23, 2017 3:10 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; d...@openvswitch.org
> Subject: RE: [ovs-dev] [PATCH 1/4] dpif-netdev: Avoid reading RSS hash
> when EMC is disabled.
> 
> Hi Billy,
> thanks a lot for you suggestions. Those would really help re-factoring the
> code by avoiding duplications.
> The thing is that this patch 1/4 is mainly a preparation for the next patch 
> 2/4.
> So I did these changes with the next patch 2/4 in mind.
> 
> The final result I meant to achieve in patch 2/4 is the following.
> EMC lookup is skipped - not only when EMC is disabled - but also when
> (we're processing recirculated packets) && (the EMC is 'enough' full).
> The purpose is to avoid EMC thrashing.
> 
> Below is how the code looks like after applying patches 1/4 and 2/4.
> Please let me know if you can find some similar optimizations to avoid code
> duplications, that would be great.
> 
> /*
>  * EMC lookup is skipped when one or both of the following
>  * two cases occurs:
>  *
>  *   - EMC is disabled.  This is detected from cur_min.
>  *
>  *   - The EMC occupancy exceeds EMC_FULL_THRESHOLD and the
>  * packet to be classified is being recirculated.  When this
>  * happens also EMC insertions are skipped for recirculated
>  * packets.  So that EMC is used just to store entries which
>  * are hit from the 'original' packets.  This way the EMC
>  * thrashing is mitigated with a benefit on performance.
>  */
> if (!md_is_valid) {
> pkt_metadata_init(>md, port_no);
> miniflow_extract(packet, >mf);  <== this fn must be called 
> after
> pkt_metadta_init
> /* This is not a recirculated packet. */
> if (OVS_LIKELY(cur_min)) {
> /* EMC is enabled.  We can retrieve the 5-tuple hash
>  * without considering the recirc id. */
> if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
> key->hash = dp_packet_get_rss_hash(packet);
> } else {
> key->hash = miniflow_hash_5tuple(>mf, 0);
> dp_packet_set_rss_hash(packet, key->hash);
> }
> flow = emc_lookup(flow_cache, key);
> } else {
> /* EMC is disabled, skip emc_lookup. */
> flow = NULL;
> }
> } else {
> /* Recirculated packets. */
> miniflow_extract(packet, >mf);
> if (flow_cache->n_entries & EMC_FULL_THRESHOLD) {
> /* EMC occupancy is over the threshold.  We skip EMC
>  * lookup for recirculated packets. */
> flow = NULL;
> } else {
> if (OVS_LIKELY(cur_min)) {
> key->hash = dpif_netdev_packet_get_rss_hash(packet,
> >mf);
> flow = emc_lookup(flow_cache, key);
> } else {
> flow = NULL;
> }
> }
> }
> 
> 
> Basically patch 1/4 is mostly a preliminary change for 2/4.
> 
> Yes, patch 1/4 also allows to avoid reading hash when EMC is disabled.
> Or - for packets that are not recirculated - avoids calling
> recirc_depth_get_unsafe() when reading the hash.
> 
> Also, as these functions are critical for performance, I tend to avoid adding
> new Booleans that require new if statements.
[[BO'M]] 

Can you investigate refactoring this patch with something like below.  I think 
this is equivalent.  The current patch duplicates miniflow_extract, emc_lookup 
across the md_is_valid and !md_is_valid branches. It also duplicates some of 
the internals of get_rss_hash out into the !md_is_valid case and is difficult 
to follow. 

If the following suggestion works  the change in emc_processing from patch 2/4 
can easily be grafted on to that. 

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 4e29085..a7e854d 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -4442,7 +4442,8 @@ dp_netdev_upcall(struct dp_netdev_pmd_thread *pmd, struct 
dp_packet *packet_,

 static inline uint32_t
 dpif_netdev_packet_get_rss_hash(struct dp_packet *packet,
-const struct miniflow *mf)
+const struct miniflow *mf,
+bool use_recirc_depth)
 {
 uint32_t hash, recirc_depth;

@@ -4456,7 +4457,7 @@ dpif_netdev_packe

Re: [ovs-dev] [PATCH 1/4] dpif-netdev: Avoid reading RSS hash when EMC is disabled.

2017-06-23 Thread O Mahony, Billy
Hi Antonio,

In this patch of the patchset there are three lines removed from the direct 
command flow:

-miniflow_extract(packet, >mf);
-key->hash = dpif_netdev_packet_get_rss_hash(packet, >mf);
-flow = (cur_min == 0) ? NULL: emc_lookup(flow_cache, key);

Which are then replicated in several different branches for logic. This is a 
lot of duplication of logic. 

I *think* (I haven't tested it) this can be re-written with less branching like 
this:

 if (!md_is_valid) {
 pkt_metadata_init(>md, port_no);
 }
 miniflow_extract(packet, >mf);
 if (OVS_LIKELY(cur_min)) {
 if (md_is_valid) {
 key->hash = dpif_netdev_packet_get_rss_hash(packet, 
>mf);
 }
 else
 {
 if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
 key->hash = dp_packet_get_rss_hash(packet);
 } else {
 key->hash = miniflow_hash_5tuple(>mf, 0);
 dp_packet_set_rss_hash(packet, key->hash);
 }
 flow = emc_lookup(flow_cache, key);
 }
 } else {
 flow = NULL;
 }

Also if I'm understanding correctly the final effect of the patch is that in 
the case where !md_is_valid it effectively replicates the work of 
dpif_netdev_packet_get_rss_hash() but leaving out the if (recirc_depth) block 
of that fn. This is effectively overriding the return value of 
recirc_depth_get_unsafe in dpif_netdev_packet_get_rss_hash() and 
forcing/assuming that it is zero. 

If so it would be less disturbing to the existing code to just add a bool arg 
to dpif_netdev_packet_get_rss_hash() called do_not_check_recirc_depth and use 
that to return early (before the if (recirc_depth) check). Also in that case 
the patch would require none of the  conditional logic changes (neither the 
original or that suggested in this email) and should be able to just set the 
proposed do_not_check_recirc_depth based on md_is_valid.

Also this is showing up as a patch set can you add a cover letter to outline 
the overall goal of the patchset.

Thanks,
Billy. 


> -Original Message-
> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> boun...@openvswitch.org] On Behalf Of antonio.fische...@intel.com
> Sent: Monday, June 19, 2017 11:12 AM
> To: d...@openvswitch.org
> Subject: [ovs-dev] [PATCH 1/4] dpif-netdev: Avoid reading RSS hash when
> EMC is disabled.
> 
> From: Antonio Fischetti 
> 
> When EMC is disabled the reading of RSS hash is skipped.
> For packets that are not recirculated it retrieves the hash value without
> considering the recirc id.
> 
> This is mostly a preliminary change for the next patch in this series.
> 
> Signed-off-by: Antonio Fischetti 
> ---
> In our testbench we used monodirectional traffic with 64B UDP packets PDM
> threads:  2 Traffic gen. streams: 1
> 
> we saw the following performance improvement:
> 
> Orig   11.49 Mpps
> With Patch#1:  11.62 Mpps
> 
>  lib/dpif-netdev.c | 30 +-
>  1 file changed, 25 insertions(+), 5 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 02af32e..fd2ed52
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -4584,13 +4584,33 @@ emc_processing(struct dp_netdev_pmd_thread
> *pmd,
> 
>  if (!md_is_valid) {
>  pkt_metadata_init(>md, port_no);
> +miniflow_extract(packet, >mf);
> +/* This is not a recirculated packet. */
> +if (OVS_LIKELY(cur_min)) {
> +/* EMC is enabled.  We can retrieve the 5-tuple hash
> + * without considering the recirc id. */
> +if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
> +key->hash = dp_packet_get_rss_hash(packet);
> +} else {
> +key->hash = miniflow_hash_5tuple(>mf, 0);
> +dp_packet_set_rss_hash(packet, key->hash);
> +}
> +flow = emc_lookup(flow_cache, key);
> +} else {
> +/* EMC is disabled, skip emc_lookup. */
> +flow = NULL;
> +}
> +} else {
> +/* Recirculated packets. */
> +miniflow_extract(packet, >mf);
> +if (OVS_LIKELY(cur_min)) {
> +key->hash = dpif_netdev_packet_get_rss_hash(packet, 
> >mf);
> +flow = emc_lookup(flow_cache, key);
> +} else {
> +flow = NULL;
> +}
>  }
> -miniflow_extract(packet, >mf);
>  key->len = 0; /* Not computed yet. */
> -key->hash = dpif_netdev_packet_get_rss_hash(packet, >mf);
> -
> -/* If EMC is disabled skip emc_lookup */
> -flow = (cur_min == 0) ? NULL: emc_lookup(flow_cache, 

Re: [ovs-dev] [PATCH v9] netdev-dpdk: Increase pmd thread priority.

2017-06-23 Thread O Mahony, Billy
Acked-by: Billy O'Mahony <billy.o.mah...@intel.com> 

> -Original Message-
> From: Bodireddy, Bhanuprakash
> Sent: Thursday, June 22, 2017 9:51 PM
> To: d...@openvswitch.org
> Cc: O Mahony, Billy <billy.o.mah...@intel.com>; Bodireddy, Bhanuprakash
> <bhanuprakash.bodire...@intel.com>
> Subject: [PATCH v9] netdev-dpdk: Increase pmd thread priority.
> 
> Increase the DPDK pmd thread scheduling priority by lowering the nice value.
> This will advise the kernel scheduler to prioritize pmd thread over other
> processes and will help PMD to provide deterministic performance in out-of-
> the-box deployments.
> 
> This patch sets the nice value of PMD threads to '-20'.
> 
>   $ ps -eLo comm,policy,psr,nice | grep pmd
> 
>COMMAND  POLICY  PROCESSORNICE
> pmd62 TS3-20
> pmd63 TS0-20
> pmd64 TS1-20
> pmd65 TS2-20
> 
> Signed-off-by: Bhanuprakash Bodireddy
> <bhanuprakash.bodire...@intel.com>
> Tested-by: Billy O'Mahony <billy.o.mah...@intel.com>
> ---
> v8->v9:
> * Rebase
> 
> v7->v8:
> * Rebase
> * Update the documentation file @Documentation/intro/install/dpdk-
> advanced.rst
> 
> v6->v7:
> * Remove realtime scheduling policy logic.
> * Increase pmd thread scheduling priority by lowering nice value to -20.
> * Update doc accordingly.
> 
> v5->v6:
> * Prohibit spawning pmd thread on the lowest core in dpdk-lcore-mask if
>   lcore-mask and pmd-mask affinity are identical.
> * Updated Note section in INSTALL.DPDK-ADVANCED doc.
> * Tested below cases to verify system stability with pmd priority patch
> 
> v4->v5:
> * Reword Note section in DPDK-ADVANCED.md
> 
> v3->v4:
> * Document update
> * Use ovs_strerror for reporting errors in lib-numa.c
> 
> v2->v3:
> * Move set_priority() function to lib/ovs-numa.c
> * Apply realtime scheduling policy and priority to pmd thread only if
>   pmd-cpu-mask is passed.
> * Update INSTALL.DPDK-ADVANCED.
> 
> v1->v2:
> * Removed #ifdef and introduced dummy function
> "pmd_thread_setpriority"
>   in netdev-dpdk.h
> * Rebase
> 
>  Documentation/intro/install/dpdk.rst |  8 +++-
>  lib/dpif-netdev.c|  4 
>  lib/ovs-numa.c   | 21 +
>  lib/ovs-numa.h   |  1 +
>  4 files changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/intro/install/dpdk.rst
> b/Documentation/intro/install/dpdk.rst
> index e83f852..b5c26ba 100644
> --- a/Documentation/intro/install/dpdk.rst
> +++ b/Documentation/intro/install/dpdk.rst
> @@ -453,7 +453,8 @@ affinitized accordingly.
>to be affinitized to isolated cores for optimum performance.
> 
>By setting a bit in the mask, a pmd thread is created and pinned to the
> -  corresponding CPU core. e.g. to run a pmd thread on core 2::
> +  corresponding CPU core with nice value set to -20.
> +  e.g. to run a pmd thread on core 2::
> 
>$ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x4
> 
> @@ -493,6 +494,11 @@ improvements as there will be more total CPU
> occupancy available::
> 
>  NIC port0 <-> OVS <-> VM <-> OVS <-> NIC port 1
> 
> +  .. note::
> +It is recommended that the OVS control thread and pmd thread shouldn't
> be
> +pinned to the same core i.e 'dpdk-lcore-mask' and 'pmd-cpu-mask' cpu
> mask
> +settings should be non-overlapping.
> +
>  DPDK Physical Port Rx Queues
>  
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index f83b632..6bbd786
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -3712,6 +3712,10 @@ pmd_thread_main(void *f_)
>  ovs_numa_thread_setaffinity_core(pmd->core_id);
>  dpdk_set_lcore_id(pmd->core_id);
>  poll_cnt = pmd_load_queues_and_ports(pmd, _list);
> +
> +/* Set pmd thread's nice value to -20 */ #define MIN_NICE -20
> +ovs_numa_thread_setpriority(MIN_NICE);
>  reload:
>  emc_cache_init(>flow_cache);
> 
> diff --git a/lib/ovs-numa.c b/lib/ovs-numa.c index 98e97cb..a1921b3 100644
> --- a/lib/ovs-numa.c
> +++ b/lib/ovs-numa.c
> @@ -23,6 +23,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #endif /* __linux__ */
> @@ -570,3 +571,23 @@ int ovs_numa_thread_setaffinity_core(unsigned
> core_id OVS_UNUSED)
>  return EOPNOTSUPP;
>  #endif /* __linux__ */
>  }
> +
> +int
> +ovs_numa_thread_setpriority(int nice OVS_UNUSED) {
> +if (dummy_numa) {
> +return 0;
>

Re: [ovs-dev] [PATCH v6] dpif-netdev: Assign ports to pmds on non-local numa node.

2017-06-13 Thread O Mahony, Billy
Hi All,

Does anyone else have any comments on this patch?

I'm adding Ilya and Jan to the CC as I believe you both had comments on this 
previously. Apologies if I've forgotten anyone else that commented from the CC!

Regards,
/Billy

> -Original Message-
> From: Stokes, Ian
> Sent: Thursday, May 11, 2017 12:09 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; d...@openvswitch.org
> Subject: RE: [ovs-dev] [PATCH v6] dpif-netdev: Assign ports to pmds on non-
> local numa node.
> 
> > Previously if there is no available (non-isolated) pmd on the numa
> > node for a port then the port is not polled at all. This can result in
> > a non- operational system until such time as nics are physically
> > repositioned. It is preferable to operate with a pmd on the 'wrong'
> > numa node albeit with lower performance. Local pmds are still chosen
> when available.
> >
> > Signed-off-by: Billy O'Mahony <billy.o.mah...@intel.com>
> > ---
> > v6: Change 'port' to 'queue' in a warning msg
> > v5: Fix warning msg; Update same in docs
> > v4: Fix a checkpatch error
> > v3: Fix warning messages not appearing when using multiqueue
> > v2: Add details of warning messages into docs
> >
> >  Documentation/intro/install/dpdk.rst | 10 +
> >  lib/dpif-netdev.c| 43
> > +++-
> >  2 files changed, 48 insertions(+), 5 deletions(-)
> >
> > diff --git a/Documentation/intro/install/dpdk.rst
> > b/Documentation/intro/install/dpdk.rst
> > index d1c0e65..7a66bff 100644
> > --- a/Documentation/intro/install/dpdk.rst
> > +++ b/Documentation/intro/install/dpdk.rst
> > @@ -460,6 +460,16 @@ affinitized accordingly.
> >  pmd thread on a NUMA node is only created if there is at least
> > one DPDK
> >  interface from that NUMA node added to OVS.
> >
> > +  .. note::
> > +   On NUMA systems PCI devices are also local to a NUMA node.  Rx
> > + queues
> > for
> > +   PCI device will assigned to a pmd on it's local NUMA node if
> > + pmd-cpu-
> > mask
> > +   has created a pmd thread on that NUMA node.  If not the queue will be
> > +   assigned to a pmd on a remote NUMA node.  This will result in reduced
> > +   maximum throughput on that device.  In the case such a queue
> > assignment
> > +   is made a warning message will be logged: "There's no available (non-
> > +   isolated) pmd thread on numa node N. Queue Q on port P will be
> > assigned to
> > +   the pmd on core C (numa node N'). Expect reduced performance."
> > +
> >  - QEMU vCPU thread Affinity
> >
> >A VM performing simple packet forwarding or running complex packet
> > pipelines diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> > b3a0806..34f1963 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -3149,10 +3149,13 @@ rr_numa_list_lookup(struct rr_numa_list *rr,
> > int
> > numa_id)  }
> >
> >  static void
> > -rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list *rr)
> > +rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list *rr,
> > +  int *all_numa_ids, unsigned all_numa_ids_sz,
> > +  int *num_ids_written)
> >  {
> >  struct dp_netdev_pmd_thread *pmd;
> >  struct rr_numa *numa;
> > +unsigned idx = 0;
> >
> >  hmap_init(>numas);
> >
> > @@ -3170,7 +3173,11 @@ rr_numa_list_populate(struct dp_netdev *dp,
> > struct rr_numa_list *rr)
> >  numa->n_pmds++;
> >  numa->pmds = xrealloc(numa->pmds, numa->n_pmds * sizeof
> > *numa-
> > >pmds);
> >  numa->pmds[numa->n_pmds - 1] = pmd;
> > +
> > +all_numa_ids[idx % all_numa_ids_sz] = pmd->numa_id;
> > +idx++;
> >  }
> > +*num_ids_written = idx;
> >  }
> >
> >  static struct dp_netdev_pmd_thread *
> > @@ -3202,8 +3209,15 @@ rxq_scheduling(struct dp_netdev *dp, bool
> > pinned)
> > OVS_REQUIRES(dp->port_mutex)  {
> >  struct dp_netdev_port *port;
> >  struct rr_numa_list rr;
> > +int all_numa_ids [64];
> > +int all_numa_ids_sz = sizeof all_numa_ids / sizeof all_numa_ids[0];
> > +unsigned all_numa_ids_idx = 0;
> > +int all_numa_ids_max_idx = 0;
> > +int num_numa_ids = 0;
> >
> > -rr_numa_list_populate(dp, );
> > +rr_numa_list_populate(dp, , all_numa_ids, all_numa_ids_sz,
> > +  

Re: [ovs-dev] [PATCH v4 2/3] netdev-dpdk: Fix device leak on port deletion.

2017-05-26 Thread O Mahony, Billy
Oops the patch series is probably the reason.

> -Original Message-
> From: Ilya Maximets [mailto:i.maxim...@samsung.com]
> Sent: Friday, May 26, 2017 2:51 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; d...@openvswitch.org;
> Daniele Di Proietto <diproiet...@ovn.org>; Darrell Ball
> <db...@vmware.com>
> Cc: Heetae Ahn <heetae82@samsung.com>
> Subject: Re: [ovs-dev] [PATCH v4 2/3] netdev-dpdk: Fix device leak on port
> deletion.
> 
> On 26.05.2017 16:38, O Mahony, Billy wrote:
> > Hi Ilya,
> >
> > This patch does not apply to head of master, currently "c899576 build-
> windows: cccl fail compilation on Wimplicit-function-declaration".
> 
> Hmm. I'm able to apply it using 'git am'. Have you applied first patch of this
> series?
> 
> > I'll don't have any comments on the code right now but if you can tell me
> the commit it's based on I'll check it out.
> 
> Originally, these patches are made on top of
> 126fb3e8abc2 ("ovn-ctl: Start ovn-northd even if ovsdb-servers are not
> running") but they are still applicable on top of current master.
> 
> Bets regards, Ilya Maximets.
> 
> >
> > Thanks,
> > Billy
> >
> >> -Original Message-
> >> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> >> boun...@openvswitch.org] On Behalf Of Ilya Maximets
> >> Sent: Friday, May 19, 2017 2:38 PM
> >> To: d...@openvswitch.org; Daniele Di Proietto <diproiet...@ovn.org>;
> >> Darrell Ball <db...@vmware.com>
> >> Cc: Ilya Maximets <i.maxim...@samsung.com>; Heetae Ahn
> >> <heetae82@samsung.com>
> >> Subject: [ovs-dev] [PATCH v4 2/3] netdev-dpdk: Fix device leak on
> >> port deletion.
> >>
> >> Currently, once created device in dpdk will exist forever even after
> >> del-port operation untill we manually call 'ovs-appctl
> >> netdev-dpdk/detach ', where  is not the port's name but
> >> the name of dpdk eth device or pci address.
> >>
> >> Few issues with current implementation:
> >>
> >>1. Different API for usual (system) and DPDK devices.
> >>   (We have to call 'ovs-appctl netdev-dpdk/detach' each
> >>time after 'del-port' to actually free the device)
> >>   This is a big issue mostly for virtual DPDK devices.
> >>
> >>2. Follows from 1:
> >>   For DPDK devices 'del-port' leads just to
> >>   'rte_eth_dev_stop' and subsequent 'add-port' will
> >>   just start the already existing device. Such behaviour
> >>   will not reset the device to initial state as it could
> >>   be expected. For example: virtual pcap pmd will continue
> >>   reading input file instead of reading it from the beginning.
> >>
> >>3. Follows from 2:
> >>   After execution of the following commands 'port1' will be
> >>   configured with the 'old-options' while 'ovs-vsctl show'
> >>   will show us 'new-options' in dpdk-devargs field:
> >>
> >> ovs-vsctl add-port port1 -- set interface port1 type=dpdk \
> >>   options:dpdk-devargs=,
> >> ovs-vsctl del-port port1
> >> ovs-vsctl add-port port1 -- set interface port1 type=dpdk \
> >>   options:dpdk-devargs=,
> >>
> >>4. Follows from 1:
> >>   Not detached device consumes 'port_id'. Since we have very
> >>   limited number of 'port_id's (32 in common case) this may
> >>   lead to quick exhausting of id pool and inability to add any
> >>   other port.
> >>
> >> To avoid above issues we need to detach all the attached devices on
> >> port destruction.
> >> appctl 'netdev-dpdk/detach' removed because not needed anymore.
> >>
> >> We need to use internal 'attached' variable to track ports on which
> >> rte_eth_dev_attach() was called and returned successfully to avoid
> >> closing and detaching devices that do not support hotplug or by any
> >> other reason attached using the 'dpdk-extra' cmdline options.
> >>
> >> CC: Ciara Loftus <ciara.lof...@intel.com>
> >> Fixes: 55e075e65ef9 ("netdev-dpdk: Arbitrary 'dpdk' port naming")
> >> Fixes: 69876ed78611 ("netdev-dpdk: Add support for virtual DPDK PMDs
> >> (vdevs)")
> >> Signed-off-by: Ilya Maximets <i.maxim...@samsung.com>
> >> ---
> >>  Documentation/howto/dpdk.rst |  5 ++-
> >>  lib/netdev-dpdk.c  

  1   2   >