Re: [ovs-dev] [PATCH v9 0/7] OVS-DPDK flow offload with rte_flow

2018-06-22 Thread Finn Christensen
Shahaf,

No, I haven't looked into that issue, mainly because it seemed NIC-PMD related, 
rather than OVS.
I can try to make a simple comparison with a loopback test in OVS using a 
Napatech NIC and with/without HW acceleration.
Will get back to you on my findings.

Finn

From: Shahaf Shuler 
Sent: 22. juni 2018 09:07
To: Finn Christensen ; Koujalagi, MalleshX 
; y...@fridaylinux.org
Cc: d...@openvswitch.org; Ergin, Mesut A ; Tsai, James 
; Olga Shern 
Subject: Re: [ovs-dev] [PATCH v9 0/7] OVS-DPDK flow offload with rte_flow

Finn,

It is great that we are aligned w/ the finding.

I agree w/ your approach to take the relevant fields on the tcp_parse and limit 
the offload rules to hw based on validated action.

There was one more issue from flavio about the single core performance w/ HWOL.
He saw performance drop when enabling HWOL in a simple phy2phy loopback test.
I wasn't able to reproduce it, in fact saw big improvement.
Have you tried it? Would really like to address this issue before the v11.

From: Finn Christensen mailto:f...@napatech.com>>
Sent: Thursday, June 21, 2018 10:23:22 PM
To: Shahaf Shuler; Koujalagi, MalleshX; 
y...@fridaylinux.org<mailto:y...@fridaylinux.org>
Cc: d...@openvswitch.org<mailto:d...@openvswitch.org>; Ergin, Mesut A; Tsai, 
James; Olga Shern
Subject: RE: [ovs-dev] [PATCH v9 0/7] OVS-DPDK flow offload with rte_flow


Hi Shahaf,



These are exactly the same bugs I found today.

I added the calculation of the offset l3_ofs and l4_ofs into the 
parse_tcp_flags() and here these can be calculated more or less without 
additional performance penalty.

I also made it not use the MARK in emc_processing if recirc depth > 0. Then 
VxLan with partial hw offload worked. However, I think the MARK also could be 
cleared in the tunnel pop function, before recirculation. To use md_is_valid 
flag is a bit more confusing to me.



I very much agree with you in your concerns, however, I do believe we must 
bypass the miniflow_extract, because otherwise nothing is gained. Then only 
full offload makes sense (or any other partly full offload, not defined yet).



To make sure we do not have such an issue again, which we have not thought of, 
we could change the flow offload check, so that we also check on the flow 
action list, before offloading into HW, and only accept the actions we have 
tested and knows work.

Then extend while we continue improve partial hw offload to cover more and more 
flows. To me it would make sense, also when extending to full offload, where 
all actions will have to be validated and accepted anyway.



Regarding inner match, I think it may be better handled by offloading 
VTEP/tunnel encap/decap in HW and then also have the possibility to match on 
inner packet while still in HW. But that's just my thoughts.



I'm all for starting limited, and then improve later.



/Finn



From: Shahaf Shuler mailto:shah...@mellanox.com>>
Sent: 21. juni 2018 19:26
To: Finn Christensen mailto:f...@napatech.com>>; Koujalagi, 
MalleshX mailto:malleshx.koujal...@intel.com>>; 
y...@fridaylinux.org<mailto:y...@fridaylinux.org>
Cc: d...@openvswitch.org<mailto:d...@openvswitch.org>; Ergin, Mesut A 
mailto:mesut.a.er...@intel.com>>; Tsai, James 
mailto:james.t...@intel.com>>; Olga Shern 
mailto:ol...@mellanox.com>>
Subject: RE: [ovs-dev] [PATCH v9 0/7] OVS-DPDK flow offload with rte_flow



Hi Finn,



Finally I was able to reproduce the error behavior w/ VXLAN traffic.



I found 2 issues related to the design being made:

  1.  The VXLAN decap requires more fields apart from the TCP header like the 
l3_ofs and l4_ofs which are missing because of bypassing the miniflow extract.
  2.  Tunnel packets are being recirculated after the outer headers decap 
(dp_netdev_recirculate). On the second processing stage the packet still 
carries the flow mark therefore the action for the inner headers Is also 
pop_tunnel.



I haven't established yet the final way to solve those issues. My current 
thoughts:

  1.  Even though attractive we cannot bypass the miniflow extract. VXLAN is 
just one case. Probably other cases will need different fields from the flow.
  2.  Use the md_is_valid flag on dp_netdev_input__ to recognize the non-first 
round and to skip the mark to flow search. I don't really like this approach 
because

 *   It is a bit hackish to relay on the md_is_valid. Alternatively we can 
clear the mbuf mark flag but this basically the same.
 *   w/ tunnel packets the massive amount of flow rules is in the inner 
headers. The outer doesn't change much. w/ the current design it means we 
offload the less consuming part.

Having said that,  the above approach looks the fastest w/o changing the 
current design. We can further improve this part later.



Be happy to hear more suggestions.





From: Finn Christensen [mailto:f...@napatech.com]
Sent: Thursday, June 21, 2018 5:46 PM
T

Re: [ovs-dev] [PATCH v9 0/7] OVS-DPDK flow offload with rte_flow

2018-06-21 Thread Finn Christensen
Hi Shahaf,

These are exactly the same bugs I found today.
I added the calculation of the offset l3_ofs and l4_ofs into the 
parse_tcp_flags() and here these can be calculated more or less without 
additional performance penalty.
I also made it not use the MARK in emc_processing if recirc depth > 0. Then 
VxLan with partial hw offload worked. However, I think the MARK also could be 
cleared in the tunnel pop function, before recirculation. To use md_is_valid 
flag is a bit more confusing to me.

I very much agree with you in your concerns, however, I do believe we must 
bypass the miniflow_extract, because otherwise nothing is gained. Then only 
full offload makes sense (or any other partly full offload, not defined yet).

To make sure we do not have such an issue again, which we have not thought of, 
we could change the flow offload check, so that we also check on the flow 
action list, before offloading into HW, and only accept the actions we have 
tested and knows work.
Then extend while we continue improve partial hw offload to cover more and more 
flows. To me it would make sense, also when extending to full offload, where 
all actions will have to be validated and accepted anyway.

Regarding inner match, I think it may be better handled by offloading 
VTEP/tunnel encap/decap in HW and then also have the possibility to match on 
inner packet while still in HW. But that's just my thoughts.

I'm all for starting limited, and then improve later.

/Finn

From: Shahaf Shuler 
Sent: 21. juni 2018 19:26
To: Finn Christensen ; Koujalagi, MalleshX 
; y...@fridaylinux.org
Cc: d...@openvswitch.org; Ergin, Mesut A ; Tsai, James 
; Olga Shern 
Subject: RE: [ovs-dev] [PATCH v9 0/7] OVS-DPDK flow offload with rte_flow

Hi Finn,

Finally I was able to reproduce the error behavior w/ VXLAN traffic.

I found 2 issues related to the design being made:

  1.  The VXLAN decap requires more fields apart from the TCP header like the 
l3_ofs and l4_ofs which are missing because of bypassing the miniflow extract.
  2.  Tunnel packets are being recirculated after the outer headers decap 
(dp_netdev_recirculate). On the second processing stage the packet still 
carries the flow mark therefore the action for the inner headers Is also 
pop_tunnel.

I haven't established yet the final way to solve those issues. My current 
thoughts:

  1.  Even though attractive we cannot bypass the miniflow extract. VXLAN is 
just one case. Probably other cases will need different fields from the flow.
  2.  Use the md_is_valid flag on dp_netdev_input__ to recognize the non-first 
round and to skip the mark to flow search. I don't really like this approach 
because

 *   It is a bit hackish to relay on the md_is_valid. Alternatively we can 
clear the mbuf mark flag but this basically the same.
 *   w/ tunnel packets the massive amount of flow rules is in the inner 
headers. The outer doesn't change much. w/ the current design it means we 
offload the less consuming part.

Having said that,  the above approach looks the fastest w/o changing the 
current design. We can further improve this part later.

Be happy to hear more suggestions.


From: Finn Christensen [mailto:f...@napatech.com]
Sent: Thursday, June 21, 2018 5:46 PM
To: Koujalagi, MalleshX 
mailto:malleshx.koujal...@intel.com>>; Shahaf 
Shuler mailto:shah...@mellanox.com>>; 
y...@fridaylinux.org<mailto:y...@fridaylinux.org>
Cc: d...@openvswitch.org<mailto:d...@openvswitch.org>; Ergin, Mesut A 
mailto:mesut.a.er...@intel.com>>; Tsai, James 
mailto:james.t...@intel.com>>
Subject: RE: [ovs-dev] [PATCH v9 0/7] OVS-DPDK flow offload with rte_flow

Hi Mallesh and Shahaf,

I have tried to reproduce the issue Mallesh is reporting, using a Napatech NIC. 
The result is that I'm able to reproduce the error and are not getting the 
decap functionality to work when using VxLan tunneling together with partial 
hw-offload. Mainly the VxLAN POP is not performed in OVS in my setup.
The VxLan setup I'm using is straight forward with 2 hosts and br-int and 
br-phy bridges, like the typical examples.

I don't know how far you have got into debugging this, but what I se is that 
the vxlan pop functionality needs packet information originally gathered in the 
miniflow extract function, which is omitted with hw-offload. Therefore, it does 
not work.

I'm still investigating how this may be solved, but I wanted to hear where you 
are in debugging this issue, and if you already have got to a solution.

Regards,
Finn

From: Koujalagi, MalleshX 
mailto:malleshx.koujal...@intel.com>>
Sent: 20. juni 2018 00:59
To: Shahaf Shuler mailto:shah...@mellanox.com>>; 
y...@fridaylinux.org<mailto:y...@fridaylinux.org>; Finn Christensen 
mailto:f...@napatech.com>>
Cc: d...@openvswitch.org<mailto:d...@openvswitch.org>; Ergin, Mesut A 
mailto:mesut.a.er...@intel.com>>; Tsai, James 
mailto:james.t...@intel.com>>
Subject: RE: [ovs-dev] [PAT

Re: [ovs-dev] [PATCH v9 0/7] OVS-DPDK flow offload with rte_flow

2018-06-21 Thread Finn Christensen
Hi Mallesh and Shahaf,

I have tried to reproduce the issue Mallesh is reporting, using a Napatech NIC. 
The result is that I'm able to reproduce the error and are not getting the 
decap functionality to work when using VxLan tunneling together with partial 
hw-offload. Mainly the VxLAN POP is not performed in OVS in my setup.
The VxLan setup I'm using is straight forward with 2 hosts and br-int and 
br-phy bridges, like the typical examples.

I don't know how far you have got into debugging this, but what I se is that 
the vxlan pop functionality needs packet information originally gathered in the 
miniflow extract function, which is omitted with hw-offload. Therefore, it does 
not work.

I'm still investigating how this may be solved, but I wanted to hear where you 
are in debugging this issue, and if you already have got to a solution.

Regards,
Finn

From: Koujalagi, MalleshX 
Sent: 20. juni 2018 00:59
To: Shahaf Shuler ; y...@fridaylinux.org; Finn 
Christensen 
Cc: d...@openvswitch.org; Ergin, Mesut A ; Tsai, James 

Subject: RE: [ovs-dev] [PATCH v9 0/7] OVS-DPDK flow offload with rte_flow

Hi Shahaf,

Thanks for setting up VXLAN env. and trying to reproduce, appreciate!!

As you suggested, I moved to DPDK17.11.3 stable and OvS 2.9.0, still observed 
same issue.

An attached Vxlan_diagram.jpg, to get clear idea, we are seeing issue while 
traffic injected from Traffic Generator-->HOST B[eth1-->Vxlan-->eth0]-->Vxlan 
Tunnel Packet-->HOST A[eth0-->Vxlan-->eth1]-->Traffic Generator direction,  at 
Host A (eth0-->Vxlan-->eth1), we see Vxlan DECAP issue in case of HW-offload 
enabled. Vxlan Tunnel packets are not DECAP from eth0 to eth1, not observed any 
packet receiving @ eth1.

We injected random destination ip address(196.0.0.(0/1)),

  1.   Find tcpdump @ Host A (eth0).
07:24:12.013848 68:05:ca:33:ff:99 > 68:05:ca:33:fe:f9, ethertype IPv4 (0x0800), 
length 110: 172.168.1.2.46816 > 172.168.1.1.4789: VXLAN, flags [I] (0x08), vni 0
00:10:94:00:00:12 > 00:00:01:00:00:01, ethertype IPv4 (0x0800), length 60: 
197.0.0.0.1024 > 196.0.0.1.1024: UDP, length 18
07:24:12.013860 68:05:ca:33:ff:99 > 68:05:ca:33:fe:f9, ethertype IPv4 (0x0800), 
length 110: 172.168.1.2.50892 > 172.168.1.1.4789: VXLAN, flags [I] (0x08), vni 0
00:10:94:00:00:12 > 00:00:01:00:00:01, ethertype IPv4 (0x0800), length 60: 
197.0.0.0.1024 > 196.0.0.0.1024: UDP, length 18


  1.  Ovs logs.
  2.  Sure, further we will debug this issue using Skype/webex session. Please 
let me know good time.


Best regards
-/Mallesh


From: Shahaf Shuler [mailto:shah...@mellanox.com]
Sent: Monday, June 18, 2018 4:05 AM
To: Koujalagi, MalleshX 
mailto:malleshx.koujal...@intel.com>>; 
y...@fridaylinux.org<mailto:y...@fridaylinux.org>; 
f...@napatech.com<mailto:f...@napatech.com>
Cc: d...@openvswitch.org<mailto:d...@openvswitch.org>; Ergin, Mesut A 
mailto:mesut.a.er...@intel.com>>; Tsai, James 
mailto:james.t...@intel.com>>
Subject: RE: [ovs-dev] [PATCH v9 0/7] OVS-DPDK flow offload with rte_flow

Mallesh,

I was finally able to setup the vxlan testing w/ OVS. Instead of using OVS on 
both sides I used vxlan i/f to inject the traffic on one host and run ovs with 
vxlan tunnel configuration you specified on the other.

I was not able to reproduce your case. Actually I see the rules are created 
successfully (see attached ovs log "vxlan_tep")

Few observations:

  1.  The rule created for the VXLAN is for the outer pattern up to the UDP, 
hence supported by the device

 *   2018-06-18T08:11:17.466Z|00057|dpif_netdev(pmd53)|DBG|flow match: 
recirc_id=0,eth,udp,vlan_tci=0x,dl_src=e4:1d:2d:

ca:ca:6a,dl_dst=e4:1d:2d:a3:aa:74,nw_dst=2.2.2.1,nw_frag=no,tp_dst=4789

  1.  I do see segfault on mlx5 PMD on 17.11.0 and this was resolved on 17.11.3 
stable.
  2.  I used MLNX_OFED_LINUX-4.3-2.0.1.0, however I don't expect it to cause 
any diff

The packets being sent to the ovs w/ vxlan tunnel logic are[1]. tcpdump after 
the ovs logic is [2], the packet outer headers were removed as expected.

On top of all, I tried to force the validate function to fail. Indeed the flow 
was not created but the decap functionality still happens (see attached ovs log 
"vxlan_tep_force_err")

Can you please:

  1.  Provide me the type of packets you inject and don't see being decap.
  2.  Try w/ 17.11.3 stable to avoid any issues from the DPDK side
  3.  If you still see the issue can we set webex/skype meeting for joint 
debug, currently I don't see the error you reported on.


[1]
###[ Ethernet ]###
  dst= e4:1d:2d:a3:aa:74
  src= e4:1d:2d:ca:ca:6a
  type= IPv4
###[ IP ]###
 version= 4
 ihl= None
 tos= 0x0
 len= None
 id= 1
 flags=
 frag= 0
 ttl= 64
 proto= udp
 chksum= None
 src= 2.2.2.2
 dst= 2.2.2.1
 \options\
###[ UDP ]###
sport= 34550
dport= 4789
len= None
chksum= None
###[ VXLAN

Re: [ovs-dev] [PATCH v10 4/7] netdev-dpdk: implement flow offload with rte flow

2018-06-18 Thread Finn Christensen
>-Original Message-
>From: Shahaf Shuler 
>Sent: 18. juni 2018 16:29
>To: Andrew Rybchenko ; Finn Christensen
>; ian.sto...@intel.com
>Cc: d...@openvswitch.org; simon.hor...@netronome.com; f...@redhat.com
>Subject: RE: [ovs-dev] [PATCH v10 4/7] netdev-dpdk: implement flow offload
>with rte flow
>
>Monday, June 18, 2018 1:17 PM, Andrew Rybchenko:
>> Subject: Re: [ovs-dev] [PATCH v10 4/7] netdev-dpdk: implement flow
>> offload with rte flow
>>
>> On 05/18/2018 12:14 PM, Shahaf Shuler wrote:
>> > From: Finn Christensen 
>> >
>> > The basic yet the major part of this patch is to translate the "match"
>> > to rte flow patterns. And then, we create a rte flow with MARK + RSS
>> > actions. Afterwards, all packets match the flow will have the mark
>> > id in the mbuf.
>> >
>> > The reason RSS is needed is, for most NICs, a MARK only action is
>> > not allowed. It has to be used together with some other actions,
>> > such as QUEUE, RSS, etc. However, QUEUE action can specify one queue
>> > only, which may break the rss. Likely, RSS action is currently the
>> > best we could now. Thus, RSS action is choosen.
>> >
>> > For any unsupported flows, such as MPLS, -1 is returned, meaning the
>> > flow offload is failed and then skipped.
>> >
>> > Co-authored-by: Yuanhan Liu 
>> > Signed-off-by: Finn Christensen 
>> > Signed-off-by: Yuanhan Liu 
>> > Signed-off-by: Shahaf Shuler 
>>
>> <...>
>>
>> > +/* VLAN */
>> > +struct rte_flow_item_vlan vlan_spec;
>> > +struct rte_flow_item_vlan vlan_mask;
>> > +memset(_spec, 0, sizeof(vlan_spec));
>> > +memset(_mask, 0, sizeof(vlan_mask));
>> > +if (match->wc.masks.vlans[0].tci && match->flow.vlans[0].tci) {
>> > +vlan_spec.tci  = match->flow.vlans[0].tci;
>> > +vlan_mask.tci  = match->wc.masks.vlans[0].tci;
>>
>> As I understand VLAN_CFI bit is used inside OVS to distinguish no VLAN
>> and zero VLAN cases (aka VLAN_TAG_PRESENT). So above two lines should
>> drop the VLAN_CFI bit, something like:
>>
>>      vlan_spec.tci = match->flow.vlans[0].tci & ~htons(VLAN_CFI);
>>      vlan_mask.tci = match->wc.masks.vlans[0].tci & ~htons(VLAN_CFI);
>
>Yes, thanks for spotting that. I guess matching on the CFI bit is not so
>important from offload perspective, I don't think we will have different
>actions of those.
>Will be addressed on v11.

Yes, I agree with this. Thanks Shahaf for correcting this in v11.
Finn

>
>>
>> Andrew.

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v8 3/6] netdev-dpdk: implement flow offload with rte flow

2018-04-11 Thread Finn Christensen


>-Original Message-
>From: Stokes, Ian <ian.sto...@intel.com>
>Sent: 10. april 2018 21:59
>To: Shahaf Shuler <shah...@mellanox.com>; Finn Christensen
><f...@napatech.com>
>Cc: simon.hor...@netronome.com; ovs-dev@openvswitch.org; Chandran,
>Sugesh <sugesh.chand...@intel.com>; db...@vmware.com; Yuanhan Liu
><y...@fridaylinux.org>
>Subject: RE: [PATCH v8 3/6] netdev-dpdk: implement flow offload with rte
>flow
>
>> The basic yet the major part of this patch is to translate the "match"
>> to rte flow patterns. And then, we create a rte flow with MARK + RSS
>> actions. Afterwards, all packets match the flow will have the mark id
>> in the mbuf.
>>
>> The reason RSS is needed is, for most NICs, a MARK only action is not
>> allowed. It has to be used together with some other actions, such as
>> QUEUE, RSS, etc. However, QUEUE action can specify one queue only,
>> which may break the rss. Likely, RSS action is currently the best we could
>now.
>> Thus, RSS action is choosen.
>>
>> For any unsupported flows, such as MPLS, -1 is returned, meaning the
>> flow offload is failed and then skipped.
>>
>> Co-authored-by: Yuanhan Liu <y...@fridaylinux.org>
>> Signed-off-by: Finn Christensen <f...@napatech.com>
>> Signed-off-by: Yuanhan Liu <y...@fridaylinux.org>
>> Signed-off-by: Shahaf Shuler <shah...@mellanox.com>
>> ---
>>  lib/netdev-dpdk.c | 563
>> -
>>  1 file changed, 562 insertions(+), 1 deletion(-)
>>
>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index
>> af9843a..df4d480
>> 100644
>> --- a/lib/netdev-dpdk.c
>> +++ b/lib/netdev-dpdk.c
>> @@ -38,7 +38,9 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>
>> +#include "cmap.h"
>>  #include "dirs.h"
>>  #include "dp-packet.h"
>>  #include "dpdk.h"
>> @@ -51,6 +53,7 @@
>>  #include "openvswitch/list.h"
>>  #include "openvswitch/ofp-print.h"
>>  #include "openvswitch/vlog.h"
>> +#include "openvswitch/match.h"
>>  #include "ovs-numa.h"
>>  #include "ovs-thread.h"
>>  #include "ovs-rcu.h"
>> @@ -60,6 +63,7 @@
>>  #include "sset.h"
>>  #include "unaligned.h"
>>  #include "timeval.h"
>> +#include "uuid.h"
>>  #include "unixctl.h"
>>
>>  enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM}; @@ -170,6 +174,17 @@
>> static const struct rte_eth_conf port_conf = {  };
>>
>>  /*
>> + * A mapping from ufid to dpdk rte_flow.
>> + */
>> +static struct cmap ufid_to_rte_flow = CMAP_INITIALIZER;
>> +
>> +struct ufid_to_rte_flow_data {
>> +struct cmap_node node;
>> +ovs_u128 ufid;
>> +struct rte_flow *rte_flow;
>> +};
>> +
>> +/*
>>   * These callbacks allow virtio-net devices to be added to vhost
>> ports when
>>   * configuration has been fully completed.
>>   */
>> @@ -3709,6 +3724,552 @@ unlock:
>>  return err;
>>  }
>>
>> +
>> +/* Find rte_flow with @ufid */
>> +static struct rte_flow *
>> +ufid_to_rte_flow_find(const ovs_u128 *ufid) {
>> +size_t hash = hash_bytes(ufid, sizeof(*ufid), 0);
>> +struct ufid_to_rte_flow_data *data;
>> +
>> +CMAP_FOR_EACH_WITH_HASH (data, node, hash, _to_rte_flow) {
>> +if (ovs_u128_equals(*ufid, data->ufid)) {
>> +return data->rte_flow;
>> +}
>> +}
>> +
>> +return NULL;
>> +}
>> +
>> +static inline void
>> +ufid_to_rte_flow_associate(const ovs_u128 *ufid,
>> +   struct rte_flow *rte_flow) {
>> +size_t hash = hash_bytes(ufid, sizeof(*ufid), 0);
>> +struct ufid_to_rte_flow_data *data = xzalloc(sizeof(*data));
>> +
>> +/*
>> + * We should not simply overwrite an existing rte flow.
>> + * We should have deleted it first before re-adding it.
>> + * Thus, if following assert triggers, something is wrong:
>> + * the rte_flow is not destroyed.
>> + */
>> +ovs_assert(ufid_to_rte_flow_find(ufid) == NULL);
>> +
>> +data->ufid = *ufid;
>> +data->rte_flow = rte_flow;
>> +
>> +cmap_insert(_to_rte_flow,
>> +CONST_CAST(struct cmap_node *, >node), hash); }
>> +
>> +static inline void
>> +ufid_to_rte_

Re: [ovs-dev] [PATCH v7 3/6] netdev-dpdk: implement flow offload with rte flow

2018-03-21 Thread Finn Christensen


>-Original Message-
>From: Stokes, Ian <ian.sto...@intel.com>
>Sent: 21. marts 2018 14:38
>To: Finn Christensen <f...@napatech.com>; 'Yuanhan Liu'
><y...@fridaylinux.org>; d...@openvswitch.org; Shahaf Shuler
><shah...@mellanox.com>
>Cc: Darrell Ball <db...@vmware.com>; Chandran, Sugesh
><sugesh.chand...@intel.com>; Simon Horman
><simon.hor...@netronome.com>
>Subject: RE: [PATCH v7 3/6] netdev-dpdk: implement flow offload with rte
>flow
>
>> -Original Message-
>> From: Finn Christensen [mailto:f...@napatech.com]
>> Sent: Friday, March 16, 2018 12:44 PM
>> To: Stokes, Ian <ian.sto...@intel.com>; 'Yuanhan Liu'
>> <y...@fridaylinux.org>; d...@openvswitch.org; Shahaf Shuler
>> <shah...@mellanox.com>
>> Cc: Darrell Ball <db...@vmware.com>; Chandran, Sugesh
>> <sugesh.chand...@intel.com>; Simon Horman
><simon.hor...@netronome.com>
>> Subject: RE: [PATCH v7 3/6] netdev-dpdk: implement flow offload with
>> rte flow
>>
>> Hi Ian,
>>
>> Thanks for the review.
>> I'll fix the compilation issues and the comments you have below.
>> Please see my answers below.
>>
>> Thanks
>> Finn
>>
>> >-Original Message-
>> >From: Stokes, Ian <ian.sto...@intel.com>
>> >Sent: 12. marts 2018 16:00
>> >To: 'Yuanhan Liu' <y...@fridaylinux.org>; d...@openvswitch.org
>> >Cc: Finn Christensen <f...@napatech.com>; Darrell Ball
>> ><db...@vmware.com>; Chandran, Sugesh <sugesh.chand...@intel.com>;
>> >Simon Horman <simon.hor...@netronome.com>
>> >Subject: RE: [PATCH v7 3/6] netdev-dpdk: implement flow offload with
>> >rte flow
>> >
>> >> -Original Message-
>> >> From: Yuanhan Liu [mailto:y...@fridaylinux.org]
>> >> Sent: Monday, January 29, 2018 7:00 AM
>> >> To: d...@openvswitch.org
>> >> Cc: Stokes, Ian <ian.sto...@intel.com>; Finn Christensen
>> >> <f...@napatech.com>; Darrell Ball <db...@vmware.com>; Chandran,
>> >> Sugesh <sugesh.chand...@intel.com>; Simon Horman
>> >> <simon.hor...@netronome.com>; Yuanhan Liu <y...@fridaylinux.org>
>> >> Subject: [PATCH v7 3/6] netdev-dpdk: implement flow offload with
>> >> rte flow
>> >>
>> >> From: Finn Christensen <f...@napatech.com>
>> >>
>> >> The basic yet the major part of this patch is to translate the "match"
>> >> to rte flow patterns. And then, we create a rte flow with MARK +
>> >> RSS actions. Afterwards, all packets match the flow will have the
>> >> mark id in the mbuf.
>> >>
>> >> The reason RSS is needed is, for most NICs, a MARK only action is
>> >> not allowed. It has to be used together with some other actions,
>> >> such as QUEUE, RSS, etc. However, QUEUE action can specify one
>> >> queue only, which may break the rss. Likely, RSS action is
>> >> currently the best we could
>> >now.
>> >> Thus, RSS action is choosen.
>> >>
>> >> For any unsupported flows, such as MPLS, -1 is returned, meaning
>> >> the flow offload is failed and then skipped.
>> >>
>> >> Co-authored-by: Yuanhan Liu <y...@fridaylinux.org>
>> >> Signed-off-by: Finn Christensen <f...@napatech.com>
>> >> Signed-off-by: Yuanhan Liu <y...@fridaylinux.org>
>> >
>> >Hi Finn,
>> >
>> >Thanks for working on this. There are compilation errors introduced
>> >by this commit, details below along with other comments.
>> >
>> >For completeness here is the link to the OVS Travis build with
>> >associated failures
>> >
>> >https://travis-ci.org/istokes/ovs/builds/352283122
>> >
>> >
>> >Thanks
>> >Ian
>> >
>> >> ---
>> >>
>> >> v7: - set the rss_conf for rss action to NULL, to workaround a mlx5
>> change
>> >>   in DPDK v17.11. Note that it will obey the rss settings
>> >> OVS-DPDK
>> has
>> >>   set in the beginning. Thus, nothing should be effected.
>> >>
>> >> ---
>> >>  lib/netdev-dpdk.c | 559
>> >> +-
>> >>  1 file changed, 558 insertions(+), 1 deletion(-)
>> >>
>> >> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index
>>

Re: [ovs-dev] [PATCH v7 3/6] netdev-dpdk: implement flow offload with rte flow

2018-03-16 Thread Finn Christensen
Hi Ian,

Thanks for the review.
I'll fix the compilation issues and the comments you have below.
Please see my answers below.

Thanks
Finn

>-Original Message-
>From: Stokes, Ian <ian.sto...@intel.com>
>Sent: 12. marts 2018 16:00
>To: 'Yuanhan Liu' <y...@fridaylinux.org>; d...@openvswitch.org
>Cc: Finn Christensen <f...@napatech.com>; Darrell Ball <db...@vmware.com>;
>Chandran, Sugesh <sugesh.chand...@intel.com>; Simon Horman
><simon.hor...@netronome.com>
>Subject: RE: [PATCH v7 3/6] netdev-dpdk: implement flow offload with rte
>flow
>
>> -Original Message-
>> From: Yuanhan Liu [mailto:y...@fridaylinux.org]
>> Sent: Monday, January 29, 2018 7:00 AM
>> To: d...@openvswitch.org
>> Cc: Stokes, Ian <ian.sto...@intel.com>; Finn Christensen
>> <f...@napatech.com>; Darrell Ball <db...@vmware.com>; Chandran, Sugesh
>> <sugesh.chand...@intel.com>; Simon Horman
>> <simon.hor...@netronome.com>; Yuanhan Liu <y...@fridaylinux.org>
>> Subject: [PATCH v7 3/6] netdev-dpdk: implement flow offload with rte
>> flow
>>
>> From: Finn Christensen <f...@napatech.com>
>>
>> The basic yet the major part of this patch is to translate the "match"
>> to rte flow patterns. And then, we create a rte flow with MARK + RSS
>> actions. Afterwards, all packets match the flow will have the mark id
>> in the mbuf.
>>
>> The reason RSS is needed is, for most NICs, a MARK only action is not
>> allowed. It has to be used together with some other actions, such as
>> QUEUE, RSS, etc. However, QUEUE action can specify one queue only,
>> which may break the rss. Likely, RSS action is currently the best we could
>now.
>> Thus, RSS action is choosen.
>>
>> For any unsupported flows, such as MPLS, -1 is returned, meaning the
>> flow offload is failed and then skipped.
>>
>> Co-authored-by: Yuanhan Liu <y...@fridaylinux.org>
>> Signed-off-by: Finn Christensen <f...@napatech.com>
>> Signed-off-by: Yuanhan Liu <y...@fridaylinux.org>
>
>Hi Finn,
>
>Thanks for working on this. There are compilation errors introduced by this
>commit, details below along with other comments.
>
>For completeness here is the link to the OVS Travis build with associated
>failures
>
>https://travis-ci.org/istokes/ovs/builds/352283122
>
>
>Thanks
>Ian
>
>> ---
>>
>> v7: - set the rss_conf for rss action to NULL, to workaround a mlx5 change
>>   in DPDK v17.11. Note that it will obey the rss settings OVS-DPDK has
>>   set in the beginning. Thus, nothing should be effected.
>>
>> ---
>>  lib/netdev-dpdk.c | 559
>> +-
>>  1 file changed, 558 insertions(+), 1 deletion(-)
>>
>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index
>> ac2e38e..4bd0503
>> 100644
>> --- a/lib/netdev-dpdk.c
>> +++ b/lib/netdev-dpdk.c
>> @@ -38,7 +38,9 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>
>> +#include "cmap.h"
>>  #include "dirs.h"
>>  #include "dp-packet.h"
>>  #include "dpdk.h"
>> @@ -60,6 +62,7 @@
>>  #include "sset.h"
>>  #include "unaligned.h"
>>  #include "timeval.h"
>> +#include "uuid.h"
>>  #include "unixctl.h"
>>
>>  enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM}; @@ -3633,6 +3636,560
>@@
>> unlock:
>>  return err;
>>  }
>>
>> +/*
>> + * A mapping from ufid to dpdk rte_flow.
>> + */
>> +static struct cmap ufid_to_rte_flow = CMAP_INITIALIZER;
>> +
>> +struct ufid_to_rte_flow_data {
>> +struct cmap_node node;
>> +ovs_u128 ufid;
>> +struct rte_flow *rte_flow;
>> +};
>
>Might be cleaner to declare above along with the other structs/maps specific
>to netdev-dpdk.c towards the beginning of this file.

Ok done.

>
>> +
>> +/* Find rte_flow with @ufid */
>> +static struct rte_flow *
>> +ufid_to_rte_flow_find(const ovs_u128 *ufid) {
>> +size_t hash = hash_bytes(ufid, sizeof(*ufid), 0);
>> +struct ufid_to_rte_flow_data *data;
>> +
>> +CMAP_FOR_EACH_WITH_HASH (data, node, hash, _to_rte_flow) {
>> +if (ovs_u128_equals(*ufid, data->ufid)) {
>> +return data->rte_flow;
>> +}
>> +}
>> +
>> +return NULL;
>> +}
>> +
>> +static inline void
>> +ufid_to_rte_flow_asso

Re: [ovs-dev] Hardware Acceleration in OVS-DPDK

2018-02-27 Thread Finn Christensen
Hi All,

As agreed in last meeting, I have created an OVS branch of our OVS fork (from 
late January - v2.9), and added the Partial hw offload proposal, followed by 
our full offload extension - 3 additional commits.
It should compile against DPDK 17.11, and for that to be possible an existing 
RTE_FLOW_ACTION_TYPE_VF is (mis-)used to send port-id to PMD. Furthermore, 
tcp_flags update from NIC is commented out due to lack of that field in the RTE 
FLOW query structure.
Please see this as a PoC. It is not yet ready for an actual proposal, though it 
is fully functioning in our lab.

https://github.com/napatech/ovs/tree/hw-full-offload-v1

Further notes:
As mentioned at the last meeting, this proposal is based on vPorts on NIC 
(being VF, virtio or other vPort), completely handled outside OVS. The vPorts 
are then connected and configured in OVS as "normal" type=dpdk ports. I know 
this is not in-line with Intels proposal, however, we think it might be a good 
idea. It makes it simpler in OVS, since we only need either capabilities, or 
trial & error to do transparent full hw-offload.
Anyway, this is our current proposal for the next discussion meeting.

Thanks,
Finn

From: Chandran, Sugesh [mailto:sugesh.chand...@intel.com]
Sent: 20. februar 2018 19:18
To: d...@openvswitch.org; Darrell Ball <db...@vmware.com>; Simon Horman 
<simon.hor...@netronome.com>; Stokes, Ian <ian.sto...@intel.com>; Yuanhan Liu 
<y...@fridaylinux.org>; Finn Christensen <f...@napatech.com>; 'jiaquan song' 
<songjiaqua...@gmail.com>; 'pieter.jansenvanvuu...@netronome.com' 
<pieter.jansenvanvuu...@netronome.com>; Doherty, Declan 
<declan.dohe...@intel.com>; 'frikkie.scho...@netronome.com' 
<frikkie.scho...@netronome.com>; Bodireddy, Bhanuprakash 
<bhanuprakash.bodire...@intel.com>; Keane, Lorna <lorna.ke...@intel.com>; 
Giller, Robin <robin.gil...@intel.com>; Loftus, Ciara <ciara.lof...@intel.com>; 
Awal, Mohammad Abdul <mohammad.abdul.a...@intel.com>; Eelco Chaudron 
<echau...@redhat.com>
Subject: RE: Hardware Acceleration in OVS-DPDK

Hello All,

As discussed in the last meeting, I have created a OVS 2.7 fork with our 
hardware acceleration implementation as below.


https://github.com/sugchand/ovs.git (branch - 
dpdk-hw-accel-intel<https://github.com/sugchand/ovs/tree/dpdk-hw-accel-intel>)

Few points on the implementation.
1.   This implementation is just for reference to show the proposal.
2.   The code is still 2.7 based. We will merge to latest branch once we 
have finalized on the approach.
3.   Some of the hardware acceleration functionalities still missing in 
this implementation, such as flow offload thread , flow stat and tcp-flag 
handling. We are working on it to add those support.
4.   This implementation uses some of hardware specific APIs that are not 
yet available in the DPDK main tree. So the code may not build properly.

Please review the implementation (in the last 12 commits), Will setup a follow 
up call to discuss further on this.

Thank you!


Regards
_Sugesh


_
From: Chandran, Sugesh
Sent: Monday, January 29, 2018 1:23 PM
To: d...@openvswitch.org<mailto:d...@openvswitch.org>; Darrell Ball 
<db...@vmware.com<mailto:db...@vmware.com>>; Simon Horman 
<simon.hor...@netronome.com<mailto:simon.hor...@netronome.com>>; Stokes, Ian 
<ian.sto...@intel.com<mailto:ian.sto...@intel.com>>; Yuanhan Liu 
<y...@fridaylinux.org<mailto:y...@fridaylinux.org>>; 'Finn Christensen' 
<f...@napatech.com<mailto:f...@napatech.com>>; 'jiaquan song' 
<songjiaqua...@gmail.com<mailto:songjiaqua...@gmail.com>>; 
'pieter.jansenvanvuu...@netronome.com' 
<pieter.jansenvanvuu...@netronome.com<mailto:pieter.jansenvanvuu...@netronome.com>>;
 Doherty, Declan <declan.dohe...@intel.com<mailto:declan.dohe...@intel.com>>; 
'frikkie.scho...@netronome.com' 
<frikkie.scho...@netronome.com<mailto:frikkie.scho...@netronome.com>>; 
Bodireddy, Bhanuprakash 
<bhanuprakash.bodire...@intel.com<mailto:bhanuprakash.bodire...@intel.com>>; 
Keane, Lorna <lorna.ke...@intel.com<mailto:lorna.ke...@intel.com>>; Giller, 
Robin <robin.gil...@intel.com<mailto:robin.giller@inte
 l.com>>; Loftus, Ciara 
<ciara.lof...@intel.com<mailto:ciara.lof...@intel.com>>; Awal, Mohammad Abdul 
<mohammad.abdul.a...@intel.com<mailto:mohammad.abdul.a...@intel.com>>; Eelco 
Chaudron <echau...@redhat.com<mailto:echau...@redhat.com>>; NPG SW Data Plane 
Virtual Switching and FPGA 
<npg.sw.data.plane.virtual.switching.and.f...@intel.com<mailto:npg.sw.data.plane.virtual.switching.and.f...@intel.com>>
Subject: RE: Hardware Acceleration in OVS-DPDK


Thank you all for attending today call.
I have updated the MOM in the following d

Re: [ovs-dev] [PATCH v5 1/5] dpif-netdev: associate flow with a mark id

2018-01-29 Thread Finn Christensen
Hi Yuanhan,

This will not break our PMD. I think PMDs should be able to handle rss_conf == 
NULL and then failover to default or initially set rss_conf.

Finn

>-Original Message-
>From: Yuanhan Liu [mailto:y...@fridaylinux.org]
>Sent: 29. januar 2018 07:59
>To: Stokes, Ian <ian.sto...@intel.com>
>Cc: d...@openvswitch.org; Finn Christensen <f...@napatech.com>; Darrell Ball
><db...@vmware.com>; Chandran, Sugesh <sugesh.chand...@intel.com>;
>Simon Horman <simon.hor...@netronome.com>
>Subject: Re: [PATCH v5 1/5] dpif-netdev: associate flow with a mark id
>
>On Fri, Jan 26, 2018 at 04:19:30PM +0800, Yuanhan Liu wrote:
>> > > +static bool
>> > > +flow_mark_has_no_ref(uint32_t mark) {
>> > > +struct dp_netdev_flow *flow;
>> > > +
>> >
>> > Maybe I'm missing something below, but I expected a hash to be
>computed for mark before being called with CMAP_FOR_EACH_WITH_HASH?
>>
>> I treat "mark" as the hash, as it's with the same type of "hash" and
>> it's uniq. But you are probably right, it might be better to get the
>> hash by "hash_int". Will fix it.
>
>Oops, I forgot to do the coressponding change for mark_to_flow find. Thus,
>the partial offload is completely skiped, as retrieving the flow from mark
>would always fail (returing NULL).
>
>The reason I missed it, while I was testing v6, is I found the flow creation is
>failed. I don't really change anything between v5 and v6 and when I was back
>to v5, I also met same issue. I then thought it might be introduced by the
>OFED or firmware change I have done (for switching to other projects).
>I then thought it's something I could figure out laterly. Thus, v6 was sent out
>basically just with a build test.
>
>I then figured out today that the flow creation failure is introduced by a MLX5
>PMD driver change in DPDK v17.11. It looks like a bug to me. And there are 2
>solutions for that:
>
>- fix it in MLX5 PMD (if it's a bug); I was talking to the author made
>  such change.
>- set rss_conf to NULL, which will let DPDK to follow the one OVS-DPDK
>  has set in the beginning.
>
>I chose 2, sicne option 1 won't change the fact that it won't work with DPDK
>v17.11.
>
>And Finn, I probably need your help to verify that option 2 won't break Napa
>PMD drivers.
>
>I will send v7 soon. Please help review.
>
>Thanks.
>
>   --yliu
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v5 0/5] OVS-DPDK flow offload with rte_flow

2018-01-23 Thread Finn Christensen
Hi,
Just added then Napatech NIC supported to the list.

Finn

---
diff --git a/Documentation/howto/dpdk.rst b/Documentation/howto/dpdk.rst
index 40f9d9649..442848870 100644
--- a/Documentation/howto/dpdk.rst
+++ b/Documentation/howto/dpdk.rst
@@ -727,3 +727,20 @@ devices to bridge ``br0``. Once complete, follow the below 
steps:
Check traffic on multiple queues::
 
$ cat /proc/interrupts | grep virtio
+
+.. _dpdk-flow-hardware-offload:
+
+Flow Hardware Offload (Experimental)
+
+
+The flow hardware offload is disabled by default and can be enabled by::
+
+$ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
+
+So far only partial flow offload is implemented. Moreover, it only
+works with PMD drivers have the rte flow action "MARK + RSS" support.
+
+The validated NICs are:
+
+- Mellanox (ConnectX-4, ConnectX-4 Lx, ConnectX-5)
+- Napatech (NT200B01)
diff --git a/NEWS b/NEWS
index d7c83c21e..b9f9bd415 100644
--- a/NEWS
+++ b/NEWS
@@ -44,6 +44,7 @@ v2.9.0 - xx xxx 
- DPDK:
  * Add support for DPDK v17.11
  * Add support for vHost IOMMU
+ * Add experimental flow hardware offload support
  * New debug appctl command 'netdev-dpdk/get-mempool-info'.
  * All the netdev-dpdk appctl commands described in ovs-vswitchd man page.
  * Custom statistics:




>-Original Message-
>From: Yuanhan Liu [mailto:y...@fridaylinux.org]
>Sent: 17. januar 2018 11:02
>To: Stokes, Ian <ian.sto...@intel.com>
>Cc: d...@openvswitch.org; Finn Christensen <f...@napatech.com>; Darrell Ball
><db...@vmware.com>; Chandran, Sugesh <sugesh.chand...@intel.com>;
>Simon Horman <simon.hor...@netronome.com>
>Subject: Re: [PATCH v5 0/5] OVS-DPDK flow offload with rte_flow
>
>On Mon, Jan 15, 2018 at 05:28:18PM +, Stokes, Ian wrote:
>> > Hi,
>> >
>> > Here is a joint work from Mellanox and Napatech, to enable the flow
>> > hw offload with the DPDK generic flow interface (rte_flow).
>> >
>> > The basic idea is to associate the flow with a mark id (a unit32_t
>> > number).
>> > Later, we then get the flow directly from the mark id, which could
>> > bypass some heavy CPU operations, including but not limiting to mini
>> > flow extract, emc lookup, dpcls lookup, etc.
>> >
>> > The association is done with CMAP in patch 1. The CPU workload
>> > bypassing is done in patch 2. The flow offload is done in patch 3,
>> > which mainly does two things:
>> >
>> > - translate the ovs match to DPDK rte flow patterns
>> > - bind those patterns with a RSS + MARK action.
>> >
>> > Patch 5 makes the offload work happen in another thread, for leaving
>> > the datapath as light as possible.
>> >
>> > A PHY-PHY forwarding with 1000 mega flows (udp,tp_src=1000-1999) and
>> > 1 million streams (tp_src=1000-1999, tp_dst=2000-2999) show more
>> > than 260% performance boost.
>> >
>> > Note that it's disabled by default, which can be enabled by:
>> >
>> > $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
>>
>> Hi Yuanhan,
>>
>> Thanks for working on this, I'll be looking at this over the coming week so
>don't consider this a full review.
>>
>> Just a general comment, at first glance there doesn't seem to be any
>documentation in the patchset?
>
>Right, my bad.
>
>> I would expect a patch for the DPDK section of the OVS docs at minimum
>detailing:
>>
>> (i) HWOL requirements (Any specific SW/drivers required or DPDK
>libs/PMDs etc. that have to be enabled for HWOL).
>> (ii) HWOL Usage (Enablement and disablement as shown above).
>> (iii) List of Validated HW devices (As HWOL functionality will vary between
>devices it would be good to provide some known 'verified' cards to use with it.
>At this stage we don't have to have every card that it will work with it but 
>I'd
>like to see a list of NIC Models that this has been validated on to date).
>> (iv) Any Known limitations.
>>
>> You'll also have to add an entry to the NEWS document to flag that HWOL
>has been introduced to OVS with DPDK.
>>
>> As discussed previously on the community call, at this point the feature
>should be marked experimental in both NEWS and the documentation, this is
>just to signify that it is subject to change as more capabilities such as full
>offload is added over time.
>
>For not flooding the mailing list, here I listed the changes I made.
>Please help review it. Also, please let me know whether I should send out a
>new version (with the doc change) soon or I should wait for you full review.
>
>Rega

Re: [ovs-dev] [PATCH v5 0/5] OVS-DPDK flow offload with rte_flow

2018-01-18 Thread Finn Christensen
Yes, I have done review on this patch and gave comments back on it a while ago.
I also have done a lot of testing as a result of the performance analysis I 
made for the PV and PVP cases.
These tests includes full integration of the feature. Meaning the hw-offload 
follows the add/deletion of 
flows in the megaflow cache. So, add and deletion I a single/multi-queue setup 
has been tested.
Furthermore, I'm working on this patchset for continued development, so it is 
continuously being 
used by me as a basic functionality.
So far, it has been running stable.

Finn


>-Original Message-
>From: Yuanhan Liu [mailto:y...@fridaylinux.org]
>Sent: 18. januar 2018 04:39
>To: Ian Stokes <ian.sto...@intel.com>
>Cc: Finn Christensen <f...@napatech.com>; d...@openvswitch.org; Darrell Ball
><db...@vmware.com>; Chandran Sugesh <sugesh.chand...@intel.com>;
>Simon Horman <simon.hor...@netronome.com>; Thomas Monjalon
><tho...@monjalon.net>; Olga Shern <ol...@mellanox.com>; Shahaf Shuler
><shah...@mellanox.com>
>Subject: Re: [PATCH v5 0/5] OVS-DPDK flow offload with rte_flow
>
>I was told more tests and reviews are needed from yesterday's meeting.
>
>For testing, as you can see below, Finn already have done some performance
>testing covering quite few testcases. I have also done some very basic
>performance testing, basically just with rxonly.
>
>Besides that, I have actually spent a lot of effort for doing functional 
>testing,
>to make sure it doesn't break anything. The limited testing
>including:
>
>- 1 queue
>- 2 queues
>- 1 flow
>- 1000 flows
>- adding/deleting flows without traffic
>- adding/deleting flows with traffic
>- keep adding and then deleting flows with traffic
>
>For review, I think Finn from Napatech should also have done it. He has
>provided some valuable comments after all :)
>
>Thanks.
>
>   --yliu
>
>On Fri, Dec 22, 2017 at 11:28:45AM +, Finn Christensen wrote:
>> Hi,
>>
>> The addition of the offload thread is an great addition, furthermore, RSS
>action usage for target queues, works now well with our NICs, I just had to do
>some additional drive work, to make it all work.
>>
>> The support for RSS removed the problem with rte-flow queue selection,
>but RSS in general added an additional challenge as now multiple pmd's may
>request the same flow offload (as Yuanhan pointed out). The solution has
>worked without issues for me. I have run tests with 1 and 2 queues in a PV
>and PVP setup going over virtio.
>> Yuanhan mentioned tests in a P2P setup with high throughput acceleration. I
>have focused on a PVP scenario, which, in addition, will crosses virtio, and
>most importantly, has acceleration in Rx direction only (seen from VM). It 
>still
>gives fairly good performance improvements.
>>
>> Here are my initial test results.
>>
>> Percentage improvements:
>> 1 RX/Tx queue:
>> 1 megaflow - 10K flowsPV: 73%
>> 10  megaflows - 100K flowsPVP: 50%PV: 56%
>> 100 megaflows - 1M flowsPVP: 42%PV: 49%
>> 1000 megaflows - 10M flowsPVP: 40%PV: 42%
>>
>> With RSS: 2 Rx/Tx queue pairs:
>> 1 megaflow - 10K flowsPV: 55%
>> 10  megaflows - 100K flowsPVP: 36%PV: 38%
>> 100 megaflows - 1M flowsPVP: 27%PV: 24%
>> 1000 megaflows - 10M flowsPVP: 21%PV: 24%
>>
>> PVP for 1 megaflow offload is omitted because my VM-app imposed a
>bottle-neck.
>>
>> Regards,
>> Finn
>>
>> -Original Message-
>> From: Yuanhan Liu [mailto:y...@fridaylinux.org]
>> Sent: 20. december 2017 15:45
>> To: d...@openvswitch.org
>> Cc: Finn Christensen <f...@napatech.com>; Darrell Ball
>> <db...@vmware.com>; Chandran Sugesh <sugesh.chand...@intel.com>;
>> Simon Horman <simon.hor...@netronome.com>; Ian Stokes
>> <ian.sto...@intel.com>; Yuanhan Liu <y...@fridaylinux.org>
>> Subject: [PATCH v5 0/5] OVS-DPDK flow offload with rte_flow
>>
>> Hi,
>>
>> Here is a joint work from Mellanox and Napatech, to enable the flow hw
>> offload with the DPDK generic flow interface (rte_flow).
>>
>> The basic idea is to associate the flow with a mark id (a unit32_t 
>> number).
>> Later, we then get the flow directly from the mark id, which could bypass
>> some heavy CPU operations, including but not limiting to mini flow 
>> extract,
>> emc lookup, dpcls lookup, etc.
>>
>> The association is done with CMAP in patch 1. The CPU workload
>bypassing
>> is done in patch 2. The flow offload is done in patch 3, which mainly 
>> doe

Re: [ovs-dev] [PATCH v5 0/5] OVS-DPDK flow offload with rte_flow

2018-01-12 Thread Finn Christensen
>-Original Message-
>From: Yuanhan Liu [mailto:y...@fridaylinux.org]
>Sent: 11. januar 2018 15:11
>To: Chandran, Sugesh <sugesh.chand...@intel.com>
>Cc: Finn Christensen <f...@napatech.com>; d...@openvswitch.org; Darrell Ball
><db...@vmware.com>; Simon Horman <simon.hor...@netronome.com>;
>Stokes, Ian <ian.sto...@intel.com>
>Subject: Re: [PATCH v5 0/5] OVS-DPDK flow offload with rte_flow
>
>On Wed, Jan 10, 2018 at 02:38:13PM +, Chandran, Sugesh wrote:
>> Hi Yuanhan,
>>
>> Thank you for your reply!..and for the wishes too :)
>>
>> At very high level, my comments are on the approach taken for enabling
>hardware acceleration in OVS-DPDK. In general, the proposal works very well
>for the specific partial offload use case. i.e alleviate the miniflow extract 
>and
>optimize the megaflow lookup. I totally agree on the performance advantage
>aspect of it. However I expect the hardware acceleration in OVS-DPDK should
>be in much more broad scope than just limited to a flow lookup optimization. I
>had raised a few comments around these point earlier when the previous
>versions of patches were released.
>>
>> Once again I am sharing my thoughts here to initiate a discussion on
>> this area. In my point of view, hardware acceleration in OVS-DPDK will
>> be,
>
>Thank you for sharing it!
>
>> 1)   Must understand the hardware and its capabilities. I would
>prefer a proactive hardware management than the reactive model. This
>approach let the software collect the relevant hardware information in
>advance before interacting with it. It will have impact on all the port 
>specific
>operations including port init, device programming and etc.
>
>Yes, I agree, and I think we have also discussed it before. Likely, doing 
>probes
>proactively seems to work.
>
>> 2)   Define hardware acceleration modes to support different
>acceleration methods. It will allow to work with various hardware devices
>based on its capabilities.

For 1) and 2):
I completely agree. I think this will be a key issue in full offload. The 
challenge 
that we see with the next full-offload step is:
a. On a port basis knowledge of what actions can be offloaded
b. based on this information (and if output action is offloadable) full offload 
may be tried.
c. fall back on partial-offload
Thus, how do we chose.
However, I think, maybe the first full-offload patch could be try full -> if 
fail do partial?


>> 3)   Its nice to have to make OVS software to use device + port
>model. This is helpful when enumerating device bound characteristics (such as
>supported protocols, tcam size, and etc). Also it helps when handling flows
>that span across different hardware acceleration devices.
>> 4)   Having device model as mentioned in (3) helps OVS to
>distribute hardware flow install into multiple threads than just single thread 
>in
>the current implementation. May be its possible to have one flow thread per
>device to avoid the contention.
>
>Single thread is chosen for simplicity, and I do think it could be extended to 
>be
>multiple thread, when it comes to necessary.
>
>> 5)   The action translation and flow validation for the hardware
>acceleration has to be device centric. So having one set of functions may not
>work for other use cases. Consider another partial hardware acceleration
>approach 'zero copy' where packets are copied directly to the VM queues by
>the NIC. For zero copy acceleration,  the actions cannot be 'mark and RSS '
>instead it  will be specific queue action. Similarly different acceleration 
>mode
>will have its own validations and set of actions. May be its good to have
>function hooks for these operation based on acceleration mode.
>
>Sure, the idea case would be choose different flow actions (or combinations)
>based on different model. And I think it would be easy to do such change.
>
>>
>> Considering there are no DPDK support available for handle most of the
>above points, I don't mind pushing this patch into OVS mainline. However I
>would expect a refactoring of that code in the future to make it generic
>enough for supporting different use cases and modes. By having a generic
>approach it is possible to support various partial acceleration & full
>acceleration modes based on the available hardware.
>>
>
>Agree. And that's also what I have been thinking of. It's also easier to make
>decisions when we are at the step of doing real extension (say, adding the full
>offload).

Agree, we need to do this stepwise.

Here is some of my thoughts:
We see the next full-offload step potentially contain (roughly):
a. Do full offload if possible, with fallback to partial (and finally none).
b. Differentiate b

Re: [ovs-dev] [PATCH v5 0/5] OVS-DPDK flow offload with rte_flow

2018-01-12 Thread Finn Christensen
>-Original Message-
>From: Simon Horman [mailto:simon.hor...@netronome.com]
>Sent: 12. januar 2018 08:55
>To: Chandran, Sugesh <sugesh.chand...@intel.com>
>Cc: Yuanhan Liu <y...@fridaylinux.org>; Finn Christensen
><f...@napatech.com>; Darrell Ball <db...@vmware.com>; Stokes, Ian
><ian.sto...@intel.com>; Loftus, Ciara <ciara.lof...@intel.com>;
>d...@openvswitch.org
>Subject: Re: [PATCH v5 0/5] OVS-DPDK flow offload with rte_flow
>
>On Wed, Jan 10, 2018 at 02:53:19PM +, Chandran, Sugesh wrote:
>> Hi All,
>>
>> Based on the comments below, I wanted to call out for a meeting to decide
>what further needs to be done to handle different kinds of hardware
>acceleration in OVS-DPDK. We have some POC implementation that partially
>handles some of the cases , But it only work with some of Intel hardware
>acceleration products. I also did a presentation about that proposal in the 
>last
>OVS conference. I feel its time to discuss now to come up with a design that
>work for all different hardware acceleration devices and possible acceleration
>modes.
>> If all are agree on the above suggestion, I can send out for a meeting invite
>sometime next week.(May be Jan15th, Monday 5:00PM GMT). Let me know
>if this timeslot is OK for everyone.
>>
>> I am ccing the 'dev' mailing list as well here in case if anyone else wanted 
>> to
>join the call and provide the inputs.
>
>As per the brief discussion of this at the OVS-DPDK meeting on Wednesday I
>agree this is a good idea. Please include me on the invitation (or point me to
>the invitation if its already out).

Please invite me also, I want to get back as soon as possible with our view of 
full 
offload, but haven't got around it yet. We have been working on it for a while 
and
want to actively participate in this discussion.

Thanks,
Finn
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] netdev-dpdk: fix port addition for ports sharing same PCI id

2018-01-04 Thread Finn Christensen
   -Original Message-
From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
boun...@openvswitch.org] On Behalf Of Yuanhan Liu
Sent: 2. januar 2018 15:11
To: d...@openvswitch.org
Cc: Thomas Monjalon 
Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: fix port addition for ports
sharing same PCI id

ping ...

On Thu, Dec 21, 2017 at 12:03:36AM +0800, Yuanhan Liu wrote:
> Some NICs have only one PCI address associated with multiple ports.
> This patch extends the dpdk-devargs option's format to cater for such
devices.
>
> To achieve that, this patch uses a new syntax that will be adapted and
> implemented in future DPDK release (likely, v18.05):
> http://dpdk.org/ml/archives/dev/2017-December/084234.html
>
> And since it's the DPDK duty to parse the (complete and full) syntax
> and this patch is more likely to serve as an intermediate workaround,
> here I take a simpler and shorter syntax from it (note it's allowed to
> have only one category being provided):
> class=eth,mac=00:11:22:33:44:55:66
>
> Also, old compatibility is kept. Users can still go on with using the
> PCI id to add a port (if that's enough for them). Meaning, this patch
> will not break anything.

[Finn]
Our NICs also have this "feature" with multiple ports on same PCI bus ID.
Agreed, this is a DPDK functionality, and should be removed when DPDK 
fully supports a similar mechanism.
However, this temp solution will be fine for our needs.

>
> This patch is basically based on the one from Ciara:
>
> https://mail.openvswitch.org/pipermail/ovs-dev/2017-
October/339496.htm
> l
>
> Cc: Loftus Ciara 
> Cc: Thomas Monjalon 
> Cc: Kevin Traynor 
> Signed-off-by: Yuanhan Liu 
> ---
>  lib/netdev-dpdk.c | 77
> ---
>  1 file changed, 62 insertions(+), 15 deletions(-)
>
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index
> 45fcc74..4e5cc25 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -1205,30 +1205,77 @@
netdev_dpdk_lookup_by_port_id(dpdk_port_t port_id)
>  return NULL;
>  }
>
> +static int
> +netdev_dpdk_str_to_ether(const char *mac, struct ether_addr *ea) {
> +unsigned int bytes[6];
> +int i;
> +
> +if (sscanf(mac, "%x:%x:%x:%x:%x:%x",
> +   [0], [1], [2],
> +   [3], [4], [5]) != 6) {
> +return -1;
> +}
> +
> +for (i = 0; i < 6; i++) {
> +ea->addr_bytes[i] = bytes[i];
> +}
> +
> +return 0;
> +}
> +
> +static dpdk_port_t
> +netdev_dpdk_get_port_by_mac(const char *mac_str) {
> +int i;
> +struct ether_addr mac;
> +struct ether_addr port_mac;
> +
> +netdev_dpdk_str_to_ether(mac_str, );
> +for (i = 0; i < RTE_MAX_ETHPORTS; i++) {
> +if (!rte_eth_dev_is_valid_port(i)) {
> +continue;
> +}
> +
> +rte_eth_macaddr_get(i, _mac);
> +if (is_same_ether_addr(, _mac)) {
> +return i;
> +}
> +}
> +
> +return DPDK_ETH_PORT_ID_INVALID;
> +}
> +
>  static dpdk_port_t
>  netdev_dpdk_process_devargs(struct netdev_dpdk *dev,
>  const char *devargs, char **errp)  {
> -/* Get the name up to the first comma. */
> -char *name = xmemdup0(devargs, strcspn(devargs, ","));
> +char *name;
>  dpdk_port_t new_port_id = DPDK_ETH_PORT_ID_INVALID;
>
> -if (rte_eth_dev_get_port_by_name(name, _port_id)
> -|| !rte_eth_dev_is_valid_port(new_port_id)) {
> -/* Device not found in DPDK, attempt to attach it */
> -if (!rte_eth_dev_attach(devargs, _port_id)) {
> -/* Attach successful */
> -dev->attached = true;
> -VLOG_INFO("Device '%s' attached to DPDK", devargs);
> -} else {
> -/* Attach unsuccessful */
> -new_port_id = DPDK_ETH_PORT_ID_INVALID;
> -VLOG_WARN_BUF(errp, "Error attaching device '%s' to DPDK",
> -  devargs);
> +if (strncmp(devargs, "class=eth,mac=", 14) == 0) {
> +new_port_id = netdev_dpdk_get_port_by_mac([14]);
> +} else {
> +name = xmemdup0(devargs, strcspn(devargs, ","));
> +if (rte_eth_dev_get_port_by_name(name, _port_id)
> +|| !rte_eth_dev_is_valid_port(new_port_id)) {
> +/* Device not found in DPDK, attempt to attach it */
> +if (!rte_eth_dev_attach(devargs, _port_id)) {
> +/* Attach 

Re: [ovs-dev] [PATCH v5 0/5] OVS-DPDK flow offload with rte_flow

2017-12-22 Thread Finn Christensen
Hi,

The addition of the offload thread is an great addition, furthermore, RSS 
action usage for target queues, works now well with our NICs, I just had to do 
some additional drive work, to make it all work.

The support for RSS removed the problem with rte-flow queue selection, but RSS 
in general added an additional challenge as now multiple pmd's may request the 
same flow offload (as Yuanhan pointed out). The solution has worked without 
issues for me. I have run tests with 1 and 2 queues in a PV and PVP setup going 
over virtio. 
Yuanhan mentioned tests in a P2P setup with high throughput acceleration. I 
have focused on a PVP scenario, which, in addition, will crosses virtio, and 
most importantly, has acceleration in Rx direction only (seen from VM). It 
still gives fairly good performance improvements.

Here are my initial test results.

Percentage improvements:
1 RX/Tx queue:
1 megaflow - 10K flows  PV: 73%
10  megaflows - 100K flows  PVP: 50%PV: 56%
100 megaflows - 1M flowsPVP: 42%PV: 49%
1000 megaflows - 10M flows  PVP: 40%PV: 42%

With RSS: 2 Rx/Tx queue pairs:
1 megaflow - 10K flows  PV: 55%
10  megaflows - 100K flows  PVP: 36%PV: 38%
100 megaflows - 1M flowsPVP: 27%PV: 24%
1000 megaflows - 10M flows  PVP: 21%PV: 24%

PVP for 1 megaflow offload is omitted because my VM-app imposed a bottle-neck.

Regards,
Finn

-Original Message-
From: Yuanhan Liu [mailto:y...@fridaylinux.org]
Sent: 20. december 2017 15:45
To: d...@openvswitch.org
Cc: Finn Christensen <f...@napatech.com>; Darrell Ball
<db...@vmware.com>; Chandran Sugesh <sugesh.chand...@intel.com>;
Simon Horman <simon.hor...@netronome.com>; Ian Stokes
<ian.sto...@intel.com>; Yuanhan Liu <y...@fridaylinux.org>
Subject: [PATCH v5 0/5] OVS-DPDK flow offload with rte_flow

Hi,

Here is a joint work from Mellanox and Napatech, to enable the flow hw
offload with the DPDK generic flow interface (rte_flow).

The basic idea is to associate the flow with a mark id (a unit32_t number).
Later, we then get the flow directly from the mark id, which could bypass
some heavy CPU operations, including but not limiting to mini flow extract,
emc lookup, dpcls lookup, etc.

The association is done with CMAP in patch 1. The CPU workload bypassing
is done in patch 2. The flow offload is done in patch 3, which mainly does
two things:

- translate the ovs match to DPDK rte flow patterns
- bind those patterns with a RSS + MARK action.

Patch 5 makes the offload work happen in another thread, for leaving the
datapath as light as possible.

A PHY-PHY forwarding with 1000 mega flows (udp,tp_src=1000-1999) and 1
million streams (tp_src=1000-1999, tp_dst=2000-2999) show more than
260% performance boost.

Note that it's disabled by default, which can be enabled by:

$ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true


v5: - fixed an issue that it took too long if we do flow add/remove
  repeatedly.
- removed an unused mutex lock
- turned most of the log level to DBG
- rebased on top of the latest code

v4: - use RSS action instead of QUEUE action with MARK
- make it work with multiple queue (see patch 1)
- rebased on top of latest code

v3: - The mark and id association is done with array instead of CMAP.
- Added a thread to do hw offload operations
- Removed macros completely
- dropped the patch to set FDIR_CONF, which is a workround some
  Intel NICs.
- Added a debug patch to show all flow patterns we have created.
- Misc fixes

v2: - workaround the queue action issue
- fixed the tcp_flags being skipped issue, which also fixed the
  build warnings
- fixed l2 patterns for Intel nic
- Converted some macros to functions
- did not hardcode the max number of flow/action
- rebased on top of the latest code

Thanks.

    --yliu
    

---
Finn Christensen (1):
  netdev-dpdk: implement flow offload with rte flow

Yuanhan Liu (4):
  dpif-netdev: associate flow with a mark id
  dpif-netdev: retrieve flow directly from the flow mark
  netdev-dpdk: add debug for rte flow patterns
  dpif-netdev: do hw flow offload in a thread

 lib/dp-packet.h   |  13 +
 lib/dpif-netdev.c | 473 ++-
 lib/flow.c| 155 +---
 lib/flow.h|   1 +
 lib/netdev-dpdk.c | 732
+-
 lib/netdev.h  |   6 +
 6 files changed, 1341 insertions(+), 39 deletions(-)

--
2.7.4

___
d

Re: [ovs-dev] [PATCH v4 1/5] dpif-netdev: associate flow with a mark id

2017-12-12 Thread Finn Christensen
Hi Yuanhan,

Nice work. I have started testing it on our NICs.
A few comments below.

Regards,
Finn Christensen

-Original Message-
From: Yuanhan Liu [mailto:y...@fridaylinux.org]
Sent: 27. november 2017 08:43
To: d...@openvswitch.org
Cc: Finn Christensen <f...@napatech.com>; Darrell Ball
<db...@vmware.com>; Chandran Sugesh <sugesh.chand...@intel.com>;
Simon Horman <simon.hor...@netronome.com>; Yuanhan Liu
<y...@fridaylinux.org>
Subject: [PATCH v4 1/5] dpif-netdev: associate flow with a mark id

Most modern NICs have the ability to bind a flow with a mark, so that
every pkt matches such flow will have that mark present in its desc.

The basic idea of doing that is, when we receives pkts later, we could
directly get the flow from the mark. That could avoid some very costly CPU
operations, including (but not limiting to) miniflow_extract, emc lookup,
dpcls lookup, etc. Thus, performance could be greatly improved.

Thus, the mojor work of this patch is to associate a flow with a mark id (an
uint32_t number). The association in netdev datapatch is done by CMAP,
while in hardware it's done by the rte_flow MARK action.

One tricky thing in OVS-DPDK is, the flow tables is per-PMD. For the case
there is only one phys port but with 2 queues, there could be 2 PMDs. In
another word, even for a single mega flow (i.e. udp,tp_src=1000), there
could be 2 different dp_netdev flows, one for each PMD. That could
results to the same mega flow being offloaded twice in the hardware,
worse, we may get 2 different marks and only the last one will work.

To avoid that, a megaflow_to_mark CMAP is created. An entry will be
added for the first PMD wants to offload a flow. For later PMDs, it will see
such megaflow is already offloaded, then the flow will not be offloaded to
HW twice.

Meanwhile, the mark to flow mapping becomes to 1:N mapping. That is
what the mark_to_flow CMAP for. For the first PMD wants to offload a
flow, it allocates a new mark and do the flow offload by reusing the
->flow_put method. When it succeeds, a "mark to flow" entry will be
added. For later PMDs, it will get the corresponding mark by above
megaflow_to_mark CMAP. Then, another "mark to flow" entry will be
added.

Another thing might worth mentioning is that hte megaflow is created by
masking all the bytes from match->flow with match->wc. It works well so
far, but I have a feeling that is not the best way.

Co-authored-by: Finn Christensen <f...@napatech.com>
Signed-off-by: Yuanhan Liu <y...@fridaylinux.org>
Signed-off-by: Finn Christensen <f...@napatech.com>
---
 lib/dpif-netdev.c | 272
++
 lib/netdev.h  |   6 ++
 2 files changed, 278 insertions(+)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 0a62630..8579474
100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -77,6 +77,7 @@
 #include "tnl-ports.h"
 #include "unixctl.h"
 #include "util.h"
+#include "uuid.h"

 VLOG_DEFINE_THIS_MODULE(dpif_netdev);

@@ -442,7 +443,9 @@ struct dp_netdev_flow {
 /* Hash table index by unmasked flow. */
 const struct cmap_node node; /* In owning dp_netdev_pmd_thread's
*/
  /* 'flow_table'. */
+const struct cmap_node mark_node; /* In owning flow_mark's
+ mark_to_flow */
 const ovs_u128 ufid; /* Unique flow identifier. */
+const ovs_u128 mega_ufid;
 const unsigned pmd_id;   /* The 'core_id' of pmd thread owning this
*/
  /* flow. */

@@ -453,6 +456,7 @@ struct dp_netdev_flow {
 struct ovs_refcount ref_cnt;

 bool dead;
+uint32_t mark;   /* Unique flow mark assiged to a flow */

 /* Statistics. */
 struct dp_netdev_flow_stats stats;
@@ -1854,6 +1858,175 @@ dp_netdev_pmd_find_dpcls(struct
dp_netdev_pmd_thread *pmd,
 return cls;
 }

+#define MAX_FLOW_MARK   (UINT32_MAX - 1)
+#define INVALID_FLOW_MARK   (UINT32_MAX)
+
+struct megaflow_to_mark_data {
+const struct cmap_node node;
+ovs_u128 mega_ufid;
+uint32_t mark;
+};
+
+struct flow_mark {
+struct cmap megaflow_to_mark;
+struct cmap mark_to_flow;
+struct id_pool *pool;
+struct ovs_mutex mutex;

[Finn] Is this mutex needed? - the structure seems to be used by the offload 
thread only.

+};
+
+struct flow_mark flow_mark = {
+.megaflow_to_mark = CMAP_INITIALIZER,
+.mark_to_flow = CMAP_INITIALIZER,
+.mutex

Re: [ovs-dev] [PATCH v3 0/9] OVS-DPDK flow offload with rte_flow

2017-09-27 Thread Finn Christensen
-Original Message-
From: Yuanhan Liu [mailto:y...@fridaylinux.org]
Sent: 27. september 2017 11:47
To: Finn Christensen <f...@napatech.com>
Cc: Darrell Ball <db...@vmware.com>; d...@openvswitch.org; Chandran
Sugesh <sugesh.chand...@intel.com>; Simon Horman
<simon.hor...@netronome.com>
Subject: Re: [PATCH v3 0/9] OVS-DPDK flow offload with rte_flow

On Wed, Sep 27, 2017 at 09:12:49AM +, Finn Christensen wrote:
>
> [Darrell] It looks fine; of course, if we could drop dependencies on 
cap
> probe, it would be ideal (.
>
>
> [Finn]
> From a Napatech point of view, we would definitely want to use the
> RTE_FLOW_ACTION_TYPE_RSS if it were implement. It definitely makes
> good sense to me. I have 2 reasons for this:
> 1. It would fit nicely into the way Napatech does flow programming in
> nic 2. We would be fully RSS controlled through OVS Furthermore,
> Future RSS splitting on a per megaflow basis will be possible.
> IMO the "pure_mark_cap_probed" function is a temp work-around to
make it work.
> The RSS seems to be a solution to the queue problem.

Finn, that's really good to know! I do agree without this probe, it makes 
the
code simpler and cleaner.

Few questions though. Have Napatech already implemented rss action? If
not, what's the plan? 

[Finn]
Our nic handles rss on a per flow basis, but we have not yet used rte rss 
action 
for OVS. In OVS we simply handles RSS config outside it.
The full control of rss through OVS is better though.

And are you okay with following code?

add_mark_action();
add_rss_action_unconditionally();

flow = rte_create_flow(...);
if (!flow)
return -1;

That is, no probes, no feedbacks. If it failed, it just failed (flow 
offload then
is just skipped).

[Finn]
Yes, we can easily make this work. Good suggestion!


But I do think some feedbacks are necessary. If we know the rte_flow cap
in the begnning, we could simply disable the flow offload for a specific 
port
if we know it doesn't have offload ability. This would avoid the repeat
effort of trying to do hw offload completely.

[Finn]
This seems to be a good idea.


--yliu
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v3 0/9] OVS-DPDK flow offload with rte_flow

2017-09-27 Thread Finn Christensen
-Original Message-
From: Darrell Ball [mailto:db...@vmware.com]
Sent: 27. september 2017 05:54
To: Yuanhan Liu <y...@fridaylinux.org>
Cc: d...@openvswitch.org; Finn Christensen <f...@napatech.com>;
Chandran Sugesh <sugesh.chand...@intel.com>; Simon Horman
<simon.hor...@netronome.com>
Subject: Re: [PATCH v3 0/9] OVS-DPDK flow offload with rte_flow



On 9/26/17, 8:27 PM, "Yuanhan Liu" <y...@fridaylinux.org> wrote:

On Wed, Sep 27, 2017 at 03:13:33AM +, Darrell Ball wrote:
>
>
> On 9/26/17, 6:25 PM, "Yuanhan Liu" <y...@fridaylinux.org> wrote:
>
> On Tue, Sep 26, 2017 at 09:58:16PM +, Darrell Ball wrote:
> > The second major change is there is a thread introduced to 
do
the real
> > flow offload. This is for diminishing the overhead by hw 
flow
offload
> > installation/deletion at data path. See patch 9 for more 
detailed
info.
> >
> > [Darrell] There might be other options to handle this such as
dividing time
> > to HWOL in a given PMD.
> > Another option to have PMD isolation.
>
> Good to know, though I'm not sure I understand it so far. But it 
can
be
> discussed later. That's also the reason I put it in the last 
patch, so
> that we could easily turn it to another solution, or even simpler 
(just
> remove it).
>
> > In the last discussion, RSS action was suggested to use to 
get rid
of
> > the QUEUE action workaround. Unfortunately, it didn't work. 
The
flow
> > creation failed with MLX5 PMD driver, and that's the only 
driver in
> > DPDK that supports RSS action so far.
> >
> >
> > [Darrell]
> > I wonder if we should take a pause before jumping into the next
version of the code.
>
> I have no objections.
>
> > If workarounds are required in OVS code, then so be it as long 
as
they are not
> > overly disruptive to the existing code and hard to maintain.
> > In the case of RTE_FLOW_ACTION_TYPE_RSS, we might have a
reasonable option
> > to avoid some unpleasant OVS workarounds.
> > This would make a significant difference in the code paths, if 
we
supported it, so
> > we need to be sure as early as possible.
>
> I agree.
>
> > The support needed would be in some drivers and seems
reasonably doable.
> > Moreover, this was discussed in the last dpdk meeting and the
support was
> > indicated as existing?, although I only verified the MLX code,
myself.
>
> From the code point of view, yes, the code was there months ago.
>
> > I had seen the MLX code supporting _RSS action and there are
some checks for
> > supported cases; when you say “it didn't work”, what was the 
issue
?
>
> I actually have met the error months ago, I even debugged it. 
IIRC,
> the error is from libibverbs, when trying to create the hw flow. I
> think I need double-confirm it with our colleague who developed 
this
> feature.
>
> > Let us have a discussion also about the Intel nic side and the
Napatech side.
>
> I think it's not necessary for Napatech, because they can support
> MARK only action.
>
> It is not necessary for Napatech; however to avoid the need to detect
the Napatech
> special (good) case, ‘ideally’ we do the exact same programming even
in Napatech case.

Will following look okay to you (assuming we have the probe ability and
DPDK
has some basic capability feedback)?

if (!pure_mark_cap_probed) {
if (dpdk_rte_flow_has_rss_action_support) {
append_rss_action();
} else {
fail and return;
}
}

/* create flow once, with no retries, if fails, let it fail */
flow = rte_flow_create(...);

I assume that's how the code looks like finally. What do you think?


[Darrell] It looks fine; of course, if we could drop dependencies on cap
probe, it would be ideal (.


[Finn]

Re: [ovs-dev] [PATCH v2 8/8] netdev-dpdk: set FDIR config

2017-09-21 Thread Finn Christensen


-Original Message-
From: Darrell Ball [mailto:db...@vmware.com]
Sent: 21. september 2017 11:00
To: Yuanhan Liu <y...@fridaylinux.org>
Cc: Chandran, Sugesh <sugesh.chand...@intel.com>; Finn Christensen
<f...@napatech.com>; d...@openvswitch.org
Subject: Re: [ovs-dev] [PATCH v2 8/8] netdev-dpdk: set FDIR config



On 9/21/17, 1:54 AM, "Yuanhan Liu" <y...@fridaylinux.org> wrote:

On Thu, Sep 21, 2017 at 08:04:45AM +, Darrell Ball wrote:
> Hi Yuanhan/Finn
>
> I think we may need to caveat the Fortville nics due to the global 
mask
> limitation;

Sorry, I didn't follow you. Like how? Or what specifically I 
could/should
do?


I meant to say that we would not need this patch 8, since it can only allow
exact match anyways.
This would not fit well with the other nics support and the overall design.
We would also add some comments to the documentation describing the
non-support for Fortville for the feature.

Darrell

[Finn] Agreed. This will not work well.

--yliu

> we also discussed this in the dpdk meeting yesterday.
>
> What do you think ?
>
> Thanks Darrell
>
> On 9/20/17, 6:47 AM, "Chandran, Sugesh"
<sugesh.chand...@intel.com> wrote:
>
>
>
> Regards
> _Sugesh
>
>
> > -Original Message-
> > From: Darrell Ball [mailto:db...@vmware.com]
> > Sent: Thursday, September 14, 2017 6:46 PM
>     > To: Chandran, Sugesh <sugesh.chand...@intel.com>; Yuanhan Liu
> > <y...@fridaylinux.org>
> > Cc: Finn Christensen <f...@napatech.com>; d...@openvswitch.org
> > Subject: Re: [ovs-dev] [PATCH v2 8/8] netdev-dpdk: set FDIR 
config
> >
> >
> >
> > On 9/14/17, 10:36 AM, "Chandran, Sugesh"
<sugesh.chand...@intel.com>
> > wrote:
> >
> >
> >
> > Regards
> > _Sugesh
> >
> > > -Original Message-
> > > From: Yuanhan Liu [mailto:y...@fridaylinux.org]
> > > Sent: Thursday, September 14, 2017 4:19 AM
> > > To: Darrell Ball <db...@vmware.com>
> > > Cc: Finn Christensen <f...@napatech.com>; Chandran, Sugesh
> > > <sugesh.chand...@intel.com>; d...@openvswitch.org
> > > Subject: Re: [ovs-dev] [PATCH v2 8/8] netdev-dpdk: set 
FDIR
config
> > >
> > > On Wed, Sep 13, 2017 at 01:57:22AM +, Darrell Ball 
wrote:
> > > >
> > > >
> > > > On 9/11/17, 1:14 AM, "ovs-dev-boun...@openvswitch.org
on behalf of
> > > Finn Christensen" <ovs-dev-boun...@openvswitch.org on
behalf of
> > > f...@napatech.com> wrote:
> > > >
> > > > -Original Message-
> > > > From: ovs-dev-boun...@openvswitch.org 
[mailto:ovs-dev-
> > > boun...@openvswitch.org] On Behalf Of Yuanhan Liu
> > > > Sent: 11. september 2017 09:55
> > > > To: Chandran, Sugesh <sugesh.chand...@intel.com>
> > > > Cc: d...@openvswitch.org
> > > > Subject: Re: [ovs-dev] [PATCH v2 8/8] netdev-dpdk: 
set
FDIR config
> > > >
> > > > On Mon, Sep 11, 2017 at 07:42:57AM +, Chandran,
Sugesh wrote:
> > > > >
> > > > >
> > > > > Regards
> > > > > _Sugesh
    >     > > > >
> > > > >
> > > > > > -Original Message-
> > > > > > From: ovs-dev-boun...@openvswitch.org 
[mailto:ovs-
dev-
> > > > > > boun...@openvswitch.org] On Behalf Of Yuanhan 
Liu
> > > > > > Sent: Tuesday, September 5, 2017 10:23 AM
> > > > > > To: d.

Re: [ovs-dev] [PATCH v2 6/8] netdev-dpdk: retry with queue action

2017-09-15 Thread Finn Christensen


-Original Message-
From: Darrell Ball [mailto:db...@vmware.com]
Sent: 15. september 2017 20:09
To: Finn Christensen <f...@napatech.com>
Cc: d...@openvswitch.org
Subject: Re: [ovs-dev] [PATCH v2 6/8] netdev-dpdk: retry with queue action



On 9/15/17, 2:16 AM, "Finn Christensen" <f...@napatech.com> wrote:



-Original Message-
From: Darrell Ball [mailto:db...@vmware.com]
Sent: 14. september 2017 21:35
    To: Finn Christensen <f...@napatech.com>
Cc: d...@openvswitch.org
Subject: Re: [ovs-dev] [PATCH v2 6/8] netdev-dpdk: retry with queue
action



    On 9/14/17, 1:14 AM, "Finn Christensen" <f...@napatech.com> wrote:





-Original Message-

From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
boun...@openvswitch.org] On Behalf Of Darrell Ball

Sent: 13. september 2017 18:18

To: Simon Horman <simon.hor...@netronome.com>

Cc: d...@openvswitch.org

Subject: Re: [ovs-dev] [PATCH v2 6/8] netdev-dpdk: retry with 
queue
action







On 9/13/17, 2:57 AM, "Simon Horman"
<simon.hor...@netronome.com> wrote:



On Tue, Sep 12, 2017 at 08:36:19AM +, Darrell Ball 
wrote:

>

> On 9/10/17, 11:14 PM, "ovs-dev-boun...@openvswitch.org on
behalf of Yuanhan Liu" <ovs-dev-boun...@openvswitch.org on behalf
of
y...@fridaylinux.org> wrote:

>

> On Fri, Sep 08, 2017 at 06:48:50PM +0200, Simon Horman
wrote:

> > On Tue, Sep 05, 2017 at 05:22:59PM +0800, Yuanhan 
Liu
wrote:

> > > From: Finn Christensen <f...@napatech.com>

> > >

> > > AFAIK, most (if not all) NICs (including Mellanox 
and Intel)
do
not

> > > support a pure MARK action.  It's required to be 
used
together
with

> > > some other actions, like QUEUE.

> > >

> > > To workaround it, retry with a queue action when 
first try
failed.

> > >

> > > Moreover, some Intel's NIC (say XL710) needs the 
QUEUE
action set

> > > before the MARK action.

> > >

> > > Co-authored-by: Yuanhan Liu <y...@fridaylinux.org>

> > > Signed-off-by: Finn Christensen 
<f...@napatech.com>

> > > Signed-off-by: Yuanhan Liu <y...@fridaylinux.org>

> >

> > This feels a bit like the tail wagging the dog.

> > Is this the lowest level at which it makes sense to 
implement

> > this logic?

> >

> > If so then I wonder if some sort of probing would 
be in order

> > to avoid the cost of trying to add the flow twice 
to hardware

> > where the queue is required.

>

> Do you mean something like rte_flow capability query, 
like
whether

> a queue action is needed for a mark action? If so, 
yes, I do
think

> we miss an interface like this.

>

> Note that even in this solution, the flow won't be 
created
twice

> to the hardware, because the first try would be 
failed.

>

> [Darrell]

>

>   Having an api to quey capability and avoid 
the first try to
HW
would be nice, but there are dependencies

>on RTE, drivers etc and I don’t know 
definitive the api
would
be.

>

>  Also, as nics are added this capability 
needs to be done
and
state needs to be kept i

Re: [ovs-dev] [PATCH v2 6/8] netdev-dpdk: retry with queue action

2017-09-14 Thread Finn Christensen


-Original Message-
From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-boun...@openvswitch.org] 
On Behalf Of Darrell Ball
Sent: 13. september 2017 18:18
To: Simon Horman <simon.hor...@netronome.com>
Cc: d...@openvswitch.org
Subject: Re: [ovs-dev] [PATCH v2 6/8] netdev-dpdk: retry with queue action



On 9/13/17, 2:57 AM, "Simon Horman" <simon.hor...@netronome.com> wrote:

On Tue, Sep 12, 2017 at 08:36:19AM +, Darrell Ball wrote:
> 
> On 9/10/17, 11:14 PM, "ovs-dev-boun...@openvswitch.org on behalf of 
Yuanhan Liu" <ovs-dev-boun...@openvswitch.org on behalf of 
y...@fridaylinux.org> wrote:
> 
> On Fri, Sep 08, 2017 at 06:48:50PM +0200, Simon Horman wrote:
> > On Tue, Sep 05, 2017 at 05:22:59PM +0800, Yuanhan Liu wrote:
> > > From: Finn Christensen <f...@napatech.com>
> > > 
> > > AFAIK, most (if not all) NICs (including Mellanox and Intel) do 
not
> > > support a pure MARK action.  It's required to be used together 
with
> > > some other actions, like QUEUE.
> > > 
> > > To workaround it, retry with a queue action when first try failed.
> > > 
> > > Moreover, some Intel's NIC (say XL710) needs the QUEUE action set
> > > before the MARK action.
> > > 
> > > Co-authored-by: Yuanhan Liu <y...@fridaylinux.org>
> > > Signed-off-by: Finn Christensen <f...@napatech.com>
> > > Signed-off-by: Yuanhan Liu <y...@fridaylinux.org>
> > 
> > This feels a bit like the tail wagging the dog.
> > Is this the lowest level at which it makes sense to implement
> > this logic?
> > 
> > If so then I wonder if some sort of probing would be in order
> > to avoid the cost of trying to add the flow twice to hardware
> > where the queue is required.
> 
> Do you mean something like rte_flow capability query, like whether
> a queue action is needed for a mark action? If so, yes, I do think
> we miss an interface like this.
> 
> Note that even in this solution, the flow won't be created twice
> to the hardware, because the first try would be failed.
> 
> [Darrell]
> 
>   Having an api to quey capability and avoid the first try to 
HW would be nice, but there are dependencies
>on RTE, drivers etc and I don’t know definitive the api 
would be.
> 
>  Also, as nics are added this capability needs to be done and 
state needs to be kept in all cases.
> 
>It is an enhancement and if done should be reliable.

Agreed. Though I was more thinking of probing the hardware rather than
having a capability API - I expect this would remove several of the
dependencies you describe above.


[Darrell] I have been pondering the probing option as well. It is certainly a 
valid option; we use it in other cases such as datapath probing. One of the 
aspects that worries me here is maintaining the correct per interface 
(essentially; although the attribute is per nic) state across various events 
such as new ports being added, vswitchd restarts, races with flow creation. It 
would be non-trivial I guess and probably appropriate for the next patch 
series, if done.

In this case, we have what seems like a clear distinction b/w Napatech which 
does not need the queue action workaround and everything else, which does.
Besides the non-Napatech behavior, which is worrisome, maintaining the 
difference for flow handling under the covers is concerning.

I wonder if we should be upfront as possible here and just have a dpdk 
interface configuration – maybe something like “supports native HWOL mark 
action” since the better behavior is the exception?
The interface config would be more robust than probing.
This would need documentation, of course.

[Finn]
I think the rte queue action should never be used here when using partial HWOL. 
Not the way OVS handles multi queues today. 
Maybe a "default queue" could be used in the dpdk PMDs when no queue is 
specified in rte flow?
Essentially, this is a mismatch between the rte flow impl functionality in PMD 
and the needs in OVS for flow classification, in a partial HWOL setup.
However, we would not mind Darells proposal, it makes sense since most nic PMDs 
unfortunately needs this and it is only relevant for partial HWOL. We are 
mostly concerned with full HWOL and therefore see partial HWOL as a failover 
when full HWOL could not be handled in nic, if enabled.


I think anyways we need documentation describing the difference b/w nics in the 
dp

Re: [ovs-dev] [PATCH v2 8/8] netdev-dpdk: set FDIR config

2017-09-11 Thread Finn Christensen
-Original Message-
From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-boun...@openvswitch.org] 
On Behalf Of Yuanhan Liu
Sent: 11. september 2017 09:55
To: Chandran, Sugesh <sugesh.chand...@intel.com>
Cc: d...@openvswitch.org
Subject: Re: [ovs-dev] [PATCH v2 8/8] netdev-dpdk: set FDIR config

On Mon, Sep 11, 2017 at 07:42:57AM +, Chandran, Sugesh wrote:
> 
> 
> Regards
> _Sugesh
> 
> 
> > -Original Message-
> > From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev- 
> > boun...@openvswitch.org] On Behalf Of Yuanhan Liu
> > Sent: Tuesday, September 5, 2017 10:23 AM
> > To: d...@openvswitch.org
> > Subject: [ovs-dev] [PATCH v2 8/8] netdev-dpdk: set FDIR config
> > 
> > From: Finn Christensen <f...@napatech.com>
> > 
> > The Intel i40e PMD driver requires the fdir mode set to 
> > RTE_FDIR_MODE_PERFECT, otherwise, the flow creation would be failed.
> [Sugesh] this means it doesn't honor the flow masks which passed onto 
> rte_flow_*?

IIRC, that's what I found after divig the code. It's an issue reported/fixed by 
Finn. I also don't have the nic for testing.

[Finn] Yes, this was needed to make our test setup using an XL710 work, with 
the rte_flow implementation.
It's a while ago so I don't exactly remember how we ended up with this 
solution. However, we are definitely not
Intel XL710 experts, so there might be other ways to achieve the rte_flow 
functionality.
This issue, and problem raised about the overall change in configuration impact 
on NICs using this setting (Napatech 
does not use it), I think should be reviewed/verified by NIC vendors using it.

--yliu
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 0/8] OVS-DPDK flow offload with rte_flow

2017-09-06 Thread Finn Christensen

-Original Message-
From: Patil, Harish [mailto:harish.pa...@cavium.com] 
Sent: 6. september 2017 08:34
To: Yuanhan Liu <y...@fridaylinux.org>; d...@openvswitch.org
Cc: Finn Christensen <f...@napatech.com>; db...@vmware.com
Subject: Re: [ovs-dev] [PATCH v2 0/8] OVS-DPDK flow offload with rte_flow



-Original Message-
From: <ovs-dev-boun...@openvswitch.org> on behalf of Yuanhan Liu 
<y...@fridaylinux.org>
Date: Tuesday, September 5, 2017 at 2:22 AM
To: "d...@openvswitch.org" <d...@openvswitch.org>
Subject: [ovs-dev] [PATCH v2 0/8] OVS-DPDK flow offload with rte_flow

>Hi,
>
>Here is a joint work from Mellanox and Napatech, to enable the flow hw 
>offload with the DPDK generic flow interface (rte_flow).
>
>The basic idea is to associate the flow with a mark id (a unit32_t 
>number).
>Later, we then get the flow directly from the mark id, bypassing the 
>heavy emc processing, including miniflow_extract.
>
>The association is done with CMAP in patch 1. It also reuses the flow 
>APIs introduced while adding the tc offloads. The emc bypassing is done 
>in patch 2. The flow offload is done in patch 4, which mainly does two
>things:
>
>- translate the ovs match to DPDK rte flow patterns
>- bind those patterns with a MARK action.
>
>Afterwards, the NIC will set the mark id in every pkt's mbuf when it 
>matches the flow. That's basically how we could get the flow directly 
>from the received mbuf.
>
>While testing with PHY-PHY forwarding with one core and one queue, I 
>got about 54% performance boost. For PHY-vhost forwarding, I got about 
>41% performance boost. The reason it's lower than v1 is I added the 
>logic to get the correct tcp_flags, which examines all packets recieved.
>
>The major issue mentioned in last version is also workarounded: the 
>queue index is never set to 0 blindly anymore, but set to the rxq that 
>first receives the upcall pkt.
>
>Note that it's disabled by default, which can be enabled by:
>
>$ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
>
>
>v2: - workaround the queue action issue
>- fixed the tcp_flags being skipped issue, which also fixed the
>  build warnings
>- fixed l2 patterns for Intel nic
>- Converted some macros to functions
>- did not hardcode the max number of flow/action
>- rebased on top of the lastest code
>
>Thanks.
>
>--yliu
>
>
>---
>Finn Christensen (3):
>  netdev-dpdk: implement flow put with rte flow
>  netdev-dpdk: retry with queue action
>  netdev-dpdk: set FDIR config
>
>Shachar Beiser (1):
>  dpif-netdev: record rx queue id for the upcall
>
>Yuanhan Liu (4):
>  dpif-netdev: associate flow with a mark id
>  dpif-netdev: retrieve flow directly from the flow mark
>  netdev-dpdk: convert ufid to dpdk flow
>  netdev-dpdk: remove offloaded flow on deletion
>
> lib/dp-packet.h   |  14 ++
> lib/dpif-netdev.c | 132 +++--
> lib/flow.c|  78 
> lib/flow.h|   1 +
> lib/netdev-dpdk.c | 574
>+-
> lib/netdev.c  |   1 +
> lib/netdev.h  |   7 +
> 7 files changed, 795 insertions(+), 12 deletions(-)
>
>--
>2.7.4
>

Hi all,

Can you please confirm that you are supporting offloading of both the EMC
flows and DPCLs (megaflows) here, i.e. OVS would skip hash table lookups
in both the cases if UFID is provided in the MBUF. Assuming that is
correct, when a match is found in dpcls, does OVS insert that new flow
back into the EMC cache?

Thanks,
Harish

[Finn]
Yes, you are correct. Once the megaflow is offloaded into NIC, using the flow 
UFID, 
the EMC and megaflow cache (dpcls) is skipped when a UFID is received in mbuf. 
When 
receiving these pre-classified packets the EMC is not needed. However, the 
initial packet
creating the megaflow (and then also creates the NIC rte flow), will be 
inserted into EMC.
But, new flows that would use the same megaflow, but would create a different 
EMC entry,
will not be inserted/created in EMC when offloaded by NIC.



>

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 0/7 RFC] OVS-DPDK flow offload with rte_flow

2017-08-30 Thread Finn Christensen
Hi,
Just wanted to add one comment to this correspondence. See below.

Thanks,
Finn Christensen

-Original Message-
From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-boun...@openvswitch.org] 
On Behalf Of Yuanhan Liu
Sent: 30. august 2017 04:25
To: Darrell Ball <db...@vmware.com>
Cc: d...@openvswitch.org
Subject: Re: [ovs-dev] [PATCH 0/7 RFC] OVS-DPDK flow offload with rte_flow

On Wed, Aug 30, 2017 at 01:32:10AM +, Darrell Ball wrote:
> 
> > Though that being said, this patchset still has issues 
> unresolved. The
> > major issue is that maybe most NIC (for instance, Mellanox and 
> Intel)
> > can not support a pure MARK action. It has to be used together 
> with a
> > QUEUE action, which in turn needs a queue index. That comes to 
> the issue:
> > the queue index is not given in the flow context. To make it 
> work, patch
> > 5 just set the queue index to 0, which is obviously wrong. One 
> possible
> > solution is to record the rxq and pass it down to the flow 
> creation
> > stage. It would be much better, but it's still far away from 
> being perfect.
> > Because it might have changed the steering rules stealthily, 
> which may
> > break the default RSS setup by OVS-DPDK.
> > 
> > If this cannot be solved by removing this restriction, I guess another 
> alternative is to actively
> > manage flow-queue associations.
> 
> do you mean let user provide the set_queue action?
> 
> I mean in the worst case, if the restriction cannot be lifted, we 
> might to need to do flow distribution across queues with additional 
> added logic/tracking,

Like the way mentioned above: record the rxq at recv stage and pass it down to 
flow_put?

> because we would not really want to keep this restriction without a 
> workaround.

For sure.

> This would be a fair bit of work, however.

Indeed. Another reason I don't really like above solution is it changes the 
core logic of OVS (just for OVS-DPDK use case). What's worse, it doesn't even 
solve the issue perfectly: it's still a workaround!

That being said, if we have to provide a workaround, I think maybe a simple 
round-robin queue distribution is better, judging it's our first attempt to 
have hw offload? Later, we could either improve the workaround, or remove the 
restriction from DPDK side (if it's possible). Good to you?

[Finn]
I think we should not further intermix the rxqs distributed to different pmd's, 
other than initially configured, when setting up hw-offload. If we make a 
round-robin distribution of the rxqs, a different pmd will most likely receive 
the hw-offloaded packets - not the same pmd that ran the slow-path originally 
creating the flow.
It is usual to optimize caches etc. per pmd and that would not work then. Maybe 
the further processing of the hw-offloaded packets does not need these 
optimizations at the moment, however, IMHO I think we would be better off using 
the first proposal above (use the same rxq as the one creating the flow). 
We will not be able to do RSS on a specific hw-offloaded flow anyway. That 
would require a larger rework of the RSS configuration in OVS, as Darell points 
out.


> Alternatively, pushing this to the user in the general may be too much 
> overhead; what do you think ?.
> As a user specification, it could be optional, however ?

Yes, that's what I was thinking. For the users that know their NIC supports 
MARK with QUEUE action, they could not add such option. For others, it's 
suggested to add it. Or, it's a must (if we do not do workaround like above). 
Makes sense?

> > 
> > The reason I still want to send it out is to get more 
> comments/thoughts
> > from community on this whole patchset. Meanwhile, I will try to 
> resolve
> > the QUEUE action issue.
> > 
> > Note that it's disabled by default, which can be enabled by:
> > 
> > $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
> > 
> > Maybe per in-port configuration would alleviate the issue to a certain 
> degree.
> 
> Yes, it could be done. I choose it for following reasons:
> 
> - the option is already there, used by tc offloads.
> - it also simplifies the (first) patchset a bit, IMO.
> 
> Of course, I understand.
> 
> However, I'm okay with making it per port. What's your suggestion for
> this? Making "hw-offload" be port, or introducing another one? If so,
> what's your suggestion on the naming?
> 
> I am not suggesting to drop the global configuration Mainly w

Re: [ovs-dev] [RFC] [PATCH 00/11] Data Path Classifier Offloading

2017-07-06 Thread Finn Christensen
It seems that this patch implements a partial hw-offload using DPDK RTE FLOW 
api.
Using a flow_tag to circumvent the EMC, and get the flow from NIC-delivered 
flow_tag on input.

In general I think that RTE FLOW is capable of doing more, like full flow 
offload, and therefore we should aim at a more full implementation of RTE FLOW 
usage in OVS. We may not always be in the output datapath in OVS, but OVS 
should still be fully updated and in control. 
At least we will need that.
The RTE FLOW feature: rte_flow_query() can be used to retrieve NIC flow 
statistics. And that is what we need here.

Furthermore, we would also like to have the possibility to offload virtual 
ports (at least at some point later), so that the in_port may not be the native 
pmd port_id. This feature is also available in RTE FLOW - item type: 
RTE_FLOW_ITEM_TYPE_PORT
And again, this RTE FLOW feature is more or less, what is needed for this. 

Besides that, I have some questions about this implementation. It is probably 
me that have not read the code carefully enough, but maybe you could clarify 
them for me:

You build an item array out of the flow wildcard mask:
...
if (memcmp(>dl_dst,_mac,sizeof(struct eth_addr))!=0
|| memcmp(>dl_src,_mac,sizeof(struct eth_addr))!=0) {
VLOG_INFO("rte_item_eth\n");
item_any_flow[ii++].set = rte_item_set_eth;
*buf_size += sizeof(struct rte_flow_item_eth);
*buf_size += sizeof(struct rte_flow_item_eth);
if (mask->nw_src!=0 || mask->nw_dst!=0) {
VLOG_INFO("rte_item_ip\n");
item_any_flow[ii++].set = rte_item_set_ip;
*buf_size += sizeof(struct rte_flow_item_ipv4);
*buf_size += sizeof(struct rte_flow_item_ipv4);
if (mask->tp_dst!=0 || mask->tp_src!=0) {
item_any_flow[ii++].set = rte_item_set_udp;
*buf_size += sizeof(struct rte_flow_item_udp);
*buf_size += sizeof(struct rte_flow_item_udp);
}
}
}
Taken from the function hw_pipeline_item_array_build().

I would like to know how you make sure that all important bits in the flow/mask 
does match the flow that you are about to program into the NIC?
Theoretically, would it be possible to program an UDP RTE FLOW filter in NIC by 
using a TCP flow?


It seems as if the flows are never programmed into NIC:

From hw_pipeline_thread():
...
while (1) {
// listen to read_socket :
// call the rte_flow_create ( flow , wildcard mask)
ret = hw_pipeline_msg_queue_dequeue(msgq,_rule);
if (ret != 0) {
continue;
}
if (ptr_rule.mode == HW_PIPELINE_REMOVE_RULE) {
ret =hw_pipeline_remove_flow(dp,_rule.data.rm_flow);
if (OVS_UNLIKELY(ret)) {
VLOG_ERR(" hw_pipeline_remove_flow failed to remove flow  \n");
}
}
ptr_rule.mode = HW_PIPELINE_NO_RULE;
}
Shouldn't hw_pipeline_insert_flow() function be called?


Finn Christensen


-Original Message-
From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-boun...@openvswitch.org] 
On Behalf Of Shachar Beiser
Sent: 6. juli 2017 06:33
To: ovs-dev@openvswitch.org
Subject: Re: [ovs-dev] [RFC] [PATCH 00/11] Data Path Classifier Offloading

Hi ,

 I would like to clarify since I did not add the label [RFC] to the subject 
of my patches , but only to the cover letter that All the patches I have sent 
were sent as a RFC:

[RFC] [PATCH 00/11] Data Path Classifier Offloading [RFC][PATCH 1/11] 
ovs/dp-cls 
[RFC][PATCH 2/11] ovs/dp-cls 
[RFC] [PATCH 3/11] ovs/dp-cls 
[RFC] [PATCH 4/11] ovs/dp-cls 
[RFC] [PATCH 5/11] ovs/dp-cls 
[RFC][PATCH 6/11] ovs/dp-cls 
[RFC] [PATCH 7/11] ovs/dp-cls 
[RFC] [PATCH 8/11] ovs/dp-cls 
[RFC] [PATCH 9/11] ovs/dp-cls 
[RFC] [PATCH 10/11] ovs/dp-cls 
[RFC] [PATCH 11/11] ovs/dp-cls 

-Shachar Beiser


-Original Message-
From: Shachar Beiser [mailto:shacha...@mellanox.com]
Sent: Wednesday, July 5, 2017 3:27 PM
To: ovs-dev@openvswitch.org
Cc: Shachar Beiser <shacha...@mellanox.com>; Mark Bloch <ma...@mellanox.com>; 
Olga Shern <ol...@mellanox.com>
Subject: [PATCH 00/11] Data Path Classifier Offloading

Request for comments Subject: OVS data-path classifier offload 

Versions:
  OVS master.
  DPDK 17.05.1.

Purpose & Scope
This RFC describes the flows’ hardware offloading over DPDK .
The motivation of hardware flows’ offloading is accelerating the OVS DPDK data 
plane.The classification is done by the hardware.
A flow tag that represents the matched rule in the hardware, received by the 
OVS and saves the lookup time.
OVS data-path classifier has to support additional functionality.
If the hardware supports flows table offloading and the user activates the 
feature,the classifier rules are offloaded to the hardware classifier in 
addition to the data-path classifier.
W

Re: [ovs-dev] [RFC] Introducing DPDK RTE FLOW HW offload support for openvswitch

2017-06-28 Thread Finn Christensen
Hi,

The link provided by Michael below is a diff between OVS release tag v2.7.0 and 
the commit done for this proposal. It should be equal to a patch against v2.7.0.

About the question about 20% performance boost. This is what we actually can 
measure. The packet has a flow_id included in the mbuf header 
(mbuf.hash.fdir.hi inserted by RTE_ACTION_TYPE_MARK and returned by NIC on flow 
hit) which contains the output port, odp_port_id or hw-port-id. If it is a 
hw-port-id, it is mapped to odp_port_id. Then the batch is transmitted to that 
port. You can inspect the code below or in the diff.

The approx. 20% is gained by omitting the need to do packet decoding and 
miniflow generation in OVS, which it need to do a EMC lookup. Furthermore, if 
the packets is going to same destination in a batch, only one lookup is done to 
get the target port. This is measured with 64 byte packet size at approx. 9MPPS.
The test setup has been a sink application running in a VM connected to OVS 
through a dpdk virtio port.

 /Finn

-Original Message-
From: Michael Lilja 
Sent: 28. juni 2017 18:55
To: Darrell Ball <db...@vmware.com>; Finn Christensen <f...@napatech.com>; 
ovs-dev@openvswitch.org
Subject: RE: [ovs-dev] [RFC] Introducing DPDK RTE FLOW HW offload support for 
openvswitch

Hi Darell,

The patch against OVS 2.7 can be found here:
https://github.com/napatech/ovs/commit/467f076835143a9d0c17ea514d4e5c0c33d72c98.diff

The link was actually further down the post, but somehow it wasn't "clickable".

Hope it helps. It is acceptable to provide the patch as above or must it be 
embedded in the ML post?

/Michael

-Original Message-
From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-boun...@openvswitch.org] 
On Behalf Of Darrell Ball
Sent: 28. juni 2017 18:06
To: Finn Christensen <f...@napatech.com>; ovs-dev@openvswitch.org
Subject: Re: [ovs-dev] [RFC] Introducing DPDK RTE FLOW HW offload support for 
openvswitch

Thanks for getting this started.
Few initial questions.


On 6/28/17, 5:44 AM, "ovs-dev-boun...@openvswitch.org on behalf of Finn 
Christensen" <ovs-dev-boun...@openvswitch.org on behalf of f...@napatech.com> 
wrote:

We would like to introduce the first OVS patch proposal to a flow 
classification hw-offload for userspace datapath in OVS.
It is implemented on DPDK, using the generic RTE_FLOW API. However, the 
hook into the userspace datapath is not DPDK dependent. 
Since OVS still only supports DPDK 16.11, and the first RTE_FLOW 
implementation in DPDK came in 17.02, we have chosen to build a branch out of 
OVS 2.7-tag, and modified it to run on DPDK 17.02 and implemented the generic 
hw-offload functionality.

This OVS-2.7-branch with generic hw offload feature can be found at Github 
: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_napatech_ovs_tree_2.7-2Dhw-2Doffload=DwICAg=uilaK90D4TOVoH58JNXRgQ=BVhFA09CGX7JQ5Ih-uZnsw=UM3AKQqhpXB-eAvQ-8xDQ3EHeVnDmSHp1aiTRSxivl4=1oBAO8v72Q_b_yJ6IZad0F9JChh8rqtwmq_DNYr5gqU=
 

When OVS is ported to the next version of DPDK (probably 17.05.1) we will 
be able to create a proper OVS patch. 


Are you able to create at least a few real patches that can be applied to 
master or even 2.7. I understand you would
have to adapt some APIs to 17.02/17.05 for example. Maybe one patch would make 
the adjustments to 17.0x
and another 3 patches primarily by layer and/or functional role (flow/stats for 
example).

However, we think that it is highly relevant to quickly move forward with a 
generic userspace datapath hw-offload feature using DPDK and this is our 
proposal to such an implementation. This proposal adds north acceleration, and 
can be run on NICs with RTE_FLOW support (e.g. Intel XL710) and give approx. 
20% performance boost, without modifying the DPDK-PMDs.

20% boost is for matching a 5 tuple and then output to port directly ?

Extended work, using a SmartNIC, and based on this proposal enabling north, 
south and east-west traffic offload, shows performance improvements of several 
100%.



How it works:
This solution tries to hw-offload flows in the flow classification cache. 
On success the NIC does the classification and hints the packets with a flow 
id, so that it can direct the packet directly to output target port, and update 
the flow cache statistics, either retrieved from NIC or sw-handled. If the 
hw-offload fails normal OVS SW handling applies.
The flow cache contains match and action structures that is used to 
generate a flow classification offload request to device, using the flow 
specification and the wildcard bitmask from the match structure (megaflow). 
This is done on add, modify and delete of flow cache elements in the 
dpif-netdev userspace module.
The actual flow offload request to the device is done by adding two 
netdev_class functions "hw_flow_offload" and "get_flow_stats".
In netd

[ovs-dev] [RFC] Introducing DPDK RTE FLOW HW offload support for openvswitch

2017-06-28 Thread Finn Christensen
We would like to introduce the first OVS patch proposal to a flow 
classification hw-offload for userspace datapath in OVS.
It is implemented on DPDK, using the generic RTE_FLOW API. However, the hook 
into the userspace datapath is not DPDK dependent. 
Since OVS still only supports DPDK 16.11, and the first RTE_FLOW implementation 
in DPDK came in 17.02, we have chosen to build a branch out of OVS 2.7-tag, and 
modified it to run on DPDK 17.02 and implemented the generic hw-offload 
functionality.

This OVS-2.7-branch with generic hw offload feature can be found at Github : 
https://github.com/napatech/ovs/tree/2.7-hw-offload

When OVS is ported to the next version of DPDK (probably 17.05.1) we will be 
able to create a proper OVS patch. 
However, we think that it is highly relevant to quickly move forward with a 
generic userspace datapath hw-offload feature using DPDK and this is our 
proposal to such an implementation. This proposal adds north acceleration, and 
can be run on NICs with RTE_FLOW support (e.g. Intel XL710) and give approx. 
20% performance boost, without modifying the DPDK-PMDs.
Extended work, using a SmartNIC, and based on this proposal enabling north, 
south and east-west traffic offload, shows performance improvements of several 
100%.


How it works:
This solution tries to hw-offload flows in the flow classification cache. On 
success the NIC does the classification and hints the packets with a flow id, 
so that it can direct the packet directly to output target port, and update the 
flow cache statistics, either retrieved from NIC or sw-handled. If the 
hw-offload fails normal OVS SW handling applies.
The flow cache contains match and action structures that is used to generate a 
flow classification offload request to device, using the flow specification and 
the wildcard bitmask from the match structure (megaflow). This is done on add, 
modify and delete of flow cache elements in the dpif-netdev userspace module.
The actual flow offload request to the device is done by adding two 
netdev_class functions "hw_flow_offload" and "get_flow_stats".
In netdev-dpdk, these functions are then implemented using the DPDK RTE FLOW 
API.


  -
  | OF Tables |
  -
^
|
 ---v-
 | Dpif   --  |  Hw  |
 | netdev | Flow cache |---> offload |
 |--  |  |
 |   ^|  |
 |   ||  |
 |   v|  |
 |--- |  |
 || EMC | |  |
 |--- |  |
 -
^   ^
 netdev_class --|---|--- API
v   v
 -
 |   |
 |  netdev-dpdk  |
 |   |
 -
   ^
DPDK --| API (incl RTE FLOW 
API)
   v
 -
 | DPDK PMD  |
 -



How to enable hw offload:
Using OVSDB and add hw-port-id to interface options:

# ovs-vsctl set interface dpdk0 options:hw-port-id=0


Implementation details (hopefully not too messy - diff can be retried from 
github: 
https://github.com/napatech/ovs/commit/467f076835143a9d0c17ea514d4e5c0c33d72c98.diff):

1) OVSDB option hw-port-id is used to enable hw-offload and maps the port 
interface to the hw port id. If a dpdk port, where the X does not match the 
hw-port-id, a in-port remap is requested using RTE FLOW ITEM PORT. This enables 
for virtual ports in NIC. This is needed to do south, east and west flow 
classification in NIC. Note: not supported on all NICs supporting RTE FLOW.

dpif-netdev.c
=

static odp_port_t hw_local_port_map[MAX_HW_PORT_CNT];

This array keeps a mapping between all created ports in dpif-netdev which has 
associated a hw-port-id, and the dpif selected odp port id. This is used to map 
port id used in flows to hw-port-id used in NIC.

in do_add_port():

if (netdev_is_pmd(port->netdev)) {
struct smap netdev_args;

smap_init(_args);
netdev_get_config(port->netdev, _args);
port->hw_port_id = smap_get_int(_args, "hw-port-id", 
(int)(MAX_HW_PORT_CNT + 1));
smap_destroy(_args);

VLOG_DBG("ADD PORT (%s) has hw-port-id %i, odp_port_id %i\n", devname, 
port->hw_port_id, port_no);

if (port->hw_port_id <= MAX_HW_PORT_CNT) {