Re: [ovs-dev] [PATCH 3/3] netdev-offload-tc: Add VxLAN encap support.

2023-07-06 Thread William Tu
On Mon, Jun 26, 2023 at 7:53 AM Simon Horman  wrote:
>
> On Wed, Jul 08, 2020 at 09:10:36AM -0700, William Tu wrote:
> > The patch adds VxLAN encap tc-offload support.  The userspace datapath, 
> > dpif-netdev,
> > flow format differs than the kernel datapath in case of tunnel encap.  
> > Unlike kernel,
> > the dpif-netdev does not use set and output action, but uses a single clone 
> > action with
> > all the tunnel info nested inside.  As an exmaple blow:
> > actions:clone(tnl_push(tnl_port(5),
> >   
> > header(size=50,type=4,eth(dst=06:1d:6e:a3:f1:61,src=26:df:25:f6:7b:4f,dl_type=0x0800),
> > ipv4(src=172.31.1.100,dst=172.31.1.1,proto=17,tos=0,ttl=64,frag=0x4000),
> > udp(src=0,dst=4789,csum=0x0),
> > vxlan(flags=0x800,vni=0x0)),out_port(2)
> >   ), 3)
> >
> > The patch parses the above tunnel encap format and passes to
> > the tc for offloading the VxLAN tunnel.
> >
> > Example of tc format:
> > $ tc -s filter show dev ovs-p1 ingress
> > filter protocol ip pref 3 flower chain 0
> > filter protocol ip pref 3 flower chain 0 handle 0x1
> >   dst_mac 56:2a:1f:3c:bb:f2
> >   src_mac 96:0c:a7:b0:60:a4
> >   eth_type ipv4
> >   ip_tos 0/0x3
> >   ip_flags nofrag
> >   skip_hw
> >   not_in_hw
> >   action order 1: tunnel_key  set
> >   src_ip 172.31.1.100
> >   dst_ip 172.31.1.1
> >   key_id 0
> >   dst_port 4789
> >   nocsum
> >   ttl 64 pipe
> >index 2 ref 1 bind 1 installed 0 sec used 0 sec
> >   Action statistics:
> >   Sent 168 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
> >   backlog 0b 0p requeues 0
> >   no_percpu
> >
> >   action order 2: mirred (Egress Redirect to device ovs-p0) stolen
> >   index 2 ref 1 bind 1 installed 0 sec used 0 sec
> >   Action statistics:
> >   Sent 168 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
> >   backlog 0b 0p requeues 0
> >   cookie b46e99079448ce581d0fe7a9853c0bb5
> >   no_percpu
> >
> > Signed-off-by: William Tu 
>
> Hi William, all,
>
> I'm a little unclear on the history of this patchset [1].
> But it seems to me that while patches 1/2 and 2/3 were applied as:
>
> * 48c1ab5d74ec netdev: Allow storing dpif type into netdev structure.
> * 8842fdf1b318 netdev-offload: Use dpif type instead of class.
>
> This patch was not. As we are now getting towards it's third birthday
> I'm going to declare it stale and mark it as Changes Requested
> in patchwork.
>
> [1] https://mail.openvswitch.org/pipermail/ovs-dev/2020-July/372699.html

Hi Simon,
It's obsolete and vxlan offload is already supported, so feel free to
declare it stale.
thanks
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2] MAINTAINERS.rst: Move several people to emeritus status

2023-05-26 Thread William Tu
On Thu, May 25, 2023 at 2:23 PM Ansis  wrote:
>
> On Fri, May 19, 2023 at 9:53 AM Russell Bryant  wrote:
> >
> > The following document discusses emeritus committer status:
> >
> > https://docs.openvswitch.org/en/latest/internals/committer-emeritus-status/
> >
> > There are several people who I would guess consider themselves
> > emeritus committers but have not formally declared it. Those moved to
> > emeritus status in this commit have either explicitly communicated
> > their desire to move or have both not been active in the last year and
> > have not yet replied to this patch.
> >
> > It is easy to re-add people in the future should any emeritus
> > committer desire to become active again.
> >
> > Per our policies, a vote of the majority of current committers (or
> > the list of maintainers prior to this change) is required to move a
> > committer to emeritus status.
> >
> > Signed-off-by: Russell Bryant 
> > CC: Alin Serdean 
> > CC: Andy Zhou 
> > CC: Ansis Atteka 
> > CC: Daniele Di Proietto 
> > CC: Gurucharan Shetty 
> > CC: Ian Stokes 
> > CC: Ilya Maximets 
> > CC: Jarno Rajahalme 
> > CC: Jesse Gross 
> > CC: Justin Pettit 
> > CC: Pravin B Shelar 
> > CC: Simon Horman 
> > CC: Thomas Graf 
> > CC: William Tu 
> > CC: YAMAMOTO Takashi 
> > ---
> >  MAINTAINERS.rst | 42 +-
> >  1 file changed, 21 insertions(+), 21 deletions(-)
> >
> > diff --git a/MAINTAINERS.rst b/MAINTAINERS.rst
> > index 1dc406170..85b8e6416 100644
> > --- a/MAINTAINERS.rst
> > +++ b/MAINTAINERS.rst
> > @@ -41,40 +41,20 @@ This is the current list of active Open vSwitch 
> > committers:
> >
> > * - Name
> >   - Email
> > -   * - Alex Wang
> > - - ee07b...@gmail.com
> > * - Alin Serdean
> >   - aserd...@ovn.org
> > -   * - Andy Zhou
> > - - az...@ovn.org
> > * - Ansis Atteka
> > - - aatt...@nicira.com
> > -   * - Daniele Di Proietto
> > - - daniele.di.proie...@gmail.com
> > -   * - Gurucharan Shetty
> > - - g...@ovn.org
> > + - ansisatt...@gmail.com
> > * - Ian Stokes
> >   - isto...@ovn.org
> > * - Ilya Maximets
> >   - i.maxim...@ovn.org
> > -   * - Jarno Rajahalme
> > - - ja...@ovn.org
> > -   * - Jesse Gross
> > - - je...@kernel.org
> > -   * - Justin Pettit
> > - - jpet...@ovn.org
> > -   * - Pravin B Shelar
> > - - pshe...@ovn.org
> > * - Russell Bryant
> >   - russ...@ovn.org
> > * - Simon Horman
> >   - ho...@ovn.org
> > -   * - Thomas Graf
> > - - tg...@noironetworks.com
> > * - William Tu
> >   - u9012...@gmail.com
> > -   * - YAMAMOTO Takashi
> > - - yamam...@midokura.com
> >
> >  The project also maintains a list of Emeritus Committers (or Maintainers).
> >  More information about Emeritus Committers can be found here:
> > @@ -85,12 +65,32 @@ More information about Emeritus Committers can be found 
> > here:
> >
> > * - Name
> >   - Email
> > +   * - Alex Wang
> > + - ee07b...@gmail.com
> > +   * - Andy Zhou
> > + - az...@ovn.org
> > * - Ben Pfaff
> >   - b...@ovn.org
> > +   * - Daniele Di Proietto
> > + - daniele.di.proie...@gmail.com
> > * - Ethan J. Jackson
> >   - e...@eecs.berkeley.edu
> > +   * - Gurucharan Shetty
> > + - g...@ovn.org
> > +   * - Jarno Rajahalme
> > + - ja...@ovn.org
> > +   * - Jesse Gross
> > + - je...@kernel.org
> > * - Joe Stringer
> >   - j...@ovn.org
> > +   * - Justin Pettit
> > + - jpet...@ovn.org
> > +   * - Pravin B Shelar
> > + - pshe...@ovn.org
> > +   * - Thomas Graf
> > + - tg...@tgraf.ch
> > +   * - YAMAMOTO Takashi
> > + - yamam...@midokura.com
> >
> Acked-by: Ansis Atteka 
Acked-by: William Tu 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] datapath-windows: Detect CT timeout.

2022-05-31 Thread William Tu
On Tue, May 31, 2022 at 12:09 PM Ilya Maximets  wrote:
>
> On 5/31/22 20:29, William Tu wrote:
> > CT timeout is not supported in Windows datapath, but
> > currently it incorrectly reports to ovs-vswitchd as supported blow
> > "system@ovs-system: Datapath supports timeout policy in conntrack
> > action". The patches detects it and returns not support.
> >
> > Cc: Alin-Gabriel Serdean 
> > Cc: Wilson Peng 
> > Signed-off-by: William Tu 
> > ---
> >  datapath-windows/ovsext/Conntrack.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/datapath-windows/ovsext/Conntrack.c 
> > b/datapath-windows/ovsext/Conntrack.c
> > index 471bf961b..b33676cc1 100644
> > --- a/datapath-windows/ovsext/Conntrack.c
> > +++ b/datapath-windows/ovsext/Conntrack.c
> > @@ -1436,6 +1436,9 @@ OvsExecuteConntrackAction(OvsForwardingContext 
> > *fwdCtx,
> >  }
> >  }
> >  break;
> > +case OVS_CT_ATTR_TIMEOUT:
> > +return NDIS_STATUS_NOT_SUPPORTED;
> > +break;
>
> 'break;' here should not be necessary, I guess.
>
> >  default:
> >  OVS_LOG_TRACE("Invalid netlink attr type: %u", 
> > NlAttrType(ctAttr));
> >  break;
>
> Similar to the other patch, shouldn't the return just be
> part of the 'default' case instead of a 'break'.
> Skipping unknown attributes doesn't seem correct.
>
Thanks for taking a look.
yes, we should log it with the attribute info and return.

I think the controller today should probe datapath features in
ovsdb (ex: ovs-vsctl list-dp-cap) and based on the result to
program each ovs-vswitchd, on Windows or Linux.

Unfortunately the "ovs-vsctl list-dp-cap" is also broken in
Windows, I will fix it too.

Regards,
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] datapath-windows: Detect truncate action.

2022-05-31 Thread William Tu
On Tue, May 31, 2022 at 12:06 PM Ilya Maximets  wrote:
>
> On 5/31/22 20:27, William Tu wrote:
> > Truncate action is not supported in Windows datapath, but
> > currently it incorrectly reports to ovs-vswitchd as supported blow
> > "system@ovs-system: Datapath supports truncate action".
> > The patches detects it and returns not support.
> >
> > Cc: Alin-Gabriel Serdean 
> > Cc: Wilson Peng 
> > Signed-off-by: William Tu 
> > ---
> >  datapath-windows/ovsext/Actions.c | 5 +
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/datapath-windows/ovsext/Actions.c 
> > b/datapath-windows/ovsext/Actions.c
> > index 0f7f78932..3a5e71ee9 100644
> > --- a/datapath-windows/ovsext/Actions.c
> > +++ b/datapath-windows/ovsext/Actions.c
> > @@ -2502,6 +2502,11 @@ OvsDoExecuteActions(POVS_SWITCH_CONTEXT 
> > switchContext,
> >  }
> >  break;
> >  }
> > +case OVS_ACTION_ATTR_TRUNC:
> > +status = NDIS_STATUS_NOT_SUPPORTED;
> > +dropReason = L"OVS-truncate action not supported";
> > +goto dropit;
> > +break;
>
> Hmm.  I don't know much about windows datapath, but shouldn't
> this just be a behavior of the 'default' case?  i.e.:
>
> default:
> status = NDIS_STATUS_NOT_SUPPORTED;
> dropReason = L"Unknown action"; // or generate the string
> // dynamically with a number?
> break;
>
> Current code seems to skip all the unknown actions.
> That doesn't seem correct.

Thanks for taking a look and you're right.
Current code simply skips if the action is not supported, and there
is no warning/error message logged at all.

Ideally ovs-vswitchd should detect the right action supported in
datapath, and generate/translate the corresponding datapath flow from
openflow flow.
I will work on the next patch to drop the unsupported action.

William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH] datapath-windows: Detect CT timeout.

2022-05-31 Thread William Tu
CT timeout is not supported in Windows datapath, but
currently it incorrectly reports to ovs-vswitchd as supported blow
"system@ovs-system: Datapath supports timeout policy in conntrack
action". The patches detects it and returns not support.

Cc: Alin-Gabriel Serdean 
Cc: Wilson Peng 
Signed-off-by: William Tu 
---
 datapath-windows/ovsext/Conntrack.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/datapath-windows/ovsext/Conntrack.c 
b/datapath-windows/ovsext/Conntrack.c
index 471bf961b..b33676cc1 100644
--- a/datapath-windows/ovsext/Conntrack.c
+++ b/datapath-windows/ovsext/Conntrack.c
@@ -1436,6 +1436,9 @@ OvsExecuteConntrackAction(OvsForwardingContext *fwdCtx,
 }
 }
 break;
+case OVS_CT_ATTR_TIMEOUT:
+return NDIS_STATUS_NOT_SUPPORTED;
+break;
 default:
 OVS_LOG_TRACE("Invalid netlink attr type: %u", 
NlAttrType(ctAttr));
 break;
-- 
2.33.0.windows.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH] datapath-windows: Detect truncate action.

2022-05-31 Thread William Tu
Truncate action is not supported in Windows datapath, but
currently it incorrectly reports to ovs-vswitchd as supported blow
"system@ovs-system: Datapath supports truncate action".
The patches detects it and returns not support.

Cc: Alin-Gabriel Serdean 
Cc: Wilson Peng 
Signed-off-by: William Tu 
---
 datapath-windows/ovsext/Actions.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/datapath-windows/ovsext/Actions.c 
b/datapath-windows/ovsext/Actions.c
index 0f7f78932..3a5e71ee9 100644
--- a/datapath-windows/ovsext/Actions.c
+++ b/datapath-windows/ovsext/Actions.c
@@ -2502,6 +2502,11 @@ OvsDoExecuteActions(POVS_SWITCH_CONTEXT switchContext,
 }
 break;
 }
+case OVS_ACTION_ATTR_TRUNC:
+status = NDIS_STATUS_NOT_SUPPORTED;
+dropReason = L"OVS-truncate action not supported";
+goto dropit;
+break;
 default:
 status = NDIS_STATUS_NOT_SUPPORTED;
 break;
-- 
2.33.0.windows.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2] datapath-windows: Add ERSPAN v1 support.

2022-05-29 Thread William Tu
Similar to Linux kernel datapath and userspace datapath,
the patch adds ERSPAN v1 implementation to Windows datapath.
The otpions and interface remains the same, ex: a ERSPAN
tunnel can be created by:
  PS> ovs-vsctl add-port br-tun erspan1 -- set int erspan1 type=erspan
  options:remote_ip=171.31.2.183 options:local_ip=171.31.2.163
  options:erspan_ver=1
  PS> ovs-vsctl -- set int erspan1 options:erspan_idx=0
  PS> ovs-vsctl -- set int erspan1 options:key=0

Tested on Windows server 2019. The bridge configuration is similar to
GRE setup mentioned here[1], but replace GRE with ERSPAN tunnel type.
1. https://cloudbase.it/open-vswitch-2-5-hyper-v-gre-part-3/
A short demo using One Linux server and one Windows server:
https://youtu.be/_01lWBRnrd4

Cc: Alin-Gabriel Serdean 
Signed-off-by: William Tu 
---
v2: fix tab and replaced with space
Tested-at:
https://github.com/williamtu/ovs/actions/runs/2406446843
---
 Documentation/faq/releases.rst |   2 +-
 NEWS   |   1 +
 datapath-windows/automake.mk   |   2 +
 datapath-windows/ovsext/Actions.c  |  18 ++
 datapath-windows/ovsext/BufferMgmt.c   |   2 +-
 datapath-windows/ovsext/Erspan.c   | 403 +
 datapath-windows/ovsext/Erspan.h   | 197 
 datapath-windows/ovsext/Flow.c |   7 +
 datapath-windows/ovsext/Flow.h |   4 +-
 datapath-windows/ovsext/Gre.h  |   5 +-
 datapath-windows/ovsext/Util.h |   1 +
 datapath-windows/ovsext/Vport.c|  11 +
 datapath-windows/ovsext/Vport.h|   8 +-
 datapath-windows/ovsext/ovsext.vcxproj |   2 +
 14 files changed, 657 insertions(+), 6 deletions(-)
 create mode 100644 datapath-windows/ovsext/Erspan.c
 create mode 100644 datapath-windows/ovsext/Erspan.h

diff --git a/Documentation/faq/releases.rst b/Documentation/faq/releases.rst
index c12ffaf4a..af882f589 100644
--- a/Documentation/faq/releases.rst
+++ b/Documentation/faq/releases.rst
@@ -146,7 +146,7 @@ Q: Are all features available with all datapaths?
 Tunnel - GRE-IPv6   4.18   2.6  2.6  NO
 Tunnel - VXLAN-IPv6 4.32.6  2.6  NO
 Tunnel - Geneve-IPv64.42.6  2.6  2.18
-Tunnel - ERSPAN 4.18   2.10 2.10 NO
+Tunnel - ERSPAN 4.18   2.10 2.10 2.18
 Tunnel - ERSPAN-IPv64.18   2.10 2.10 NO
 Tunnel - GTP-U  NO NO   2.14 NO
 Tunnel - Bareudp5.7NO   NO   NO
diff --git a/NEWS b/NEWS
index eece0d0b2..8874b2592 100644
--- a/NEWS
+++ b/NEWS
@@ -29,6 +29,7 @@ Post-v2.17.0
- Windows:
  * Conntrack support for TCPv6, UDPv6, ICMPv6, FTPv6.
  * IPv6 Geneve tunnel support.
+ * IPv4 ERSPAN tunnel support.
 
 
 v2.17.0 - 17 Feb 2022
diff --git a/datapath-windows/automake.mk b/datapath-windows/automake.mk
index 60b3d6033..2e040d799 100644
--- a/datapath-windows/automake.mk
+++ b/datapath-windows/automake.mk
@@ -28,6 +28,8 @@ EXTRA_DIST += \
datapath-windows/ovsext/Debug.h \
datapath-windows/ovsext/DpInternal.h\
datapath-windows/ovsext/Driver.c \
+   datapath-windows/ovsext/Erspan.c \
+   datapath-windows/ovsext/Erspan.h \
datapath-windows/ovsext/Ethernet.h \
datapath-windows/ovsext/Event.c \
datapath-windows/ovsext/Event.h \
diff --git a/datapath-windows/ovsext/Actions.c 
b/datapath-windows/ovsext/Actions.c
index 0f7f78932..425a61bda 100644
--- a/datapath-windows/ovsext/Actions.c
+++ b/datapath-windows/ovsext/Actions.c
@@ -34,6 +34,7 @@
 #include "Vport.h"
 #include "Vxlan.h"
 #include "Geneve.h"
+#include "Erspan.h"
 #include "IpFragment.h"
 
 #ifdef OVS_DBG_MOD
@@ -46,6 +47,8 @@
 typedef struct _OVS_ACTION_STATS {
 UINT64 rxGre;
 UINT64 txGre;
+UINT64 rxErspan;
+UINT64 txErspan;
 UINT64 rxVxlan;
 UINT64 txVxlan;
 UINT64 rxStt;
@@ -223,6 +226,9 @@ OvsDetectTunnelRxPkt(OvsForwardingContext *ovsFwdCtx,
 case OVS_VPORT_TYPE_GRE:
 ovsActionStats.rxGre++;
 break;
+case OVS_VPORT_TYPE_ERSPAN:
+ovsActionStats.rxErspan++;
+break;
 }
 }
 }
@@ -310,6 +316,9 @@ OvsDetectTunnelPkt(OvsForwardingContext *ovsFwdCtx,
 case OVS_VPORT_TYPE_GRE:
 ovsActionStats.txGre++;
 break;
+case OVS_VPORT_TYPE_ERSPAN:
+ovsActionStats.txErspan++;
+break;
 case OVS_VPORT_TYPE_VXLAN:
 ovsActionStats.txVxlan++;
 break;
@@ -665,6 +674,11 @@ OvsTunnelPortTx(OvsForwardingContext *ovsFwdCtx)
  >tunKey, ovsFwdCtx->switchContext,

[ovs-dev] [PATCH v2] datapath-windows: Fix GRE/VxLAN/STT Tunnel RX.

2022-05-29 Thread William Tu
GRE/Vxlan/STT tunnel RX is broken due to incorrecly checking the
'tunKey->dst.si_family != AF_INET', which is actually
set later after parsing the GRE header. Removing such
chunk makes tunnel works.

Fixes: edb2335861d6 ("datapath-windows: Add IPv6 Geneve tunnel support in 
Windows")
Cc: Alin-Gabriel Serdean 
Signed-off-by: William Tu 
---
v2: talked to Wilson and add fixes to Vxlan.c and Stt.c
---
 datapath-windows/ovsext/Gre.c   | 5 -
 datapath-windows/ovsext/Stt.c   | 6 --
 datapath-windows/ovsext/Vxlan.c | 5 -
 3 files changed, 16 deletions(-)

diff --git a/datapath-windows/ovsext/Gre.c b/datapath-windows/ovsext/Gre.c
index d87864029..54725dd17 100644
--- a/datapath-windows/ovsext/Gre.c
+++ b/datapath-windows/ovsext/Gre.c
@@ -332,11 +332,6 @@ OvsDecapGre(POVS_SWITCH_CONTEXT switchContext,
 
 *newNbl = NULL;
 
-if (tunKey->dst.si_family != AF_INET) {
-/*V6 tunnel support will be supported later*/
-return NDIS_STATUS_FAILURE;
-}
-
 status = OvsExtractLayers(curNbl, );
 if (status != NDIS_STATUS_SUCCESS) {
 return status;
diff --git a/datapath-windows/ovsext/Stt.c b/datapath-windows/ovsext/Stt.c
index ebdebb690..09d317934 100644
--- a/datapath-windows/ovsext/Stt.c
+++ b/datapath-windows/ovsext/Stt.c
@@ -931,12 +931,6 @@ OvsDecapStt(POVS_SWITCH_CONTEXT switchContext,
 UINT32 advanceCnt, hdrLen;
 OVS_PACKET_HDR_INFO layers = { 0 };
 
-
-if (tunKey->dst.si_family != AF_INET) {
-/*V6 tunnel support will be supported later*/
-return NDIS_STATUS_FAILURE;
-}
-
 status = OvsExtractLayers(curNbl, );
 if (status != NDIS_STATUS_SUCCESS) {
 return status;
diff --git a/datapath-windows/ovsext/Vxlan.c b/datapath-windows/ovsext/Vxlan.c
index d2c7a4a46..b268e7de2 100644
--- a/datapath-windows/ovsext/Vxlan.c
+++ b/datapath-windows/ovsext/Vxlan.c
@@ -412,11 +412,6 @@ OvsDecapVxlan(POVS_SWITCH_CONTEXT switchContext,
 NDIS_STATUS status;
 OVS_PACKET_HDR_INFO layers = { 0 };
 
-if (tunKey->dst.si_family != AF_INET) {
-/*V6 tunnel support will be supported later*/
-return NDIS_STATUS_FAILURE;
-}
-
 status = OvsExtractLayers(curNbl, );
 if (status != NDIS_STATUS_SUCCESS) {
 return status;
-- 
2.33.0.windows.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH] datapath-windows: Add ERSPAN v1 support.

2022-05-28 Thread William Tu
Similar to Linux kernel datapath and userspace datapath,
the patch adds ERSPAN v1 implementation to Windows datapath.
The otpions and interface remains the same, ex: a ERSPAN
tunnel can be created by:
  PS> ovs-vsctl add-port br-tun erspan1 -- set int erspan1 type=erspan
  options:remote_ip=171.31.2.183 options:local_ip=171.31.2.163
  options:erspan_ver=1
  PS> ovs-vsctl -- set int erspan1 options:erspan_idx=0
  PS> ovs-vsctl -- set int erspan1 options:key=0

Tested on Windows server 2019. The bridge configuration is similar to
GRE setup mentioned here[1], but replace GRE with ERSPAN tunnel type.
1. https://cloudbase.it/open-vswitch-2-5-hyper-v-gre-part-3/
A short demo using One Linux server and one Windows server:
https://youtu.be/_01lWBRnrd4

Cc: Alin-Gabriel Serdean 
Signed-off-by: William Tu 
---
 Documentation/faq/releases.rst |   2 +-
 NEWS   |   1 +
 datapath-windows/automake.mk   |   2 +
 datapath-windows/ovsext/Actions.c  |  18 ++
 datapath-windows/ovsext/BufferMgmt.c   |   2 +-
 datapath-windows/ovsext/Erspan.c   | 404 +
 datapath-windows/ovsext/Erspan.h   | 197 
 datapath-windows/ovsext/Flow.c |   7 +
 datapath-windows/ovsext/Flow.h |   4 +-
 datapath-windows/ovsext/Gre.h  |   5 +-
 datapath-windows/ovsext/Util.h |   1 +
 datapath-windows/ovsext/Vport.c|  11 +
 datapath-windows/ovsext/Vport.h|   8 +-
 datapath-windows/ovsext/ovsext.vcxproj |   2 +
 14 files changed, 658 insertions(+), 6 deletions(-)
 create mode 100644 datapath-windows/ovsext/Erspan.c
 create mode 100644 datapath-windows/ovsext/Erspan.h

diff --git a/Documentation/faq/releases.rst b/Documentation/faq/releases.rst
index c12ffaf4a..af882f589 100644
--- a/Documentation/faq/releases.rst
+++ b/Documentation/faq/releases.rst
@@ -146,7 +146,7 @@ Q: Are all features available with all datapaths?
 Tunnel - GRE-IPv6   4.18   2.6  2.6  NO
 Tunnel - VXLAN-IPv6 4.32.6  2.6  NO
 Tunnel - Geneve-IPv64.42.6  2.6  2.18
-Tunnel - ERSPAN 4.18   2.10 2.10 NO
+Tunnel - ERSPAN 4.18   2.10 2.10 2.18
 Tunnel - ERSPAN-IPv64.18   2.10 2.10 NO
 Tunnel - GTP-U  NO NO   2.14 NO
 Tunnel - Bareudp5.7NO   NO   NO
diff --git a/NEWS b/NEWS
index 1e107340f..101965b28 100644
--- a/NEWS
+++ b/NEWS
@@ -26,6 +26,7 @@ Post-v2.17.0
- Windows:
  * Conntrack support for TCPv6, UDPv6, ICMPv6, FTPv6.
  * IPv6 Geneve tunnel support.
+ * IPv4 ERSPAN tunnel support.
 
 
 v2.17.0 - 17 Feb 2022
diff --git a/datapath-windows/automake.mk b/datapath-windows/automake.mk
index 60b3d6033..2e040d799 100644
--- a/datapath-windows/automake.mk
+++ b/datapath-windows/automake.mk
@@ -28,6 +28,8 @@ EXTRA_DIST += \
datapath-windows/ovsext/Debug.h \
datapath-windows/ovsext/DpInternal.h\
datapath-windows/ovsext/Driver.c \
+   datapath-windows/ovsext/Erspan.c \
+   datapath-windows/ovsext/Erspan.h \
datapath-windows/ovsext/Ethernet.h \
datapath-windows/ovsext/Event.c \
datapath-windows/ovsext/Event.h \
diff --git a/datapath-windows/ovsext/Actions.c 
b/datapath-windows/ovsext/Actions.c
index 0f7f78932..51e447d30 100644
--- a/datapath-windows/ovsext/Actions.c
+++ b/datapath-windows/ovsext/Actions.c
@@ -34,6 +34,7 @@
 #include "Vport.h"
 #include "Vxlan.h"
 #include "Geneve.h"
+#include "Erspan.h"
 #include "IpFragment.h"
 
 #ifdef OVS_DBG_MOD
@@ -46,6 +47,8 @@
 typedef struct _OVS_ACTION_STATS {
 UINT64 rxGre;
 UINT64 txGre;
+UINT64 rxErspan;
+UINT64 txErspan;
 UINT64 rxVxlan;
 UINT64 txVxlan;
 UINT64 rxStt;
@@ -223,6 +226,9 @@ OvsDetectTunnelRxPkt(OvsForwardingContext *ovsFwdCtx,
 case OVS_VPORT_TYPE_GRE:
 ovsActionStats.rxGre++;
 break;
+case OVS_VPORT_TYPE_ERSPAN:
+ovsActionStats.rxErspan++;
+   break;
 }
 }
 }
@@ -310,6 +316,9 @@ OvsDetectTunnelPkt(OvsForwardingContext *ovsFwdCtx,
 case OVS_VPORT_TYPE_GRE:
 ovsActionStats.txGre++;
 break;
+case OVS_VPORT_TYPE_ERSPAN:
+ovsActionStats.txErspan++;
+break;
 case OVS_VPORT_TYPE_VXLAN:
 ovsActionStats.txVxlan++;
 break;
@@ -665,6 +674,11 @@ OvsTunnelPortTx(OvsForwardingContext *ovsFwdCtx)
  >tunKey, ovsFwdCtx->switchContext,
  >layers, , );
 break;
+case OVS_VPORT_TYPE_ERSPAN:
+ 

[ovs-dev] [PATCH] datapath-windows: Fix GRE tunnel.

2022-05-27 Thread William Tu
GRE tunnel RX is broken due to incorrecly checking the
'tunKey->dst.si_family != AF_INET', which is actually
set later after parsing the GRE header. Removing such
chunk makes GRE tunnel works.

Fixes: edb2335861d6 ("datapath-windows: Add IPv6 Geneve tunnel support in 
Windows")
Cc: Alin-Gabriel Serdean 
Signed-off-by: William Tu 
---
 datapath-windows/ovsext/Gre.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/datapath-windows/ovsext/Gre.c b/datapath-windows/ovsext/Gre.c
index d87864029303..54725dd176c1 100644
--- a/datapath-windows/ovsext/Gre.c
+++ b/datapath-windows/ovsext/Gre.c
@@ -332,11 +332,6 @@ OvsDecapGre(POVS_SWITCH_CONTEXT switchContext,
 
 *newNbl = NULL;
 
-if (tunKey->dst.si_family != AF_INET) {
-/*V6 tunnel support will be supported later*/
-return NDIS_STATUS_FAILURE;
-}
-
 status = OvsExtractLayers(curNbl, );
 if (status != NDIS_STATUS_SUCCESS) {
 return status;
-- 
2.30.1 (Apple Git-130)

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2] daemon-unix: Close log file in monitor process while waiting on child.

2022-02-14 Thread William Tu
On Sat, Jan 29, 2022 at 11:38 AM Ben Pfaff  wrote:
>
> Until now, the monitor process held its log file open while it waited for
> the child to exit, and then it reopened it after the child exited.  Holding
> the log file open this way prevented log rotation from freeing disk space.
> This commit avoids the problem by closing the log file before waiting, then
> reopening it afterward.
>
> Signed-off-by: Ben Pfaff 
> Reported-by: Antonin Bas 
> VMware-BZ: #2743409
> ---
> v1->v2: Fix memory leak in monitor_daemon() reported by William.
>
I applied to master,  thank you
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH] acinclude: Detect avx512 vpopcntdq compiler support.

2022-01-26 Thread William Tu
Ubuntu Xenial 16.04 is using GCC 5.4 and it does not support
target "-mavx512vpopcntdq" and cuases error
  lib/dpif-netdev-lookup-avx512-gather.c:356:47:
  error: attribute(target("avx512vpopcntdq")) is unknown
GCC 7+ supports vpopcntdq:
https://gcc.gnu.org/gcc-7/changes.html
The patch detects vpopcntdq and disables AVX512 when not found.

Reported-by: Greg Rose 
Signed-off-by: William Tu 
---
 acinclude.m4 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/acinclude.m4 b/acinclude.m4
index 5c971e98ce91..0c360fd1ef73 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -77,7 +77,7 @@ dnl Checks if compiler and binutils supports AVX512.
 AC_DEFUN([OVS_CHECK_AVX512], [
   OVS_CHECK_BINUTILS_AVX512
   OVS_CHECK_CC_OPTION(
-[-mavx512f], [ovs_have_cc_mavx512f=yes], [ovs_have_cc_mavx512f=no])
+[-mavx512f -mavx512vpopcntdq], [ovs_have_cc_mavx512f=yes], 
[ovs_have_cc_mavx512f=no])
   AM_CONDITIONAL([HAVE_AVX512F], [test $ovs_have_cc_mavx512f = yes])
   if test "$ovs_have_cc_mavx512f" = yes; then
 AC_DEFINE([HAVE_AVX512F], [1],
-- 
2.30.1 (Apple Git-130)

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] daemon-unix: Close log file in monitor process while waiting on child.

2022-01-26 Thread William Tu
On Wed, Jan 26, 2022 at 8:56 AM Ben Pfaff  wrote:
>
> Until now, the monitor process held its log file open while it waited for
> the child to exit, and then it reopened it after the child exited.  Holding
> the log file open this way prevented log rotation from freeing disk space.
> This commit avoids the problem by closing the log file before waiting, then
> reopening it afterward.
>
> Signed-off-by: Ben Pfaff 
> Reported-by: Antonin Bas 
> VMware-BZ: #2743409
> ---

Hi Ben,
Thanks, ASAN reports a memory leak,
2022-01-27T02:17:41.3146557Z 1788: ovsdb-server/add-db with --monitor
FAILED (ovs-macros.at:217)
2022-01-27T02:17:41.3467168Z 1789: ovsdb-server/add-db and remove-db
with --monitor FAILED (ovs-macros.at:217)

==29645==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 99 byte(s) in 1 object(s) allocated from:
#0 0x49633d in malloc
(/home/runner/work/ovs/ovs/openvswitch-2.17.90/_build/sub/ovsdb/ovsdb-server+0x49633d)
#1 0x56df34 in xmalloc__
/home/runner/work/ovs/ovs/openvswitch-2.17.90/_build/sub/../../lib/util.c:137:15
#2 0x56e098 in xmemdup0
/home/runner/work/ovs/ovs/openvswitch-2.17.90/_build/sub/../../lib/util.c:193:15
#3 0x576800 in vlog_get_log_file
/home/runner/work/ovs/ovs/openvswitch-2.17.90/_build/sub/../../lib/vlog.c:352:16
#4 0x57a80e in monitor_daemon
/home/runner/work/ovs/ovs/openvswitch-2.17.90/_build/sub/../../lib/daemon-unix.c:363:30
#5 0x57a05a in daemonize_start
/home/runner/work/ovs/ovs/openvswitch-2.17.90/_build/sub/../../lib/daemon-unix.c:481:13
#6 0x4c5e66 in main
/home/runner/work/ovs/ovs/openvswitch-2.17.90/_build/sub/../../ovsdb/ovsdb-server.c:356:5
#7 0x7f0effa20bf6 in __libc_start_main
(/lib/x86_64-linux-gnu/libc.so.6+0x21bf6)

SUMMARY: AddressSanitizer: 99 byte(s) leaked in 1 allocation(s).
I think we should "free(log_file)" after vlog_set_log_file(log_file)?

William

>  include/openvswitch/vlog.h |  2 +
>  lib/daemon-unix.c  |  4 +-
>  lib/vlog.c | 93 ++
>  3 files changed, 70 insertions(+), 29 deletions(-)
>
> diff --git a/include/openvswitch/vlog.h b/include/openvswitch/vlog.h
> index 886fce5e0..e53ce6d81 100644
> --- a/include/openvswitch/vlog.h
> +++ b/include/openvswitch/vlog.h
> @@ -131,7 +131,9 @@ void vlog_set_verbosity(const char *arg);
>
>  /* Configuring log destinations. */
>  void vlog_set_pattern(enum vlog_destination, const char *pattern);
> +char *vlog_get_log_file(void);
>  int vlog_set_log_file(const char *file_name);
> +void vlog_close_log_file(void);
>  int vlog_reopen_log_file(void);
>  #ifndef _WIN32
>  void vlog_change_owner_unix(uid_t, gid_t);
> diff --git a/lib/daemon-unix.c b/lib/daemon-unix.c
> index 34d45b82a..10806bf81 100644
> --- a/lib/daemon-unix.c
> +++ b/lib/daemon-unix.c
> @@ -360,12 +360,14 @@ monitor_daemon(pid_t daemon_pid)
> (unsigned long int) daemon_pid, status_msg);
>
>  if (child_ready) {
> +char *log_file = vlog_get_log_file();
> +vlog_close_log_file();
>  int error;
>  do {
>  retval = waitpid(daemon_pid, , 0);
>  error = retval == -1 ? errno : 0;
>  } while (error == EINTR);
> -vlog_reopen_log_file();
> +vlog_set_log_file(log_file);
>  if (error) {
>  VLOG_FATAL("waitpid failed (%s)", ovs_strerror(error));
>  }
> diff --git a/lib/vlog.c b/lib/vlog.c
> index 533f93755..0a615bb66 100644
> --- a/lib/vlog.c
> +++ b/lib/vlog.c
> @@ -343,13 +343,25 @@ vlog_set_pattern(enum vlog_destination destination, 
> const char *pattern)
>  }
>  }
>
> -/* Sets the name of the log file used by VLF_FILE to 'file_name', or to the
> - * default file name if 'file_name' is null.  Returns 0 if successful,
> - * otherwise a positive errno value. */
> -int
> -vlog_set_log_file(const char *file_name)
> +/* Returns a copy of the name of the log file used by VLF_FILE, or NULL if 
> none
> + * is configured.  The caller must eventually free the returned string. */
> +char *
> +vlog_get_log_file(void)
> +{
> +ovs_mutex_lock(_file_mutex);
> +char *fn = nullable_xstrdup(log_file_name);
> +ovs_mutex_unlock(_file_mutex);
> +
> +return fn;
> +}
> +
> +/* Sets the name of the log file used by VLF_FILE to 'new_log_file_name', or
> + * closes the current log file if 'new_log_file_name' is NULL.  Takes 
> ownership
> + * of 'new_log_file_name'.  Returns 0 if successful, otherwise a positive 
> errno
> + * value. */
> +static int
> +vlog_set_log_file__(char *new_log_file_name)
>  {
> -char *new_log_file_name;
>  struct vlog_module *mp;
>  struct stat old_stat;
>  struct stat new_stat;
> @@ -358,25 +370,29 @@ vlog_set_log_file(const char *file_name)
>  bool log_close;
>
>  /* Open new log file. */
> -new_log_file_name = (file_name
> - ? xstrdup(file_name)
> - : 

[ovs-dev] [PATCH RFC v2] netdev-dpdk: Add Windows support.

2021-11-05 Thread William Tu
) NA (there is no such example code)
C) 31Mpps
D) Not supported yet (no action=drop action support)
E) 23Mpps

l2fwd
A) 1.2Mpps
B) 120Kpps
C) 29Mpps
D) Not supported yet (no action=port id support)
E) 23Mpps

Deploying on the Cloud
--
Currently I can only make it works on my bare-metal setup.
I tried the three cloud providers:
a. AWS EC2:
AWS supports ixgbevf and DPDK supports it.
However, I can load the dpdk-testpmd but the traffic doesn't
go through. Otherwise, we need ENA driver support.
b. Azure:
Partially works with MLX5 VF, I tested using Standard D2sv3 VM.
Other than VF, there is no virtual NIC support at this moment.
c. Google cloud:
There is no VF support, need to add virtio-net PMD support.

Thoughts and Future Work

a. Resolve all the compiler warnings
b. Can we run ovs-dpdk on public cloud
c. Interrupt mode support

Reference
-
[1] https://dl.acm.org/doi/10.1145/3452296.3472914
[2] 
http://patchwork.ozlabs.org/project/openvswitch/cover/20210808014931.320242-1-sergey.madami...@gmail.com/
[3] https://doc.dpdk.org/guides/windows_gsg/
[4] Porting OvS-DPDK to Windows with Meson
https://github.com/smadaminov/ovs-dpdk-meson-issues
[5] ebpf-for-windows: https://github.com/microsoft/ebpf-for-windows

Signed-off-by: William Tu 
---
v2:
* add Dockerfile for building the project
* add performance number and how to start ovs
* enable HW rte_flow offload
---
 ovs/Dockerfile | 90 +++
 ovs/config.h.meson |  7 ---
 ovs/include/openvswitch/compiler.h |  4 +-
 ovs/include/openvswitch/util.h |  2 +-
 ovs/lib/dpdk.c | 23 ---
 ovs/lib/dpif-netdev.c  |  4 ++
 ovs/lib/meson.build| 13 +++-
 ovs/lib/netdev-dpdk.c  | 99 --
 ovs/lib/sflow_api.h|  1 +
 ovs/lib/util.c |  4 --
 ovs/lib/vlog.c |  7 +++
 ovs/meson.build| 88 ++
 ovs/meson_options.txt  |  2 +
 ovs/ofproto/meson.build|  3 +
 ovs/ovsdb/meson.build  |  3 +
 ovs/utilities/meson.build  |  3 +
 ovs/vswitchd/meson.build   | 10 +++
 ovs/vtep/meson.build   |  3 +
 18 files changed, 298 insertions(+), 68 deletions(-)
 create mode 100644 ovs/Dockerfile

diff --git a/ovs/Dockerfile b/ovs/Dockerfile
new file mode 100644
index 000..6467fab
--- /dev/null
+++ b/ovs/Dockerfile
@@ -0,0 +1,90 @@
+# escape=`
+# I use some of the content from:
+# https://github.com/aserdean/ovs_docker_file/blob/master/Dockerfile
+# and add the DPDK windows built with OVS
+
+ARG WIN_VER="ltsc2019"
+FROM mcr.microsoft.com/windows/servercore:$WIN_VER AS BUILDTOOLS
+
+# Downloa0d VS buildtool, choco installer, and WDK
+ADD https://aka.ms/vs/16/release/vs_buildtools.exe C:\TEMP\vs_buildtools.exe
+ADD https://chocolatey.org/install.ps1 C:\TEMP\choco-install.ps1
+ADD https://go.microsoft.com/fwlink/?linkid=2085767 C:\TEMP\wdksetup.exe
+
+# Let's be explicit about the shell that we're going to use.
+SHELL ["cmd", "/S", "/C"]
+
+# Download pthread4w
+RUN mkdir c:\pthreads
+RUN powershell Invoke-WebRequest 
'ftp://sourceware.org/pub/pthreads-win32/pthreads-w32-2-9-1-release.zip' 
-OutFile 'C:\pthreads\pthreads-win32.zip'
+
+# Install Build Tools. A 3010 error signals that requested operation is
+# successfull but changes will not be effective until the system is rebooted.
+RUN C:\TEMP\vs_buildtools.exe --quiet --wait --norestart --nocache 
--installPath C:\BuildTools --add Microsoft.VisualStudio.Workload.VCTools --add 
Microsoft.VisualStudio.Workload.MSBuildTools  --add 
Microsoft.VisualStudio.Component.VC.Runtimes.x86.x64.Spectre --add 
Microsoft.VisualStudio.Component.VC.Tools.x86.x64 --add 
Microsoft.VisualStudio.Component.Windows10SDK.18362 --add 
Microsoft.VisualStudio.Component.VC.14.24.x86.x64 --add 
Microsoft.VisualStudio.Component.VC.14.24.x86.x64.Spectre || IF 
"%ERRORLEVEL%"=="3010" EXIT 0
+
+# Install choco
+RUN powershell C:\TEMP\choco-install.ps1
+
+RUN choco install git -y
+#RUN choco install msys2 --params "/NoUpdate /InstallDir:C:\msys2" -y
+RUN choco install python3 -y
+RUN choco install 7zip.install -y
+
+RUN 7z x C:\pthreads\pthreads-win32.zip -OC:\pthreads
+
+# OVS build using mingw and msys2
+#RUN msys2_shell.cmd -defterm -no-start -use-full-path -c "pacman --noconfirm 
-S automake autoconf libtool make patch"
+#RUN msys2_shell.cmd -defterm -no-start -use-full-path -c "cp `which python` 
`which python`3"
+#RUN msys2_shell.cmd -defterm -no-start -use-full-path -c "mv /usr/bin/link 
/usr/bin/link_bkup"
+RUN python -m pip install pypiwin32 --disable-pip-version-check
+
+# Install WDK excluding WDK.vsix.
+RUN C:\TEMP\wdksetup.exe /q
+
+# Install WDK.vsix in manual manner.
+RUN copy "C:\Program Files (x86)\Windows Kits\

Re: [ovs-dev] [PATCH] RFC: netdev-dpdk: Add Windows support.

2021-11-04 Thread William Tu
On Thu, Nov 4, 2021 at 4:35 AM Thomas Monjalon  wrote:
>
> 06/10/2021 19:00, William Tu:
> > The patch adds OVS-DPDK supports on Windows.
> [...]
> > Implementation on Windows
> > -
> > It's harder than I thought due to my Linux only background.
> > Sergey and I first need to add meson build support to OVS, in order
> > to make compiling and linking to Windows DPDK library easier[2].
> > In this patch, we use clang compiler on Windows to compile DPDK and
> > OVS, and OVS links to DPDK static library (Dynamic lib still not
> > supported yet in DPDK Windows).
> >
> > Windows DPDK library hasn't finished all its library/driver porting
> > to Windows. So a couple of DPDK library code used by OVS doesn't
> > compile. For examples:
> >   1) ovs-numa.c: used to detect lcores used by DPDK.
> >   2) open_memstream, fopencookies: used to redirect DPDK logs.
> >   3) vhostuser doesn't support: used to connect VM/container.
> >   4) dpdk-socket-mem not support: configuration
> > I simply remove them in this patch.
> > In the future, I will probably refactor or #ifdef.
>
> Yes please, removing code is clearly not the right direction :)
>
> [...]
> > Future Work
> > ---
> > In priority ordering
> > a. Add CI or Dockerfile for building the code
> > b. Performance numbers, compared with OVS kernel datapath
> > c. Try rte_flow HW offload on Windows
> > d. Resolve all the compiler warnings
> > e. Refactor the vhostuser code
> >
> > Some thoughts:
> > f. With clang support on Windows, can we remove MSVC?
> >so OVS will have less compiler-specific code, ex:
> >include\openvswitch\compiler.h
> > g. DPDK starts to implement pthread in Windows, can
> >OVS uses pthread library from DPDK, instead of pthreads4w?
>
> The thread API is not ready yet in DPDK.
> What would be the benefit for OVS to use the same pthread layer as DPDK?
> OVS needs to be able to compile without DPDK.

Hi Thomas,

Yes, you're right.
I was thinking about copying the pthread code from DPDK to OVS code base.
This removes the dependencies on pthreads4w.

I'm working on the RFC v2 patch, will send out this week.

Thanks for all the feedback!
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH] RFC: netdev-dpdk: Add Windows support.

2021-10-06 Thread William Tu
The patch adds OVS-DPDK supports on Windows.

Motivation
--
Currently OVS supports multiple datapath implementations: Linux kernel,
Windows kernel, userspace with OVS-DPDK Linux, and HW offload.
Adding any new feature to OVS datapath requires OS-specific expertise
and usually ends up with feature mismatch, ex: Linux kernel supports
feature A, but Windows does not, and high maintenace cost [1].

It would be great if OVS just uses single datapath across different
platforms, and the datapath is portatble, performant, and easy
to maintain. The natural choice is the dpif-netdev, the usersapce
netdev datapath currently used by OVS-DPDK and AF_XDP, currently
only works on Linux platform.

So the last piece is to make OVS-DPDK runs on Windows. With this work,
the OVS userspace datapath becomes the one runs on Windows/Linux/FreeBSD
and any new datapath feature should naturally work on different OSes.
Performance of userspace datapath should be equal or better to the kernel
datapath, due to the kernel bypass design of DPDK/AF_XDP, and
optimizations in OVS userspace datapath.

OVS also translates its datapath flows into HW offload-capable NICs,
such as Mellanox CX6. With userspace datapath, HW offload works by
using tc-flower (when using AF_XDP netdev) or using rte_flow API,
(when using DPDK netdev). We've tried non-triival OVS actions such as
conntrack or tunnel actions offloaded into Mellanox card.

Moving forward, OVS will have better cross-platform support, better
performance, and easier to maintain. So far I haven't seen any virtual
switch capable of doing all of the above.

Implementation on Windows
-
It's harder than I thought due to my Linux only background.
Sergey and I first need to add meson build support to OVS, in order
to make compiling and linking to Windows DPDK library easier[2].
In this patch, we use clang compiler on Windows to compile DPDK and
OVS, and OVS links to DPDK static library (Dynamic lib still not
supported yet in DPDK Windows).

Windows DPDK library hasn't finished all its library/driver porting
to Windows. So a couple of DPDK library code used by OVS doesn't
compile. For examples:
  1) ovs-numa.c: used to detect lcores used by DPDK.
  2) open_memstream, fopencookies: used to redirect DPDK logs.
  3) vhostuser doesn't support: used to connect VM/container.
  4) dpdk-socket-mem not support: configuration
I simply remove them in this patch.
In the future, I will probably refactor or #ifdef.

In addition, only a few DPDK PMD drivers are supported.
(please check the DPDK doc[3])
  1) Physical: ice, i40e, mlx5, ixgbe
  3) Virtual: none (still work in progress)

For ice, i40, ixgbe, you will need additional two Windows drivers,
netuio and virt2phys. For mlx5, you will need WINOF-2.
I tested this patch using mlx5 pmd on Azure cloud and on-prem
and ixgbevf on AWS EC2, using Windows server 2019.
And I only tested basic OpenFlow actions such as matching ipv4 headers,
drop and output.

Future Work
---
In priority ordering
a. Add CI or Dockerfile for building the code
b. Performance numbers, compared with OVS kernel datapath
c. Try rte_flow HW offload on Windows
d. Resolve all the compiler warnings
e. Refactor the vhostuser code

Some thoughts:
f. With clang support on Windows, can we remove MSVC?
   so OVS will have less compiler-specific code, ex:
   include\openvswitch\compiler.h
g. DPDK starts to implement pthread in Windows, can
   OVS uses pthread library from DPDK, instead of pthreads4w?
h. Mingw/msys is pretty slow, by switching to meson+clang,
   what's the improvement of build process?
f. How does OVS-DPDK connect to VM/Container in Windows?
   There is no vhostuser, we should look at netvsc/vmbus.

I got lots of help from many people in DPDK community, thanks!

Reference
-
[1] https://dl.acm.org/doi/10.1145/3452296.3472914
[2] 
http://patchwork.ozlabs.org/project/openvswitch/cover/20210808014931.320242-1-sergey.madami...@gmail.com/
[3] https://doc.dpdk.org/guides/windows_gsg/
[4] Porting OvS-DPDK to Windows with Meson
https://github.com/smadaminov/ovs-dpdk-meson-issues

Signed-off-by: William Tu 
---
 ovs/config.h.meson |   7 -
 ovs/include/openvswitch/compiler.h |   4 +-
 ovs/include/openvswitch/util.h |   2 +-
 ovs/lib/dpdk.c | 114 +---
 ovs/lib/dpif-netdev.c  |   4 +
 ovs/lib/meson.build|  12 +-
 ovs/lib/netdev-dpdk.c  | 934 +
 ovs/lib/sflow_api.h|   1 +
 ovs/lib/util.c |   4 -
 ovs/lib/vlog.c |   7 +
 ovs/meson.build|  88 ++-
 ovs/meson_options.txt  |   2 +
 ovs/ofproto/meson.build|   3 +
 ovs/ovsdb/meson.build  |   3 +
 ovs/utilities/meson.build  |   3 +
 ovs/vswitchd/meson.build   |  10 +
 ovs/vtep/meson.build   |   3 +
 17 files changed, 140 insertions(+), 1061 deletions(-)

diff --git

Re: [ovs-dev] [PATCH ovn] MAINTAINERS: Transition myself to emeritus status.

2021-10-05 Thread William Tu
On Tue, Oct 5, 2021 at 8:05 PM William Tu  wrote:
>
> On Tue, Oct 5, 2021 at 11:33 AM Numan Siddique  wrote:
> >
> > On Tue, Oct 5, 2021 at 1:46 PM Han Zhou  wrote:
> > >
> > > On Tue, Oct 5, 2021 at 9:39 AM Ben Pfaff  wrote:
> > > >
> > > > I have not been actively maintaining OVN for some time now.  I don't
> > > expect
> > > > that to change.  I think that the responsible thing to do is to
> > > > acknowledge this properly by transitioning to emeritus status.
> > > >
> > > > Signed-off-by: Ben Pfaff 
> > > > ---
> > > >  MAINTAINERS.rst | 4 ++--
> > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/MAINTAINERS.rst b/MAINTAINERS.rst
> > > > index 0d19bd622..fff2cd977 100644
> > > > --- a/MAINTAINERS.rst
> > > > +++ b/MAINTAINERS.rst
> > > > @@ -41,8 +41,6 @@ This is the current list of active OVN committers:
> > > >
> > > > * - Name
> > > >   - Email
> > > > -   * - Ben Pfaff
> > > > - - b...@ovn.org
> > > > * - Gurucharan Shetty
> > > >   - g...@ovn.org
> > > > * - Han Zhou
> > > > @@ -67,3 +65,5 @@ More information about Emeritus Committers can be 
> > > > found
> > > >
> > > > * - Name
> > > >   - Email
> > > > +   * - Ben Pfaff
> > > > + - b...@ovn.org
> > > > --
> > > > 2.31.1
> > > >
> > > > ___
> > > > dev mailing list
> > > > d...@openvswitch.org
> > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > >
> > > Hi Ben,
> > >
> > > The patch LGTM, although I really don't like the solution :(
> > > Thank you so much for building the amazing stuff from scratch!
> > >
> >
> > Agree, Thank you for all your amazing work!
> >
> > Thanks
> > Numan
> >
> >
> > > Acked-by: Han Zhou 
>
> Thanks Ben!
> I applied to master.
> William

and feel free to revert the commit if you change your mind :)
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn] MAINTAINERS: Transition myself to emeritus status.

2021-10-05 Thread William Tu
On Tue, Oct 5, 2021 at 11:33 AM Numan Siddique  wrote:
>
> On Tue, Oct 5, 2021 at 1:46 PM Han Zhou  wrote:
> >
> > On Tue, Oct 5, 2021 at 9:39 AM Ben Pfaff  wrote:
> > >
> > > I have not been actively maintaining OVN for some time now.  I don't
> > expect
> > > that to change.  I think that the responsible thing to do is to
> > > acknowledge this properly by transitioning to emeritus status.
> > >
> > > Signed-off-by: Ben Pfaff 
> > > ---
> > >  MAINTAINERS.rst | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/MAINTAINERS.rst b/MAINTAINERS.rst
> > > index 0d19bd622..fff2cd977 100644
> > > --- a/MAINTAINERS.rst
> > > +++ b/MAINTAINERS.rst
> > > @@ -41,8 +41,6 @@ This is the current list of active OVN committers:
> > >
> > > * - Name
> > >   - Email
> > > -   * - Ben Pfaff
> > > - - b...@ovn.org
> > > * - Gurucharan Shetty
> > >   - g...@ovn.org
> > > * - Han Zhou
> > > @@ -67,3 +65,5 @@ More information about Emeritus Committers can be found
> > >
> > > * - Name
> > >   - Email
> > > +   * - Ben Pfaff
> > > + - b...@ovn.org
> > > --
> > > 2.31.1
> > >
> > > ___
> > > dev mailing list
> > > d...@openvswitch.org
> > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >
> > Hi Ben,
> >
> > The patch LGTM, although I really don't like the solution :(
> > Thank you so much for building the amazing stuff from scratch!
> >
>
> Agree, Thank you for all your amazing work!
>
> Thanks
> Numan
>
>
> > Acked-by: Han Zhou 

Thanks Ben!
I applied to master.
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] rhel: Fix dual kernel rpm install for RHEL 8.4

2021-09-03 Thread William Tu
On Wed, Aug 25, 2021 at 3:08 PM Gregory Rose  wrote:
>
>
>
> On 8/25/2021 11:42 AM, Yifeng Sun wrote:
> > LGTM, thanks Greg.
> >
> > Reviewed-by: Yifeng Sun 
>
Thanks, I pushed to master.
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH RFC 0/1] use meson and ninja as a build system for ovs

2021-08-13 Thread William Tu
> >>> Meson is a nice system in many aspects, but its support for tests is very
> >>> limited.  IIUC, it can only run a single binary and check the error codes.
> >>> Most of our tests starts several daemons and performs several fairly 
> >>> complex
> >>> operations and checks.  I'm afraid that we will end up writing our own
> >>> separate testing framework that will mimic autotest in order to be able to
> >>> run our tests from meson.
> >>>
> >>> Did you think about this problem?  Do you have a solution in mind?
> >>
> >> Thanks, we thought about it but I don't have a solution in mind at this 
> >> moment.
>
> PoC of the build process is fine and I had no doubts that it can be done.
> I'd suggest to make a PoC for at least one unit test that we have (e.g.
> "ofproto-dpif - MPLS handling with goto_table") or two tests to actually
> figure out what we need to do with them.
>
Thanks! That makes sense.

Sergey is looking for two possible solutions:
Avocado testing framework
https://avocado-framework.readthedocs.io/en/90.0/quickstart/index.html
or
Unity test framework
https://docs.unity3d.com/Packages/com.unity.test-framework@1.1/manual/index.html
or
any suggestions are welcome!
Regards,
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH RFC 0/1] use meson and ninja as a build system for ovs

2021-08-09 Thread William Tu
Hi Ilya,

Thanks for your feedback!

> Hi, Sergey and William.  Thanks for working on this.
>
> I think that it might be a good idea to move to a different build system
> that will not be that painful to run on Windows.  I'm not working on
> Windows parts, but it would be great to have a fast CI that can confirm
> that everything still working fine.

Yes, and avoiding the msys/mingw makes coding and debugging on windows
much easier.

>
> However, as I already said in the discussion on GitHub, it seems to be
> very hard to migrate our testsuite that heavily depends on autotest.
> And without migrating the testsuite we will, probably, have to maintain
> two different build systems to be able to run tests.  This, kind of,
> defeats the whole purpose of the change.

Why is it very hard?
I thought once we find a way to add tests to meson, it's just
"mechanically" copying all these scripts into a new test framework.
It's a lot of tedious work, but hopefully not a difficult or impossible task.

Or does current OVS autotests heavily depend on autotool/autoconf?
I see the test cases are pretty independent.
>
> Meson is a nice system in many aspects, but its support for tests is very
> limited.  IIUC, it can only run a single binary and check the error codes.
> Most of our tests starts several daemons and performs several fairly complex
> operations and checks.  I'm afraid that we will end up writing our own
> separate testing framework that will mimic autotest in order to be able to
> run our tests from meson.
>
> Did you think about this problem?  Do you have a solution in mind?

Thanks, we thought about it but I don't have a solution in mind at this moment.

Regards,
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v4] netlink: removed incorrect optimization

2021-06-15 Thread William Tu
On Tue, Jun 15, 2021 at 11:43 AM Ansis  wrote:
>
> On Mon, Jun 14, 2021 at 10:22 PM Ansis  wrote:
> >
> > On Mon, Jun 7, 2021 at 1:31 PM Toms Atteka  wrote:
> > >
> > > This optimization caused FLOW_TNL_F_UDPIF flag not to be used in
> > > hash calculation for geneve tunnel when revalidating flows which
> > > resulted in different cache hash values and incorrect behaviour.
> > >
> > > Added test to prevent regression.
> > >
> > > CC: Jesse Gross 
> > > Fixes: 6728d578f64e ("dpif-netdev: Translate Geneve options per-flow, not 
> > > per-packet.")
> > > Reported-at: https://github.com/vmware-tanzu/antrea/issues/897
> > > Signed-off-by: Toms Atteka 
> > > ---
> > >  lib/tun-metadata.c  |  2 +-
> > >  tests/system-traffic.at | 54 +
> > >  2 files changed, 55 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/lib/tun-metadata.c b/lib/tun-metadata.c
> > > index c0b0ae044..af0bcbde8 100644
> > > --- a/lib/tun-metadata.c
> > > +++ b/lib/tun-metadata.c
> > > @@ -828,7 +828,7 @@ tun_metadata_to_geneve_nlattr(const struct flow_tnl 
> > > *tun,
> > >  } else {
> > >  tun_metadata_to_geneve_nlattr_mask(key, tun, flow, b);
> > >  }
> > > -} else if (flow->metadata.present.len || is_mask) {
> > > +} else {
> >
> > I reverted the line above, but the regression prevention test you
> > added below still seems to pass. So - does this regression prevention
> > test serve any purpose or am I doing something wrong? Here is proof:
> >
> >  18: datapath - n ok
>
> I take it back. I ran make check-kernel. I had to run make
> check-system-userspace to see your regression test in action.
>
> William, do you have any comments for this patch? You indicated you
> wanted to look at it too.
>
Don't have any comments.
Thanks.
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 2/5] odp-util: Fix use of uninitialized erspan metadata.

2021-04-05 Thread William Tu
On Sun, Apr 4, 2021 at 10:31 AM Ilya Maximets  wrote:
>
> 'struct erspan_metadata' contains union with fields of different
> sizes, hence not all the memory initiliazed.  This memory goes
> to syscalls and also used to compare ukeys with memcmp which may
> cause unexpected behavior.
>
>  Thread 15 revalidator13:
>  Conditional jump or move depends on uninitialised value(s)
> at 0x4C377B6: bcmp (vg_replace_strmem.c:)
> by 0x43F844: ofpbuf_equal (ofpbuf.h:273)
> by 0x43F844: revalidate_ukey__ (ofproto-dpif-upcall.c:2227)
> by 0x43F9C9: revalidate_ukey (ofproto-dpif-upcall.c:2294)
> by 0x4425C2: revalidate.isra.33 (ofproto-dpif-upcall.c:2734)
> by 0x4434B8: udpif_revalidator (ofproto-dpif-upcall.c:943)
> by 0x4FDE2C: ovsthread_wrapper (ovs-thread.c:383)
> by 0x5E19159: start_thread (in /usr/lib64/libpthread-2.28.so)
> by 0x69ECF72: clone (in /usr/lib64/libc-2.28.so)
>   Uninitialised value was created by a stack allocation
>     at 0x4B1CE0: tun_key_to_attr (odp-util.c:3129)
>
> CC: William Tu 
> Fixes: 98514eea21f4 ("erspan: add kernel datapath support")
> Signed-off-by: Ilya Maximets 
> ---

LGTM, Thanks.
Acked-by: William Tu 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] RFC: netdev-afxdp: Support for XDP metadata HW hints.

2021-03-05 Thread William Tu
Thanks Björn and Jesper for feedbacks.

On Fri, Mar 5, 2021 at 7:34 AM Björn Töpel  wrote:
>
> On 2021-03-05 16:07, Jesper Dangaard Brouer wrote:
> > On Fri, 5 Mar 2021 14:56:02 +0100
> > Björn Töpel  wrote:
> >
> >> On 2021-03-05 11:13, Jesper Dangaard Brouer wrote:
>
>
> [...]
> >>
> >> You'd like to use the length of metadata to express if it's available or
> >> not. I'm stating that we don't need that for AF_XDP. In XDP if we're
> >> accessing the field, we need to validate access due to the verifier. But
> >> it's not to know if it's actually there. When we attach a certain XDP to
> >> a device, we can query what offload it supports, and the the application
> >> knows that meta data is structured in a certain way.
> >>
> >> Some application has simple metadata:
> >> struct tstamp { u64 rx; };
> >>
> >> That application will unconditionally look at (struct tstamp
> >> *)(pkt-sizeof(struct tstamp)->rx;.
> >
> > I imagine that NIC hardware will/can provide packets with different
> > metadata.  Your example with timestamps are a good example, as some
> > drivers/HW only support timestamping PTP packets.  Thus, I don't want
> > to provide metadata info on tstamp if HW didn't provide it (and I also
> > don't want to spend CPU-cycles on clearing the u64 field with zero).
> > Another example is vlan IDs is sometimes provided by hardware.
> >
> > Thus, on a per packet basis the metadata can change.
> >
Is this future work?

Looking at Saeed's patch, I thought we query the netdev through
BTF netlink API, and we get a BTF ID and C-struct per-netdev.
And this will be done when OVS control plane, when a user attaches
a device to its bridge, we query its metadata structure.

>
> Yes, I agree. For a certain set of packets, there can be different
> metadata per packet that, as you pointed out, can be dispatched do
> different handlers.
>
> >> Some might be more flexible and have:
> >> struct metadata { char storage[64]; int btf_id; };
> >
> > I really like the idea of placing the btf_id as the last entry, that
> > way we know its location.  I can get the btf_id via packet minus-offset
> > sizeof(int).
> >
[...]

> >
> > BTF sort-of make the struct names identifiers and API.  And gives *BPF*
> > programs the ability to adjust offsets, when loading the BPF-prog into
> > the kernel.
> >
> > Driver-1 choose:
> >
> >   struct xdp_md_desc {
> >   uint32_t flow_mark;
> >   uint32_t hash32;
> >   } __attribute__((preserve_access_index));
> >
> > Driver-2 choose:
> >
> >   struct xdp_md_desc {
> >   uint32_t hash32;
> >   uint32_t flow_mark;
> >   } __attribute__((preserve_access_index));
> >
> > The BPF code can access the members, and kernel will remap-struct
> > access, and internally in the kernel create two different BPF-programs
> > with adjusted offsets in the byte-code.
> >
> > How do we handle this in the OVS userspace C-code?
> >

BTW, do we assume all vendors use the same field name
for same purpose? For example:
When OVS query Intel or mlx, if rxhash is available, they should
use the same name as "hash32".
Otherwise, I don't know how to associate it into OVS code logic.

If that's the case, I can do (in userspace C code, not BPF program)
1) When attaching a device, query its MD struct and MD's size
using BTF info.
2) From the BTF info returned, check if string "hash32" exists.
if yes, then calculate its offset from the beginning of the MD.
3) OVS cat set it by calling "dp_packet_set_rss(packet, md[rxhash_offset]);"

>
> Longer-term, I think extending the linkers to support struct member
> relocations would be a good thing! I.e. simply support
> __attribute__((preserve_access_index)) in userland. That would enable
> very powerful userland/kernel interface, like what we're discussing now.
>
> But I wonder if it's too big of an initial undertaking to do that as a
> first step? Maybe non-relocatable struct could be a start.
>
Having relocatable struct would be great. So I can just use 'md->hash32',
instead of offset like above.

> I definitely agree that the goal should be relocatable structs!
>
> > Maybe/hopefully you have a better idea than mine, which is simply to
> > compile two C-programs (or program for-each know BTF-id) and dispatch
> > based on the BTF-id.
> >

Thanks
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH] RFC: netdev-afxdp: Support for XDP metadata HW hints.

2021-03-04 Thread William Tu
 SMC hits:   0  (  0.0 %)
  - Megaflow hits:  0  (  0.0 %, 0.00 subtbl lookups/hit)
  - Upcalls:0  (  0.0 %, 0.0 us/upcall)
  - Lost upcalls:   0  (  0.0 %)
  Tx packets:   0

Perf record/report:
  22.96%  pmd-c07/id:8  ovs-vswitchd[.] netdev_afxdp_rxq_recv
  10.43%  pmd-c07/id:8  ovs-vswitchd[.] miniflow_extract
   7.20%  pmd-c07/id:8  ovs-vswitchd[.] dp_packet_init__
   7.00%  pmd-c07/id:8  ovs-vswitchd[.] dp_netdev_input__
   6.79%  pmd-c07/id:8  ovs-vswitchd[.] dp_netdev_process_rxq_port
   6.62%  pmd-c07/id:8  ovs-vswitchd[.] pmd_thread_main
   5.65%  pmd-c07/id:8  ovs-vswitchd[.] pmd_perf_end_iteration
   5.04%  pmd-c07/id:8  [vdso]  [.] __vdso_clock_gettime
   3.60%  pmd-c07/id:8  ovs-vswitchd[.] time_timespec__
   3.10%  pmd-c07/id:8  ovs-vswitchd[.] umem_elem_push
   2.74%  pmd-c07/id:8  libc-2.27.so[.] __memcmp_avx2_movbe
   2.62%  pmd-c07/id:8  ovs-vswitchd[.] time_usec
   2.14%  pmd-c07/id:8  ovs-vswitchd[.] dp_packet_use_afxdp
   1.58%  pmd-c07/id:8  ovs-vswitchd[.] netdev_rxq_recv
   1.47%  pmd-c07/id:8  ovs-vswitchd[.] netdev_get_class
   1.34%  pmd-c07/id:8  ovs-vswitchd[.] pmd_perf_start_iteration

Signed-off-by: William Tu 
---
 lib/netdev-afxdp.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
index 482400d8d135..49881a8cc0cb 100644
--- a/lib/netdev-afxdp.c
+++ b/lib/netdev-afxdp.c
@@ -169,6 +169,17 @@ struct netdev_afxdp_tx_lock {
 );
 };
 
+/* FIXME:
+ * This should be done dynamically by query the device's
+ * XDP metadata structure. Ex:
+ *   $ bpftool net xdp md_btf cstyle dev enp2s0f0np0
+ */
+struct xdp_md_desc {
+uint32_t flow_mark;
+uint32_t hash32;
+uint16_t vlan;
+};
+
 #ifdef HAVE_XDP_NEED_WAKEUP
 static inline void
 xsk_rx_wakeup_if_needed(struct xsk_umem_info *umem,
@@ -849,6 +860,7 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct 
dp_packet_batch *batch,
 struct dp_packet_afxdp *xpacket;
 const struct xdp_desc *desc;
 struct dp_packet *packet;
+struct xdp_md_desc *md;
 uint64_t addr, index;
 uint32_t len;
 char *pkt;
@@ -858,6 +870,7 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct 
dp_packet_batch *batch,
 len = desc->len;
 
 pkt = xsk_umem__get_data(umem->buffer, addr);
+md = pkt - sizeof *md;
 index = addr >> FRAME_SHIFT;
 xpacket = >xpool.array[index];
 packet = >packet;
@@ -868,6 +881,12 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct 
dp_packet_batch *batch,
 OVS_XDP_HEADROOM);
 dp_packet_set_size(packet, len);
 
+/* FIXME: This should be done by detecting whether
+ * XDP MD is enabled or not. Ex:
+ * $ bpftool net xdp set dev enp2s0f0np0 md_btf on
+ */
+dp_packet_set_rss_hash(packet, md->hash32);
+
 /* Add packet into batch, increase batch->count. */
 dp_packet_batch_add(batch, packet);
 
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] tunnel:fix vlan bug for tunnel forwarding

2021-03-01 Thread William Tu
On Mon, Mar 1, 2021 at 3:47 AM Ilya Maximets  wrote:
>
> On 1/10/21 3:16 AM, Hongzhi Guo wrote:
> > From: hongzhi guo 
> >
> > I found the basic vxlan topology cannot work when upgrade ovs 2.12 version.
> > Through debugging, I found that when the native tunnel forward,
> > the inner vlan was used for forwarding mistakely.
> > Which resulted in packet loss on the tunnel_bearing port.
> >
> > Then I abstracted the scene where the problem occurred.
> >
> > Test:
> > |---|
> > | br1 o br1(tag=2)
> > |p--|
> >  |br1-2(tag=2)
> >  |
> >  |ip = 10.10.10.22
> >  |br2-1(tag=2)   p3(tag=3)in namespace
> > |p--||o--|
> > |   br2 o|   br3 |
> > |v--||-t-|
> > vxlan   tunnel_bearing(tag=3)
> >  (remote_ip=10.10.10.22) (10.10.10.21)
> >
> > Bridge br3
> > datapath_type: netdev
> > Port p3
> > tag: 3
> > Interface p3
> > type: internal
> > Port tunnel_bearing
> > tag: 3
> > Interface tunnel_bearing
> > type: internal
> > Port br3
> > Interface br3
> > type: internal
> > Bridge br2
> > datapath_type: netdev
> > Port br2
> > Interface br2
> > type: internal
> > Port br2-1
> > tag: 2
> > Interface br2-1
> > type: patch
> > options: {peer=br1-2}
> > Port vxlan
> > Interface vxlan
> > type: vxlan
> > options: {remote_ip="10.10.10.22"}
> > Bridge br1
> > datapath_type: netdev
> > Port br1
> > tag: 2
> > Interface br1
> > type: internal
> > Port br1-2
> > tag: 2
> > Interface br1-2
> > type: patch
> > options: {peer=br2-1}
> >
> > Trace:
> > ovs-appctl ofproto/trace br1 in_port=br1,\
> >   dl_src=fa:16:3e:8c:eb:5b,dl_dst=fa:16:3e:a5:15:f3,\
> >   ip,nw_src=20.20.20.104,nw_dst=20.20.20.101,\
> >   nw_proto=6 -generate
> >
> > Trace-result:
> > bridge("br1")
> > -
> >  0. priority 0
> > NORMAL
> >  -> no learned MAC for destination, flooding
> >
> > bridge("br2")
> > -
> >  0. priority 0
> > NORMAL
> >  -> no learned MAC for destination, flooding
> >  -> output to native tunnel
> >  -> tunneling to 10.10.10.22 via tunnel_bearing
> >  -> tunneling from 96:2d:94:57:6a:55 10.10.10.21\
> >   to c6:bc:3e:69:83:cc 10.10.10.22
> >
> > bridge("br3")
> > -
> >  0. priority 0
> > NORMAL
> >   dropping VLAN 2 tagged packet received on\
> >  port tunnel_bearing configured as VLAN 3 access port 
> >  >> disallowed VLAN VID for this input port, dropping

Why is this an issue?
I looked at your setup, the inner packet has a VLAN id = 2, and
you assign tunnel port to be VLAN id = 3, so it makes sens to
drop the packet, right?

William
>
> Hi.  I didn't look closely to the code yet, but since you have a test
> scenario, could you, please, create a unit test in tests/tunnel.at for it?
>
> Best regards, Ilya Maximets.
>
> >
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy

2021-02-27 Thread William Tu
Hi Hepeng,
Thanks, now I understand.

On Sat, Feb 27, 2021 at 12:54 AM 贺鹏  wrote:
>
> Hi,
>
> Thanks William and Ilya for a detailed revisiting of the origin of the
> problem. I learned a lot.
>
> I now understand that the mix using of RCU and refcounts is not
> intended in the first place.
> But my point is that mix using RCU and refcounts now gives you more
> choices, and actually eases the code changes.
>
> For example, the code for * ofproto_dpif_lookup_by_name* or other
> ofproto lookup function,
> when only using refcounts, you need to change it to:
>
>  struct ofproto_dpif *
>  ofproto_dpif_lookup_by_uuid(const struct uuid *uuid)
>  {
>  struct ofproto_dpif *ofproto;
>
>  HMAP_FOR_EACH_WITH_HASH (ofproto, all_ofproto_dpifs_by_uuid_node,
>   uuid_hash(uuid), _ofproto_dpifs_by_uuid) {
>  if (uuid_equals(>uuid, uuid)) {
>
>  ---> if  ovs_refcount_try_ref(ofproto)
>
>  return ofproto;
>  }
>  }
>  return NULL;
>  }
>
> and after finish its usage, you have to do ofproto_unref the ofproto.
>
> This is why I said, most accessing to ofproto is ephemeral. If you
> change to use the pure refcounts solution, you have
> to add refcount and release it every time you access the ofproto. We
> should be more careful and remember to unref
> the ofproto.
>
> However, if using RCU and refcounts, in the above case, you do not
> need to change the code, since the RCU ensures that
> these ephemeral accesses are safe.

How do we know which access is ephemeral, so no refcount is needed,
and which access is not, so we have to add refcount?

>
> you only need to add refcount, when you find that the pointer to
> ofproto lives longer than one grace period.
>
>
> This is why in my patch, I do not add ref to ofproto after its
> creation. I agree the patch is not complete and has issues,
> and understand it could confuse people if changes into mix RCU and
> refcounts version.
>
I see, thanks!
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v4 0/5] XDP offload using flow API provider

2021-02-25 Thread William Tu
On Thu, Feb 25, 2021 at 7:26 AM Toshiaki Makita
 wrote:
>
> On 2021/02/24 3:37, William Tu wrote:
> >>>>> I don't know if this is too much to ask for.
> >>>>> I wonder if you, or we can work together, to add at least a tunnel
> >>>>> support, ex: vxlan?
> >>>>> The current version is a good prototype for people to test an L2/L3
> >>>>> XDP offload switch,
> >>>>> but without a good use case, it's hard to attract more people to
> >>>>> contribute or use it.
> >>>>
> >>>> I think we have discussed this before.
> >>>> Vxlan or other tunneling is indeed important, but that's not 
> >>>> straightforward.
> >>>> Push is easy, but pop is not. Pop requires two rules and recirculation.
> >>>> Recirculation is highly likely to cause eBPF 1M insn limit error.
> >>>
> >>> Recirculation is pretty important. For example connection tracking also
> >>> relies on recirc. Can we break into multiple program and tail call?
> >>> For recirc action, can we tail call the main ebpf program, and let the
> >>> packet goes through parse/megaflow lookup/action?
> >>
> >> OK, will try using tail calls.
> >> This will require vswitchd to load another bpf program for tail calls.
> >> I guess such a program can be specified in main bpf program meta data.
> >> I'll check if it works.
> >>
> >
> > Do you know whether using bpf function call helps solving the
> > 1M insn limit or stack limitation?
> > https://lwn.net/Articles/741773/
>
> Looking at check_func_call() in verifier.c, it seems not.
>
> > IIUC, If we have a flow that requires executing multiple actions,
> > using bpf function call can save the stack size, but the total
> > instruction limit is still 1M.
> > When using tail call, we have more work to do to break into
> > individual ebpf program, but each program can have 1M insn
> > and stack size.
>
> Yes I think tail calls do help.
> But on the second thought, I think I should identify the verifier logic
> which causes consumption of that many instructions. Then we can determine if
> we have another option to mitigate that.
> E.g. The problem happens around loops. Loop unrolling may help mitigating it 
> or not?
>
I thought the problem is loop unrolling?
because of loop unrolling, the number of instructions increases a lot,
and might cause over 1M insn.
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy

2021-02-25 Thread William Tu
On Thu, Feb 25, 2021 at 4:54 AM Ilya Maximets  wrote:
>
> On 2/25/21 4:32 AM, Guohongzhi (Russell Lab) wrote:
> > Refcount and RCU are not mutually exclusive.
> > IMO, the main reason for this problem is that the rule uses both refcount 
> > and rcu, while the ofproto uses only rcu, and the rule retains the pointer 
> > of the ofproto. More importantly, ofproto cannot guarantee a longer grace 
> > period than the rule.
> >
>
> I understand that refcount and RCU are not mutually exclusive.
> However, in this particular case RCU for ofproto was introduced for one
> and only one reason - to avoid freeing ofproto while rules are still
> alive.  This was done in commit f416c8d61601 ("ofproto: RCU postpone
> rule destruction.").  The goal was to allow using rules without
> refcounting them within a single grace period.  And that forced us
> to postpone destruction of the ofproto for a single grace period.
> Later commit 39c9459355b6 ("Use classifier versioning.") made it
> possible for rules to be alive for more than one grace period, so
> the commit made ofproto wait for 2 grace periods by double postponing.
> As we can see now, that wasn't enough and we have to wait for more
> than 2 grace periods in certain cases.
>
> Now we're introducing refcounts to wait for all rules to be destroyed
> before destroying the ofproto.  But what is the point of having
> RCU if we already know that if refcount is zero than all rules are
> already destroyed and we don't need to wait anything and could just
> destroy the ofproto?
>
> RCU doesn't protect ofproto itself here, it actually protects rules,
> i.e. keeps ofproto alive while rules alive, but we can fully replace
> this with refcounts and I see no value in having RCU additionally.
>
> To have a full picture: right now we also have groups protected by
> RCU, so we need to refcount ofproto for them too.  But that is almost
> exactly same situation as we have with rules.
>
Thanks Guohongzhi, Hepeng, and Ilya, I learned a lot from your discussions.

I think refcount and RCU are mutually exclusive. They are two different ways of
doing synchronization and somehow we mix them together by using RCU to
optimize refcount.

Before the commit f416c8d61601 ("ofproto: RCU postpone rule destruction."),
we are using refcount to protect rules, and as a result every time OVS
references
a rule, we have to take refcount. It works ok but this probably has
high performance
overhead because it's doing atomic operations.

So the commit f416c8d61601 optimizes it by doing
1) not taking refcount of rule "within a grace period"
2) introduce RCU for rule
The assumption is that a grace period is like refcount == 0, so it's
valid to do so.
A side effect is that ofproto_destroy__()  needs to be postponed.
Note that if a rule is alive across grace period, it has to take refcount.

However, later commit 39c9459355b6 ("Use classifier versioning.")
makes rule alive across grace period. It breaks the first commit's assumption
and it has to introduce double postponing for ofproto, which we found
out is the problem now.

Since my point is RCU and refcount are mutually exclusive (A grace period
is like refcount ==0) , I agree with Ilya that we can just use refcount to fix
ofproto issue.

Regards,
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [ovs-discuss] Do you know how I can set nd_options_type field for ipv6 ND message?

2021-02-24 Thread William Tu
On Tue, Feb 23, 2021 at 6:03 PM Yi Yang (杨燚)-云服务集团  wrote:
>
> Out of curious, I remember OVN doesn't support OVS DPDK, I believe OVN also 
> does IPv6 ND by openflow, is it acceptable to use slow path to handle IPv6 ND 
> for OVS kernel datapath?
>
Maybe OVN uses IPv6 ND, but not setting/matching 'nd_options_type'?
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v1 6/9] conntrack: Do not schedule zero ms timers

2021-02-24 Thread William Tu
On Tue, Feb 23, 2021 at 4:35 PM Gaëtan Rivet  wrote:
>
> On Tue, Feb 23, 2021, at 22:56, William Tu wrote:
> > On Wed, Feb 17, 2021 at 8:34 AM Gaetan Rivet  wrote:
> > >
> > > When ct_sweep() is far behind on its work, the 'next_wake' returned can
> > > be before the moment it started. When it happens, the thread schedules a
> > > zero ms timer that is logged as an error.
> > >
> > > Instead, mark the thread for immediate wake in the next poll_block().
> > >
> > > Signed-off-by: Gaetan Rivet 
> > > Reviewed-by: Eli Britstein 
> > > ---
> >
> > Looks ok to me.
> > I guess previously we don't want to clean too often, so there is
> > a minimal CT_CLEAN_MIN_INTERVAL.
> > With this change, we might end up busy doing ct_sweep() and
> > hit 100% cpu?
> >
>
> Yes, this patch and the next will remove CT_CLEAN_MIN_INTERVAL.
> The thread will still call poll_block() and quiesce, but it has the potential 
> to hog the core if work is still needed.
>
> Is it an issue? If so instead it could yield for a short time.
> The main idea is to avoid waiting for 200 ms, as it is not useful now.
>
I don't have a strong opinion on this.
It's just sometimes customers complain OVS is using too much CPU.
And they start to look at which process/thread is using 100%.
Maybe we can find a balance between a reasonable backlog, accuracy of
timeout, and cpu utilizaion.

William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v1 1/9] conntrack: Use rcu-lists to store conn expirations

2021-02-24 Thread William Tu
On Tue, Feb 23, 2021 at 4:34 PM Gaëtan Rivet  wrote:
>
>
>
> On Tue, Feb 23, 2021, at 22:55, William Tu wrote:
> > Hi Gaetan,
> >
> > Thanks for the patch, looks very useful.
> > I haven't tested it yet.
> > Minor question/comments inline.
>
> Hi William, thanks for taking a look!
>
> [...]
> > >
> > > +/* Timeouts: all the possible timeout states passed to 
> > > update_expiration()
> > > + * are listed here. The name will be prefix by CT_TM_ and the value is in
> > > + * milliseconds */
> > > +#define CT_TIMEOUTS \
> > > +CT_TIMEOUT(TCP_FIRST_PACKET) \
> > > +CT_TIMEOUT(TCP_OPENING) \
> > > +CT_TIMEOUT(TCP_ESTABLISHED) \
> > > +CT_TIMEOUT(TCP_CLOSING) \
> > > +CT_TIMEOUT(TCP_FIN_WAIT) \
> > > +CT_TIMEOUT(TCP_CLOSED) \
> > > +CT_TIMEOUT(OTHER_FIRST) \
> > > +CT_TIMEOUT(OTHER_MULTIPLE) \
> > > +CT_TIMEOUT(OTHER_BIDIR) \
> > > +CT_TIMEOUT(ICMP_FIRST) \
> > > +CT_TIMEOUT(ICMP_REPLY)
> > > +
> > > +enum ct_timeout {
> > > +#define CT_TIMEOUT(NAME) CT_TM_##NAME,
> > > +CT_TIMEOUTS
> > > +#undef CT_TIMEOUT
> > > +N_CT_TM
> > > +};
> > > +
> > any reason for moving this macro to here?
> >
>
> I embed the enum ct_timeout within the conn expiration node, so I must have 
> it defined first.
> I did not want to put too much code between 'ct_conn_type' below and the 
> 'conn' struct itself,
> which is why I put 'ct_timeout' just above.
>
> Otherwise, this could be avoided by linking the rculist pointer directly 
> instead of storing its index.
> But I thought it would be unwise as it would more tightly couple the list 
> type to the expiration node, and it means taking a full pointer vs. storing 
> an enum.
>

I see, make sense, thanks!

> > >  enum OVS_PACKED_ENUM ct_conn_type {
> > >  CT_CONN_TYPE_DEFAULT,
> > >  CT_CONN_TYPE_UN_NAT,
> > >  };
> > >
> > > +struct conn_expire {
> > > +/* Set once when initializing the expiration node. */
> > > +struct conntrack *ct;
> > > +/* Timeout state of the connection.
> > > + * It follows the connection state updates.
> > > + */
> > > +enum ct_timeout tm;
> > > +/* Insert and remove the expiration node only once per RCU syncs.
> > > + * If multiple threads update the connection, its expiration should
> > > + * be removed only once and added only once to timeout lists.
> > > + */
> > > +atomic_flag insert_once;
> > > +atomic_flag remove_once;
> > > +struct rculist node;
> > > +};
> > > +
> > >  struct conn {
> > >  /* Immutable data. */
> > >  struct conn_key key;
> > >  struct conn_key rev_key;
> > >  struct conn_key parent_key; /* Only used for orig_tuple support. */
> > > -struct ovs_list exp_node;
> > >  struct cmap_node cm_node;
> > >  struct nat_action_info_t *nat_info;
> > >  char *alg;
> > > @@ -104,6 +143,7 @@ struct conn {
> > >
> > >  /* Mutable data. */
> > >  struct ovs_mutex lock; /* Guards all mutable fields. */
> > > +struct conn_expire exp;
> > >  ovs_u128 label;
> > >  long long expiration;
> > >  uint32_t mark;
> > > @@ -132,33 +172,10 @@ enum ct_update_res {
> > >  CT_UPDATE_VALID_NEW,
> > >  };
> > >
> > > -/* Timeouts: all the possible timeout states passed to 
> > > update_expiration()
> > > - * are listed here. The name will be prefix by CT_TM_ and the value is in
> > > - * milliseconds */
> > > -#define CT_TIMEOUTS \
> > > -CT_TIMEOUT(TCP_FIRST_PACKET) \
> > > -CT_TIMEOUT(TCP_OPENING) \
> > > -CT_TIMEOUT(TCP_ESTABLISHED) \
> > > -CT_TIMEOUT(TCP_CLOSING) \
> > > -CT_TIMEOUT(TCP_FIN_WAIT) \
> > > -CT_TIMEOUT(TCP_CLOSED) \
> > > -CT_TIMEOUT(OTHER_FIRST) \
> > > -CT_TIMEOUT(OTHER_MULTIPLE) \
> > > -CT_TIMEOUT(OTHER_BIDIR) \
> > > -CT_TIMEOUT(ICMP_FIRST) \
> > > -CT_TIMEOUT(ICMP_REPLY)
> > > -
> > > -enum ct_timeout {
> > > -#define CT_TIMEOUT(NAME) CT_TM_##NAME,
> > > -CT_TIMEOUTS
> > > -#undef CT_TIMEOUT
> > > -N_CT_TM
> > > -};
> > > -
> > >  struct conntrack {
> > >  struct ovs_mutex ct_lock; /* Prot

Re: [ovs-dev] [PATCH v1 9/9] conntrack: Use an atomic conn expiration value

2021-02-23 Thread William Tu
On Wed, Feb 17, 2021 at 8:34 AM Gaetan Rivet  wrote:
>
> A lock is taken during conn_lookup() to check whether a connection is
> expired before returning it. This lock can have some contention.
>
> Even though this lock ensures a consistent sequence of writes, it does
> not imply a specific order. A ct_clean thread taking the lock first
> could read a value that would be updated immediately after by a PMD
> waiting on the same lock, just as well as the opposite order.
>
> As such, the expiration time can be stale anytime it is read. In this
> context, using an atomic will ensure the same write consistency while
> keeping the same (lack of) guarantee for reads. Reading the atomic will
> however be less costly than taking and releasing the lock.
>
> Signed-off-by: Gaetan Rivet 
> Reviewed-by: Eli Britstein 
> ---
LGTM! thanks
Acked-by: William Tu 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v1 6/9] conntrack: Do not schedule zero ms timers

2021-02-23 Thread William Tu
On Wed, Feb 17, 2021 at 8:34 AM Gaetan Rivet  wrote:
>
> When ct_sweep() is far behind on its work, the 'next_wake' returned can
> be before the moment it started. When it happens, the thread schedules a
> zero ms timer that is logged as an error.
>
> Instead, mark the thread for immediate wake in the next poll_block().
>
> Signed-off-by: Gaetan Rivet 
> Reviewed-by: Eli Britstein 
> ---

Looks ok to me.
I guess previously we don't want to clean too often, so there is
a minimal CT_CLEAN_MIN_INTERVAL.
With this change, we might end up busy doing ct_sweep() and
hit 100% cpu?



>  lib/conntrack.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/lib/conntrack.c b/lib/conntrack.c
> index 5aad64994..71f79a790 100644
> --- a/lib/conntrack.c
> +++ b/lib/conntrack.c
> @@ -1628,6 +1628,8 @@ clean_thread_main(void *f_)
>  next_wake = conntrack_clean(ct, now);
>
>  if (next_wake < now) {
> +poll_immediate_wake();
> +} else if (next_wake < now + CT_CLEAN_MIN_INTERVAL) {
>  poll_timer_wait_until(now + CT_CLEAN_MIN_INTERVAL);
>  } else {
>  poll_timer_wait_until(MAX(next_wake, now + CT_CLEAN_INTERVAL));
> --
> 2.30.0
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v1 7/9] conntrack: Do not rate limit ct-sweep

2021-02-23 Thread William Tu
On Wed, Feb 17, 2021 at 8:34 AM Gaetan Rivet  wrote:
>
> The current rate limit is set to allow other threads to update the
> connections when applicable. This was valid when taking the 'ct_lock'
> was needed with a global critical section.
>
> Now that the size of the critical section for 'ct_lock' is reduced, it
> is not necessary to rate limit calls to ct_sweep() anymore.
>
> Signed-off-by: Gaetan Rivet 
> Reviewed-by: Eli Britstein 
> ---
>  lib/conntrack.c | 24 +++-
>  1 file changed, 7 insertions(+), 17 deletions(-)
>
> diff --git a/lib/conntrack.c b/lib/conntrack.c
> index 71f79a790..1b21b79bd 100644
> --- a/lib/conntrack.c
> +++ b/lib/conntrack.c
> @@ -1602,20 +1602,12 @@ conntrack_clean(struct conntrack *ct, long long now)
>   * there is an actual connection that expires, or because a new connection
>   * might be created with the minimum timeout).
>   *
> - * The logic below has two goals:
> - *
> - * - We want to reduce the number of wakeups and batch connection cleanup
> - *   when the load is not very high.  CT_CLEAN_INTERVAL ensures that if we
> - *   are coping with the current cleanup tasks, then we wait at least
> - *   5 seconds to do further cleanup.
> - *
> - * - We don't want to keep the map locked too long, as we might prevent
> - *   traffic from flowing.  CT_CLEAN_MIN_INTERVAL ensures that if cleanup is
> - *   behind, there is at least some 200ms blocks of time when the map will be
> - *   left alone, so the datapath can operate unhindered.
> + * We want to reduce the number of wakeups and batch connection cleanup
> + * when the load is not very high.  CT_CLEAN_INTERVAL ensures that if we
> + * are coping with the current cleanup tasks, then we wait at least
> + * 5 seconds to do further cleanup.
>   */

IIUC, it's either wait for next 5-second interval, or keep cleaning when behind.
It depends on how fine grained people program the timeout value.
If users program s.t like 2-second, probably in reality it takes longer
to timeout.


William

>  #define CT_CLEAN_INTERVAL 5000 /* 5 seconds */
> -#define CT_CLEAN_MIN_INTERVAL 200  /* 0.2 seconds */
>
>  static void *
>  clean_thread_main(void *f_)
> @@ -1627,12 +1619,10 @@ clean_thread_main(void *f_)
>  long long now = time_msec();
>  next_wake = conntrack_clean(ct, now);
>
> -if (next_wake < now) {
> -poll_immediate_wake();
> -} else if (next_wake < now + CT_CLEAN_MIN_INTERVAL) {
> -poll_timer_wait_until(now + CT_CLEAN_MIN_INTERVAL);
> +if (next_wake > now) {
> +poll_timer_wait_until(MIN(next_wake, now + CT_CLEAN_INTERVAL));
>  } else {
> -poll_timer_wait_until(MAX(next_wake, now + CT_CLEAN_INTERVAL));
> +poll_immediate_wake();
>  }
>  latch_wait(>clean_thread_exit);
>  poll_block();
> --
> 2.30.0
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v1 8/9] conntrack: Do not log empty ct-sweep

2021-02-23 Thread William Tu
On Wed, Feb 17, 2021 at 8:34 AM Gaetan Rivet  wrote:
>
> Do not add noise to the DBG log for empty sweeps.
> Only log time taken when some connections were cleaned.
>
> Signed-off-by: Gaetan Rivet 
> Reviewed-by: Eli Britstein 
> ---

LGTM
Acked-by: William Tu 


>  lib/conntrack.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/lib/conntrack.c b/lib/conntrack.c
> index 1b21b79bd..e042683aa 100644
> --- a/lib/conntrack.c
> +++ b/lib/conntrack.c
> @@ -1574,8 +1574,10 @@ ct_sweep(struct conntrack *ct, long long now, size_t 
> limit)
>  }
>
>  out:
> -VLOG_DBG("conntrack cleanup %"PRIuSIZE" entries in %lld msec", count,
> - time_msec() - now);
> +if (count > 0) {
> +VLOG_DBG("conntrack cleanup %"PRIuSIZE" entries in %lld msec", count,
> + time_msec() - now);
> +}
>  return min_expiration;
>  }
>
> --
> 2.30.0
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v1 5/9] conntrack: Inverse conn and ct lock precedence

2021-02-23 Thread William Tu
On Wed, Feb 17, 2021 at 8:34 AM Gaetan Rivet  wrote:
>
> The lock priority order is for the global 'ct_lock' to be taken first
> and then 'conn->lock'. This is an issue, as multiple operations on
> connections are thus blocked between threads contending on the
> global 'ct_lock'.
>
> This was previously necessary due to how the expiration lists, timeout
> policies and zone limits were managed. They are now using RCU-friendly
> structures that allow concurrent readers. The mutual exclusion now only
> needs to happen during writes.
>
> This allows reducing the 'ct_lock' precedence, and to only take it
> when writing the relevant structures. This will reduce contention on
> 'ct_lock', which impairs scalability when the connection tracker is
> used by many threads.
>
> Signed-off-by: Gaetan Rivet 
> Reviewed-by: Eli Britstein 
> ---

Thanks! it's pretty cool to see locks being removed or optimized.
The changes make sense to me. But maybe others should also
take a second look.


William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v1 4/9] conntrack-tp: Use a cmap to store timeout policies

2021-02-23 Thread William Tu
On Wed, Feb 17, 2021 at 8:34 AM Gaetan Rivet  wrote:
>
> Multiple lookups are done to stored timeout policies, each time blocking
> the global 'ct_lock'. This is usually not necessary and it should be
> acceptable to get policy updates slightly delayed (by one RCU sync
> at most). Using a CMAP reduces multiple lock taking and releasing in
> the connection insertion path.
>
> Signed-off-by: Gaetan Rivet 
> Reviewed-by: Eli Britstein 
> ---

Read through the patch and everything looks good to me
Acked-by: William Tu 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v1 2/9] conntrack: Use a cmap to store zone limits

2021-02-23 Thread William Tu
Thanks, I have one question inline.

On Wed, Feb 17, 2021 at 8:34 AM Gaetan Rivet  wrote:
>
> Change the data structure from hmap to cmap for zone limits.
> As they are shared amongst multiple conntrack users, multiple
> readers want to check the current zone limit state before progressing in
> their processing. Using a CMAP allows doing lookups without taking the
> global 'ct_lock', thus reducing contention.
>
> Signed-off-by: Gaetan Rivet 
> Reviewed-by: Eli Britstein 
> ---
>  lib/conntrack-private.h |  2 +-
>  lib/conntrack.c | 72 -
>  lib/conntrack.h |  2 +-
>  lib/dpif-netdev.c   |  5 +--
>  4 files changed, 55 insertions(+), 26 deletions(-)
>
> diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h
> index 4b6f9eae3..f2cbf657e 100644
> --- a/lib/conntrack-private.h
> +++ b/lib/conntrack-private.h
> @@ -176,7 +176,7 @@ struct conntrack {
>  struct ovs_mutex ct_lock; /* Protects 2 following fields. */
>  struct cmap conns OVS_GUARDED;
>  struct rculist exp_lists[N_CT_TM] OVS_GUARDED;
> -struct hmap zone_limits OVS_GUARDED;
> +struct cmap zone_limits OVS_GUARDED;
>  struct hmap timeout_policies OVS_GUARDED;
>  uint32_t hash_basis; /* Salt for hashing a connection key. */
>  pthread_t clean_thread; /* Periodically cleans up connection tracker. */
> diff --git a/lib/conntrack.c b/lib/conntrack.c
> index ac12f9196..c218200dc 100644
> --- a/lib/conntrack.c
> +++ b/lib/conntrack.c
> @@ -79,7 +79,7 @@ enum ct_alg_ctl_type {
>  };
>
>  struct zone_limit {
> -struct hmap_node node;
> +struct cmap_node node;
>  struct conntrack_zone_limit czl;
>  };
>
> @@ -303,7 +303,7 @@ conntrack_init(void)
>  for (unsigned i = 0; i < ARRAY_SIZE(ct->exp_lists); i++) {
>  rculist_init(>exp_lists[i]);
>  }
> -hmap_init(>zone_limits);
> +cmap_init(>zone_limits);
>  ct->zone_limit_seq = 0;
>  timeout_policy_init(ct);
>  ovs_mutex_unlock(>ct_lock);
> @@ -339,12 +339,25 @@ zone_key_hash(int32_t zone, uint32_t basis)
>  }
>
>  static struct zone_limit *
> -zone_limit_lookup(struct conntrack *ct, int32_t zone)
> +zone_limit_lookup_protected(struct conntrack *ct, int32_t zone)
>  OVS_REQUIRES(ct->ct_lock)
>  {
>  uint32_t hash = zone_key_hash(zone, ct->hash_basis);
>  struct zone_limit *zl;
> -HMAP_FOR_EACH_IN_BUCKET (zl, node, hash, >zone_limits) {
> +CMAP_FOR_EACH_WITH_HASH_PROTECTED (zl, node, hash, >zone_limits) {
> +if (zl->czl.zone == zone) {
> +return zl;
> +}
> +}
> +return NULL;
> +}
> +
> +static struct zone_limit *
> +zone_limit_lookup(struct conntrack *ct, int32_t zone)
> +{
> +uint32_t hash = zone_key_hash(zone, ct->hash_basis);
> +struct zone_limit *zl;
> +CMAP_FOR_EACH_WITH_HASH (zl, node, hash, >zone_limits) {
>  if (zl->czl.zone == zone) {
>  return zl;
>  }
> @@ -354,7 +367,6 @@ zone_limit_lookup(struct conntrack *ct, int32_t zone)
>
>  static struct zone_limit *
>  zone_limit_lookup_or_default(struct conntrack *ct, int32_t zone)
> -OVS_REQUIRES(ct->ct_lock)
>  {
>  struct zone_limit *zl = zone_limit_lookup(ct, zone);
>  return zl ? zl : zone_limit_lookup(ct, DEFAULT_ZONE);
> @@ -363,13 +375,16 @@ zone_limit_lookup_or_default(struct conntrack *ct, 
> int32_t zone)
>  struct conntrack_zone_limit
>  zone_limit_get(struct conntrack *ct, int32_t zone)
>  {
> -ovs_mutex_lock(>ct_lock);
> -struct conntrack_zone_limit czl = {DEFAULT_ZONE, 0, 0, 0};
> +struct conntrack_zone_limit czl = {
> +.zone = DEFAULT_ZONE,
> +.limit = 0,
> +.count = ATOMIC_COUNT_INIT(0),
> +.zone_limit_seq = 0,
> +};
>  struct zone_limit *zl = zone_limit_lookup_or_default(ct, zone);
>  if (zl) {
>  czl = zl->czl;
>  }
> -ovs_mutex_unlock(>ct_lock);
>  return czl;
>  }
>
> @@ -377,13 +392,19 @@ static int
>  zone_limit_create(struct conntrack *ct, int32_t zone, uint32_t limit)
>  OVS_REQUIRES(ct->ct_lock)
>  {
> +struct zone_limit *zl = zone_limit_lookup_protected(ct, zone);
> +
> +if (zl) {
> +return 0;
> +}
> +
>  if (zone >= DEFAULT_ZONE && zone <= MAX_ZONE) {
> -struct zone_limit *zl = xzalloc(sizeof *zl);
> +zl = xzalloc(sizeof *zl);
>  zl->czl.limit = limit;
>  zl->czl.zone = zone;
>  zl->czl.zone_limit_seq = ct->zone_limit_seq++;
>  uint32_t hash = zone_key_hash(zone, ct->hash_basis);
> -hmap_insert(>zone_limits, >node, hash);
> +cmap_insert(>zone_limits, >node, hash);
>  return 0;
>  } else {
>  return EINVAL;
> @@ -394,13 +415,14 @@ int
>  zone_limit_update(struct conntrack *ct, int32_t zone, uint32_t limit)
>  {
>  int err = 0;
> -ovs_mutex_lock(>ct_lock);
>  struct zone_limit *zl = zone_limit_lookup(ct, zone);
>  if (zl) {
>  zl->czl.limit = limit;
>  VLOG_INFO("Changed 

Re: [ovs-dev] [PATCH v1 3/9] conntrack: Init hash basis first at creation

2021-02-23 Thread William Tu
On Wed, Feb 17, 2021 at 8:34 AM Gaetan Rivet  wrote:
>
> The 'hash_basis' field is used sometimes during sub-systems init
> routine. It will be 0 by default before randomization. Sub-systems would
> then init some nodes with incorrect hash values.
>
> The timeout policies module is affected, making the default policy being
> referenced using an incorrect hash value.
>
> Fixes: 2078901a4c14 ("userspace: Add conntrack timeout policy support.")
> Signed-off-by: Gaetan Rivet 
> Reviewed-by: Eli Britstein 
> ---

LGTM, thanks for the fix.
Acked-by: William Tu 


>  lib/conntrack.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/lib/conntrack.c b/lib/conntrack.c
> index c218200dc..8062dcd6b 100644
> --- a/lib/conntrack.c
> +++ b/lib/conntrack.c
> @@ -291,6 +291,11 @@ conntrack_init(void)
>  static struct ovsthread_once setup_l4_once = OVSTHREAD_ONCE_INITIALIZER;
>  struct conntrack *ct = xzalloc(sizeof *ct);
>
> +/* This value can be used during init (e.g. timeout_policy_init()),
> + * set it first to ensure it is available.
> + */
> +ct->hash_basis = random_uint32();
> +
>  ovs_rwlock_init(>resources_lock);
>  ovs_rwlock_wrlock(>resources_lock);
>  hmap_init(>alg_expectations);
> @@ -308,7 +313,6 @@ conntrack_init(void)
>  timeout_policy_init(ct);
>  ovs_mutex_unlock(>ct_lock);
>
> -ct->hash_basis = random_uint32();
>  atomic_count_init(>n_conn, 0);
>  atomic_init(>n_conn_limit, DEFAULT_N_CONN_LIMIT);
>  atomic_init(>tcp_seq_chk, true);
> --
> 2.30.0
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v1 1/9] conntrack: Use rcu-lists to store conn expirations

2021-02-23 Thread William Tu
Hi Gaetan,

Thanks for the patch, looks very useful.
I haven't tested it yet.
Minor question/comments inline.

On Wed, Feb 17, 2021 at 8:34 AM Gaetan Rivet  wrote:
>
> Change the connection expiration lists from ovs_list to rculist.
> This is a pre-step towards reducing the granularity of 'ct_lock'.
>
> Signed-off-by: Gaetan Rivet 
> Reviewed-by: Eli Britstein 
> ---
>  lib/conntrack-private.h | 76 +++--
>  lib/conntrack-tp.c  | 60 +---
>  lib/conntrack.c | 14 
>  3 files changed, 105 insertions(+), 45 deletions(-)
>
> diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h
> index e8332bdba..4b6f9eae3 100644
> --- a/lib/conntrack-private.h
> +++ b/lib/conntrack-private.h
> @@ -29,6 +29,7 @@
>  #include "openvswitch/list.h"
>  #include "openvswitch/types.h"
>  #include "packets.h"
> +#include "rculist.h"
>  #include "unaligned.h"
>  #include "dp-packet.h"
>
> @@ -86,17 +87,55 @@ struct alg_exp_node {
>  bool nat_rpl_dst;
>  };
>
> +/* Timeouts: all the possible timeout states passed to update_expiration()
> + * are listed here. The name will be prefix by CT_TM_ and the value is in
> + * milliseconds */
> +#define CT_TIMEOUTS \
> +CT_TIMEOUT(TCP_FIRST_PACKET) \
> +CT_TIMEOUT(TCP_OPENING) \
> +CT_TIMEOUT(TCP_ESTABLISHED) \
> +CT_TIMEOUT(TCP_CLOSING) \
> +CT_TIMEOUT(TCP_FIN_WAIT) \
> +CT_TIMEOUT(TCP_CLOSED) \
> +CT_TIMEOUT(OTHER_FIRST) \
> +CT_TIMEOUT(OTHER_MULTIPLE) \
> +CT_TIMEOUT(OTHER_BIDIR) \
> +CT_TIMEOUT(ICMP_FIRST) \
> +CT_TIMEOUT(ICMP_REPLY)
> +
> +enum ct_timeout {
> +#define CT_TIMEOUT(NAME) CT_TM_##NAME,
> +CT_TIMEOUTS
> +#undef CT_TIMEOUT
> +N_CT_TM
> +};
> +
any reason for moving this macro to here?

>  enum OVS_PACKED_ENUM ct_conn_type {
>  CT_CONN_TYPE_DEFAULT,
>  CT_CONN_TYPE_UN_NAT,
>  };
>
> +struct conn_expire {
> +/* Set once when initializing the expiration node. */
> +struct conntrack *ct;
> +/* Timeout state of the connection.
> + * It follows the connection state updates.
> + */
> +enum ct_timeout tm;
> +/* Insert and remove the expiration node only once per RCU syncs.
> + * If multiple threads update the connection, its expiration should
> + * be removed only once and added only once to timeout lists.
> + */
> +atomic_flag insert_once;
> +atomic_flag remove_once;
> +struct rculist node;
> +};
> +
>  struct conn {
>  /* Immutable data. */
>  struct conn_key key;
>  struct conn_key rev_key;
>  struct conn_key parent_key; /* Only used for orig_tuple support. */
> -struct ovs_list exp_node;
>  struct cmap_node cm_node;
>  struct nat_action_info_t *nat_info;
>  char *alg;
> @@ -104,6 +143,7 @@ struct conn {
>
>  /* Mutable data. */
>  struct ovs_mutex lock; /* Guards all mutable fields. */
> +struct conn_expire exp;
>  ovs_u128 label;
>  long long expiration;
>  uint32_t mark;
> @@ -132,33 +172,10 @@ enum ct_update_res {
>  CT_UPDATE_VALID_NEW,
>  };
>
> -/* Timeouts: all the possible timeout states passed to update_expiration()
> - * are listed here. The name will be prefix by CT_TM_ and the value is in
> - * milliseconds */
> -#define CT_TIMEOUTS \
> -CT_TIMEOUT(TCP_FIRST_PACKET) \
> -CT_TIMEOUT(TCP_OPENING) \
> -CT_TIMEOUT(TCP_ESTABLISHED) \
> -CT_TIMEOUT(TCP_CLOSING) \
> -CT_TIMEOUT(TCP_FIN_WAIT) \
> -CT_TIMEOUT(TCP_CLOSED) \
> -CT_TIMEOUT(OTHER_FIRST) \
> -CT_TIMEOUT(OTHER_MULTIPLE) \
> -CT_TIMEOUT(OTHER_BIDIR) \
> -CT_TIMEOUT(ICMP_FIRST) \
> -CT_TIMEOUT(ICMP_REPLY)
> -
> -enum ct_timeout {
> -#define CT_TIMEOUT(NAME) CT_TM_##NAME,
> -CT_TIMEOUTS
> -#undef CT_TIMEOUT
> -N_CT_TM
> -};
> -
>  struct conntrack {
>  struct ovs_mutex ct_lock; /* Protects 2 following fields. */
>  struct cmap conns OVS_GUARDED;
> -struct ovs_list exp_lists[N_CT_TM] OVS_GUARDED;
> +struct rculist exp_lists[N_CT_TM] OVS_GUARDED;
>  struct hmap zone_limits OVS_GUARDED;
>  struct hmap timeout_policies OVS_GUARDED;
>  uint32_t hash_basis; /* Salt for hashing a connection key. */
> @@ -204,4 +221,13 @@ struct ct_l4_proto {
> struct ct_dpif_protoinfo *);
>  };
>
> +static inline void
> +conn_expire_remove(struct conn_expire *exp)
> +{
> +if (!atomic_flag_test_and_set(>remove_once)
> +&& rculist_next(>node)) {
> +rculist_remove(>node);
> +}
> +}
> +
>  #endif /* conntrack-private.h */
> diff --git a/lib/conntrack-tp.c b/lib/conntrack-tp.c
> index a586d3a8d..30ba4bda8 100644
> --- a/lib/conntrack-tp.c
> +++ b/lib/conntrack-tp.c
> @@ -230,6 +230,50 @@ tm_to_ct_dpif_tp(enum ct_timeout tm)
>  return CT_DPIF_TP_ATTR_MAX;
>  }
>
> +static void
> +conn_expire_init(struct conn *conn, struct conntrack *ct)
> +{
> +struct conn_expire *exp = >exp;
> +
> +if (exp->ct != NULL) {
> +return;
> +}
> +
> +

Re: [ovs-dev] [PATCH v4 3/5] netdev-offload: Add xdp flow api provider

2021-02-23 Thread William Tu
> >>> +
> >>> +return 0;
> >>> +}
> >>> +#endif
> >>> +
> >>>   static int
> >>>   xsk_load_prog(struct netdev *netdev, const char *path,
> >>> struct bpf_object **pobj, int *prog_fd)
> >>>   {
> >>> +struct netdev_linux *dev OVS_UNUSED = netdev_linux_cast(netdev);
> >>>   struct bpf_object_open_attr attr = {
> >>>   .prog_type = BPF_PROG_TYPE_XDP,
> >>>   .file = path,
> >>> @@ -298,6 +496,14 @@ xsk_load_prog(struct netdev *netdev, const char 
> >>> *path,
> >>>   goto err;
> >>>   }
> >>>
> >>> +#ifdef HAVE_XDP_OFFLOAD
> >>> +if (!xdp_preload(netdev, obj)) {
> >>
> >> I forgot what's the purpose of doing xdp_preload, before we call
> >> bpf_object__load below.
> >> Does it just to check whether the flowtable_afxdp.o has every map we want?
> >> Or it does initialize something else.
>
> - Create map-in-maps as they cannot be automatically created.
> - Create xsks_map with the proper n_rxq parameter.
> - Use global output_map table instead of creating new one.
> All of them need to be done before bpf_object__load.

Now I understand, thanks!

>
> >>> +VLOG_INFO("%s: Detected flowtable support in XDP program",
> >>> +  netdev_get_name(netdev));
> >>> +dev->has_xdp_flowtable = true;
> >>> +}
> >>> +#endif
> >>> +
> >>>   if (bpf_object__load(obj)) {
> >>>   VLOG_ERR("%s: Can't load XDP program at '%s'",
> >>>netdev_get_name(netdev), path);
> >>> @@ -1297,7 +1503,17 @@ libbpf_print(enum libbpf_print_level level,
> >>>
> >>>   int netdev_afxdp_init(void)
> >>>   {
> >>> -libbpf_set_print(libbpf_print);
> >>> +static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
> >>> +
> >>> +if (ovsthread_once_start()) {
> >>> +libbpf_set_print(libbpf_print);
> >>> +#ifdef HAVE_XDP_OFFLOAD
> >>> +if (netdev_register_flow_api_provider(_offload_xdp)) {
> >>> +VLOG_WARN("Failed to register XDP flow api provider");
> >>> +}
> >>> +#endif
> >>> +ovsthread_once_done();
> >>> +}
> >>>   return 0;
> >>>   }
> >>>
> >>> diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h
> >>> index e91cd102d..324152e8f 100644
> >>> --- a/lib/netdev-afxdp.h
> >>> +++ b/lib/netdev-afxdp.h
> >>> @@ -44,6 +44,7 @@ struct netdev_stats;
> >>>   struct smap;
> >>>   struct xdp_umem;
> >>>   struct xsk_socket_info;
> >>> +struct bpf_object;
> >>>
> >>>   int netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_);
> >>>   void netdev_afxdp_rxq_destruct(struct netdev_rxq *rxq_);
> >>> @@ -70,6 +71,8 @@ int netdev_afxdp_get_custom_stats(const struct netdev 
> >>> *netdev,
> >>>   void free_afxdp_buf(struct dp_packet *p);
> >>>   int netdev_afxdp_reconfigure(struct netdev *netdev);
> >>>   void signal_remove_xdp(struct netdev *netdev);
> >>> +bool has_xdp_flowtable(struct netdev *netdev);
> >>> +struct bpf_object *get_xdp_object(struct netdev *netdev);
> >>>
> >>>   #else /* !HAVE_AF_XDP */
> >>>
> >> Thanks
> >> William
> >
> > btw, looking at lib/netdev-offload-xdp.c
> > the netdev_xdp_flow_get is not implemented.
> > Does that mean we will not detect flow that alread exists and always
> > replace with the new flow?
>
> IIUC flows are inserted when upcall happens. It does not use flow_get API and
> just insert flows.
> Without flow_get, we cannot use dpctl/get-flow or we cannot get flows stats.
> But implementing stats is not a trivial work as we need to prepare per-cpu 
> counters
> for flows so we don't slow down the XDP program. So I left it as a future 
> work.

I think we might want to consider adding it.
Because once people start to use XDP datapath, we will surely want to know
which flows are processed in XDP and which are not.

btw, I tested your patch again, and after ping between two namespaces,
then $ovs-ofctl dump-flows
show correct stats. I don't understand why without flow_get, it still shows
correct flow stats...

Thanks!
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v4 0/5] XDP offload using flow API provider

2021-02-23 Thread William Tu
> >>> I don't know if this is too much to ask for.
> >>> I wonder if you, or we can work together, to add at least a tunnel
> >>> support, ex: vxlan?
> >>> The current version is a good prototype for people to test an L2/L3
> >>> XDP offload switch,
> >>> but without a good use case, it's hard to attract more people to
> >>> contribute or use it.
> >>
> >> I think we have discussed this before.
> >> Vxlan or other tunneling is indeed important, but that's not 
> >> straightforward.
> >> Push is easy, but pop is not. Pop requires two rules and recirculation.
> >> Recirculation is highly likely to cause eBPF 1M insn limit error.
> >
> > Recirculation is pretty important. For example connection tracking also
> > relies on recirc. Can we break into multiple program and tail call?
> > For recirc action, can we tail call the main ebpf program, and let the
> > packet goes through parse/megaflow lookup/action?
>
> OK, will try using tail calls.
> This will require vswitchd to load another bpf program for tail calls.
> I guess such a program can be specified in main bpf program meta data.
> I'll check if it works.
>

Do you know whether using bpf function call helps solving the
1M insn limit or stack limitation?
https://lwn.net/Articles/741773/

IIUC, If we have a flow that requires executing multiple actions,
using bpf function call can save the stack size, but the total
instruction limit is still 1M.
When using tail call, we have more work to do to break into
individual ebpf program, but each program can have 1M insn
and stack size.

William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [ovs-discuss] Do you know how I can set nd_options_type field for ipv6 ND message?

2021-02-23 Thread William Tu
On Sun, Feb 21, 2021 at 9:39 PM Yi Yang (杨燚)-云服务集团  wrote:
>
> Thanks William, it is my ovs-ofctl issue, my ovs-vswitchd is new, but 
> ovs-ofctl is old, but after I used ovs-ofctl, I still saw issues:
>
> OFPT_ERROR (OF1.3) (xid=0x6): OFPBAC_BAD_SET_ARGUMENT
> OFPT_FLOW_MOD (OF1.3) (xid=0x6): ADD 
> icmp6,icmp_type=135,icmp_code=0,nd_sll=22:70:e0:d0:fc:75 
> cookie:0x1234567890123456 
> actions=set_field:136->icmpv6_type,set_field:0->icmpv6_code,set_field:2->nd_options_type,resubmit(,0)
> OFPT_ERROR (OF1.3) (xid=0x6): OFPBAC_BAD_SET_ARGUMENT
> OFPT_FLOW_MOD (OF1.3) (xid=0x6): ADD icmp6,icmp_type=136,icmp_code=0 
> cookie:0x1234567890123456 
> actions=move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],set_field:22:70:e0:d0:fc:76->eth_src,move:NXM_NX_IPV6_SRC[]->NXM_NX_IPV6_DST[],set_field:fe80::2070:e0ff:fed0:fc76->ipv6_src,set_field:22:70:e0:d0:fc:76->nd_tll,set_field:57344->nd_reserved,IN_PORT
>
> I saw this thread 
> https://www.mail-archive.com/ovs-dev@openvswitch.org/msg47815.html discussed 
> it,  it can work after I applied patch in 
> https://www.mail-archive.com/ovs-dev@openvswitch.org/msg47815.html.
>
> So @Flavio, maybe you need to apply the patch in 
> https://www.mail-archive.com/ovs-dev@openvswitch.org/msg47815.html to fix 
> this issue in kernel datapath, per my check, kernel datapath with this patch 
> can work, I verified it by using openflow to do NS reply and ICMPv6 ping 
> reply.
>
> William, I guess  you're using ovs userspace, so it doesn't have this issue.

Yes, I'm testing it using userspace datapath. I think kernel datapath
doesn't support setting nd_ext.
Look like you can get it working by doing it in slow path, like the
patch you pointed to.
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [ovs-discuss] Do you know how I can set nd_options_type field for ipv6 ND message?

2021-02-21 Thread William Tu
On Sat, Feb 20, 2021 at 2:11 AM Yi Yang (杨燚)-云服务集团  wrote:
>
> Hi, folks
>
>
>
> I need to set nd_options_type to 2 for NS message to respond IPv6 NS, my flow 
> is below, why nd_options_type can’t be set? Per commit 
> 9b2b84973db76e1138d9234ff1b84bb6bb156011, it should work, what’s wrong? 
> Appreciate your help in advance, thank you.
>
>
>
> $ sudo ovs-ofctl -Oopenflow13 add-flow br-int 
> "table=0,ipv6,icmp6,icmp_type=135,icmp_code=0,nd_target=fe80::505c:cff:fe88:392f,nd_sll=52:5c:0c:88:39:2f,actions=set_field:136->icmpv6_type,set_field:0->icmpv6_code,move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],set_field:52:5c:0c:88:39:3f->eth_src,move:NXM_NX_IPV6_SRC[]->NXM_NX_IPV6_DST[],set_field:fe80::505c:cff:fe88:392f->ipv6_src,set_field:52:5c:0c:88:39:3f->nd_tll,set_field:2->nd_options_type,set_field:OxE000->nd_reserved,output:IN_PORT"
>
> ovs-ofctl: nd_options_type is not a valid OXM field name
>
> $ sudo ovs-ofctl -Oopenflow13 add-flow br-int 
> "table=0,ipv6,icmp6,icmp_type=135,icmp_code=0,nd_target=fe80::505c:cff:fe88:392f,nd_sll=52:5c:0c:88:39:2f,actions=set_field:136->icmpv6_type,set_field:0->icmpv6_code,move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],set_field:52:5c:0c:88:39:3f->eth_src,move:NXM_NX_IPV6_SRC[]->NXM_NX_IPV6_DST[],set_field:fe80::505c:cff:fe88:392f->ipv6_src,set_field:52:5c:0c:88:39:3f->nd_tll,load:2->ERICOXM_OF_ICMPV6_ND_OPTIONS_TYPE[],set_field:OxE000->nd_reserved,output:IN_PORT"
>
> ovs-ofctl: ERICOXM_OF_ICMPV6_ND_OPTIONS_TYPE[]: unknown field 
> `ERICOXM_OF_ICMPV6_ND_OPTIONS_TYPE'

I tested by doing
roos:~/ovs# ovs-ofctl add-flow br0 "in_port=1
icmp6,icmpv6_code=0,icmpv6_type=135
actions=set_field:2->nd_options_type, 2"
roos:~/ovs# ovs-ofctl dump-flows br0
 cookie=0x0, duration=8.492s, table=0, n_packets=0, n_bytes=0,
icmp6,in_port="afxdp-p0",icmp_type=135,icmp_code=0
actions=load:0x2->ERICOXM_OF_ICMPV6_ND_OPTIONS_TYPE[],output:2
 cookie=0x0, duration=1078.248s, table=0, n_packets=0, n_bytes=0,
priority=0 actions=NORMAL

Looks ok.
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCHv2] connmgr: Check nullptr inside ofmonitor_report()

2021-02-21 Thread William Tu
On Fri, Feb 19, 2021 at 6:29 PM 贺鹏  wrote:
>
> Hi, Ilya
>
> Ilya Maximets  于2021年2月19日周五 下午7:19写道:
> >
> > On 2/19/21 3:12 AM, 贺鹏 wrote:
> > > Hi,
> > >
> > > Looks like this bug is caused by violating the fact that if a rule is
> > > referenced, the related ofproto should not be destroyed.
> > >
> > > If so, I have a patch that also fixes the problem, not sure if this helps.
> > >
> > > http://patchwork.ozlabs.org/project/openvswitch/patch/20200717015041.82746-1-hepeng.0...@bytedance.com/
> >
> > There is at least one more problem that is not strictly related but
> > in more or less the same part of the code:
> >   https://mail.openvswitch.org/pipermail/ovs-dev/2021-February/380582.html
>
> So maybe before add it into *ovsrcu_postpone*, we should add refcount
> of ofproto also?
>
Yes, I think that will fix the issue. But are we able to find out all the
places that we need to add refcount of ofproto?

Looks like we might have multiple rcu postponed function that might
access the 'ofproto'. And 'ofproto' might be freed already and cause segfault.

Hepeng's patch fixes two places.
  
http://patchwork.ozlabs.org/project/openvswitch/patch/20200717015041.82746-1-hepeng.0...@bytedance.com/
Ilya pointed out another place
  https://mail.openvswitch.org/pipermail/ovs-dev/2021-February/380582.html
yifeng's case is a little different (not due to ofproto = NULL, but
due to setting
the p->connmgr = NULL before postponed)

Regards,
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCHv2] connmgr: Check nullptr inside ofmonitor_report()

2021-02-18 Thread William Tu
On Wed, Feb 17, 2021 at 1:09 PM Yifeng Sun  wrote:
>
> ovs-vswitchd could crash under these circumstances:
> 1. When one bridge is being destroyed, ofproto_destroy() is called and
> connmgr pointer of its ofproto struct is nullified. This ofproto struct is
> deallocated through 'ovsrcu_postpone(ofproto_destroy_defer__, p);'.
> 2. Before RCU enters quiesce state to actually free this ofproto struct,
> revalidator thread calls udpif_revalidator(), which could handle
> a learn flow and calls ofproto_flow_mod_learn(), it later calls
> ofmonitor_report() and ofproto struct's connmgr pointer is accessed.
>
> The crash stack trace is shown below:
>
> 0  ofmonitor_report (mgr=0x0, rule=rule@entry=0x7fa4ac067c30, 
> event=event@entry=NXFME_ADDED,
> reason=reason@entry=OFPRR_IDLE_TIMEOUT, abbrev_ofconn=0x0, abbrev_xid=0, 
> old_actions=old_actions@entry=0x0)
> at ofproto/connmgr.c:2160
> 1  0x7fa4d6803495 in add_flow_finish (ofproto=0x55d9075d4ab0, 
> ofm=, req=req@entry=0x0)
> at ofproto/ofproto.c:5221
> 2  0x7fa4d68036af in modify_flows_finish (req=0x0, ofm=0x7fa4980753f0, 
> ofproto=0x55d9075d4ab0)
> at ofproto/ofproto.c:5823
> 3  ofproto_flow_mod_finish (ofproto=0x55d9075d4ab0, 
> ofm=ofm@entry=0x7fa4980753f0, req=req@entry=0x0)
> at ofproto/ofproto.c:8088
> 4  0x7fa4d680372d in ofproto_flow_mod_learn_finish 
> (ofm=ofm@entry=0x7fa4980753f0,
> orig_ofproto=orig_ofproto@entry=0x0) at ofproto/ofproto.c:5439
> 5  0x7fa4d68072f9 in ofproto_flow_mod_learn (ofm=0x7fa4980753f0, 
> keep_ref=keep_ref@entry=true,
> limit=, below_limitp=below_limitp@entry=0x0) at 
> ofproto/ofproto.c:5499
> 6  0x7fa4d6835d33 in xlate_push_stats_entry (entry=0x7fa498012448, 
> stats=stats@entry=0x7fa4d2701a10,
> offloaded=offloaded@entry=false) at ofproto/ofproto-dpif-xlate-cache.c:127
> 7  0x7fa4d6835e3a in xlate_push_stats (xcache=, 
> stats=stats@entry=0x7fa4d2701a10,
> offloaded=offloaded@entry=false) at ofproto/ofproto-dpif-xlate-cache.c:181
> 8  0x7fa4d6822046 in revalidate_ukey (udpif=udpif@entry=0x55d90760b240, 
> ukey=ukey@entry=0x7fa4b0191660,
> stats=stats@entry=0x7fa4d2705118, 
> odp_actions=odp_actions@entry=0x7fa4d2701b50,
> reval_seq=reval_seq@entry=5655486242, 
> recircs=recircs@entry=0x7fa4d2701b40, offloaded=false)
> at ofproto/ofproto-dpif-upcall.c:2294
> 9  0x7fa4d6825aee in revalidate (revalidator=0x55d90769dd00) at 
> ofproto/ofproto-dpif-upcall.c:2683
> 10 0x7fa4d6825cf3 in udpif_revalidator (arg=0x55d90769dd00) at 
> ofproto/ofproto-dpif-upcall.c:936
> 11 0x7fa4d6259c9f in ovsthread_wrapper (aux_=) at 
> lib/ovs-thread.c:423
> 12 0x7fa4d582cea5 in start_thread () from /usr/lib64/libpthread.so.0
> 13 0x7fa4d504b96d in clone () from /usr/lib64/libc.so.6
>
> At the time of crash, the involved ofproto was already deallocated:
>
> (gdb) print *ofproto
> $1 = ..., name = 0x55d907602820 "nsx-managed", ..., ports = {...,
> one = 0x0, mask = 63, n = 0}, ..., connmgr = 0x0, ...
>
> This patch fixes it.
>
> VMware-BZ: #2700626
> Signed-off-by: Yifeng Sun 
> ---
> v1->v2: Add check for ofmonitor_flush, thanks William.
>

LGTM, thanks.
Acked-by: William Tu < u9012...@gmail.com>

CC Ilya and Ben to see if any comments.

I feel this kind of RCU issue is hard to find out, and
existing tools such as addressSanitizer are usually
not helpful.

William

>  ofproto/connmgr.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/ofproto/connmgr.c b/ofproto/connmgr.c
> index 9c5c633b4171..fa8f6cd0e83a 100644
> --- a/ofproto/connmgr.c
> +++ b/ofproto/connmgr.c
> @@ -2140,7 +2140,7 @@ ofmonitor_report(struct connmgr *mgr, struct rule *rule,
>   const struct rule_actions *old_actions)
>  OVS_REQUIRES(ofproto_mutex)
>  {
> -if (rule_is_hidden(rule)) {
> +if (!mgr || rule_is_hidden(rule)) {
>  return;
>  }
>
> @@ -2244,6 +2244,10 @@ ofmonitor_flush(struct connmgr *mgr)
>  {
>  struct ofconn *ofconn;
>
> +if (!mgr) {
> +return;
> +}
> +
>  LIST_FOR_EACH (ofconn, connmgr_node, >conns) {
>  struct rconn_packet_counter *counter = ofconn->monitor_counter;
>
> --
> 2.7.4
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCHv2] Documentation: Fix DPDK qos example.

2021-02-18 Thread William Tu
Fix the example use case based on the decription.
EIR and CIR are measured in bytes/sec and considered 64-byte
IP packets size withtout 14-byte Ethernet header.
So fix the 1000pps example by: (64 - 14) * 1000 = 50,000
If the frame includes 4-byte FCS header, then it's
(64 - 14 - 4) * 1000 = 46,000

Fixes: e61bdffc2a98 ("netdev-dpdk: Add new DPDK RFC 4115 egress policer")
Signed-off-by: William Tu 
---
 Documentation/topics/dpdk/qos.rst | 14 --
 vswitchd/vswitch.xml  |  6 --
 2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/Documentation/topics/dpdk/qos.rst 
b/Documentation/topics/dpdk/qos.rst
index 103495415a9c..a98ec672fcf5 100644
--- a/Documentation/topics/dpdk/qos.rst
+++ b/Documentation/topics/dpdk/qos.rst
@@ -69,22 +69,24 @@ to prioritize certain traffic over others at a port level.
 
 For example, the following configuration will limit the traffic rate at a
 port level to a maximum of 2000 packets a second (64 bytes IPv4 packets).
-100pps as CIR (Committed Information Rate) and 1000pps as EIR (Excess
-Information Rate). High priority traffic is routed to queue 10, which marks
+1000pps as CIR (Committed Information Rate) and 1000pps as EIR (Excess
+Information Rate). CIR and EIR are measured in bytes without Ethernet header.
+As a result, 1000pps means (64-byte - 14-byte) * 1000 = 50,000 in the
+configuration below. High priority traffic is routed to queue 10, which marks
 all traffic as CIR, i.e. Green. All low priority traffic, queue 20, is
 marked as EIR, i.e. Yellow::
 
 $ ovs-vsctl --timeout=5 set port dpdk1 qos=@myqos -- \
 --id=@myqos create qos type=trtcm-policer \
-other-config:cir=52000 other-config:cbs=2048 \
-other-config:eir=52000 other-config:ebs=2048  \
+other-config:cir=5 other-config:cbs=2048 \
+other-config:eir=5 other-config:ebs=2048  \
 queues:10=@dpdk1Q10 queues:20=@dpdk1Q20 -- \
  --id=@dpdk1Q10 create queue \
-  other-config:cir=4160 other-config:cbs=2048 \
+  other-config:cir=10 other-config:cbs=2048 \
   other-config:eir=0 other-config:ebs=0 -- \
  --id=@dpdk1Q20 create queue \
other-config:cir=0 other-config:cbs=0 \
-   other-config:eir=4160 other-config:ebs=2048 \
+   other-config:eir=5 other-config:ebs=2048
 
 This configuration accomplishes that the high priority traffic has a
 guaranteed bandwidth egressing the ports at CIR (1000pps), but it can also
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index a2ad84edefa9..4597a215d936 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -4660,7 +4660,8 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch 
options:peer=p1 \
 packets per second the CIR would be set to to to 4600. This value
 can be broken into '1,000,000 x 46'. Where 1,000,000 is the policing
 rate for the number of packets per second and 46 represents the size
-of the packet data for a 64 byte ip packet.
+of the packet data for a 64 bytes IP packet without 14 bytes Ethernet
+and 4 bytes FCS header.
   
   
 The Committed Burst Size (CBS) is measured in bytes and represents a
@@ -4681,7 +4682,8 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch 
options:peer=p1 \
 packets per second the EIR would be set to to to 4600. This value
 can be broken into '1,000,000 x 46'. Where 1,000,000 is the policing
 rate for the number of packets per second and 46 represents the size
-of the packet data for a 64 byte ip packet.
+of the packet data for a 64 bytes IP packet without 14 bytes Ethernet
+and 4 bytes FCS header.
   
   
 The Excess Burst Size (EBS) is measured in bytes and represents a
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] Documentation: Fix DPDK qos example.

2021-02-18 Thread William Tu
On Thu, Feb 18, 2021 at 6:55 AM Eelco Chaudron  wrote:
>
>
>
> On 18 Feb 2021, at 15:40, William Tu wrote:
>
> > On Thu, Feb 18, 2021 at 1:23 AM Eelco Chaudron 
> > wrote:
> >>
> >>
> >>
> >> On 17 Feb 2021, at 18:41, William Tu wrote:
> >>
> >>>>
> >>>>>>  Information Rate). High priority traffic is routed to queue 10,
> >>>>>> which marks
> >>>>>>  all traffic as CIR, i.e. Green. All low priority traffic, queue
> >>>>>> 20,
> >>>>>> is
> >>>>>>  marked as EIR, i.e. Yellow::
> >>>>>>
> >>>>>>  $ ovs-vsctl --timeout=5 set port dpdk1 qos=@myqos -- \
> >>>>>>  --id=@myqos create qos type=trtcm-policer \
> >>>>>> -other-config:cir=52000 other-config:cbs=2048 \
> >>>>>> -other-config:eir=52000 other-config:ebs=2048  \
> >>>>
> >>>> 52000 is fine as our documentation states cir, eir are in bytes per
> >>>> second, minus the ethernet header.
> >>>> So (64-12) * 1000 = 52000
> >>>
> >>> How come it's not minus 14-byte ethernet header?
> >>
> >> You are right, it should be 14, miscounted the rte_ether_hdr
> >> structure size :(
> >>
> > Thank you!
> > at vswitchd/vswitch.xml, it uses "46", should I change it to "64- 14 =
> > 50"?
> > calculated by (pps x packet data size).  For example assuming
> > a user
> > wishes to limit a stream consisting of 64 byte packets to 1
> > million
> > packets per second the EIR would be set to to to 4600.
> > This value
> > can be broken into '1,000,000 x 46'. Where 1,000,000 is the
> > policing
> > rate for the number of packets per second and 46 represents
> > the size
> > of the packet data for a 64 byte ip packet.
> > William
>
> Guess they take of the FCS also, so the packets they sent are 14 bytes
> ethernet, + 46 bytes IP + 4 FCS.

I see, thanks!

>
> In my tests I sent 64 byte packets, i.e. 14 bytes ethernet + 50 bytes
> IP.
>
> Maybe we should clarify this in the text some how?
>
Sure, I will do it.
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] Documentation: Fix DPDK qos example.

2021-02-18 Thread William Tu
On Thu, Feb 18, 2021 at 1:23 AM Eelco Chaudron  wrote:
>
>
>
> On 17 Feb 2021, at 18:41, William Tu wrote:
>
> >>
> >>>>  Information Rate). High priority traffic is routed to queue 10,
> >>>> which marks
> >>>>  all traffic as CIR, i.e. Green. All low priority traffic, queue 20,
> >>>> is
> >>>>  marked as EIR, i.e. Yellow::
> >>>>
> >>>>  $ ovs-vsctl --timeout=5 set port dpdk1 qos=@myqos -- \
> >>>>  --id=@myqos create qos type=trtcm-policer \
> >>>> -other-config:cir=52000 other-config:cbs=2048 \
> >>>> -other-config:eir=52000 other-config:ebs=2048  \
> >>
> >> 52000 is fine as our documentation states cir, eir are in bytes per
> >> second, minus the ethernet header.
> >> So (64-12) * 1000 = 52000
> >
> > How come it's not minus 14-byte ethernet header?
>
> You are right, it should be 14, miscounted the rte_ether_hdr structure size :(
>
Thank you!
at vswitchd/vswitch.xml, it uses "46", should I change it to "64- 14 = 50"?
calculated by (pps x packet data size).  For example assuming a user
wishes to limit a stream consisting of 64 byte packets to 1 million
packets per second the EIR would be set to to to 4600. This value
can be broken into '1,000,000 x 46'. Where 1,000,000 is the policing
rate for the number of packets per second and 46 represents the size
of the packet data for a 64 byte ip packet.
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] ofproto-dpif-upcall: Fix ukey leak on udpif destroy.

2021-02-17 Thread William Tu
On Wed, Feb 17, 2021 at 8:40 AM Ilya Maximets  wrote:
>
> On 1/18/21 5:12 PM, Ilya Maximets wrote:
> > Since commit 79eadafeb1b4 udpif_stop_threads() doesn't delete datapath
> > flows while called from udpif_destroy().  This means that ukeys are
> > not cleaned up either.  So, hash maps in udpif->ukeys[] might still
> > contain valid pointers to ukeys that should be destroyed before
> > destroying the hash map itself:
> >
> >   ==2783089==ERROR: LeakSanitizer: detected memory leaks
> >
> >   Direct leak of 1560 byte(s) in 1 object(s) allocated from:
> > # 0 0x7f8a57eae667 in __interceptor_malloc (/lib64/libasan.so.6+0xb0667)
> > # 1 0x8411f6 in xmalloc lib/util.c:138
> > # 2 0x4d8a52 in ukey_create__ ofproto/ofproto-dpif-upcall.c:1682
> > # 3 0x4d99e3 in ukey_create_from_upcall 
> > ofproto/ofproto-dpif-upcall.c:1751
> > # 4 0x4d517d in upcall_xlate ofproto/ofproto-dpif-upcall.c:1242
> > # 5 0x4d63d2 in process_upcall ofproto/ofproto-dpif-upcall.c:1414
> > # 6 0x4d29f3 in recv_upcalls ofproto/ofproto-dpif-upcall.c:833
> > # 7 0x4d1ee1 in udpif_upcall_handler ofproto/ofproto-dpif-upcall.c:750
> > # 8 0x795aa2 in ovsthread_wrapper lib/ovs-thread.c:383
> > # 9 0x7f8a57a59431 in start_thread (/lib64/libpthread.so.0+0x9431)
> >
> > Fixes: 79eadafeb1b4 ("ofproto: Do not delete datapath flows on exit by 
> > default.")
> > Reported-by: Dumitru Ceara 
> > Signed-off-by: Ilya Maximets 
> > ---

LGTM. LeakSanitizer is pretty useful.
Acked-by: William Tu 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] Documentation: Fix DPDK qos example.

2021-02-17 Thread William Tu
>
> >>  Information Rate). High priority traffic is routed to queue 10,
> >> which marks
> >>  all traffic as CIR, i.e. Green. All low priority traffic, queue 20,
> >> is
> >>  marked as EIR, i.e. Yellow::
> >>
> >>  $ ovs-vsctl --timeout=5 set port dpdk1 qos=@myqos -- \
> >>  --id=@myqos create qos type=trtcm-policer \
> >> -other-config:cir=52000 other-config:cbs=2048 \
> >> -other-config:eir=52000 other-config:ebs=2048  \
>
> 52000 is fine as our documentation states cir, eir are in bytes per
> second, minus the ethernet header.
> So (64-12) * 1000 = 52000

How come it's not minus 14-byte ethernet header?

the rest looks good to me. Thanks!
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] Documentation: Fix DPDK qos example.

2021-02-17 Thread William Tu
On Wed, Feb 17, 2021 at 8:21 AM Eelco Chaudron  wrote:
>
>
>
> On 17 Feb 2021, at 9:23, Eelco Chaudron wrote:
>
> > On 17 Feb 2021, at 4:39, William Tu wrote:
> >
> >> RFC4115 says "The CIR and EIR are both measured in bits/s."
> >> Fix the example use case based on the decription.
> >> 64-Byte packet * 8 * 1000 pps = 512000
> >
> > Did you run some tests to verify the changes you made?
> >
> >> Fixes: e61bdffc2a98 ("netdev-dpdk: Add new DPDK RFC 4115 egress
> >> policer")
> >> Signed-off-by: William Tu 
> >> ---
> >>  Documentation/topics/dpdk/qos.rst | 12 ++--
> >>  1 file changed, 6 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/Documentation/topics/dpdk/qos.rst
> >> b/Documentation/topics/dpdk/qos.rst
> >> index 103495415a9c..e9a51ab3a3f0 100644
> >> --- a/Documentation/topics/dpdk/qos.rst
> >> +++ b/Documentation/topics/dpdk/qos.rst
> >> @@ -69,22 +69,22 @@ to prioritize certain traffic over others at a
> >> port level.
> >>
> >>  For example, the following configuration will limit the traffic rate
> >> at a
> >>  port level to a maximum of 2000 packets a second (64 bytes IPv4
> >> packets).
> >> -100pps as CIR (Committed Information Rate) and 1000pps as EIR
> >> (Excess
> >> +1000pps as CIR (Committed Information Rate) and 1000pps as EIR
> >> (Excess
>
> Ack this should be 1000
>
> >>  Information Rate). High priority traffic is routed to queue 10,
> >> which marks
> >>  all traffic as CIR, i.e. Green. All low priority traffic, queue 20,
> >> is
> >>  marked as EIR, i.e. Yellow::
> >>
> >>  $ ovs-vsctl --timeout=5 set port dpdk1 qos=@myqos -- \
> >>  --id=@myqos create qos type=trtcm-policer \
> >> -other-config:cir=52000 other-config:cbs=2048 \
> >> -other-config:eir=52000 other-config:ebs=2048  \
>
> 52000 is fine as our documentation states cir, eir are in bytes per
> second, minus the ethernet header.
> So (64-12) * 1000 = 52000
>
> >> +other-config:cir=512000 other-config:cbs=2048 \
> >> +other-config:eir=512000 other-config:ebs=2048  \
> >>  queues:10=@dpdk1Q10 queues:20=@dpdk1Q20 -- \
> >>   --id=@dpdk1Q10 create queue \
> >> -  other-config:cir=4160 other-config:cbs=2048 \
>
> This one should change to cir 2*52000 = 104000 also.
>
> >> -  other-config:eir=0 other-config:ebs=0 -- \
>
> This should change unaltered.
>
> >> +  other-config:cir=512000 other-config:cbs=2048 \
> >> +  other-config:eir=512000 other-config:ebs=0 -- \
> >
> > The eir should stay zero here
> >
> >>   --id=@dpdk1Q20 create queue \
> >> other-config:cir=0 other-config:cbs=0 \
> >> -   other-config:eir=4160 other-config:ebs=2048 \
> >> +   other-config:eir=512000 other-config:ebs=2048 \
>
> This should be 52000 also the trailing backslash can be removed.
> >>
> >>  This configuration accomplishes that the high priority traffic has a
> >>  guaranteed bandwidth egressing the ports at CIR (1000pps), but it
> >> can also
> >
> > I’ll re-run some of my tests if you have not done so with the new
> > config. Hopefully next week or the week after.
>
> I found some time to re-test this, and with the new values it works as
> expected.
> To be clear this is the correct set:
>
> $ ovs-vsctl --timeout=5 set port dpdk1 qos=@myqos -- \
>  --id=@myqos create qos type=trtcm-policer \
>  other-config:cir=52000 other-config:cbs=2048 \
>  other-config:eir=52000 other-config:ebs=2048  \
>  queues:10=@dpdk1Q10 queues:20=@dpdk1Q20 -- \
>   --id=@dpdk1Q10 create queue \
>other-config:cir=104000 other-config:cbs=2048 \
>other-config:eir=0 other-config:ebs=0 -- \
>   --id=@dpdk1Q20 create queue \
> other-config:cir=0 other-config:cbs=0 \
> other-config:eir=52000 other-config:ebs=2048

Hi Eelco,
Thanks!
One question, I thought for queue10, instead of "cir=104000", we should do
  --id=@dpdk1Q10 create queue \
   other-config:cir=52000 other-config:cbs=2048 \
   other-config:eir=52000 other-config:ebs=2048 -- \
because later on in the description, we said "
 This configuration accomplishes that the high priority traffic has a
  guaranteed bandwidth egressing the ports at CIR (1000pps), but it can also
  use the EIR, so a total of 2000pps at max. These additional 1000pps is
 shared with the low priority traffic. The low priority traffic can use at
 maximum 1000pps.
"
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] connmgr: Check nullptr inside ofmonitor_report()

2021-02-16 Thread William Tu
On Tue, Feb 16, 2021 at 1:40 PM Yi-Hung Wei  wrote:
>
> On Tue, Feb 16, 2021 at 1:06 PM Yifeng Sun  wrote:
> >
> > ovs-vswitchd could crash under these circumstances:
> > 1. When one bridge is being destroyed, ofproto_destroy() is called and
> > connmgr pointer of its ofproto struct is nullified. This ofproto struct is
> > deallocated through 'ovsrcu_postpone(ofproto_destroy_defer__, p);'.
> > 2. Before RCU enters quiesce state to actually free this ofproto struct,
> > revalidator thread calls udpif_revalidator(), which could handle
> > a learn flow and calls ofproto_flow_mod_learn(), it later calls
> > ofmonitor_report() and ofproto struct's connmgr pointer is accessed.
> >

LGTM, thanks. I guess this is hard to reproduce or create a test.
Do we need to worry about other places that use 'ofproto->connmgr'?
ex: there are a couple of places calling ofmonitor_flush(ofproto->connmgr);

> > The crash stack trace is shown below:
> >
> > 0  ofmonitor_report (mgr=0x0, rule=rule@entry=0x7fa4ac067c30, 
> > event=event@entry=NXFME_ADDED,
> > reason=reason@entry=OFPRR_IDLE_TIMEOUT, abbrev_ofconn=0x0, 
> > abbrev_xid=0, old_actions=old_actions@entry=0x0)
> > at ofproto/connmgr.c:2160
> > 1  0x7fa4d6803495 in add_flow_finish (ofproto=0x55d9075d4ab0, 
> > ofm=, req=req@entry=0x0)
> > at ofproto/ofproto.c:5221
> > 2  0x7fa4d68036af in modify_flows_finish (req=0x0, ofm=0x7fa4980753f0, 
> > ofproto=0x55d9075d4ab0)
> > at ofproto/ofproto.c:5823
> > 3  ofproto_flow_mod_finish (ofproto=0x55d9075d4ab0, 
> > ofm=ofm@entry=0x7fa4980753f0, req=req@entry=0x0)
> > at ofproto/ofproto.c:8088
> > 4  0x7fa4d680372d in ofproto_flow_mod_learn_finish 
> > (ofm=ofm@entry=0x7fa4980753f0,
> > orig_ofproto=orig_ofproto@entry=0x0) at ofproto/ofproto.c:5439
> > 5  0x7fa4d68072f9 in ofproto_flow_mod_learn (ofm=0x7fa4980753f0, 
> > keep_ref=keep_ref@entry=true,
> > limit=, below_limitp=below_limitp@entry=0x0) at 
> > ofproto/ofproto.c:5499
> > 6  0x7fa4d6835d33 in xlate_push_stats_entry (entry=0x7fa498012448, 
> > stats=stats@entry=0x7fa4d2701a10,
> > offloaded=offloaded@entry=false) at 
> > ofproto/ofproto-dpif-xlate-cache.c:127
> > 7  0x7fa4d6835e3a in xlate_push_stats (xcache=, 
> > stats=stats@entry=0x7fa4d2701a10,
> > offloaded=offloaded@entry=false) at 
> > ofproto/ofproto-dpif-xlate-cache.c:181
> > 8  0x7fa4d6822046 in revalidate_ukey (udpif=udpif@entry=0x55d90760b240, 
> > ukey=ukey@entry=0x7fa4b0191660,
> > stats=stats@entry=0x7fa4d2705118, 
> > odp_actions=odp_actions@entry=0x7fa4d2701b50,
> > reval_seq=reval_seq@entry=5655486242, 
> > recircs=recircs@entry=0x7fa4d2701b40, offloaded=false)
> > at ofproto/ofproto-dpif-upcall.c:2294
> > 9  0x7fa4d6825aee in revalidate (revalidator=0x55d90769dd00) at 
> > ofproto/ofproto-dpif-upcall.c:2683
> > 10 0x7fa4d6825cf3 in udpif_revalidator (arg=0x55d90769dd00) at 
> > ofproto/ofproto-dpif-upcall.c:936
> > 11 0x7fa4d6259c9f in ovsthread_wrapper (aux_=) at 
> > lib/ovs-thread.c:423
> > 12 0x7fa4d582cea5 in start_thread () from /usr/lib64/libpthread.so.0
> > 13 0x7fa4d504b96d in clone () from /usr/lib64/libc.so.6
> >
> > At the time of crash, the involved ofproto was already deallocated:
> >
> > (gdb) print *ofproto
> > $1 = ..., name = 0x55d907602820 "nsx-managed", ..., ports = {...,
> > one = 0x0, mask = 63, n = 0}, ..., connmgr = 0x0, ...
> >
> > This patch fixes it.
> >
> > VMware-BZ: #2700626
> > Signed-off-by: Yifeng Sun 
> > ---
>
> LGTM.
>
> Acked-by: Yi-Hung Wei 
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH] Documentation: Fix DPDK qos example.

2021-02-16 Thread William Tu
RFC4115 says "The CIR and EIR are both measured in bits/s."
Fix the example use case based on the decription.
64-Byte packet * 8 * 1000 pps = 512000

Fixes: e61bdffc2a98 ("netdev-dpdk: Add new DPDK RFC 4115 egress policer")
Signed-off-by: William Tu 
---
 Documentation/topics/dpdk/qos.rst | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/Documentation/topics/dpdk/qos.rst 
b/Documentation/topics/dpdk/qos.rst
index 103495415a9c..e9a51ab3a3f0 100644
--- a/Documentation/topics/dpdk/qos.rst
+++ b/Documentation/topics/dpdk/qos.rst
@@ -69,22 +69,22 @@ to prioritize certain traffic over others at a port level.
 
 For example, the following configuration will limit the traffic rate at a
 port level to a maximum of 2000 packets a second (64 bytes IPv4 packets).
-100pps as CIR (Committed Information Rate) and 1000pps as EIR (Excess
+1000pps as CIR (Committed Information Rate) and 1000pps as EIR (Excess
 Information Rate). High priority traffic is routed to queue 10, which marks
 all traffic as CIR, i.e. Green. All low priority traffic, queue 20, is
 marked as EIR, i.e. Yellow::
 
 $ ovs-vsctl --timeout=5 set port dpdk1 qos=@myqos -- \
 --id=@myqos create qos type=trtcm-policer \
-other-config:cir=52000 other-config:cbs=2048 \
-other-config:eir=52000 other-config:ebs=2048  \
+other-config:cir=512000 other-config:cbs=2048 \
+other-config:eir=512000 other-config:ebs=2048  \
 queues:10=@dpdk1Q10 queues:20=@dpdk1Q20 -- \
  --id=@dpdk1Q10 create queue \
-  other-config:cir=4160 other-config:cbs=2048 \
-  other-config:eir=0 other-config:ebs=0 -- \
+  other-config:cir=512000 other-config:cbs=2048 \
+  other-config:eir=512000 other-config:ebs=0 -- \
  --id=@dpdk1Q20 create queue \
other-config:cir=0 other-config:cbs=0 \
-   other-config:eir=4160 other-config:ebs=2048 \
+   other-config:eir=512000 other-config:ebs=2048 \
 
 This configuration accomplishes that the high priority traffic has a
 guaranteed bandwidth egressing the ports at CIR (1000pps), but it can also
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] ofp-actions: Fix use-after-free while decoding RAW_ENCAP.

2021-02-16 Thread William Tu
On Tue, Feb 16, 2021 at 2:27 PM Ilya Maximets  wrote:
>
> While decoding RAW_ENCAP action, decode_ed_prop() might re-allocate
> ofpbuf if there is no enough space left.  However, function
> 'decode_NXAST_RAW_ENCAP' continues to use old pointer to 'encap'
> structure leading to write-after-free and incorrect decoding.
>
>   ==3549105==ERROR: AddressSanitizer: heap-use-after-free on address
>   0x6060011a at pc 0x005f6cc6 bp 0x7ffc3a2d4410 sp 0x7ffc3a2d4408
>   WRITE of size 2 at 0x6060011a thread T0
> #0 0x5f6cc5 in decode_NXAST_RAW_ENCAP lib/ofp-actions.c:4461:20
> #1 0x5f0551 in ofpact_decode ./lib/ofp-actions.inc2:4777:16
> #2 0x5ed17c in ofpacts_decode lib/ofp-actions.c:7752:21
> #3 0x5eba9a in ofpacts_pull_openflow_actions__ lib/ofp-actions.c:7791:13
> #4 0x5eb9fc in ofpacts_pull_openflow_actions lib/ofp-actions.c:7835:12
> #5 0x64bb8b in ofputil_decode_packet_out lib/ofp-packet.c:1113:17
> #6 0x65b6f4 in ofp_print_packet_out lib/ofp-print.c:148:13
> #7 0x659e3f in ofp_to_string__ lib/ofp-print.c:1029:16
> #8 0x659b24 in ofp_to_string lib/ofp-print.c:1244:21
> #9 0x65a28c in ofp_print lib/ofp-print.c:1288:28
> #10 0x540d11 in ofctl_ofp_parse utilities/ovs-ofctl.c:2814:9
> #11 0x564228 in ovs_cmdl_run_command__ lib/command-line.c:247:17
> #12 0x56408a in ovs_cmdl_run_command lib/command-line.c:278:5
> #13 0x5391ae in main utilities/ovs-ofctl.c:179:9
> #14 0x7f6911ce9081 in __libc_start_main (/lib64/libc.so.6+0x27081)
> #15 0x461fed in _start (utilities/ovs-ofctl+0x461fed)
>
> Fix that by getting a new pointer before using.
>
> Credit to OSS-Fuzz.
>
> Fuzzer regression test will fail only with AddressSanitizer enabled.
>
> Reported-at: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=27851
> Fixes: f839892a206a ("OF support and translation of generic encap and decap")
> Signed-off-by: Ilya Maximets 


LGTM.
Acked-by: William Tu 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] conntrack: Add a test for IPv4 UDP zero checksum

2021-02-16 Thread William Tu
Hi Aaron,

Should we also consider the case where udp checksum is 0x?
I saw in netdev_tnl_calc_udp_csum, we set to 0x when a packet has
udp checksum = 0.

Thanks
William

On Wed, Feb 10, 2021 at 12:15 PM Flavio Leitner  wrote:
>
> On Wed, Feb 10, 2021 at 02:45:06PM -0500, Aaron Conole wrote:
> > Flavio Leitner  writes:
> >
> > > On Wed, Feb 10, 2021 at 11:49:33AM -0500, Aaron Conole wrote:
> > >> Recently, during some conntrack testing a bug was uncovered in a DPDK
> > >> PMD, which doesn't support an IPv4 packet with a zero checksum value.
> > >> In order to show that the connection tracking code in userspace
> > >> supports IPv4 UDP with a zero checksum, add a test case to enforce
> > >> this behavior.
> > >>
> > >> Reported-at: http://mails.dpdk.org/archives/dev/2021-January/198528.html
> > >> Reported-by: Paolo Valerio 
> > >> Signed-off-by: Aaron Conole 
> > >> ---
> > >>  tests/system-traffic.at | 43 +
> > >>  1 file changed, 43 insertions(+)
> > >>
> > >> diff --git a/tests/system-traffic.at b/tests/system-traffic.at
> > >> index fb5b9a36d2..4971ccc966 100644
> > >> --- a/tests/system-traffic.at
> > >> +++ b/tests/system-traffic.at
> > >> @@ -5927,6 +5927,48 @@ ovs-appctl dpif/dump-flows br0
> > >>  OVS_TRAFFIC_VSWITCHD_STOP
> > >>  AT_CLEANUP
> > >>
> > >> +
> > >> +AT_SETUP([conntrack - IPv4 UDP zero checksum])
> > >> +dnl This tracks sending zero checksum packets for udp over ipv4
> > >> +CHECK_CONNTRACK()
> > >> +OVS_TRAFFIC_VSWITCHD_START()
> > >> +OVS_CHECK_CT_CLEAR()
> > >> +
> > >> +ADD_NAMESPACES(at_ns0, at_ns1)
> > >> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24", "f0:00:00:01:01:01")
> > >> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24", "f0:00:00:01:01:02")
> > >> +dnl setup ct flows
> > >> +AT_DATA([flows.txt], [dnl
> > >> +table=0,priority=10  ip,udp,ct_state=-trk action=ct(zone=1,table=1)
> > >> +table=0,priority=0   action=drop
> > >> +table=1,priority=10  ct_state=-est+trk+new,ip,ct_zone=1,in_port=1 
> > >> action=ct(commit,table=2)
> > >> +table=1,priority=10  ct_state=+est-new+trk,ct_zone=1,in_port=1 
> > >> action=resubmit(,2)
> > >> +table=1,priority=0   action=drop
> > >> +table=2,priority=10  ct_state=+trk+new,in_port=1 action=2
> > >> +table=2,priority=10  ct_state=+trk+est action=2
> > >> +])
> > >> +
> > >> +AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
> > >> +
> > >> +# sending udp pkt
> > >> +NS_CHECK_EXEC([at_ns0], [$PYTHON3 $srcdir/sendpkt.py p0 f0 00 00 01
> > >> 01 02 f0 00 00 01 01 01 08 00 45 00 00 28 00 01 00 00 40 11 64 c0 0a
> > >> 01 01 01 0a 01 01 02 04 d2 04 d2 00 14 00 00 aa aa aa aa aa aa aa aa
> > >> aa aa aa aa > /dev/null])
> > >
> > > That hex string translates to this packet which doesn't use cksum:
> > > 12:57:40.065353 IP (tos 0x0, ttl 64, id 1, offset 0, flags [none],
> > > proto UDP (17), length 40)
> > > 10.1.1.1.search-agent > 10.1.1.2.search-agent: [no cksum] UDP,
> > > length 12
> >
> > Yes, there should be no UDP checksum.  This is the point of this test.
>
> Sorry, I was confirming that the hex string does use a packet
> as this patch says. It's good.
>
>
> > >> +
> > >> +sleep 1
> > >
> > > Can we use OVS_WAIT_UNTIL() or OVS_WAIT_WHILE() ?
> >
> > Good idea.
> >
> > >> +
> > >> +dnl ensure CT picked up the packet
> > >> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(10.1.1)], [0], 
> > >> [dnl
> > >> +udp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=,dport=),reply=(src=10.1.1.2,dst=10.1.1.1,sport=,dport=)
> > >> +])
> > >> +
> > >> +AT_CHECK([ovs-ofctl dump-flows br0 | grep table=2, | 
> > >> OFPROTO_CLEAR_DURATION_IDLE],
> > >> + [0], [dnl
> > >> + cookie=0x0, duration=, table=2, n_packets=1, n_bytes=54,
> > >> idle_age=, priority=10,ct_state=+new+trk,in_port=1
> > >> actions=output:2
> > >> + cookie=0x0, duration=, table=2, n_packets=0, n_bytes=0,
> > >> idle_age=, priority=10,ct_state=+est+trk actions=output:2
> > >> +])
> > >> +
> > >> +OVS_TRAFFIC_VSWITCHD_STOP
> > >> +AT_CLEANUP
> > >> +
> > >>  AT_SETUP([conntrack - Multiple ICMP traverse])
> > >>  dnl This tracks sending ICMP packets via conntrack multiple times for 
> > >> the
> > >>  dnl same packet
> > >> @@ -5971,6 +6013,7 @@ AT_CHECK([ovs-ofctl dump-flows br0 | grep table=2, 
> > >> | OFPROTO_CLEAR_DURATION_IDLE
> > >>  OVS_TRAFFIC_VSWITCHD_STOP
> > >>  AT_CLEANUP
> > >>
> > >> +
> > >
> > > There mixed styles. Some use two lines, some use single line
> > > and some use two lines with a dnl - or ^L.
> > >
> > > I guess we should stick to single line, but I don't have a
> > > strong preference.
> >
> > Oops, this snuck in.  I will remove it, and we can do something there
> > with a different patch if someone thinks it is needed.
>
> Cool,
> thanks
> fbl
>
>
> >
> > > Otherwise the patch works for me.
> > >
> > > Thanks,
> > > fbl
> > >
> > >>  AT_BANNER([802.1ad])
> > >>
> > >>  AT_SETUP([802.1ad - vlan_limit])
> > >> --
> > >> 2.25.4
> > >>
> > >> ___
> > >> dev mailing 

[ovs-dev] [PATCH] netdev-linux: Fix indentation.

2021-02-16 Thread William Tu
Remove one extra space. No actual code logic changed.

Fixes: 2109841b79845 ("Use batch process recv for tap and raw socket in netdev 
datapath")
Signed-off-by: William Tu 
---
 lib/netdev-linux.c | 28 ++--
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index 6be23dbeed57..15b25084b33c 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -1255,21 +1255,21 @@ netdev_linux_batch_rxq_recv_sock(struct 
netdev_rxq_linux *rx, int mtu,
  * aux_buf is allocated so that it can be prepended to TSO buffer. */
 std_len = virtio_net_hdr_size + VLAN_ETH_HEADER_LEN + mtu;
 for (i = 0; i < NETDEV_MAX_BURST; i++) {
- buffers[i] = dp_packet_new_with_headroom(std_len, DP_NETDEV_HEADROOM);
- iovs[i][IOV_PACKET].iov_base = dp_packet_data(buffers[i]);
- iovs[i][IOV_PACKET].iov_len = std_len;
- if (iovlen == IOV_TSO_SIZE) {
- iovs[i][IOV_AUXBUF].iov_base = dp_packet_data(rx->aux_bufs[i]);
- iovs[i][IOV_AUXBUF].iov_len = dp_packet_tailroom(rx->aux_bufs[i]);
- }
+buffers[i] = dp_packet_new_with_headroom(std_len, DP_NETDEV_HEADROOM);
+iovs[i][IOV_PACKET].iov_base = dp_packet_data(buffers[i]);
+iovs[i][IOV_PACKET].iov_len = std_len;
+if (iovlen == IOV_TSO_SIZE) {
+iovs[i][IOV_AUXBUF].iov_base = dp_packet_data(rx->aux_bufs[i]);
+iovs[i][IOV_AUXBUF].iov_len = dp_packet_tailroom(rx->aux_bufs[i]);
+}
 
- mmsgs[i].msg_hdr.msg_name = NULL;
- mmsgs[i].msg_hdr.msg_namelen = 0;
- mmsgs[i].msg_hdr.msg_iov = iovs[i];
- mmsgs[i].msg_hdr.msg_iovlen = iovlen;
- mmsgs[i].msg_hdr.msg_control = _buffers[i];
- mmsgs[i].msg_hdr.msg_controllen = sizeof cmsg_buffers[i];
- mmsgs[i].msg_hdr.msg_flags = 0;
+mmsgs[i].msg_hdr.msg_name = NULL;
+mmsgs[i].msg_hdr.msg_namelen = 0;
+mmsgs[i].msg_hdr.msg_iov = iovs[i];
+mmsgs[i].msg_hdr.msg_iovlen = iovlen;
+mmsgs[i].msg_hdr.msg_control = _buffers[i];
+mmsgs[i].msg_hdr.msg_controllen = sizeof cmsg_buffers[i];
+mmsgs[i].msg_hdr.msg_flags = 0;
 }
 
 do {
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] pcap-file: Fix calculation of TCP payload length in tcp_reader_run().

2021-02-16 Thread William Tu
I see, thanks.

On Tue, Feb 16, 2021 at 12:52 PM Ben Pfaff  wrote:
>
> On Fri, Feb 12, 2021 at 11:15:46AM -0800, William Tu wrote:
> > I'm confused with what l2_pad_size is.
> > I thought it's between L2 and L3 header, there is a 2-byte padding to
> > make it 16-byte alignment. But it doesn't look like that.
>
> Ethernet has a 64-byte minimum packet length.  l2_pad_size is the number
> of bytes of padding added to a packet to bring it up to that minimum
> length.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v4 0/5] XDP offload using flow API provider

2021-02-15 Thread William Tu
On Tue, Feb 9, 2021 at 1:39 AM Toshiaki Makita
 wrote:
>
> On 2021/02/05 2:36, William Tu wrote:
> > Hi Toshiaki,
> >
> > Thanks for the patch. I've been testing it for a couple days.
> > I liked it a lot! The compile and build process all work without any issues.
>
> Hi, thank you for reviewing!
> Sorry for taking time to reply. It took time to remember every detail of the 
> patch set...
>
> > On Thu, Jul 30, 2020 at 7:55 PM Toshiaki Makita
> >  wrote:
> >>
> >> This patch adds an XDP-based flow cache using the OVS netdev-offload
> >> flow API provider.  When an OVS device with XDP offload enabled,
> >> packets first are processed in the XDP flow cache (with parse, and
> >> table lookup implemented in eBPF) and if hits, the action processing
> >> are also done in the context of XDP, which has the minimum overhead.
> >>
> >> This provider is based on top of William's recently posted patch for
> >> custom XDP load.  When a custom XDP is loaded, the provider detects if
> >> the program supports classifier, and if supported it starts offloading
> >> flows to the XDP program.
> >>
> >> The patches are derived from xdp_flow[1], which is a mechanism similar to
> >> this but implemented in kernel.
> >>
> >>
> >> * Motivation
> >>
> >> While userspace datapath using netdev-afxdp or netdev-dpdk shows good
> >> performance, there are use cases where packets better to be processed in
> >> kernel, for example, TCP/IP connections, or container to container
> >> connections.  Current solution is to use tap device or af_packet with
> >> extra kernel-to/from-userspace overhead.  But with XDP, a better solution
> >> is to steer packets earlier in the XDP program, and decides to send to
> >> userspace datapath or stay in kernel.
> >>
> >> One problem with current netdev-afxdp is that it forwards all packets to
> >> userspace, The first patch from William (netdev-afxdp: Enable loading XDP
> >> program.) only provides the interface to load XDP program, howerver users
> >> usually don't know how to write their own XDP program.
> >>
> >> XDP also supports HW-offload so it may be possible to offload flows to
> >> HW through this provider in the future, although not currently.
> >> The reason is that map-in-map is required for our program to support
> >> classifier with subtables in XDP, but map-in-map is not offloadable.
> >> If map-in-map becomes offloadable, HW-offload of our program may also
> >> be possible.
> >
> > I think it's too far away for XDP to be offloaded into HW and meet OVS's
> > feature requirements.
>
> I don't know blockers other than map-in-map, but probably there are more.
> If you can provide explicit blockers I can add them in the cover letter.

It's hard to list them when we don't have a full OVS datapath
implemented in XDP.
Here are a couple things I can imagine. How does HW offloaded support:
- AF_XDP socket. The XSK map contains XSK fd, how does it exchange
  the fd to host kernel?
- how does offloaded XDP redirect to another netdev
- Helper functions such as adjust_head for pushing the outer header.

>
> > There is a research prototype here, FYI.
> > https://www.usenix.org/conference/osdi20/presentation/brunella
>
> This is a presentation about FPGA, not HW offload to SmartNIC, right?
>
Yes, that's for offloading to FPGA.

> >>
> >>
> >> * How to use
> >>
> >> 1. Install clang/llvm >= 9, libbpf >= 0.0.6 (included in kernel 5.5), and
> >> kernel >= 5.3.
> >>
> >> 2. make with --enable-afxdp --enable-xdp-offload
> >> --enable-bpf will generate XDP program "bpf/flowtable_afxdp.o".  Note that
> >
> > typo: I think you mean --enable-xdp-offload
>
> Thanks.
>
> >
> >> the BPF object will not be installed anywhere by "make install" at this 
> >> point.
> >>
> >> 3. Load custom XDP program
> >> E.g.
> >> $ ovs-vsctl add-port ovsbr0 veth0 -- set int veth0 options:xdp-mode=native 
> >> \
> >>options:xdp-obj="/path/to/ovs/bpf/flowtable_afxdp.o"
> >> $ ovs-vsctl add-port ovsbr0 veth1 -- set int veth1 options:xdp-mode=native 
> >> \
> >>options:xdp-obj="/path/to/ovs/bpf/flowtable_afxdp.o"
> >>
> >> 4. Enable XDP_REDIRECT
> >> If you use veth devices, make sure to load some (possibly dummy) programs
> >> on the peers of veth devices. This patch set includes a program which
> >

Re: [ovs-dev] [PATCH] pcap-file: Fix calculation of TCP payload length in tcp_reader_run().

2021-02-12 Thread William Tu
On Tue, Feb 2, 2021 at 10:00 AM Ben Pfaff  wrote:
>
> On Tue, Feb 02, 2021 at 05:11:09PM +0100, Ilya Maximets wrote:
> > On 1/21/21 11:33 PM, Ben Pfaff wrote:
> > > The calculation in tcp_reader_run() failed to account for L2 padding.
> > > This fixes the problem, by moving the existing function
> > > tcp_payload_length() from a conntrack private header file into
> > > dp-packet.h and renaming it to suit the dp_packet style.
> > >
> > > Signed-off-by: Ben Pfaff 
> > > ---
> >
> > LGTM,
> > Acked-by: Ilya Maximets 
>
> Thanks, applied to master.

Hi Ben and Ilya,

I'm confused with what l2_pad_size is.
I thought it's between L2 and L3 header, there is a 2-byte padding to
make it 16-byte alignment. But it doesn't look like that.

Then every time dp_packet API use dp_packet_tail, we have to
subtract the l2_pad_size. ex: dp_packet_l4_size()
So is thie l2_pad_size the extra byte at the end of buffer? (before the
memory pointed by dp_packet_tail(p)?

Thanks
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [RFC PATCH] dp-packet-gso: Add GSO support.

2021-02-09 Thread William Tu
This patch adds GSO support for IPv4 TCP, when userspace-tso is enabled.
Tested using veth sending a TSO packet to OVS, segments to smaller TCP
segment, and forward to netdev-afxdp port at another namespace.

Future work includes:
1. GSO for UDP, and IPv6 TCP/UDP GSO.
2. Tunnel GSO: VxLan GSO, Geneve GSO, GRE GSO...

Tested using
$ make check-afxdp TESTSUITEFLAGS='3'

Or script below:
  ovs-vsctl set Open_vSwitch . other_config:userspace-tso-enable=true
  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
  ip netns add at_ns0
  ip link add p0 type veth peer name afxdp-p0
  ip link set p0 netns at_ns0
  ip link set dev afxdp-p0 up
  ovs-vsctl add-port br0 afxdp-p0
  ip netns exec at_ns0 sh << NS_EXEC_HEREDOC
  ip addr add "10.1.1.1/24" dev p0
  ip link set dev p0 up
  NS_EXEC_HEREDOC

  ip netns add at_ns1
  ip link add p1 type veth peer name afxdp-p1
  ip link set p1 netns at_ns1
  ip link set dev afxdp-p1 up
  ovs-vsctl add-port br0 afxdp-p1 -- set int afxdp-p1 type=afxdp

  ip netns exec at_ns1 sh << NS_EXEC_HEREDOC
  ip addr add "10.1.1.2/24" dev p1
  ip link set dev p1 up
  NS_EXEC_HEREDOC

  ip netns exec at_ns0 ping -c 3 -i .2 10.1.1.2
  ip netns exec at_ns1 ethtool -K p1 tx off
  ip netns exec at_ns1 iperf -s
  ip netns exec at_ns0 iperf -c 10.1.1.2 -t1

Tested-at: https://github.com/williamtu/ovs-travis/actions/runs/553156643
Signed-off-by: William Tu 
---
 lib/automake.mk   |   2 +
 lib/dp-packet-gso.c   | 149 ++
 lib/dp-packet-gso.h   |  27 +
 lib/netdev-afxdp.c|   6 ++
 lib/netdev.c  |  88 +++--
 lib/packets.c |  35 
 lib/packets.h |   1 +
 tests/system-afxdp.at |  32 +++
 8 files changed, 324 insertions(+), 16 deletions(-)
 create mode 100644 lib/dp-packet-gso.c
 create mode 100644 lib/dp-packet-gso.h

diff --git a/lib/automake.mk b/lib/automake.mk
index 39afbff9d1a0..57f504d52f5c 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -104,6 +104,8 @@ lib_libopenvswitch_la_SOURCES = \
lib/dpctl.h \
lib/dp-packet.h \
lib/dp-packet.c \
+   lib/dp-packet-gso.h \
+   lib/dp-packet-gso.c \
lib/dpdk.h \
lib/dpif-netdev-lookup.h \
lib/dpif-netdev-lookup.c \
diff --git a/lib/dp-packet-gso.c b/lib/dp-packet-gso.c
new file mode 100644
index ..5ae7c88298a5
--- /dev/null
+++ b/lib/dp-packet-gso.c
@@ -0,0 +1,149 @@
+/*
+ * Copyright (c) 2021 VMware, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+#include "coverage.h"
+#include "csum.h"
+#include "dp-packet.h"
+#include "dp-packet-gso.h"
+#include "dpif-netdev.h"
+#include "openvswitch/compiler.h"
+#include "openvswitch/dynamic-string.h"
+#include "openvswitch/vlog.h"
+#include "packets.h"
+#include "util.h"
+
+VLOG_DEFINE_THIS_MODULE(dp_packet_gso);
+static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
+
+/* Update ip header's total len, and id and update tcp header's
+ * sent sequence number.  In the end, update ip and tcp csum.
+ */
+static void
+update_ipv4_tcp_headers(const struct dp_packet *src, struct dp_packet **pkts,
+uint16_t nb_segs)
+{
+struct tcp_header *tcp;
+struct ip_header *ip;
+struct dp_packet *p;
+uint32_t tcp_seq;
+uint16_t ipid;
+int i;
+
+ip = dp_packet_l3(src);
+ipid = ntohs(ip->ip_id);
+tcp = dp_packet_l4(src);
+tcp_seq = ntohl(get_16aligned_be32(>tcp_seq));
+
+for (i = 0; i < nb_segs; i++) {
+p = pkts[i];
+
+ip = dp_packet_l3(p);
+ip->ip_tot_len = htons(dp_packet_l3_size(p));
+ip->ip_id = htons(ipid);
+ip->ip_csum = 0;
+ip->ip_csum = csum(ip, sizeof *ip);
+
+tcp = dp_packet_l4(p);
+put_16aligned_be32(>tcp_seq, htonl(tcp_seq));
+packet_csum_tcpudp(p);
+
+ipid += 1;
+tcp_seq += (const char *) dp_packet_tail(p) -
+   (const char *) dp_packet_l4(p) -
+   TCP_OFFSET(tcp->tcp_ctl) * 4;
+}
+}
+
+static void
+hdr_segment_init(struct dp_packet *dst, const struct dp_packet *src)
+{
+/* Copy the following fields into the returned buffer: l2_pad_size,
+ * l2_5_of

Re: [ovs-dev] [PATCH] netdev-afxdp: Add start qid support.

2021-02-08 Thread William Tu
On Mon, Feb 8, 2021 at 4:58 AM Ilya Maximets  wrote:
>
> On 2/7/21 5:05 PM, Toshiaki Makita wrote:
> > On 2021/02/07 2:00, William Tu wrote:
> >> On Fri, Feb 5, 2021 at 1:08 PM Gregory Rose  wrote:
> >>> On 2/4/2021 7:08 PM, William Tu wrote:
> >>>> On Thu, Feb 4, 2021 at 3:17 PM Gregory Rose  wrote:
> >>>>> On 2/3/2021 1:21 PM, William Tu wrote:
> >>>>>> Mellanox card has different XSK design. It requires users to create
> >>>>>> dedicated queues for XSK. Unlike Intel's NIC which loads XDP program
> >>>>>> to all queues, Mellanox only loads XDP program to a subset of its 
> >>>>>> queue.
> >>>>>>
> >>>>>> When OVS uses AF_XDP with mlx5, it doesn't replace the existing RX and 
> >>>>>> TX
> >>>>>> queues in the channel with XSK RX and XSK TX queues, but it creates an
> >>>>>> additional pair of queues for XSK in that channel. To distinguish
> >>>>>> regular and XSK queues, mlx5 uses a different range of qids.
> >>>>>> That means, if the card has 24 queues, queues 0..11 correspond to
> >>>>>> regular queues, and queues 12..23 are XSK queues.
> >>>>>> In this case, we should attach the netdev-afxdp with 'start-qid=12'.
> >>>>>>
> >>>>>> I tested using Mellanox Connect-X 6Dx, by setting 'start-qid=1', and:
> >>>>>>  $ ethtool -L enp2s0f0np0 combined 1
> >>>>>>  # queue 0 is for non-XDP traffic, queue 1 is for XSK
> >>>>>>  $ ethtool -N enp2s0f0np0 flow-type udp4 action 1
> >>>>>> note: we need additionally add flow-redirect rule to queue 1
> >>>>>
> >>>>> Seems awfully hardware dependent.  Is this just for Mellanox or does
> >>>>> it have general usefulness?
> >>>>>
> >>>> It is just Mellanox's design which requires pre-configure the 
> >>>> flow-director.
> >>>> I only have cards from Intel and Mellanox so I don't know about other 
> >>>> vendors.
> >>>>
> >>>> Thanks,
> >>>> William
> >>>>
> >>>
> >>> I think we need to abstract the HW layer a little bit.  This start-qid
> >>> option is specific to a single piece of HW, at least at this point.
> >>> We should expect that further HW  specific requirements for
> >>> different NIC vendors will come up in the future.  I suggest
> >>> adding a hw_options:mellanox:start-qid type hierarchy  so that
> >>> as new HW requirements come up we can easily scale.  It will
> >>> also make adding new vendors easier in the future.
> >>>
> >>> Even with NIC vendors you can't always count on each new generation
> >>> design to always keep old requirements and methods for feature
> >>> enablement.
> >>>
> >>> What do you think?
> >>>
> >> Thanks for the feedback.
> >> So far I don't know whether other vendors will need this option or not.
> >
> > FWIU, this api "The lower half of the available amount of RX queues are 
> > regular queues, and the upper half are XSK RX queues." is the result of 
> > long discussion to support dedicated/isolated XSK rings, which is not meant 
> > for a mellanox-specific feature.
> >
> > https://patchwork.ozlabs.org/project/netdev/cover/20190524093431.20887-1-maxi...@mellanox.com/
> > https://patchwork.ozlabs.org/project/netdev/cover/20190612155605.22450-1-maxi...@mellanox.com/
> >
> > Toshiaki Makita
>
> Thanks for the links.  Very helpful.
>
> From what I understand lower half of queues should still work, i.e.
> it should still be possible to attach AF_XDP socket to them.  But
> they will not work in zero-copy mode ("generic" only?).
> William, could you check that?  Does it work and with which mode
> "best-effort" ends up with?  And what kind of errors libbpf returns
> if we're trying to enable zero-copy?

Thanks for your feedback.
Yes, only zero-copy mode needs to be aware of this, meaning zero-copy
mode has to use the upper half of the queues (the start-qid option here).
Native mode and SKB mode works OK on upper and lower queues.
When attaching zc XSK to lower half queue, libbpf returns EINVAL at
xsk_socket__create().

William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] 答复: 答复: [PATCH V3 2/4] Add GSO support for DPDK data path

2021-02-08 Thread William Tu
On Mon, Feb 8, 2021 at 6:36 AM Ilya Maximets  wrote:
>
> On 2/8/21 1:56 AM, Yi Yang (杨燚)-云服务集团 wrote:
> > Yes, GSO  is ok, but GRO may be have that issue, I didn't see that issue in 
> > my openstack environment, so maybe it will be great if we can have a test 
> > case to trigger that issue.
> >
> > -邮件原件-
> > 发件人: William Tu [mailto:u9012...@gmail.com]
> > 发送时间: 2021年2月7日 23:46
> > 收件人: Yi Yang (杨燚)-云服务集团 
> > 抄送: i.maxim...@ovn.org; yang_y...@163.com; ovs-dev@openvswitch.org; 
> > f...@sysclose.org
> > 主题: Re: [ovs-dev] 答复: [PATCH V3 2/4] Add GSO support for DPDK data path
> >
> > On Tue, Oct 27, 2020 at 6:02 PM Yi Yang (杨燚)-云服务集团  
> > wrote:
> >>
> >> -邮件原件-
> >> 发件人: dev [mailto:ovs-dev-boun...@openvswitch.org] 代表 Ilya Maximets
> >> 发送时间: 2020年10月27日 21:03
> >> 收件人: yang_y...@163.com; ovs-dev@openvswitch.org
> >> 抄送: f...@sysclose.org; i.maxim...@ovn.org
> >> 主题: Re: [ovs-dev] [PATCH V3 2/4] Add GSO support for DPDK data path
> >>
> >> On 8/7/20 12:56 PM, yang_y...@163.com wrote:
> >>> From: Yi Yang 
> >>>
> >>> GSO(Generic Segment Offload) can segment large UDP  and TCP packet
> >>> to small packets per MTU of destination , especially for the case
> >>> that physical NIC can't do hardware offload VXLAN TSO and VXLAN UFO,
> >>> GSO can make sure userspace TSO can still work but not drop.
> >>>
> >>> In addition, GSO can help improve UDP performane when UFO is enabled
> >>> in VM.
> >>>
> >>> GSO can support TCP, UDP, VXLAN TCP, VXLAN UDP, it is done in Tx
> >>> function of physical NIC.
> >>>
> >>> Signed-off-by: Yi Yang 
> >>> ---
> >>>  lib/dp-packet.h|  21 +++-
> >>>  lib/netdev-dpdk.c  | 358
> >>> +
> >>>  lib/netdev-linux.c |  17 ++-
> >>>  lib/netdev.c   |  67 +++---
> >>>  4 files changed, 417 insertions(+), 46 deletions(-)
> >
> > snip
> >
> >>>
> >>> @@ -2339,24 +2428,19 @@ netdev_dpdk_prep_hwol_batch(struct netdev_dpdk 
> >>> *dev, struct rte_mbuf **pkts,
> >>>  return cnt;
> >>>  }
> >>>
> >>> -/* Tries to transmit 'pkts' to txq 'qid' of device 'dev'.  Takes
> >>> ownership of
> >>> - * 'pkts', even in case of failure.
> >>> - *
> >>> - * Returns the number of packets that weren't transmitted. */
> >>> static inline int -netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, int 
> >>> qid,
> >>> - struct rte_mbuf **pkts, int cnt)
> >>> +__netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, int qid,
> >>> +   struct rte_mbuf **pkts, int cnt)
> >>>  {
> >>>  uint32_t nb_tx = 0;
> >>> -uint16_t nb_tx_prep = cnt;
> >>> +uint32_t nb_tx_prep;
> >>>
> >>> -if (userspace_tso_enabled()) {
> >>> -nb_tx_prep = rte_eth_tx_prepare(dev->port_id, qid, pkts, cnt);
> >>> -if (nb_tx_prep != cnt) {
> >>> -VLOG_WARN_RL(, "%s: Output batch contains invalid 
> >>> packets. "
> >>> - "Only %u/%u are valid: %s", dev->up.name, 
> >>> nb_tx_prep,
> >>> - cnt, rte_strerror(rte_errno));
> >>> -}
> >>> +nb_tx_prep = rte_eth_tx_prepare(dev->port_id, qid, pkts, cnt);
> >>> +if (nb_tx_prep != cnt) {
> >>> +VLOG_WARN_RL(, "%s: Output batch contains invalid packets. "
> >>> +  "Only %u/%u are valid: %s",
> >>> + dev->up.name, nb_tx_prep,
> >>> + cnt, rte_strerror(rte_errno));
> >>>  }
> >>>
> >>>  while (nb_tx != nb_tx_prep) {
> >>> @@ -2384,6 +2468,200 @@ netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, 
> >>> int qid,
> >>>  return cnt - nb_tx;
> >>>  }
> >>>
> >>> +static inline void
> >>> +set_multiseg_udptcp_cksum(struct rte_mbuf *mbuf)
> >>
> >> I didn't review the patch, only had a quick glance, but this part bothers 
> >> me.  OVS doesn't support multi-segment mbufs, so it should not be possible 
> >> for such mbufs being transmitted by O

Re: [ovs-dev] 答复: 答复: [PATCH] netdev-dpdk: fix incorrect shinfo initialization

2021-02-08 Thread William Tu
On Mon, Feb 8, 2021 at 8:57 AM Ilya Maximets  wrote:
>
> On 2/6/21 5:15 PM, William Tu wrote:
> > On Mon, Feb 1, 2021 at 5:48 PM Yi Yang (杨燚)-云服务集团  
> > wrote:
> >>
> >> Thanks Ilya, net_tap PMD is handling tap device on host side, so it can 
> >> leverage vnet header to do TSO/GSO, maybe net_pmd authors don't know how 
> >> to do this, from source code, tap fd isn't enabled vnet header and TSO.
> >>
> > thanks, learned a lot from these discussions.
> >
> > I looked at the DPDK net_tap and indeed it doesn't support virtio net hdr.
> > Do you guys think it makes sense to add TSO at dpdk net_tap?
> > Or simply using the current OVS's userspace-enable-tso on tap/veth is
> > good enough?
> > (using type=system, not using dpdk port type on tap/veth.)
> >
> > Regards,
> > William
> >
>
> I didn't benchmark all types of interfaces, but I'd say that, if you
> need more or less high performance solution for userspace<->kernel
> communication, you should, probably, take a look at virtio-user
> ports with vhost kernel backend:
>   https://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html
> This should be the fastest and also feature-rich solution.
Thanks! I will give it a try.
>
> Tap devices are not designed for high performance in general,
> so I'd not suggest any of them for highly loaded ports.
> If it's only for some small management traffic, it should be fine
> to just use netdev-linux implementation.

That's what I thought until Flavio enables vnet header.
>
> netdev-afxdp with pmd or non-pmd modes on a veth devices is another
> (potentially high performance) solution.

When testing intra-host container to container performance,
Tap device becomes much faster than netdev-afxdp, especially with iperf TCP.
Mostly due to vnet header's TSO and csum offload feature.
It's a big limitation for XDP frame which couldn't carry large buffer or carry
the partial csum information.

I reach a conclusion that for intra-host container to container
TCP performance, from the fastest configuration to slowest (ns: namespace)
0) dpdk vhostuser in ns0 -> vhostuer - OVS userspace
(But requires TCP in userspace and application modification)
1) veth0 in ns0 -> veth with TSO - OVS kernel module - veth with TSO
-> veth1 in ns1
2) tap0 in ns0 -> virtio_user - OVS userspace - virtio_user -> tap1 in ns1
3) tap0 in ns0 -> recv_tap - OVS with userspace-tso - tap_batch_send
-> tap1 in ns1
4) veth0 in ns0 -> af_packet sock - OVS with userspace-tso -
af_packet_sock -> veth1 in ns1
5) veth0 in ns0 -> netdev-afxdp - OVS - netdev-afxdp -> veth1 in ns1

I also tested Toshiaki's XDP offload patch,
https://www.mail-archive.com/ovs-dev@openvswitch.org/msg45930.html
I would guess it's between 2 to 4.

Regards,
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] 答复: [PATCH V3 2/4] Add GSO support for DPDK data path

2021-02-07 Thread William Tu
On Tue, Oct 27, 2020 at 6:02 PM Yi Yang (杨燚)-云服务集团  wrote:
>
> -邮件原件-
> 发件人: dev [mailto:ovs-dev-boun...@openvswitch.org] 代表 Ilya Maximets
> 发送时间: 2020年10月27日 21:03
> 收件人: yang_y...@163.com; ovs-dev@openvswitch.org
> 抄送: f...@sysclose.org; i.maxim...@ovn.org
> 主题: Re: [ovs-dev] [PATCH V3 2/4] Add GSO support for DPDK data path
>
> On 8/7/20 12:56 PM, yang_y...@163.com wrote:
> > From: Yi Yang 
> >
> > GSO(Generic Segment Offload) can segment large UDP  and TCP packet to
> > small packets per MTU of destination , especially for the case that
> > physical NIC can't do hardware offload VXLAN TSO and VXLAN UFO, GSO
> > can make sure userspace TSO can still work but not drop.
> >
> > In addition, GSO can help improve UDP performane when UFO is enabled
> > in VM.
> >
> > GSO can support TCP, UDP, VXLAN TCP, VXLAN UDP, it is done in Tx
> > function of physical NIC.
> >
> > Signed-off-by: Yi Yang 
> > ---
> >  lib/dp-packet.h|  21 +++-
> >  lib/netdev-dpdk.c  | 358
> > +
> >  lib/netdev-linux.c |  17 ++-
> >  lib/netdev.c   |  67 +++---
> >  4 files changed, 417 insertions(+), 46 deletions(-)

snip

> >
> > @@ -2339,24 +2428,19 @@ netdev_dpdk_prep_hwol_batch(struct netdev_dpdk 
> > *dev, struct rte_mbuf **pkts,
> >  return cnt;
> >  }
> >
> > -/* Tries to transmit 'pkts' to txq 'qid' of device 'dev'.  Takes
> > ownership of
> > - * 'pkts', even in case of failure.
> > - *
> > - * Returns the number of packets that weren't transmitted. */  static
> > inline int -netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, int qid,
> > - struct rte_mbuf **pkts, int cnt)
> > +__netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, int qid,
> > +   struct rte_mbuf **pkts, int cnt)
> >  {
> >  uint32_t nb_tx = 0;
> > -uint16_t nb_tx_prep = cnt;
> > +uint32_t nb_tx_prep;
> >
> > -if (userspace_tso_enabled()) {
> > -nb_tx_prep = rte_eth_tx_prepare(dev->port_id, qid, pkts, cnt);
> > -if (nb_tx_prep != cnt) {
> > -VLOG_WARN_RL(, "%s: Output batch contains invalid packets. "
> > - "Only %u/%u are valid: %s", dev->up.name, 
> > nb_tx_prep,
> > - cnt, rte_strerror(rte_errno));
> > -}
> > +nb_tx_prep = rte_eth_tx_prepare(dev->port_id, qid, pkts, cnt);
> > +if (nb_tx_prep != cnt) {
> > +VLOG_WARN_RL(, "%s: Output batch contains invalid packets. "
> > +  "Only %u/%u are valid: %s",
> > + dev->up.name, nb_tx_prep,
> > + cnt, rte_strerror(rte_errno));
> >  }
> >
> >  while (nb_tx != nb_tx_prep) {
> > @@ -2384,6 +2468,200 @@ netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, 
> > int qid,
> >  return cnt - nb_tx;
> >  }
> >
> > +static inline void
> > +set_multiseg_udptcp_cksum(struct rte_mbuf *mbuf)
>
> I didn't review the patch, only had a quick glance, but this part bothers me. 
>  OVS doesn't support multi-segment mbufs, so it should not be possible for 
> such mbufs being transmitted by OVS.  So, I do not understand why this 
> function needs to work with such mbufs.
>
> [Yi Yang] Only DPDK driver/Tx function will use it, not OVS, 
> set_multiseg_udptcp_cksum is called in GSO part, it is last step before Tx 
> function, it is a big external mbuf before rte_gso_segment, that isn't a 
> multi-segmented mbuf.
>

Hi Ilya,

Now I understand Yi Yang's point better and I agree with him.
Looks like the patch does the GSO at the DPDK TX function.
It creates multi-seg mbuf after rte_gso_segment(), but will immediately
send out the multi-seg mbuf to DPDK port, without traversing inside other
part of OVS code. I guess this case it should work OK?

William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] netdev-afxdp: Add start qid support.

2021-02-06 Thread William Tu
On Fri, Feb 5, 2021 at 1:08 PM Gregory Rose  wrote:
>
>
>
> On 2/4/2021 7:08 PM, William Tu wrote:
> > On Thu, Feb 4, 2021 at 3:17 PM Gregory Rose  wrote:
> >>
> >>
> >>
> >> On 2/3/2021 1:21 PM, William Tu wrote:
> >>> Mellanox card has different XSK design. It requires users to create
> >>> dedicated queues for XSK. Unlike Intel's NIC which loads XDP program
> >>> to all queues, Mellanox only loads XDP program to a subset of its queue.
> >>>
> >>> When OVS uses AF_XDP with mlx5, it doesn't replace the existing RX and TX
> >>> queues in the channel with XSK RX and XSK TX queues, but it creates an
> >>> additional pair of queues for XSK in that channel. To distinguish
> >>> regular and XSK queues, mlx5 uses a different range of qids.
> >>> That means, if the card has 24 queues, queues 0..11 correspond to
> >>> regular queues, and queues 12..23 are XSK queues.
> >>> In this case, we should attach the netdev-afxdp with 'start-qid=12'.
> >>>
> >>> I tested using Mellanox Connect-X 6Dx, by setting 'start-qid=1', and:
> >>> $ ethtool -L enp2s0f0np0 combined 1
> >>> # queue 0 is for non-XDP traffic, queue 1 is for XSK
> >>> $ ethtool -N enp2s0f0np0 flow-type udp4 action 1
> >>> note: we need additionally add flow-redirect rule to queue 1
> >>
> >> Seems awfully hardware dependent.  Is this just for Mellanox or does
> >> it have general usefulness?
> >>
> > It is just Mellanox's design which requires pre-configure the flow-director.
> > I only have cards from Intel and Mellanox so I don't know about other 
> > vendors.
> >
> > Thanks,
> > William
> >
>
> I think we need to abstract the HW layer a little bit.  This start-qid
> option is specific to a single piece of HW, at least at this point.
> We should expect that further HW  specific requirements for
> different NIC vendors will come up in the future.  I suggest
> adding a hw_options:mellanox:start-qid type hierarchy  so that
> as new HW requirements come up we can easily scale.  It will
> also make adding new vendors easier in the future.
>
> Even with NIC vendors you can't always count on each new generation
> design to always keep old requirements and methods for feature
> enablement.
>
> What do you think?
>
Thanks for the feedback.
So far I don't know whether other vendors will need this option or not.
I think adding another "hw_options" is a little confusing because this
is already an option on the device.
Looking at AF_XDP driver at DPDK, it also has similar option:
see start_queue
https://doc.dpdk.org/guides/nics/af_xdp.html
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] 答复: 答复: [PATCH] netdev-dpdk: fix incorrect shinfo initialization

2021-02-06 Thread William Tu
On Mon, Feb 1, 2021 at 5:48 PM Yi Yang (杨燚)-云服务集团  wrote:
>
> Thanks Ilya, net_tap PMD is handling tap device on host side, so it can 
> leverage vnet header to do TSO/GSO, maybe net_pmd authors don't know how to 
> do this, from source code, tap fd isn't enabled vnet header and TSO.
>
thanks, learned a lot from these discussions.

I looked at the DPDK net_tap and indeed it doesn't support virtio net hdr.
Do you guys think it makes sense to add TSO at dpdk net_tap?
Or simply using the current OVS's userspace-enable-tso on tap/veth is
good enough?
(using type=system, not using dpdk port type on tap/veth.)

Regards,
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] 答复: [PATCH V3 3/4] Add VXLAN TCP and UDP GRO support for DPDK data path

2021-02-06 Thread William Tu
On Tue, Oct 27, 2020 at 5:50 PM Yi Yang (杨燚)-云服务集团  wrote:
>
> -邮件原件-
> 发件人: dev [mailto:ovs-dev-boun...@openvswitch.org] 代表 Ilya Maximets
> 发送时间: 2020年10月27日 21:12
> 收件人: yang_y...@163.com; ovs-dev@openvswitch.org
> 抄送: f...@sysclose.org; i.maxim...@ovn.org
> 主题: Re: [ovs-dev] [PATCH V3 3/4] Add VXLAN TCP and UDP GRO support for DPDK 
> data path
>
> On 8/7/20 12:56 PM, yang_y...@163.com wrote:
> > From: Yi Yang 
> >
> > GRO(Generic Receive Offload) can help improve performance when TSO
> > (TCP Segment Offload) or VXLAN TSO is enabled on transmit side, this
> > can avoid overhead of ovs DPDK data path and enqueue vhost for VM by
> > merging many small packets to large packets (65535 bytes at most) once
> > it receives packets from physical NIC.
>
> IIUC, this patch allows multi-segment mbufs to float across different parts 
> of OVS.  This will definitely crash it somewhere.  Much more changes all over 
> the OVS required to make it safely work with such mbufs.  There were few 
> attempts to introduce this support, but all of them ended up being rejected.  
> As it is this patch is not acceptable as it doesn't cover almost anything 
> beside simple cases inside netdev implementation.
>
> Here is the latest attempt with multi-segment mbufs:
> https://patchwork.ozlabs.org/project/openvswitch/list/?series=130193=*

Thanks, that's very helpful. Looks like a huge amount of work to
introduce multi-seg mbuf.
>
> Best regards, Ilya Maximets.
>
> [Yi Yang] We have to support this because we have supported TSO for TCP, it 
> can't handle big UDP, this is why we must introduce GSO, the prerequisite of 
> GSO is multi-segment  must be enabled because GSOed mbufs are 
> multi-segmented, but it is just last  step before dpdk Tx, so I don't think 
> it is an issue, per my test in our openstack environment, I didn't encounter 
> any crash, this just enabled DPDK PMD driver to handle GSOed mbuf. For GRO, 
> reassembling also use chained multi-segment mbuf to avoid copy, per long time 
> test, it also didn't lead to any crash. We can fix some corner cases if they 
> aren't covered.
>
I just started to understand the problem. Sorry if I missed something.
So currently what do we do to prevent DPDK sending OVS using multi-seg mbuf?
Do we check it and linearize the mbuf?
Can we make GSO/GRO working using linearized mbuf?

Regards,
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] netdev-afxdp: Add start qid support.

2021-02-04 Thread William Tu
On Thu, Feb 4, 2021 at 3:17 PM Gregory Rose  wrote:
>
>
>
> On 2/3/2021 1:21 PM, William Tu wrote:
> > Mellanox card has different XSK design. It requires users to create
> > dedicated queues for XSK. Unlike Intel's NIC which loads XDP program
> > to all queues, Mellanox only loads XDP program to a subset of its queue.
> >
> > When OVS uses AF_XDP with mlx5, it doesn't replace the existing RX and TX
> > queues in the channel with XSK RX and XSK TX queues, but it creates an
> > additional pair of queues for XSK in that channel. To distinguish
> > regular and XSK queues, mlx5 uses a different range of qids.
> > That means, if the card has 24 queues, queues 0..11 correspond to
> > regular queues, and queues 12..23 are XSK queues.
> > In this case, we should attach the netdev-afxdp with 'start-qid=12'.
> >
> > I tested using Mellanox Connect-X 6Dx, by setting 'start-qid=1', and:
> >$ ethtool -L enp2s0f0np0 combined 1
> ># queue 0 is for non-XDP traffic, queue 1 is for XSK
> >$ ethtool -N enp2s0f0np0 flow-type udp4 action 1
> > note: we need additionally add flow-redirect rule to queue 1
>
> Seems awfully hardware dependent.  Is this just for Mellanox or does
> it have general usefulness?
>
It is just Mellanox's design which requires pre-configure the flow-director.
I only have cards from Intel and Mellanox so I don't know about other vendors.

Thanks,
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v4 3/5] netdev-offload: Add xdp flow api provider

2021-02-04 Thread William Tu
On Thu, Feb 4, 2021 at 5:15 PM William Tu  wrote:
>
> Hi Toshiaki,
>
> Thanks for the patch. I have some questions inline.
>
> On Thu, Jul 30, 2020 at 7:55 PM Toshiaki Makita
>  wrote:
> >
> > This provider offloads classifier to software XDP.
> >
> > It works only when a custom XDP object is loaded by afxdp netdev.
> > The BPF program needs to implement classifier with array-of-maps for
> > subtable hashmaps and arraymap for subtable masks. The flow api
> > provider detects classifier support in the custom XDP program when
> > loading it.
> >
> > The requirements for the BPF program is as below:
> >
> > - A struct named "meta_info" in ".ovs_meta" section
> >   inform vswitchd of supported keys, actions, and the max number of
> >   actions.
> >
> > - A map named "subtbl_masks"
> >   An array which implements a list. The entry contains struct
> >   xdp_subtables_mask provided by lib/netdev-offload-xdp.h followed by
> >   miniflow buf of any length for the mask.
> >
> > - A map named "subtbl_masks_hd"
> >   A one entry int array which contains the first entry index of the list
> >   implemented by subtbl_masks.
> >
> > - A map named "flow_table"
> >   An array-of-maps. Each entry points to subtable hash-map instantiated
> >   with the following "subtbl_template".
> >   The entry of each index of subtbl_masks has miniflow mask of subtable
> >   in the corresponding index of flow_table.
> >
> > - A map named "subtbl_template"
> >   A template for subtable, used for entries of "flow_table".
> >   The key must be miniflow buf (without leading map).
> >
> > - A map named "output_map"
> >   A devmap. Each entry represents odp port.
> >
> > - A map named "xsks_map"
> >   A xskmap. Used for upcall.
> >
> > For more details, refer to the reference BPF program in next commit.
> >
> > In the future it may be possible to offload classifier to SmartNIC
> > through XDP, but currently it's not because map-in-map is not supported
> > by XDP hw-offload.
> >
> > Signed-off-by: Toshiaki Makita 
> > ---
> >  acinclude.m4  |   25 +
> >  configure.ac  |1 +
> >  lib/automake.mk   |8 +
> >  lib/bpf-util.c|   38 ++
> >  lib/bpf-util.h|   22 +
> >  lib/netdev-afxdp.c|  218 +-
> >  lib/netdev-afxdp.h|3 +
> >  lib/netdev-linux-private.h|2 +
> >  lib/netdev-offload-provider.h |8 +-
> >  lib/netdev-offload-xdp.c  | 1213 +
> >  lib/netdev-offload-xdp.h  |   49 ++
> >  lib/netdev-offload.c  |3 +
> >  12 files changed, 1588 insertions(+), 2 deletions(-)
> >  create mode 100644 lib/bpf-util.c
> >  create mode 100644 lib/bpf-util.h
> >  create mode 100644 lib/netdev-offload-xdp.c
> >  create mode 100644 lib/netdev-offload-xdp.h
> >
> > diff --git a/acinclude.m4 b/acinclude.m4
> > index 4bac9dbdd..31ff0c013 100644
> > --- a/acinclude.m4
> > +++ b/acinclude.m4
> > @@ -329,6 +329,31 @@ AC_DEFUN([OVS_CHECK_LINUX_AF_XDP], [
> >AM_CONDITIONAL([HAVE_AF_XDP], test "$AF_XDP_ENABLE" = true)
> >  ])
> >
> > +dnl OVS_CHECK_LINUX_XDP_OFFLOAD
> > +dnl
> > +dnl Check both llvm and libbpf support
> > +AC_DEFUN([OVS_CHECK_LINUX_XDP_OFFLOAD], [
> > +  AC_ARG_ENABLE([xdp_offload],
> > +[AC_HELP_STRING([--enable-xdp-offload],
> > +[Compile XDP offload])],
> > +[], [enable_xdp_offload=no])
> > +  AC_MSG_CHECKING([whether XDP offload is enabled])
> > +  if test "$enable_xdp_offload" != yes; then
> > +AC_MSG_RESULT([no])
> > +XDP_OFFLOAD_ENABLE=false
> > +  else
> > +if test "$enable_afxdp" == no; then
> > +  AC_MSG_ERROR([XDP offload depends on afxdp. Please add 
> > --enable-afxdp.])
> > +fi
> > +AC_MSG_RESULT([yes])
> > +XDP_OFFLOAD_ENABLE=true
> > +
> > +AC_DEFINE([HAVE_XDP_OFFLOAD], [1],
> > +  [Define to 1 if XDP offload compilation is available and 
> > enabled.])
> > +  fi
> > +  AM_CONDITIONAL([HAVE_XDP_OFFLOAD], test "$XDP_OFFLOAD_ENABLE" = true)
> > +])
> > +
> >  dnl OVS_CHECK_DPDK
> >  dnl
> >  dnl Configure DPDK source tree
> > diff --git a/configure.ac b/configure.ac
> >

Re: [ovs-dev] [PATCH v4 3/5] netdev-offload: Add xdp flow api provider

2021-02-04 Thread William Tu
Hi Toshiaki,

Thanks for the patch. I have some questions inline.

On Thu, Jul 30, 2020 at 7:55 PM Toshiaki Makita
 wrote:
>
> This provider offloads classifier to software XDP.
>
> It works only when a custom XDP object is loaded by afxdp netdev.
> The BPF program needs to implement classifier with array-of-maps for
> subtable hashmaps and arraymap for subtable masks. The flow api
> provider detects classifier support in the custom XDP program when
> loading it.
>
> The requirements for the BPF program is as below:
>
> - A struct named "meta_info" in ".ovs_meta" section
>   inform vswitchd of supported keys, actions, and the max number of
>   actions.
>
> - A map named "subtbl_masks"
>   An array which implements a list. The entry contains struct
>   xdp_subtables_mask provided by lib/netdev-offload-xdp.h followed by
>   miniflow buf of any length for the mask.
>
> - A map named "subtbl_masks_hd"
>   A one entry int array which contains the first entry index of the list
>   implemented by subtbl_masks.
>
> - A map named "flow_table"
>   An array-of-maps. Each entry points to subtable hash-map instantiated
>   with the following "subtbl_template".
>   The entry of each index of subtbl_masks has miniflow mask of subtable
>   in the corresponding index of flow_table.
>
> - A map named "subtbl_template"
>   A template for subtable, used for entries of "flow_table".
>   The key must be miniflow buf (without leading map).
>
> - A map named "output_map"
>   A devmap. Each entry represents odp port.
>
> - A map named "xsks_map"
>   A xskmap. Used for upcall.
>
> For more details, refer to the reference BPF program in next commit.
>
> In the future it may be possible to offload classifier to SmartNIC
> through XDP, but currently it's not because map-in-map is not supported
> by XDP hw-offload.
>
> Signed-off-by: Toshiaki Makita 
> ---
>  acinclude.m4  |   25 +
>  configure.ac  |1 +
>  lib/automake.mk   |8 +
>  lib/bpf-util.c|   38 ++
>  lib/bpf-util.h|   22 +
>  lib/netdev-afxdp.c|  218 +-
>  lib/netdev-afxdp.h|3 +
>  lib/netdev-linux-private.h|2 +
>  lib/netdev-offload-provider.h |8 +-
>  lib/netdev-offload-xdp.c  | 1213 +
>  lib/netdev-offload-xdp.h  |   49 ++
>  lib/netdev-offload.c  |3 +
>  12 files changed, 1588 insertions(+), 2 deletions(-)
>  create mode 100644 lib/bpf-util.c
>  create mode 100644 lib/bpf-util.h
>  create mode 100644 lib/netdev-offload-xdp.c
>  create mode 100644 lib/netdev-offload-xdp.h
>
> diff --git a/acinclude.m4 b/acinclude.m4
> index 4bac9dbdd..31ff0c013 100644
> --- a/acinclude.m4
> +++ b/acinclude.m4
> @@ -329,6 +329,31 @@ AC_DEFUN([OVS_CHECK_LINUX_AF_XDP], [
>AM_CONDITIONAL([HAVE_AF_XDP], test "$AF_XDP_ENABLE" = true)
>  ])
>
> +dnl OVS_CHECK_LINUX_XDP_OFFLOAD
> +dnl
> +dnl Check both llvm and libbpf support
> +AC_DEFUN([OVS_CHECK_LINUX_XDP_OFFLOAD], [
> +  AC_ARG_ENABLE([xdp_offload],
> +[AC_HELP_STRING([--enable-xdp-offload],
> +[Compile XDP offload])],
> +[], [enable_xdp_offload=no])
> +  AC_MSG_CHECKING([whether XDP offload is enabled])
> +  if test "$enable_xdp_offload" != yes; then
> +AC_MSG_RESULT([no])
> +XDP_OFFLOAD_ENABLE=false
> +  else
> +if test "$enable_afxdp" == no; then
> +  AC_MSG_ERROR([XDP offload depends on afxdp. Please add 
> --enable-afxdp.])
> +fi
> +AC_MSG_RESULT([yes])
> +XDP_OFFLOAD_ENABLE=true
> +
> +AC_DEFINE([HAVE_XDP_OFFLOAD], [1],
> +  [Define to 1 if XDP offload compilation is available and 
> enabled.])
> +  fi
> +  AM_CONDITIONAL([HAVE_XDP_OFFLOAD], test "$XDP_OFFLOAD_ENABLE" = true)
> +])
> +
>  dnl OVS_CHECK_DPDK
>  dnl
>  dnl Configure DPDK source tree
> diff --git a/configure.ac b/configure.ac
> index 8d37af9db..530112c49 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -99,6 +99,7 @@ OVS_CHECK_DOT
>  OVS_CHECK_IF_DL
>  OVS_CHECK_STRTOK_R
>  OVS_CHECK_LINUX_AF_XDP
> +OVS_CHECK_LINUX_XDP_OFFLOAD
>  AC_CHECK_DECLS([sys_siglist], [], [], [[#include ]])
>  AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct stat.st_mtimensec],
>[], [], [[#include ]])
> diff --git a/lib/automake.mk b/lib/automake.mk
> index 920c958e3..9486b32e7 100644
> --- a/lib/automake.mk
> +++ b/lib/automake.mk
> @@ -451,6 +451,14 @@ lib_libopenvswitch_la_SOURCES += \
> lib/netdev-afxdp.h
>  endif
>
> +if HAVE_XDP_OFFLOAD
> +lib_libopenvswitch_la_SOURCES += \
> +   lib/bpf-util.c \
> +   lib/bpf-util.h \
> +   lib/netdev-offload-xdp.c \
> +   lib/netdev-offload-xdp.h
> +endif
> +
>  if DPDK_NETDEV
>  lib_libopenvswitch_la_SOURCES += \
> lib/dpdk.c \
> diff --git a/lib/bpf-util.c b/lib/bpf-util.c
> new file mode 100644
> index 0..324cfbe1d
> --- /dev/null
> +++ b/lib/bpf-util.c
> @@ -0,0 +1,38 @@
> +/*
> + * Copyright (c) 2020 NTT Corp.
> + 

Re: [ovs-dev] [PATCH v4 2/5] netdev-offload: Add "offload-driver" other_config to specify offload driver

2021-02-04 Thread William Tu
On Thu, Jul 30, 2020 at 7:55 PM Toshiaki Makita
 wrote:
>
> The following commit will introduce another offload driver using XDP.
> When using afxdp netdev, both of TC and XDP will be supported, so let's
> add an other_config to specify which offload driver is preferable.
> When not specified and multiple offload drivers can be used, TC will be
> used if netdev supports it.
>
> Signed-off-by: Toshiaki Makita 
> ---
The implementation looks good to me. Please also add "offload-driver"
to the vswitch.xml.

About the interface, currently people are just setting
$ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
and assume tc hw offload for linux device and DPDK rte_flow for OVS-DPDK.

Instead of boolean, another way is to
$ ovs-vsctl set Open_vSwitch . other_config:hw-offload={tc, rte_flow, xdp}
or just
$ ovs-vsctl set Open_vSwitch . other_config:offload={tc, rte_flow, xdp}

Let's wait for others feedback.
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v4 1/5] netdev-afxdp: Enable loading XDP program.

2021-02-04 Thread William Tu
On Thu, Jul 30, 2020 at 7:55 PM Toshiaki Makita
 wrote:
>
> From: William Tu 
>
> Now netdev-afxdp always forwards all packets to userspace because
> it is using libbpf's default XDP program, see 'xsk_load_xdp_prog'.
> There are some cases when users want to keep packets in kernel instead
> of sending to userspace, for example, management traffic such as SSH
> should be processed in kernel.
>
> The patch enables loading the user-provided XDP program by
>   $ovs-vsctl -- set int afxdp-p0 options:xdp-obj=
>
> So users can implement their filtering logic or traffic steering idea
> in their XDP program, and rest of the traffic passes to AF_XDP socket
> handled by OVS.
>
> Note: kernel in AF_XDP CI test is updated to 5.5 because libbpf from 5.3
> does not have newly used APIs like "bpf_program__get_type".

Thanks for figuring the right version.

>
> Signed-off-by: William Tu 
> Co-Authored-by: Toshiaki Makita 
> Signed-off-by: Toshiaki Makita 
> ---
>  .travis.yml   |   2 +-
>  Documentation/intro/install/afxdp.rst |  59 ++
>  NEWS  |   2 +
>  lib/netdev-afxdp.c| 155 --
>  lib/netdev-linux-private.h|   3 +
>  5 files changed, 213 insertions(+), 8 deletions(-)
>
> diff --git a/.travis.yml b/.travis.yml
> index 527240a67..26b55a3e6 100644
> --- a/.travis.yml
> +++ b/.travis.yml
> @@ -40,7 +40,7 @@ env:
>- TESTSUITE=1 LIBS=-ljemalloc
>- KERNEL_LIST="5.5  4.20 4.19 4.18 4.17 4.16"
>- KERNEL_LIST="4.15 4.14 4.9  4.4  3.19 3.16"
> -  - AFXDP=1 KERNEL=5.3
> +  - AFXDP=1 KERNEL=5.5
>- M32=1 OPTS="--disable-ssl"
>- DPDK=1 OPTS="--enable-shared"
>- DPDK_SHARED=1

We'll need to rebase this patch to latest master.
Otherwise the patch looks good to me, tested OK
on my system.

William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] netdev-natvie-tnl: Fix geneve tunnel flag

2021-02-04 Thread William Tu
On Thu, Feb 4, 2021 at 10:08 AM Cpp Code  wrote:
>
> The patch is ready, I am writing a test for this. Maybe it was worth checking 
> this in and adding test later!?
>
I'd suggest having a test case together with the patch, since this
isn't a trivial issue.
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v4 0/5] XDP offload using flow API provider

2021-02-04 Thread William Tu
Hi Toshiaki,

Thanks for the patch. I've been testing it for a couple days.
I liked it a lot! The compile and build process all work without any issues.

On Thu, Jul 30, 2020 at 7:55 PM Toshiaki Makita
 wrote:
>
> This patch adds an XDP-based flow cache using the OVS netdev-offload
> flow API provider.  When an OVS device with XDP offload enabled,
> packets first are processed in the XDP flow cache (with parse, and
> table lookup implemented in eBPF) and if hits, the action processing
> are also done in the context of XDP, which has the minimum overhead.
>
> This provider is based on top of William's recently posted patch for
> custom XDP load.  When a custom XDP is loaded, the provider detects if
> the program supports classifier, and if supported it starts offloading
> flows to the XDP program.
>
> The patches are derived from xdp_flow[1], which is a mechanism similar to
> this but implemented in kernel.
>
>
> * Motivation
>
> While userspace datapath using netdev-afxdp or netdev-dpdk shows good
> performance, there are use cases where packets better to be processed in
> kernel, for example, TCP/IP connections, or container to container
> connections.  Current solution is to use tap device or af_packet with
> extra kernel-to/from-userspace overhead.  But with XDP, a better solution
> is to steer packets earlier in the XDP program, and decides to send to
> userspace datapath or stay in kernel.
>
> One problem with current netdev-afxdp is that it forwards all packets to
> userspace, The first patch from William (netdev-afxdp: Enable loading XDP
> program.) only provides the interface to load XDP program, howerver users
> usually don't know how to write their own XDP program.
>
> XDP also supports HW-offload so it may be possible to offload flows to
> HW through this provider in the future, although not currently.
> The reason is that map-in-map is required for our program to support
> classifier with subtables in XDP, but map-in-map is not offloadable.
> If map-in-map becomes offloadable, HW-offload of our program may also
> be possible.

I think it's too far away for XDP to be offloaded into HW and meet OVS's
feature requirements.
There is a research prototype here, FYI.
https://www.usenix.org/conference/osdi20/presentation/brunella

>
>
> * How to use
>
> 1. Install clang/llvm >= 9, libbpf >= 0.0.6 (included in kernel 5.5), and
>kernel >= 5.3.
>
> 2. make with --enable-afxdp --enable-xdp-offload
> --enable-bpf will generate XDP program "bpf/flowtable_afxdp.o".  Note that

typo: I think you mean --enable-xdp-offload

> the BPF object will not be installed anywhere by "make install" at this point.
>
> 3. Load custom XDP program
> E.g.
> $ ovs-vsctl add-port ovsbr0 veth0 -- set int veth0 options:xdp-mode=native \
>   options:xdp-obj="/path/to/ovs/bpf/flowtable_afxdp.o"
> $ ovs-vsctl add-port ovsbr0 veth1 -- set int veth1 options:xdp-mode=native \
>   options:xdp-obj="/path/to/ovs/bpf/flowtable_afxdp.o"
>
> 4. Enable XDP_REDIRECT
> If you use veth devices, make sure to load some (possibly dummy) programs
> on the peers of veth devices. This patch set includes a program which
> does nothing but returns XDP_PASS. You can use it for the veth peer like
> this:
> $ ip link set veth1 xdpdrv object /path/to/ovs/bpf/xdp_noop.o section xdp

I'd suggest not using "veth1" as an example, because in (3) above, people
might think "veth1" is already attached to ovsbr0.
IIUC, here your "veth1" should be the device at the peer inside
another namespace.

>
> Some HW NIC drivers require as many queues as cores on its system. Tweak
> queues using "ethtool -L".
>
> 5. Enable hw-offload
> $ ovs-vsctl set Open_vSwitch . other_config:offload-driver=linux_xdp
> $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
> This will starts offloading flows to the XDP program.
>
> You should be able to see some maps installed, including "debug_stats".
> $ bpftool map
>
> If packets are successfully redirected by the XDP program,
> debug_stats[2] will be counted.
> $ bpftool map dump id 
>
> Currently only very limited keys and output actions are supported.
> For example NORMAL action entry and IP based matching work with current
> key support. VLAN actions used by port tag/trunks are also supported.
>

I don't know if this is too much to ask for.
I wonder if you, or we can work together, to add at least a tunnel
support, ex: vxlan?
The current version is a good prototype for people to test an L2/L3
XDP offload switch,
but without a good use case, it's hard to attract more people to
contribute or use it.

>From a maintenance point of view, can we add a test case to avoid regression?
For example, something like "make check-afxdp".
We can have "make check-xdp-offload". I can also help adding it.

>
> * Performance
>
> Tested 2 cases. 1) i40e to veth, 2) i40e to i40e.
> Test 1 Measured drop rate at veth interface with redirect action from
> physical interface (i40e 25G NIC, XXV 710) to veth. The CPU is Xeon
> Silver 4114 (2.20 GHz).

Re: [ovs-dev] [PATCH v2] rhel: Add option to enable AF_XDP on rpm package

2021-02-04 Thread William Tu
On Wed, Feb 3, 2021 at 4:33 PM Yi-Hung Wei  wrote:
>
> This patch adds an RPMBUILD_OPT so that user can enable
> AF_XDP support in the rpm package by:
>
> $ make rpm-fedora RPMBUILD_OPT="--with afxdp"
>
> Signed-off-by: Yi-Hung Wei 
> ---

LGTM
Acked-by: William Tu 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH] netdev-afxdp: Add start qid support.

2021-02-03 Thread William Tu
Mellanox card has different XSK design. It requires users to create
dedicated queues for XSK. Unlike Intel's NIC which loads XDP program
to all queues, Mellanox only loads XDP program to a subset of its queue.

When OVS uses AF_XDP with mlx5, it doesn't replace the existing RX and TX
queues in the channel with XSK RX and XSK TX queues, but it creates an
additional pair of queues for XSK in that channel. To distinguish
regular and XSK queues, mlx5 uses a different range of qids.
That means, if the card has 24 queues, queues 0..11 correspond to
regular queues, and queues 12..23 are XSK queues.
In this case, we should attach the netdev-afxdp with 'start-qid=12'.

I tested using Mellanox Connect-X 6Dx, by setting 'start-qid=1', and:
  $ ethtool -L enp2s0f0np0 combined 1
  # queue 0 is for non-XDP traffic, queue 1 is for XSK
  $ ethtool -N enp2s0f0np0 flow-type udp4 action 1
note: we need additionally add flow-redirect rule to queue 1

Tested-at: https://github.com/williamtu/ovs-travis/actions/runs/535141041
Signed-off-by: William Tu 
---
 Documentation/intro/install/afxdp.rst |  2 ++
 lib/netdev-afxdp.c| 23 ++-
 lib/netdev-linux-private.h|  1 +
 vswitchd/vswitch.xml  |  8 
 4 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/Documentation/intro/install/afxdp.rst 
b/Documentation/intro/install/afxdp.rst
index f2643e0d41a1..eac298c52575 100644
--- a/Documentation/intro/install/afxdp.rst
+++ b/Documentation/intro/install/afxdp.rst
@@ -204,6 +204,8 @@ more details):
  * ``use-need-wakeup``: default ``true`` if libbpf supports it,
otherwise ``false``.
 
+* ``start-qid``: default ``0``.
+
 For example, to use 1 PMD (on core 4) on 1 queue (queue 0) device,
 configure these options: ``pmd-cpu-mask``, ``pmd-rxq-affinity``, and
 ``n_rxq``::
diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
index 482400d8d135..36f6a323b1bc 100644
--- a/lib/netdev-afxdp.c
+++ b/lib/netdev-afxdp.c
@@ -458,12 +458,13 @@ xsk_configure_queue(struct netdev_linux *dev, int 
ifindex, int queue_id,
 VLOG_DBG("%s: configuring queue: %d, mode: %s, use-need-wakeup: %s.",
  netdev_get_name(>up), queue_id, xdp_modes[mode].name,
  dev->use_need_wakeup ? "true" : "false");
-xsk_info = xsk_configure(ifindex, queue_id, mode, dev->use_need_wakeup,
- report_socket_failures);
+xsk_info = xsk_configure(ifindex, dev->startqid + queue_id, mode,
+ dev->use_need_wakeup, report_socket_failures);
 if (!xsk_info) {
 VLOG(report_socket_failures ? VLL_ERR : VLL_DBG,
- "%s: Failed to create AF_XDP socket on queue %d in %s mode.",
- netdev_get_name(>up), queue_id, xdp_modes[mode].name);
+ "%s: Failed to create AF_XDP socket on queue %d+%d in %s mode.",
+ netdev_get_name(>up), dev->startqid, queue_id,
+ xdp_modes[mode].name);
 dev->xsks[queue_id] = NULL;
 return -1;
 }
@@ -604,6 +605,7 @@ netdev_afxdp_set_config(struct netdev *netdev, const struct 
smap *args,
 enum afxdp_mode xdp_mode;
 bool need_wakeup;
 int new_n_rxq;
+int new_startqid;
 
 ovs_mutex_lock(>mutex);
 new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
@@ -637,12 +639,18 @@ netdev_afxdp_set_config(struct netdev *netdev, const 
struct smap *args,
 }
 #endif
 
+/* TODO: need to check
+ * new_startqid + new_n_rxq > total dev's queues. */
+new_startqid = smap_get_int(args, "start-qid", 0);
+
 if (dev->requested_n_rxq != new_n_rxq
 || dev->requested_xdp_mode != xdp_mode
-|| dev->requested_need_wakeup != need_wakeup) {
+|| dev->requested_need_wakeup != need_wakeup
+|| dev->requested_startqid != new_startqid) {
 dev->requested_n_rxq = new_n_rxq;
 dev->requested_xdp_mode = xdp_mode;
 dev->requested_need_wakeup = need_wakeup;
+dev->requested_startqid = new_startqid;
 netdev_request_reconfigure(netdev);
 }
 ovs_mutex_unlock(>mutex);
@@ -661,6 +669,7 @@ netdev_afxdp_get_config(const struct netdev *netdev, struct 
smap *args)
 xdp_modes[dev->xdp_mode_in_use].name);
 smap_add_format(args, "use-need-wakeup", "%s",
 dev->use_need_wakeup ? "true" : "false");
+smap_add_format(args, "start-qid", "%d", dev->startqid);
 ovs_mutex_unlock(>mutex);
 return 0;
 }
@@ -696,6 +705,7 @@ netdev_afxdp_reconfigure(struct netdev *netdev)
 if (netdev->n_rxq == dev->requested_n_rxq
 && dev->xdp_mode == dev->requested_xdp_mode
 && dev->use_need_wakeup == dev->requested_need_wakeup
+&& dev->startqid == dev-&

[ovs-dev] [RFC PATCH] userspace: Enable tunnel with TSO.

2021-02-03 Thread William Tu
Currently when setting 'userspace-tso-enable=true', tunnel test cases
fail due to incorrect checksum, at inner header and outer header.
The patch recalculates the checksum before packet is outputting to
a port (tunnel and tap), makes sure the receiver sees correct checksum.

Consider the following cases:
1) veth -> ovs -> veth, and 2) tap -> ovs -> tap
No need to recalc csum because vnet hdr carries the offload
information.

3) decap: vxlan tunnel -> br-underlay -> br-overlay
The inner packet is sent to br-overlay (which is a tap).
Need to fix the inner header's csum.

4) encap: br-overlay -> br-underlay -> vxlan tunnel
Fix the inner csum before pushing the outer header.

I added iperf and pass vxlan and geneve tests:
$ make check-system-tso TESTSUITEFLAGS="-k vxlan"
$ make check-system-tso TESTSUITEFLAGS="-k geneve"

While TCP works over tunnel, the TCP sender sending huge
packet size will fail. I have to segment the inner TCP
packet before pushing the outer tunnel header.

Signed-off-by: William Tu 
---
 lib/netdev-linux.c  |  2 +-
 lib/netdev-native-tnl.c | 11 ++-
 lib/netdev.c| 18 ++
 lib/packets.c   | 34 ++
 lib/packets.h   |  1 +
 tests/system-tap.at |  3 +++
 tests/system-traffic.at |  9 +
 7 files changed, 64 insertions(+), 14 deletions(-)

diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index 6be23dbeed57..bb365b3b0da3 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -1446,7 +1446,6 @@ netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux 
*rx, int mtu,
  netdev_get_name(netdev_));
 continue;
 }
-
 dp_packet_batch_add(batch, pkt);
 }
 
@@ -1604,6 +1603,7 @@ netdev_linux_tap_batch_send(struct netdev *netdev_, bool 
tso, int mtu,
 int error;
 
 if (tso) {
+packet_csum_tcpudp(packet);
 netdev_linux_prepend_vnet_hdr(packet, mtu);
 }
 
diff --git a/lib/netdev-native-tnl.c b/lib/netdev-native-tnl.c
index b89dfdd52a86..003c78a151f8 100644
--- a/lib/netdev-native-tnl.c
+++ b/lib/netdev-native-tnl.c
@@ -43,6 +43,7 @@
 #include "seq.h"
 #include "unaligned.h"
 #include "unixctl.h"
+#include "userspace-tso.h"
 #include "openvswitch/vlog.h"
 
 VLOG_DEFINE_THIS_MODULE(native_tnl);
@@ -153,6 +154,12 @@ netdev_tnl_push_ip_header(struct dp_packet *packet,
 struct ip_header *ip;
 struct ovs_16aligned_ip6_hdr *ip6;
 
+if (userspace_tso_enabled()) {
+/* Calculate inner header's checksum before pushing outer header.
+ * (Assume the device does not support tnl checksum) */
+packet_csum_tcpudp(packet);
+}
+
 eth = dp_packet_push_uninit(packet, size);
 *ip_tot_size = dp_packet_size(packet) - sizeof (struct eth_header);
 
@@ -189,7 +196,9 @@ udp_extract_tnl_md(struct dp_packet *packet, struct 
flow_tnl *tnl,
 return NULL;
 }
 
-if (udp->udp_csum) {
+/* 'udp->udp_csum' will be the pseudo header csum when when userspace
+ * TSO is enabled. Skip the validation. */
+if (udp->udp_csum && !userspace_tso_enabled()) {
 if (OVS_UNLIKELY(!dp_packet_l4_checksum_valid(packet))) {
 uint32_t csum;
 if (netdev_tnl_is_header_ipv6(dp_packet_data(packet))) {
diff --git a/lib/netdev.c b/lib/netdev.c
index 91e91955c09b..bdfc45e9 100644
--- a/lib/netdev.c
+++ b/lib/netdev.c
@@ -960,18 +960,12 @@ netdev_push_header(const struct netdev *netdev,
 size_t i, size = dp_packet_batch_size(batch);
 
 DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
-if (OVS_UNLIKELY(dp_packet_hwol_is_tso(packet)
- || dp_packet_hwol_l4_mask(packet))) {
-COVERAGE_INC(netdev_push_header_drops);
-dp_packet_delete(packet);
-VLOG_WARN_RL(, "%s: Tunneling packets with HW offload flags is "
- "not supported: packet dropped",
- netdev_get_name(netdev));
-} else {
-netdev->netdev_class->push_header(netdev, packet, data);
-pkt_metadata_init(>md, data->out_port);
-dp_packet_batch_refill(batch, packet, i);
-}
+/* Tunneling packet with HW offload flags is not supported. */
+*dp_packet_ol_flags_ptr(packet) = 0;
+
+netdev->netdev_class->push_header(netdev, packet, data);
+pkt_metadata_init(>md, data->out_port);
+dp_packet_batch_refill(batch, packet, i);
 }
 
 return 0;
diff --git a/lib/packets.c b/lib/packets.c
index 4a7643c5dd3a..b0bb283acdfa 100644
--- a/lib/packets.c
+++ b/lib/packets.c
@@ -1887,3 +1887,37 @@ IP_ECN_set_ce(struct dp_packet *pkt, bool is_ipv6)
 }
 }
 }
+
+void
+packet_csum_tcpudp(struct dp_packet *p)
+{
+struct eth_header *eth;
+ 

Re: [ovs-dev] [PATCH] netdev-natvie-tnl: Fix geneve tunnel flag

2020-12-15 Thread William Tu
ns=resubmit(,10)
> priority=0 actions=NORMAL
> priority=50 actions=resubmit(,40)
> table=10, priority=100,ip actions=ct(table=20,zone=65520)
> table=20, priority=200,ct_state=-new+trk,ip actions=resubmit(,30)
> table=20, priority=100,ip,nw_dst=10.10.0.2 actions=resubmit(,30)
> table=20, priority=0,ip actions=drop
> table=30, priority=100,ip actions=ct(commit,table=40,zone=65520)
> table=40, priority=100,in_port=veth1 
> actions=load:0xc0a84d65->NXM_NX_TUN_IPV4_DST[],output:tun0
> table=40, priority=100,in_port=tun0 actions=output:veth1
> table=40, priority=0 actions=drop
>
> Step 5 – Check that ping works:
> From the first Node: ip netns exec ns0 ping 10.10.0.2
>
> Step 6 – The actual issue:
> From the first Node: ip netns exec ns0 ping -c 3 10.10.0.2
> From the second Node: ovs-ofctl del-flows br-int 
> 'table=20,ip,nw_dst=10.10.0.2'
> From the first Node: ip netns exec ns0 ping -c 3 10.10.0.2

Thanks for the steps.
The bug should be reproducible without using conntrack.

>
> Execute these instructions in order, as soon as the previous one
> completes.  If you follow these steps, you should see the ping in
> the last step succeed.  This is not expected because of the deleted
> flow. It also does not happen with VXLAN.
Make sense, because it's the inconsistency in FLOW_TNL_F_UDPIF field
in megaflow matching.
VXLAN does not use it.

> Reported-by: Antonin Bas 
> Reported-at: https://github.com/vmware-tanzu/antrea/issues/897
> Fixes: 6b241d645291 ("netdev-vport: Factor-out tunnel Push-pop code into 
> separate module.")
> Signed-off-by: Yi-Hung Wei 

VMWare-BZ: #2623444

> ---
>  lib/netdev-native-tnl.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/lib/netdev-native-tnl.c b/lib/netdev-native-tnl.c
> index b89dfdd52a86..09cff3a1fc7d 100644
> --- a/lib/netdev-native-tnl.c
> +++ b/lib/netdev-native-tnl.c
> @@ -1015,9 +1015,11 @@ netdev_geneve_pop_header(struct dp_packet *packet)
>  tnl->tun_id = htonll(ntohl(get_16aligned_be32(>vni)) >> 8);
>      tnl->flags |= FLOW_TNL_F_KEY;
>
> -memcpy(tnl->metadata.opts.gnv, gnh->options, opts_len);
> -tnl->metadata.present.len = opts_len;
> -tnl->flags |= FLOW_TNL_F_UDPIF;
> +if (opts_len > 0) {
> +memcpy(tnl->metadata.opts.gnv, gnh->options, opts_len);
> +tnl->metadata.present.len = opts_len;
> +tnl->flags |= FLOW_TNL_F_UDPIF;
> +}
>
>  packet->packet_type = htonl(PT_ETH);
>  dp_packet_reset_packet(packet, hlen);
> --
LGTM.
Acked-by: William Tu 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2] bugtool: Fix for Python3

2020-11-05 Thread William Tu
On Thu, Nov 5, 2020 at 4:48 AM Ilya Maximets  wrote:
>
> On 4/10/20 12:03 AM, William Tu wrote:
> > On Wed, Mar 25, 2020 at 05:33:51PM -0700, William Tu wrote:
> >> On Thu, Mar 19, 2020 at 12:20:19PM -0700, William Tu wrote:
> >>> On Thu, Mar 19, 2020 at 12:05 PM Timothy Redaelli  
> >>> wrote:
> >>>>
> >>>> Currently ovs-bugtool tool doesn't start on Python 3.
> >>>> This commit fixes ovs-bugtool to make it works on Python 3.
> >>>>
> >>>> Replaced StringIO.StringIO with io.BytesIO since the script is
> >>>> processing binary data.
> >>>>
> >>>> Reported-at: https://bugzilla.redhat.com/1809241
> >>>> Reported-by: Flavio Leitner 
> >>>> Signed-off-by: Timothy Redaelli 
> >>>> ---
> >>>> Changes since v1:
> >>>>   * Converted StringIO to BytesIO
> >>>>   * Fix some other string/bytes conversion
> >>>> ---
> >>>
> >>> Thanks for sending out v2. Hit an error below:
> >>> ~/ovs# python3
> >>> Python 3.5.2 (default, Oct  8 2019, 13:06:37)
> >>>
> >>> ~/ovs# ./utilities/bugtool/ovs-bugtool -y -s --output=tar.gz
> >>> --outfile=/tmp/t.tgz
> >>> Traceback (most recent call last):
> >>>   File "./utilities/bugtool/ovs-bugtool", line 1405, in 
> >>> sys.exit(main())
> >>>   File "./utilities/bugtool/ovs-bugtool", line 717, in main
> >>> collect_data()
> >>>   File "./utilities/bugtool/ovs-bugtool", line 388, in collect_data
> >>> v['output'] = BytesIOmtime(s)
> >>>   File "./utilities/bugtool/ovs-bugtool", line 1395, in __init__
> >>> BytesIO.__init__(self, buf)
> >>> TypeError: a bytes-like object is required, not 'str'
> >>>
> >>> I think sometimes 's' is bytes type, sometimes 's' is a str type...
> >>> William
> >>
> >> Hi Timothy,
> >>
> >> How about adding this to your patch?
> >> I tested it and works ok.
> >>
> >> diff --git a/utilities/bugtool/ovs-bugtool.in 
> >> b/utilities/bugtool/ovs-bugtool.in
> >> index c26c2be7a4eb..47f3c4629f70 100755
> >> --- a/utilities/bugtool/ovs-bugtool.in
> >> +++ b/utilities/bugtool/ovs-bugtool.in
> >> @@ -385,7 +385,10 @@ def collect_data():
> >>  except Exception as e:
> >>  s = str(e).encode()
> >>  if check_space(cap, k, len(s)):
> >> -v['output'] = BytesIOmtime(s)
> >> +if isinstance(s, str):
> >> +v['output'] = BytesIOmtime(s.encode())
> >> +else:
> >> +v['output'] = BytesIOmtime(s)
> >>
> >>
> >>  def main(argv=None):
> >>
> > I applied to master with the diff above.
> > Thanks!
> > William
>
> Looks like this patch didn't make it to branch-2.13, but it needed there since
> we do not support python2 staring from 2.13.
> I'll backport it.
>
> Best regards, Ilya Maximets.

Thanks!
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH] ovs-bugtool: Fix crash when enable --ovs.

2020-11-04 Thread William Tu
When enabling '--ovs' or when not using '-y', ovs-bugtool crashes due to
Traceback (most recent call last):
  File "/usr/local/sbin/ovs-bugtool", line 1410, in 
sys.exit(main())
  File "/usr/local/sbin/ovs-bugtool", line 690, in main
for (k, v) in data.items():
RuntimeError: dictionary changed size during iteration

The patch fixes it by making a copy of the key and value.

VMware-BZ: #2663359
Signed-off-by: William Tu 
---
 utilities/bugtool/ovs-bugtool.in | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/utilities/bugtool/ovs-bugtool.in b/utilities/bugtool/ovs-bugtool.in
index ddb5bc8dc54c..3792502e7a0c 100755
--- a/utilities/bugtool/ovs-bugtool.in
+++ b/utilities/bugtool/ovs-bugtool.in
@@ -687,7 +687,7 @@ exclude those logs from the archive.
  CAP_OPENVSWITCH_LOGS, CAP_NETWORK_CONFIG]
 ovs_info_list = ['process-tree']
 # We cannot use iteritems, since we modify 'data' as we pass through
-for (k, v) in data.items():
+for (k, v) in list(data.items()):
 cap = v['cap']
 if 'filename' in v:
 info = k[0]
@@ -708,7 +708,7 @@ exclude those logs from the archive.
 
 # permit the user to filter out data
 # We cannot use iteritems, since we modify 'data' as we pass through
-for (k, v) in data.items():
+for (k, v) in list(data.items()):
 cap = v['cap']
 if 'filename' in v:
 key = k[0]
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] dpif-netdev: Add miniflow bits to dump-flows.

2020-10-21 Thread William Tu
On Wed, Oct 21, 2020 at 8:47 AM Stokes, Ian  wrote:
>
> > On 6/8/20 5:36 PM, William Tu wrote:
> > > On Mon, Jun 8, 2020 at 7:01 AM Ilya Maximets  wrote:
> > >>
> > >> On 5/14/20 4:11 PM, William Tu wrote:
> > >>> The 'dpctl/dump-flows -m' only shows the number of 1-bit in the
> > >>> miniflow map, the patch outputs additional miniflow bits after it.
> > >>> The format will be
> > >>>   dp-extra-info:miniflow_bits(count_1bit(unit0):unit0,
> > >>>   count_1bit(unit1):unit1)
> > >>> Example:
> > >>>   dp-extra-info:miniflow_bits(4:0x30c0,1:0x400)
> > >>>
> > >>> By searching the unique miniflow bits, we know the number of subtables,
> > >>> and for earch subtables, the fields it matches on.
> > >>
> > >> Hi.
> > >>
> > >> Beside the curiosity what is the purpose of printing this information?
> > >> How can it be used?
> > >>
> > > So from the bitmap we can know which field in the 'struct flow' this
> > > subtable is matching on. And collecting all the bitmaps from 
> > > dpctl/dump-flow,
> > > we can know which fields are used to match more frequently than others.
> >
> > Don't you have all this information from the flow match?
>
> Hi William,
>
> I haven't seen feedback to Ilyas point above. I guess from our side I was 
> think the same as Ilya, if this is available already then do we need to 
> duplicate here?
>
Hi Ian and Ilya,

Thanks for the feedback.
I agree with you that it's not worth duplicating the information here.
Please drop this patch. Thanks!
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH branch-2.13 0/2] Release patches for v2.13.1.

2020-08-04 Thread William Tu
On Tue, Aug 4, 2020 at 7:24 AM Ilya Maximets  wrote:
>
> On 8/4/20 3:19 PM, William Tu wrote:
> > On Mon, Aug 3, 2020 at 2:23 PM Ilya Maximets  wrote:
> >>
> >> On 7/30/20 12:44 AM, Ilya Maximets wrote:
> >>>
> >>> Ilya Maximets (2):
> >>>   Set release date for 2.13.1.
> >>>   Prepare for 2.13.2.
> >>>
> >>>  NEWS | 14 +-
> >>>  configure.ac |  2 +-
> >>>  debian/changelog |  8 +++-
> >>>  3 files changed, 17 insertions(+), 7 deletions(-)
> >>>
> >>
> >> Kind reminder.
> >>
> >> These patches and patches for older branches still needs review.
> >> If anyone has a spare time slot, please take a look.
> >>
> > Thanks for working on this, this is a lot of work!
> >
> > I've reviewed all and acked.
>
> Thanks!
>
> > We also need to create a tarball for each release and update website.
> > https://www.openvswitch.org/download/
> > I can help if you haven't done it yet.
>
> I'm going to apply and backport following small patch first:
> https://patchwork.ozlabs.org/project/openvswitch/patch/20200804015456.4047-1-hepeng.0...@bytedance.com/
>
> Right after that I'll push release patches and tags.
> I also have a script to generate tarballs, so I'll prepare them and the
> website pull request once releases tagged.
>
> Maybe we need to return to the question about just having a link to
> tarballs generated by github at some point?

Yes, I think that would be easier.

>
> >
> >> I have run unit and system tests on all release branches (that wasn't
> >> simple, but more on this later).  2.13.1 also passed through ovn-k8s
> >> and OpenStack CI systems.
> >>
> >
> > Yes, I believe there are lots of issues when running system tests.
> > It would be great to have some public CI running system tests.
> > Previously I was using Github Actions, I'm curious about how you do it.
>
> For now, I just prepared a small script and a VM image to build and test.
> And actually system tests are not that bad, at least kernel ones.  Some
> tests are not skipped while should be on older branches, but it's not
> a big deal.  For now, I had to look at all the failures manually.
> Userspace testsuite is not that good.  Basically, we cleaned it up
> somewhere between 2.10-2.12.  So, on older branches it's a mess.
>
> Harder case is that older branches requires python2 to build and also
> requires /usr/bin/python to point to python2.  This is an issue if you're
> trying to build OVS on some modern distros that doesn't have python2, or
> have it but uses python3 by default or doesn't have simple 'python'
> symlink at all.  I had to switch from fedora to rhel just to build.
> The second build issue is that some older branches doesn't build with
> openssl >= 1.1.0.  This is tricky.  We will need to decide what to do
> with that later.  I will send a follow-up email about older branches
> and what options we have.
>
Thanks for sharing the experience.
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH branch-2.13 0/2] Release patches for v2.13.1.

2020-08-04 Thread William Tu
On Mon, Aug 3, 2020 at 2:23 PM Ilya Maximets  wrote:
>
> On 7/30/20 12:44 AM, Ilya Maximets wrote:
> >
> > Ilya Maximets (2):
> >   Set release date for 2.13.1.
> >   Prepare for 2.13.2.
> >
> >  NEWS | 14 +-
> >  configure.ac |  2 +-
> >  debian/changelog |  8 +++-
> >  3 files changed, 17 insertions(+), 7 deletions(-)
> >
>
> Kind reminder.
>
> These patches and patches for older branches still needs review.
> If anyone has a spare time slot, please take a look.
>
Thanks for working on this, this is a lot of work!

I've reviewed all and acked.
We also need to create a tarball for each release and update website.
https://www.openvswitch.org/download/
I can help if you haven't done it yet.

> I have run unit and system tests on all release branches (that wasn't
> simple, but more on this later).  2.13.1 also passed through ovn-k8s
> and OpenStack CI systems.
>

Yes, I believe there are lots of issues when running system tests.
It would be great to have some public CI running system tests.
Previously I was using Github Actions, I'm curious about how you do it.

Thanks,
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH branch-2.5 2/2] Prepare for 2.5.11.

2020-08-04 Thread William Tu
On Wed, Jul 29, 2020 at 3:42 PM Ilya Maximets  wrote:
>
> Signed-off-by: Ilya Maximets 
> ---
LGTM,
Acked-by: William Tu 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH branch-2.5 1/2] Set release date for 2.5.10.

2020-08-04 Thread William Tu
On Wed, Jul 29, 2020 at 3:42 PM Ilya Maximets  wrote:
>
> Signed-off-by: Ilya Maximets 
> ---

LGTM,
Acked-by: William Tu 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH branch-2.6 2/2] Prepare for 2.6.9.

2020-08-04 Thread William Tu
On Wed, Jul 29, 2020 at 3:43 PM Ilya Maximets  wrote:
>
> Signed-off-by: Ilya Maximets 
> ---
LGTM,
Acked-by: William Tu 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH branch-2.6 1/2] Set release date for 2.6.8.

2020-08-04 Thread William Tu
On Wed, Jul 29, 2020 at 3:43 PM Ilya Maximets  wrote:
>
> Signed-off-by: Ilya Maximets 
> ---

LGTM,
Acked-by: William Tu 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH branch-2.7 2/2] Prepare for 2.7.12.

2020-08-04 Thread William Tu
On Wed, Jul 29, 2020 at 3:43 PM Ilya Maximets  wrote:
>
> Signed-off-by: Ilya Maximets 
> ---

LGTM,
Acked-by: William Tu 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH branch-2.7 1/2] Set release date for 2.7.11.

2020-08-04 Thread William Tu
On Wed, Jul 29, 2020 at 3:43 PM Ilya Maximets  wrote:
>
> Signed-off-by: Ilya Maximets 
> ---
LGTM,
Acked-by: William Tu 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH branch-2.8 2/2] Prepare for 2.8.10.

2020-08-04 Thread William Tu
On Wed, Jul 29, 2020 at 3:44 PM Ilya Maximets  wrote:
>
> Signed-off-by: Ilya Maximets 
> ---
LGTM,
Acked-by: William Tu 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH branch-2.8 1/2] Set release date for 2.8.9.

2020-08-04 Thread William Tu
On Wed, Jul 29, 2020 at 3:43 PM Ilya Maximets  wrote:
>
> Signed-off-by: Ilya Maximets 
> ---
LGTM,
Acked-by: William Tu 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH branch-2.9 2/2] Prepare for 2.9.8.

2020-08-04 Thread William Tu
On Wed, Jul 29, 2020 at 3:44 PM Ilya Maximets  wrote:
>
> Signed-off-by: Ilya Maximets 
> ---
LGTM,
Acked-by: William Tu 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH branch-2.9 1/2] Set release date for 2.9.7.

2020-08-04 Thread William Tu
On Wed, Jul 29, 2020 at 3:43 PM Ilya Maximets  wrote:
>
> Signed-off-by: Ilya Maximets 
> ---
LGTM,
Acked-by: William Tu 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH branch-2.10 2/2] Prepare for 2.10.6.

2020-08-04 Thread William Tu
On Wed, Jul 29, 2020 at 3:44 PM Ilya Maximets  wrote:
>
> Signed-off-by: Ilya Maximets 
> ---

LGTM,
Acked-by: William Tu 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH branch-2.10 1/2] Set release date for 2.10.5.

2020-08-04 Thread William Tu
On Wed, Jul 29, 2020 at 3:45 PM Ilya Maximets  wrote:
>
> Signed-off-by: Ilya Maximets 

LGTM,
Acked-by: William Tu 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH branch-2.11 2/2] Prepare for 2.11.5.

2020-08-04 Thread William Tu
On Wed, Jul 29, 2020 at 3:45 PM Ilya Maximets  wrote:
>
> Signed-off-by: Ilya Maximets 
> ---

LGTM,
Acked-by: William Tu 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH branch-2.11 1/2] Set release date for 2.11.4.

2020-08-04 Thread William Tu
On Wed, Jul 29, 2020 at 3:45 PM Ilya Maximets  wrote:
>
> Signed-off-by: Ilya Maximets 
> ---
LGTM,
Acked-by: William Tu 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH branch-2.12 2/2] Prepare for 2.12.2.

2020-08-04 Thread William Tu
On Wed, Jul 29, 2020 at 3:45 PM Ilya Maximets  wrote:
>
> Signed-off-by: Ilya Maximets 
> ---

LGTM,
Acked-by: William Tu 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


  1   2   3   4   5   6   7   8   9   10   >