Re: [ovs-dev] [PATCH v2] netdev:clear out vlan flow fields while processing native tunnel

2022-04-07 Thread Thilak Raj Surendra Babu
Hi Ilya,
I addressed most of the review comments in my v3 patch.
I am not sure what I need to do for the fixes tag as I am not sure when this 
bug was introduced.
I noticed this in ovs 2.14.

Thanks
Thilak Raj S

-Original Message-
From: Thilak Raj Surendra Babu 
Sent: 06 April 2022 10:11
To: Rosemarie O'Riorden ; Ilya Maximets 

Cc: ovs-dev@openvswitch.org; Rosemarie O'Riorden 
Subject: RE: [ovs-dev] [PATCH v2] netdev:clear out vlan flow fields while 
processing native tunnel

Thank you, Rosemarie, for your patch, and thank you Ilya for reviewing my patch.
I will re-work this and provide an updated patch.

Thanks
Thilak Raj S

-Original Message-
From: Rosemarie O'Riorden 
Sent: 06 April 2022 07:30
To: Ilya Maximets 
Cc: Thilak Raj Surendra Babu ; 
ovs-dev@openvswitch.org; Rosemarie O'Riorden 
Subject: Re: [ovs-dev] [PATCH v2] netdev:clear out vlan flow fields while 
processing native tunnel

Hello,

Here is my unit test:

diff --git a/tests/tunnel-push-pop.at b/tests/tunnel-push-pop.at index 
57589758f..bfb0f440b 100644
--- a/tests/tunnel-push-pop.at
+++ b/tests/tunnel-push-pop.at
@@ -842,3 +842,52 @@ Datapath actions: 7

 OVS_VSWITCHD_STOP
 AT_CLEANUP
+
+AT_SETUP([tunnel_push_pop - VXLAN access port])
+
+dnl Create bridge that has a MAC address OVS_VSWITCHD_START([set bridge
+br0 dnl
+datapath_type=dummy -- set Interface dnl
+br0 other-config:hwaddr=aa:55:aa:55:00:00])
+AT_CHECK([ovs-vsctl add-port br0 p8 -- set Interface p8 dnl
+   type=dummy ofport_request=8])
+
+dnl Create another bridge
+AT_CHECK([ovs-vsctl add-br ovs-tun0 -- set bridge ovs-tun0
datapath_type=dummy])
+
+dnl Add VXLAN port to this bridge
+AT_CHECK([ovs-vsctl add-port ovs-tun0 tun0 -- set int tun0 dnl
+  type=vxlan options:remote_ip=10.0.0.11 -- add-port ovs-tun0 p7 dnl
+  -- set interface p7 type=dummy ofport_request=7])
+
+dnl Set VLAN tags, so that br0 and its port p8 have the same tag, dnl 
+but ovs-tun0's port p7 has a different tag AT_CHECK([ovs-vsctl set port
+p8 tag=42 -- set port br0 tag=42 dnl
+  -- set port p7 tag=200])
+
+dnl Set IP address and route for br0
+AT_CHECK([ovs-appctl netdev-dummy/ip4addr br0 10.0.0.2/24], [0], [OK
+])
+AT_CHECK([ovs-appctl ovs/route/add 10.0.0.11/24 br0], [0], [OK
+])
+
+dnl Send an ARP reply to port b8 on br0, so that packets will be 
+forwarded dnl to learned port AT_CHECK([ovs-ofctl add-flow br0
+action=normal]) AT_CHECK([ovs-appctl netdev-dummy/receive p8 
+'in_port(8),dnl
+   
+eth(src=aa:55:aa:66:00:00,dst=ff:ff:ff:ff:ff:ff),eth_type(0x0806),dnl
+   
+arp(sip=10.0.0.11,tip=10.0.0.2,op=2,sha=aa:55:aa:66:00:00,tha=00:00:00:
+00:00:00)'])
+
+AT_CHECK([ovs-appctl ofproto/trace ovs-tun0 in_port=p7], [0], [stdout]) 
+AT_CHECK([tail -2 stdout], [0], [dnl
+Megaflow: recirc_id=0,eth,in_port=7,dl_src=00:00:00:00:00:00,dnl
+dl_dst=00:00:00:00:00:00,dl_type=0x
+Datapath actions: 
+push_vlan(vid=200,pcp=0),1,clone(tnl_push(tnl_port(4789),dnl
+header(size=50,type=4,eth(dst=aa:55:aa:66:00:00,src=aa:55:aa:55:00:00,d
+nl
+dl_type=0x0800),ipv4(src=10.0.0.2,dst=10.0.0.11,proto=17,tos=0,ttl=64,d
+nl
+frag=0x4000),udp(src=0,dst=4789,csum=0x0),vxlan(flags=0x800,vni=0x0
+)),dnl
+out_port(100)),8)
+])
+
+OVS_VSWITCHD_STOP
+AT_CLEANUP

On Tue, Apr 5, 2022 at 2:35 PM Ilya Maximets  wrote:
>
> Hi.  Thanks for the patch!
>
> One general comment for the subject line:
> 'netdev' doesn't seem to be a correct 'area' for the change.
> It should be 'ofproto-dpif-xlate' instead.  And, please, add the space 
> after the ':'.  'summary' should preferably start with a capital 
> letter and end with a period.
>
> More details on how to format the patch here:
>   Documentation/internals/contributing/submitting-patches.rst
>
> See some more comments inline.
>
> On 4/5/22 01:01, Thilak Raj Surendra Babu wrote:
> > when a packet is received over an access port that needs to be sent 
> > over a vxlan tunnel, the access port VLAN id is used in the lookup 
> > leading to a wrong packet being crafted and sent over the 
> > tunnel.Clear out the flow 's VLAN field as it should not be used 
> > while performing mac lookup for the outer tunnel and also at this point the 
> > VLAN action related to inner flow is already committed.
>
> Please, limit the line length for a commit message.
> 72 characters is the most common limit for it.
>
> >
>
> Please, add a 'Fixes' tag.
>
> > Signed-off-by: Thilak Raj Surendra Babu 
> > ---
>
> While sending new versions of a patch, please, add here a brief 
> explanation of what changed between versions.
> E.g.:
>
> Version 2:
>   - Fixed line length warning from checkpatch.
>   - Dropped unrelated changes.
>
> >  ofproto/ofproto-dpif-xlate.c |  3 ++
> >  tests/system-traffic.at  | 54 
>
> I see you added a system test.  That is OK, good to have one.
> However, the issue doesn't require any actual network to be 
> reproduced.  So, we should have a unit test for it.
>
> I 

Re: [ovs-dev] [PATCH] flow: Consistent VXLAN UDP src ports for fragmented packets

2022-04-07 Thread Peng He
got it. thanks!



Hemanth Aramadaka  于2022年4月8日周五 00:36写道:

> Hi,
>
>
>
> Here we are not fragmenting VXLAN packets in the outer IP header.The
> packets originating from VM with the larger size than the configured MTU
>
> will get fragmented and these inner packets are encapsulated with vxlan
> header in which we have the different source ports for UDP.
>
>
>
> Thanks,
>
> Hemanth.
>
>
>
> *From:* Peng He 
> *Sent:* 16 March 2022 11:30
> *To:* Hemanth Aramadaka 
> *Cc:* Sriharsha Basavapatna via dev ; Ilya
> Maximets 
> *Subject:* Re: [ovs-dev] [PATCH] flow: Consistent VXLAN UDP src ports for
> fragmented packets
>
>
>
> Normally, VXLAN packets will be set to DF( don't frag) in the outter IP
> header.
>
> I am trying to understand why fragmentation happens in the first place.
>
>
>
> Hemanth Aramadaka via dev  于2022年3月15日周二 22:38写道:
>
> Issue:
>
> The src-port for UDP is based on RSS hash in the packet metadata.
> In case of packets coming from VM it will be 5-tuple, if available,
> otherwise just IP addresses.If the VM fragments a large IP packet
> and sends the fragments to ovs, only the first fragment will contain
> the L4 header. Therefore, the first fragment and subsequent fragments
> get different UDP src ports in the outgoing VXLAN header.This can
> lead to fragment re-ordering in the fabric as packet will take
> different paths.
>
> Fix:
>
> Intention of this is to avoid fragment packets taking different paths.
> For example, due to presence of firewalls, fragment packets will take
> different paths and will get dropped.To avoid this we ignore the L4
> header during hash calculation only in the case of fragmented packets.
>
> P.S: There is already a review request raised for the same.
> https://mail.openvswitch.org/pipermail/ovs-dev/2021-January/379395.html.
> Re-iterating the same patch request on the latest master code.
>
> Signed-off-by: Hemanth Aramadaka 
> ---
>  lib/flow.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/lib/flow.c b/lib/flow.c
> index dd523c889..17bd47724 100644
> --- a/lib/flow.c
> +++ b/lib/flow.c
> @@ -2238,7 +2238,7 @@ miniflow_hash_5tuple(const struct miniflow *flow,
> uint32_t basis)
>
>  if (flow) {
>  ovs_be16 dl_type = MINIFLOW_GET_BE16(flow, dl_type);
> -uint8_t nw_proto;
> +uint8_t nw_proto,nw_frag;
>
>  if (dl_type == htons(ETH_TYPE_IPV6)) {
>  struct flowmap map = FLOWMAP_EMPTY_INITIALIZER;
> @@ -2260,6 +2260,14 @@ miniflow_hash_5tuple(const struct miniflow *flow,
> uint32_t basis)
>
>  nw_proto = MINIFLOW_GET_U8(flow, nw_proto);
>  hash = hash_add(hash, nw_proto);
> +/* Skip l4 header fields if IP packet is fragmented since
> + * only first fragment will carry l4 header.
> + */
> +nw_frag = MINIFLOW_GET_U8(flow, nw_frag);
> +if ((nw_frag)) {
> +goto out;
> +}
> +
>  if (nw_proto != IPPROTO_TCP && nw_proto != IPPROTO_UDP
>  && nw_proto != IPPROTO_SCTP && nw_proto != IPPROTO_ICMP
>  && nw_proto != IPPROTO_ICMPV6) {
> --
> 2.25.1
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
>
>
>
> --
>
> hepeng
>


-- 
hepeng
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn 2/2] ofctrl: Support ovn-ofctrl-wait-before-clear to avoid down time during upgrade.

2022-04-07 Thread Han Zhou
I have some comment myself regarding the documentation:

On Wed, Apr 6, 2022 at 4:36 PM Han Zhou  wrote:
>
> Whenever OpenFlow connection between ovn-controller and OVS is
> connected/reconnected, typically during ovn-controller/OVS
> restart/upgrade, ovn-controller would:
> 1. clears all the existing flows after the initial hand-shaking
> 2. compute the new flows
> 3. install the new flows to OVS.
>
> In large scale environments when there are a big number of flows
> needs to be computed by ovn-controller, the step 2 above can take
> very long time (from seconds to minutes, depending on the scale and
> topology), which would cause a data plane down time.

I should make it more clear that both step 2 and 3 can take a very long
time. This patch reduces the down time of step 2 only. And I should change
the word "avoid" in the patch title to "reduce".
(To avoid the down time introduced by step 3, I will work on another patch
to address it separately.)

>
> The purpose of this patch is to avoid data plane down time during
> restart/upgrade.

I should make it clear that this patch helps for ovn-controller
restart/upgrade, but not for OVS restart/upgrade, because as soon as OVS
restarts, the OVS flows are gone, and postponing the flow clearing won't
help for this.

> It adds a new state S_WAIT_BEFORE_CLEAR to the ofctrl
> state machine, so that ovn-controller waits for a period of time before
> clearing the existing OVS flows while it is computing the new flows.
> Ideally, ovn-controller should clear the flows after it completes the
> new flows computing. However, it is difficult for ovn-controller to
> judge if it has generated all the new flows (or the flows that are
> sufficient to avoid down time), because flows computation depends on
> iterations of SB monitor data processing and condition changes. So,
> instead of try to determine by ovn-controller itself, this patch
> provides a configuration "ovn-ofctrl-wait-before-clear" for users to
> determine the feasible time to wait, according to the scale. As
> mentioned in the document, the output of
>
>   ovn-appctl -t ovn-controller stopwatch/show flow-generation
>
> can help users determining what value to set.

This command alone may be insufficient. It would be better to trigger a
recompute before this command:
ovn-appctl -t ovn-controller inc-engine/recompute

I will update the document in v2. I think the rest of the patch is still
valid for review. Please review and comment.

Thanks,
Han

>
> Signed-off-by: Han Zhou 
> ---
>  controller/ofctrl.c | 87 -
>  controller/ofctrl.h |  4 +-
>  controller/ovn-controller.8.xml | 27 ++
>  controller/ovn-controller.c |  6 +--
>  controller/pinctrl.c|  2 +
>  tests/ovn-controller.at | 52 
>  6 files changed, 161 insertions(+), 17 deletions(-)
>
> diff --git a/controller/ofctrl.c b/controller/ofctrl.c
> index a67dab024..8239c27f6 100644
> --- a/controller/ofctrl.c
> +++ b/controller/ofctrl.c
> @@ -47,6 +47,7 @@
>  #include "physical.h"
>  #include "openvswitch/rconn.h"
>  #include "socket-util.h"
> +#include "timeval.h"
>  #include "util.h"
>  #include "vswitch-idl.h"
>
> @@ -297,6 +298,7 @@ static unsigned int seqno;
>  STATE(S_NEW)\
>  STATE(S_TLV_TABLE_REQUESTED)\
>  STATE(S_TLV_TABLE_MOD_SENT) \
> +STATE(S_WAIT_BEFORE_CLEAR)  \
>  STATE(S_CLEAR_FLOWS)\
>  STATE(S_UPDATE_FLOWS)
>  enum ofctrl_state {
> @@ -339,6 +341,14 @@ static uint64_t cur_cfg;
>  /* Current state. */
>  static enum ofctrl_state state;
>
> +/* The time (ms) to stay in the state S_WAIT_BEFORE_CLEAR. Read from
> + * external_ids: ovn-ofctrl-wait-before-clear. */
> +static unsigned int wait_before_clear_time = 0;
> +
> +/* The time when the state S_WAIT_BEFORE_CLEAR should complete.
> + * If the timer is not started yet, it is set to 0. */
> +static long long int wait_before_clear_expire = 0;
> +
>  /* Transaction IDs for messages in flight to the switch. */
>  static ovs_be32 xid, xid2;
>
> @@ -444,18 +454,19 @@ recv_S_NEW(const struct ofp_header *oh OVS_UNUSED,
>   * If we receive an NXT_TLV_TABLE_REPLY:
>   *
>   * - If it contains our tunnel metadata option, assign its field ID
to
> - *   mff_ovn_geneve and transition to S_CLEAR_FLOWS.
> + *   mff_ovn_geneve and transition to S_WAIT_BEFORE_CLEAR.
>   *
>   * - Otherwise, if there is an unused tunnel metadata field ID, send
>   *   NXT_TLV_TABLE_MOD and OFPT_BARRIER_REQUEST, and transition to
>   *   S_TLV_TABLE_MOD_SENT.
>   *
>   * - Otherwise, log an error, disable Geneve, and transition to
> - *   S_CLEAR_FLOWS.
> + *   S_WAIT_BEFORE_CLEAR.
>   *
>   * If we receive an OFPT_ERROR:
>   *
> - * - Log an error, disable Geneve, and transition to S_CLEAR_FLOWS.
*/
> + * - Log an error, disable Geneve, and transition to

Re: [ovs-dev] [PATCH ovn v2] northd: Use separate SNAT for already-DNATted traffic.

2022-04-07 Thread Numan Siddique
On Fri, Apr 1, 2022 at 1:55 PM Mark Michelson  wrote:
>
> Commit 4deac4509abbedd6ffaecf27eed01ddefccea40a introduced functionality
> in ovn-northd to use a single zone for SNAT and DNAT when possible. It
> accounts for certain situations, such as hairpinned traffic, where we
> still need a separate SNAT and DNAT zone.
>
> However, one situation it did not account for was when traffic traverses
> a logical router and is DNATted as a result of a load balancer, then
> when the traffic egresses the router, it needs to be SNATted. In this
> situation, we try to use the same CT zone for the SNAT as for the load
> balancer DNAT, which does not work.
>
> This commit fixes the issue by setting the DNAT_LOCAL bit in the initial
> stage of the egress pipeline if the packet was dnatted during the
> ingress pipeline. This ensures that when the SNAT stage is reached, a
> separate CT zone is used for SNAT.
>
> Signed-off-by: Mark Michelson 

Hi Mark,


Thanks for addressing the comments in v2.  The patch LGTM.
However the system tests added in this patch are failing - both
locally and also in CI -
https://github.com/ovsrobot/ovn/runs/5792688717?check_suite_focus=true

Also there are a couple of checkpatch line length warnings.

Numan

> ---
> v1 -> v2
> * Switched from setting flows in SNAT stage to using CHK_DNAT_LOCAL
>   stage, as suggested by Numan. Updated the commit message to indicate
>   this.
> * Removed ovn-northd test in favor of moving the checks to the existing
>   NAT configuration check test. This also resulted in fixing these
>   tests, since they were failing with my first patch applied.
> * Added new flow documentation to ovn-northd.8.xml
> * Added notes to both northd.c and ovn-northd.8.xml that this relies on
>   the ct_label.natted bit to carry its status from the ingress to egress
>   pipelines.
> ---
>  northd/northd.c |  17 ++
>  northd/ovn-northd.8.xml |  16 ++
>  tests/ovn-northd.at |   7 ++-
>  tests/system-ovn.at | 117 
>  4 files changed, 156 insertions(+), 1 deletion(-)
>
> diff --git a/northd/northd.c b/northd/northd.c
> index b264fb850..637cb8534 100644
> --- a/northd/northd.c
> +++ b/northd/northd.c
> @@ -13388,6 +13388,23 @@ build_lrouter_nat_defrag_and_lb(struct ovn_datapath 
> *od, struct hmap *lflows,
>  }
>  }
>
> +if (od->nbr->n_nat) {
> +ds_clear(match);
> +ds_put_cstr(match, "ip && ct_label.natted == 1");
> +/* This flow is unique since it is in the egress pipeline but checks 
> the
> + * value of ct_label.natted, which would have been set in the ingress
> + * pipeline. If a change is ever introduced that clears or otherwise
> + * invalidates the ct_label between the ingress and egress 
> pipelines, then
> + * an alternative will need to be devised.
> + */
> +ds_clear(actions);
> +ds_put_cstr(actions, REGBIT_DST_NAT_IP_LOCAL" = 1; next;");
> +ovn_lflow_add_with_hint(lflows, od, S_ROUTER_OUT_CHECK_DNAT_LOCAL,
> +50, ds_cstr(match), ds_cstr(actions),
> +>nbr->header_);
> +
> +}
> +
>  /* Handle force SNAT options set in the gateway router. */
>  if (od->is_gw_router) {
>  if (dnat_force_snat_ip) {
> diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
> index 4784bff04..06ef74dcc 100644
> --- a/northd/ovn-northd.8.xml
> +++ b/northd/ovn-northd.8.xml
> @@ -4310,6 +4310,22 @@ nd_ns {
>
>  
>
> +
> +  This table also installs a priority-50 logical flow for each logical
> +  router that has NATs configured on it. The flow has match
> +  ip  ct_label.natted == 1 and action
> +  REGBIT_DST_NAT_IP_LOCAL = 1; next;. This is intended
> +  to ensure that traffic that was DNATted locally will use a separate
> +  conntrack zone for SNAT if SNAT is required later in the egress
> +  pipeline. Note that this flow checks the value of
> +  ct_label.natted, which is set in the ingress pipeline.
> +  This means that ovn-northd assumes that this value is carried over
> +  from the ingress pipeline to the egress pipeline and is not altered
> +  or cleared. If conntrack label values are ever changed to be cleared
> +  between the ingress and egress pipelines, then the match conditions
> +  of this flow will be updated accordingly.
> +
> +
>  Egress Table 1: UNDNAT
>
>  
> diff --git a/tests/ovn-northd.at b/tests/ovn-northd.at
> index 60d91a771..3b3866fa2 100644
> --- a/tests/ovn-northd.at
> +++ b/tests/ovn-northd.at
> @@ -4983,6 +4983,7 @@ AT_CHECK([grep "lr_in_dnat" lr0flows | sort], [0], [dnl
>
>  AT_CHECK([grep "lr_out_chk_dnat_local" lr0flows | sed 's/table=./table=?/' | 
> sort], [0], [dnl
>table=? (lr_out_chk_dnat_local), priority=0, match=(1), 
> action=(reg9[[4]] = 0; next;)
> +  table=? (lr_out_chk_dnat_local), priority=50   , match=(ip && 
> 

Re: [ovs-dev] [PATCH ovn v2] controller/pinctrl: avoid accessing invalid memory

2022-04-07 Thread Numan Siddique
On Tue, Apr 5, 2022 at 10:21 AM Mark Michelson  wrote:
>
> Thanks for the update Mohammad. I like this version much better.
>
> Acked-by: Mark Michelson 

Thanks for the fix.

I applied this patch to the main and backported upto branch-21.09 with
the below changes as it would reduce the code.

---
diff --git a/controller/binding.c b/controller/binding.c
index 36884b1eae..1259e6b3bd 100644
--- a/controller/binding.c
+++ b/controller/binding.c
@@ -483,16 +483,12 @@ remove_related_lport(const struct sbrec_port_binding *pb,

 static void
 delete_active_pb_ras_pd(const struct sbrec_port_binding *pb,
-struct shash *map)
+struct shash *ras_pd_map)
 {
-struct shash_node *iter = shash_find(map, pb->logical_port);
+struct pb_ld_binding *ras_pd =
+shash_find_and_delete(ras_pd_map, pb->logical_port);

-if (iter) {
-struct pb_ld_binding *ras_pd = iter->data;
-shash_delete(map, iter);
-free(ras_pd);
-return;
-}
+free(ras_pd);
 }

 static void
---


Numan

>
> On 4/3/22 07:26, Mohammad Heib wrote:
> > currently pinctrl uses some hash tables that were supplied
> > by ovn-controller to prepare and send IPv6 RAs, those
> > hash tables are not updated properly when a port_bindings record
> > deleted and in this case, some port_bindings records remain in the
> > hash table where they should be removed.
> >
> > This patch handles such changes and update those lists
> > to avoid invalid memory access.
> >
> > Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2052945
> > Signed-off-by: Mohammad Heib 
> > ---
> >   controller/binding.c | 17 +
> >   1 file changed, 17 insertions(+)
> >
> > diff --git a/controller/binding.c b/controller/binding.c
> > index 4d62b0858..36884b1ea 100644
> > --- a/controller/binding.c
> > +++ b/controller/binding.c
> > @@ -481,6 +481,20 @@ remove_related_lport(const struct sbrec_port_binding 
> > *pb,
> >   }
> >   }
> >
> > +static void
> > +delete_active_pb_ras_pd(const struct sbrec_port_binding *pb,
> > +struct shash *map)
> > +{
> > +struct shash_node *iter = shash_find(map, pb->logical_port);
> > +
> > +if (iter) {
> > +struct pb_ld_binding *ras_pd = iter->data;
> > +shash_delete(map, iter);
> > +free(ras_pd);
> > +return;
> > +}
> > +}
> > +
> >   static void
> >   update_active_pb_ras_pd(const struct sbrec_port_binding *pb,
> >   struct hmap *local_datapaths,
> > @@ -2251,6 +2265,9 @@ binding_handle_port_binding_changes(struct 
> > binding_ctx_in *b_ctx_in,
> >   continue;
> >   }
> >
> > +delete_active_pb_ras_pd(pb, b_ctx_out->local_active_ports_ipv6_pd);
> > +delete_active_pb_ras_pd(pb, b_ctx_out->local_active_ports_ras);
> > +
> >   enum en_lport_type lport_type = get_lport_type(pb);
> >
> >   struct binding_lport *b_lport =
> >
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v3] ofproto-dpif-xlate: Clear out vlan flow fields while processing native tunnel.

2022-04-07 Thread Thilak Raj Surendra Babu
When a packet is received over an access port that needs to be sent
over a vxlan tunnel,the access port VLAN id is used in the lookup
leading to a wrong packet being crafted and sent over the tunnel.
Clear out the flow 's VLAN field as it should not be used while
performing mac lookup for the outer tunnel and also at this point
the VLAN action related to inner flow is already committed.

Version 3:
  - Address review comments.
  - Add unit-test patch from Rosemarie O'Riorden.

Version 2:
  - Fixed line length warning from checkpatch.
  - Dropped unrelated changes.

Signed-off-by: Thilak Raj Surendra Babu 
Signed-off-by: Rosemarie O'Riorden 
Co-authored-by: Rosemarie O'Riorden 
---
 ofproto/ofproto-dpif-xlate.c |  3 ++
 tests/system-traffic.at  | 61 
 tests/tunnel-push-pop.at | 52 ++
 3 files changed, 116 insertions(+)

diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
index 5a770f171..7ce61b176 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -3542,6 +3542,9 @@ propagate_tunnel_data_to_flow__(struct flow *dst_flow,
 dst_flow->dl_dst = dmac;
 dst_flow->dl_src = smac;
 
+/* Clear VLAN entries which do not apply for tunnel flows. */
+memset(dst_flow->vlans, 0, sizeof(dst_flow->vlans));
+
 dst_flow->packet_type = htonl(PT_ETH);
 dst_flow->nw_dst = src_flow->tunnel.ip_dst;
 dst_flow->nw_src = src_flow->tunnel.ip_src;
diff --git a/tests/system-traffic.at b/tests/system-traffic.at
index 4a7fa49fc..f1a146b97 100644
--- a/tests/system-traffic.at
+++ b/tests/system-traffic.at
@@ -259,6 +259,67 @@ NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 
10.1.1.100 | FORMAT_PI
 OVS_TRAFFIC_VSWITCHD_STOP
 AT_CLEANUP
 
+AT_SETUP([datapath - ping vlan over vxlan tunnel])
+OVS_CHECK_TUNNEL_TSO()
+OVS_CHECK_VXLAN()
+
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-underlay])
+
+AT_CHECK([ovs-vsctl -- add-port br0 patch0 -- set interface patch0 type=patch 
dnl
+options:peer=patch1 -- add-port br-underlay patch1 -- set interface patch1 dnl
+type=patch options:peer=patch0])
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "table=0,priority=10,actions=normal"])
+
+ADD_NAMESPACES(at_ns0)
+
+dnl Set up underlay link from host into the namespace using veth pair.
+ADD_VETH(p0, at_ns0, br-underlay, "172.31.1.1/24")
+AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
+AT_CHECK([ip link set dev br-underlay up])
+
+dnl Add a rule to forward arp packets towards tunnels remote endpoint
+AT_CHECK([ovs-ofctl add-flow br-underlay "in_port=br-underlay,priority=12,dnl
+dl_type=0x806,nw_dst=172.31.1.1,actions=output:ovs-p0"])
+
+dnl Set up tunnel endpoints on OVS outside the namespace and with a native
+dnl linux device inside the namespace.
+ADD_NATIVE_TUNNEL([vxlan], [at_vxlan1], [at_ns0], [172.31.1.100], 
[10.1.1.1/24],
+  [id 0 dstport 4789])
+NS_EXEC([at_ns0], [ip link add link at_vxlan1 name at_vxlan1.100 type vlan id 
dnl
+100])
+NS_EXEC([at_ns0], [ip addr flush dev at_vxlan1])
+NS_EXEC([at_ns0], [ip addr add dev at_vxlan1.100 "10.1.1.30/24"])
+NS_EXEC([at_ns0], [ip link set dev at_vxlan1.100 up])
+
+dnl add OVS tunnel
+ADD_OVS_TUNNEL([vxlan], [br-underlay], [at_vxlan0], [172.31.1.1], 
[10.1.1.100/24])
+
+AT_CHECK([ovs-ofctl add-flow br-underlay "priority=12,dl_type=0x806,dnl
+nw_dst=10.1.1.30,actions=output:at_vxlan0"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "priority=12,dl_type=0x800,dnl
+nw_dst=10.1.1.30,actions=output:at_vxlan0"])
+AT_CHECK([ovs-ofctl mod-port br-underlay at_vxlan0 no-flood])
+
+ADD_NAMESPACES(at_ns1)
+ADD_VETH(p1, at_ns1, br0, "10.1.1.10/24")
+
+AT_CHECK([ovs-vsctl set port ovs-p1 tag=100])
+
+dnl First, check the underlay
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], 
[0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+dnl Check if overlay network is reachable
+NS_CHECK_EXEC([at_ns1], [ping -q -c 3 -i 0.3 -w 2 10.1.1.30 | FORMAT_PING], 
[0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
 AT_SETUP([datapath - ping over vxlan6 tunnel])
 OVS_CHECK_TUNNEL_TSO()
 OVS_CHECK_VXLAN_UDP6ZEROCSUM()
diff --git a/tests/tunnel-push-pop.at b/tests/tunnel-push-pop.at
index 57589758f..b1bc328c3 100644
--- a/tests/tunnel-push-pop.at
+++ b/tests/tunnel-push-pop.at
@@ -842,3 +842,55 @@ Datapath actions: 7
 
 OVS_VSWITCHD_STOP
 AT_CLEANUP
+
+AT_SETUP([tunnel_push_pop - VXLAN access port])
+
+dnl Create bridge that has a MAC address
+OVS_VSWITCHD_START([set bridge br0 dnl
+datapath_type=dummy -- set Interface dnl
+br0 other-config:hwaddr=aa:55:aa:55:00:00])
+AT_CHECK([ovs-vsctl add-port br0 p8 -- set Interface p8 dnl
+   type=dummy ofport_request=8])
+
+dnl Create another bridge
+AT_CHECK([ovs-vsctl add-br ovs-tun0 -- set bridge ovs-tun0 

Re: [ovs-dev] [PATCH ovn] northd: Fix check for NAT on LR with multiple gw ports.

2022-04-07 Thread Numan Siddique
On Wed, Mar 23, 2022 at 11:35 AM Dumitru Ceara  wrote:
>
> Add check to see if the LR actually has NAT entries configured.
> Otherwise we spam the log unnecessarily.
>
> Fixes: b8194738c99e ("northd: Properly warn for NAT on LR with multiple gw 
> ports.")
> Signed-off-by: Dumitru Ceara 

Hi Dumitru,

I suppose we can drop this patch now since a recent commit added the
support for NAT on LR with multiple gw ports.

Numan

> ---
>  northd/northd.c | 2 +-
>  tests/ovn-northd.at | 4 ++--
>  2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/northd/northd.c b/northd/northd.c
> index a2cf8d6fc7..87d9b2867a 100644
> --- a/northd/northd.c
> +++ b/northd/northd.c
> @@ -13242,7 +13242,7 @@ build_lrouter_nat_defrag_and_lb(struct ovn_datapath 
> *od, struct hmap *lflows,
>
>  /* NAT rules are not currently supported on logical routers with multiple
>   * distributed gateway ports. */
> -if (od->n_l3dgw_ports > 1) {
> +if (od->nbr->n_nat && od->n_l3dgw_ports > 1) {
>  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1);
>  VLOG_WARN_RL(, "NAT is configured on logical router %s, which has 
> %"
>   PRIuSIZE" distributed gateway ports. NAT is not 
> supported"
> diff --git a/tests/ovn-northd.at b/tests/ovn-northd.at
> index 17d4f31b39..4883360c97 100644
> --- a/tests/ovn-northd.at
> +++ b/tests/ovn-northd.at
> @@ -5807,8 +5807,8 @@ AT_CHECK([grep lr_in_gw_redirect lrflows | grep cr-DR | 
> sed 's/table=../table=??
>  # Check that ovn-northd logs a warning when trying to configure NAT
>  # on the router with multiple distributed gw ports.  Such configurations are
>  # not supported yet.
> -check ovn-nbctl lr-nat-add DR dnat_and_snat 42.42.42.1 20.0.0.2
> -AT_CHECK([grep -q 'NAT is configured on logical router DR, which has 2 
> distributed gateway ports. NAT is not supported yet when there is more than 
> one distributed gateway port on the router.' northd/ovn-northd.log], [0])
> +check ovn-nbctl --wait=sb lr-nat-add DR dnat_and_snat 42.42.42.1 20.0.0.2
> +AT_CHECK([grep -q 'NAT is configured on logical router DR, which has 3 
> distributed gateway ports. NAT is not supported yet when there is more than 
> one distributed gateway port on the router.' northd/ovn-northd.log], [0])
>
>  AT_CLEANUP
>  ])
> --
> 2.27.0
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net-next v2] net: openvswitch: fix uAPI incompatibility with existing user space

2022-04-07 Thread Vlad Buslov via dev
On Mon 14 Mar 2022 at 20:40, Ilya Maximets  wrote:
> On 3/14/22 19:33, Roi Dayan wrote:
>> 
>> 
>> On 2022-03-10 8:44 PM, Aaron Conole wrote:
>>> Ilya Maximets  writes:
>>>
 Few years ago OVS user space made a strange choice in the commit [1]
 to define types only valid for the user space inside the copy of a
 kernel uAPI header.  '#ifndef __KERNEL__' and another attribute was
 added later.

 This leads to the inevitable clash between user space and kernel types
 when the kernel uAPI is extended.  The issue was unveiled with the
 addition of a new type for IPv6 extension header in kernel uAPI.

 When kernel provides the OVS_KEY_ATTR_IPV6_EXTHDRS attribute to the
 older user space application, application tries to parse it as
 OVS_KEY_ATTR_PACKET_TYPE and discards the whole netlink message as
 malformed.  Since OVS_KEY_ATTR_IPV6_EXTHDRS is supplied along with
 every IPv6 packet that goes to the user space, IPv6 support is fully
 broken.

 Fixing that by bringing these user space attributes to the kernel
 uAPI to avoid the clash.  Strictly speaking this is not the problem
 of the kernel uAPI, but changing it is the only way to avoid breakage
 of the older user space applications at this point.

 These 2 types are explicitly rejected now since they should not be
 passed to the kernel.  Additionally, OVS_KEY_ATTR_TUNNEL_INFO moved
 out from the '#ifdef __KERNEL__' as there is no good reason to hide
 it from the userspace.  And it's also explicitly rejected now, because
 it's for in-kernel use only.

 Comments with warnings were added to avoid the problem coming back.

 (1 << type) converted to (1ULL << type) to avoid integer overflow on
 OVS_KEY_ATTR_IPV6_EXTHDRS, since it equals 32 now.

   [1] beb75a40fdc2 ("userspace: Switching of L3 packets in L2 pipeline")

 Fixes: 28a3f0601727 ("net: openvswitch: IPv6: Add IPv6 extension header 
 support")
 Link: 
 https://lore.kernel.org/netdev/3adf00c7-fe65-3ef4-b6d7-6d8a0cad8...@nvidia.com
 Link: 
 https://github.com/openvswitch/ovs/commit/beb75a40fdc295bfd6521b0068b4cd12f6de507c
 Reported-by: Roi Dayan 
 Signed-off-by: Ilya Maximets 
 ---
>>>
>>> Acked-by: Aaron Conole 
>>>
>> 
>> 
>> 
>> I got to check traffic with the fix and I do get some traffic
>> but something is broken. I didn't investigate much but the quick
>> test shows me rules are not offloaded and dumping ovs rules gives
>> error like this
>> 
>> recirc_id(0),in_port(enp8s0f0_1),ct_state(-trk),eth(),eth_type(0x86dd),ipv6(frag=no)(bad
>> key length 2, expected -1)(00 00/(bad mask length 2, expected -1)(00 00),
>> packets:2453, bytes:211594, used:0.004s, flags:S., actions:ct,recirc(0x2)
>
> Such a dump is expected, because kernel parses fields that current
> userspace doesn't understand, and at the same time OVS by design is
> using kernel provided key/mask while installing datapath rules, IIRC.
> It should be possible to make these dumps a bit more friendly though.
>
> For the offloading not working, see my comment in the v2 patch email
> I sent (top email of this thread).  In short, it's a problem in user
> space and it can not be fixed from the kernel side, unless we revert
> IPv6 extension header support and never add any new types, which is
> unreasonable.  I didn't test any actual offloading, but I had a
> successful run of 'make check-offloads' with my quick'n'dirty fix from
> the top email.

Hi Ilya,

I can confirm that with latest OvS master IPv6 rules offload still fails
without your pastebin code applied.

>
> Since we're here:
>
> Toms, do you plan to submit user space patches for this feature?

I see there is a patch from you that is supposed to fix compatibility
issues caused by this change in OvS d96d14b14733 ("openvswitch.h: Align
uAPI definition with the kernel."), but it doesn't fix offload for me
without pastebin patch. Do you plan to merge that code into OvS or you
require some help from our side?

Regards,
Vlad

[...]


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] GRO for VXLAN and GENEVE doesn't work with OOT kernel module on CentOS 7.x

2022-04-07 Thread 0-day Robot
Bleep bloop.  Greetings Vladislav Odintsov, I am a robot and I have tried out 
your patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


checkpatch:
ERROR: Author Vladislav Odintsov  needs to sign off.
Lines checked: 78, Warnings: 0, Errors: 1


Please check this out.  If you feel there has been an error, please email 
acon...@redhat.com

Thanks,
0-day Robot
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] GRO for VXLAN and GENEVE doesn't work with OOT kernel module on CentOS 7.x

2022-04-07 Thread Vladislav Odintsov
Hi,

I’ve faced an issue where GENEVE and VXLAN offload on RX path doesn’t work with 
openvswitch OOT kernel module.
The problem reproduced at least on CentOS 7.5, 7.8 (but I think more recent 
versions are affected too).
Initially the problem was observed with OVS 2.13, but it reproduces with 2.17 
and master branches as well.

The problem is shown with tcpdump while iperf3 on top of VXLAN of GENEVE is run 
(RX traffic; length 1448):

19:19:00.223575 IP 10.1.0.105.60228 > 10.1.0.106.6081: Geneve, Flags 
[none], vni 0x0: IP 1.1.1.5.targus-getdata1 > 1.1.1.6.44652: Flags [.], seq 
40722105:40723553, ack 38, win 64, options [nop,nop,TS val 2491722176 ecr 
151867193], length 1448

The correct RX handling for GENEVE and VXLAN was broken in v2.10.0 release with 
[1] commit (if I understand correctly).
With OVS kmod v2.9.x GRO for VXLAN and GENEVE packets works as expected.
In CentOS 7.x in kernel-devel packages there is no include/net/erspan.h header 
file, therefore USE_UPSTREAM_TUNNEL macro is unset.
So I’ve tried to apply next patch:

diff --git a/acinclude.m4 b/acinclude.m4
index 914e27b93..cde232966 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -671,9 +671,10 @@ AC_DEFUN([OVS_CHECK_LINUX_COMPAT], [
 [OVS_GREP_IFELSE([$KSRC/include/net/ip_tunnels.h],
  [iptunnel_pull_offloads],
 [OVS_GREP_IFELSE([$KSRC/include/net/dst_cache.h], 
[dst_cache],
-[OVS_GREP_IFELSE([$KSRC/include/net/erspan.h], 
[erspan_md2],
- 
[OVS_DEFINE([USE_UPSTREAM_TUNNEL])])])])])
+ 
[OVS_DEFINE([USE_UPSTREAM_TUNNEL])])])])

+  OVS_GREP_IFELSE([$KSRC/include/net/erspan.h], [erspan_md2],
+  [OVS_DEFINE([USE_BUILTIN_ERSPAN_MD2])])
   OVS_GREP_IFELSE([$KSRC/include/net/dst_cache.h], [dst_cache],
   [OVS_DEFINE([USE_BUILTIN_DST_CACHE])])
   OVS_GREP_IFELSE([$KSRC/include/net/mpls.h], [mpls_hdr],
diff --git a/datapath/linux/compat/include/net/erspan.h 
b/datapath/linux/compat/include/net/erspan.h
index 4a6a8f240..a36ce727b 100644
--- a/datapath/linux/compat/include/net/erspan.h
+++ b/datapath/linux/compat/include/net/erspan.h
@@ -1,4 +1,4 @@
-#ifndef USE_UPSTREAM_TUNNEL
+#ifndef USE_BUILTIN_ERSPAN_MD2
 #ifndef __LINUX_ERSPAN_H
 #define __LINUX_ERSPAN_H

After the next patch the problem was resolved (length 65k):

19:05:13.500646 IP 10.1.0.105.51432 > 10.1.0.106.6081: Geneve, Flags 
[none], vni 0x0: IP 1.1.1.5.targus-getdata1 > 1.1.1.6.54874: Flags [.], seq 
185753153:185818313, ack 38, win 64, options [nop,nop,TS val 2490895458 ecr 
151040475], length 65160

The problem is that with this change erspan functionality in OVS is broken and 
I couldn’t understand where to see the root cause for this:

# ovs-vsctl add-port test at_erspan0 -- set int at_erspan0 type=erspan 
options:key=1 options:remote_ip=10.1.0.101 options:erspan_ver=1 
options:erspan_idx=1
ovs-vsctl: Error detected while setting up 'at_erspan0': could not add network 
device at_erspan0 to ofproto (Address family not supported by protocol).  See 
ovs-vswitchd log for details.
ovs-vsctl: The default log directory is "/var/log/openvswitch".

In ovs logs:
2022-04-07T16:17:33.175Z|00051|dpif|WARN|system@ovs-system: failed to add 
at_erspan0 as port: Address family not supported by protocol
2022-04-07T16:17:33.175Z|00052|bridge|WARN|could not add network device 
at_erspan0 to ofproto (Address family not supported by protocol)

Can anybody help understand the best way to fix the initial (GRO for udp_tnl) 
issue and not to break things around this? :)
Thanks.

1: 
https://github.com/openvswitch/ovs/commit/e1ededf45f072c41295f1b441a6f106159ff191b

Regards,
Vladislav Odintsov

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v3 2/2] northd: Use ct_(snat/dnat)_in_czone action for distributed routers.

2022-04-07 Thread Numan Siddique
On Wed, Apr 6, 2022 at 12:11 PM Numan Siddique  wrote:
>
> On Wed, Apr 6, 2022 at 4:58 AM Frode Nordahl
>  wrote:
> >
> > On Wed, Mar 30, 2022 at 4:54 PM Numan Siddique  wrote:
> > >
> > > On Wed, Mar 30, 2022 at 6:21 AM Frode Nordahl
> > >  wrote:
> > > >
> > > > On Wed, Mar 30, 2022 at 11:45 AM Frode Nordahl
> > > >  wrote:
> > > > >
> > > > > Hello Numan,
> > > > >
> > > > > This patch does unfortunately break gateway routers in some
> > > > > circumstances, (but not all!), at least for the way OpenStack consumes
> > > > > them.
> > > > >
> > > > > Will dig more and try to figure out what is going on, but sending this
> > > > > e-mail proactively in case you have any ideas from the top of your
> > > > > head or if anyone else has run into the same issue.
> > > > >
> > > > > The gateway appears to handle conntrack as it should:
> > > > > $ sudo conntrack -E --dst 194.169.254.178
> > > > > [NEW] tcp  6 120 SYN_SENT src=10.11.2.11 dst=194.169.254.178
> > > > > sport=60234 dport=22 [UNREPLIED] src=10.42.3.34 dst=10.11.2.11
> > > > > sport=22 dport=60234 zone=52
> > > > >  [UPDATE] tcp  6 60 SYN_RECV src=10.11.2.11 dst=194.169.254.178
> > > > > sport=60234 dport=22 src=10.42.3.34 dst=10.11.2.11 sport=22
> > > > > dport=60234 zone=52
> > > > >  [UPDATE] tcp  6 432000 ESTABLISHED src=10.11.2.11
> > > > > dst=194.169.254.178 sport=60234 dport=22 src=10.42.3.34 dst=10.11.2.11
> > > > > sport=22 dport=60234 [ASSURED] zone=52
> > > > >
> > > > > However as you can see traffic going to the instance suddenly gets its
> > > > > source address mangled/set to 0.0.0.0, and the connection can never
> > > > > establish.
> > > > > $ sudo tcpdump -nevvi tape5c1862d-b4
> > > > > tcpdump: listening on tape5c1862d-b4, link-type EN10MB (Ethernet),
> > > > > capture size 262144 bytes
> > > > > 09:40:34.920187 fa:16:3e:c8:19:af > fa:16:3e:fc:82:be, ethertype IPv4
> > > > > (0x0800), length 74: (tos 0x0, ttl 62, id 24824, offset 0, flags [DF],
> > > > > proto TCP (6), length 60)
> > > > > 10.11.2.11.60820 > 10.42.3.34.22: Flags [S], cksum 0x9f1b
> > > > > (correct), seq 2926328730, win 64240, options [mss 1460,sackOK,TS val
> > > > > 870680827 ecr 0,nop,wscale 7], length 0
> > > > > 09:40:34.920537 fa:16:3e:fc:82:be > fa:16:3e:c8:19:af, ethertype IPv4
> > > > > (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
> > > > > proto TCP (6), length 60)
> > > > > 10.42.3.34.22 > 10.11.2.11.60820: Flags [S.], cksum 0x1990
> > > > > (incorrect -> 0x284c), seq 1596962125, ack 2926328731, win 62230,
> > > > > options [mss 8902,sackOK,TS val 3490675961 ecr 870680827,nop,wscale
> > > > > 7], length 0
> > > > > 09:40:34.968033 fa:16:3e:c8:19:af > fa:16:3e:fc:82:be, ethertype IPv4
> > > > > (0x0800), length 107: (tos 0x0, ttl 62, id 24826, offset 0, flags
> > > > > [DF], proto TCP (6), length 93)
> > > > > 10.11.2.11.60820 > 10.42.3.34.22: Flags [P.], cksum 0x4293
> > > > > (correct), seq 1:42, ack 1, win 502, options [nop,nop,TS val 870680922
> > > > > ecr 3490675961], length 41
> > > > > 09:40:34.968243 fa:16:3e:fc:82:be > fa:16:3e:c8:19:af, ethertype IPv4
> > > > > (0x0800), length 66: (tos 0x0, ttl 64, id 30244, offset 0, flags [DF],
> > > > > proto TCP (6), length 52)
> > > > > 10.42.3.34.22 > 10.11.2.11.60820: Flags [.], cksum 0x1988
> > > > > (incorrect -> 0x64a4), seq 1, ack 42, win 486, options [nop,nop,TS val
> > > > > 3490676008 ecr 870680922], length 0
> > > > > 09:40:34.968857 fa:16:3e:c8:19:af > fa:16:3e:fc:82:be, ethertype IPv4
> > > > > (0x0800), length 66: (tos 0x0, ttl 62, id 24825, offset 0, flags [DF],
> > > > > proto TCP (6), length 52)
> > > > > 10.11.2.11.60820 > 10.42.3.34.22: Flags [.], cksum 0x64ec
> > > > > (correct), seq 1, ack 1, win 502, options [nop,nop,TS val 870680922
> > > > > ecr 3490675961], length 0
> > > > > 09:40:34.968932 fa:16:3e:fc:82:be > fa:16:3e:c8:19:af, ethertype IPv4
> > > > > (0x0800), length 66: (tos 0x0, ttl 64, id 30245, offset 0, flags [DF],
> > > > > proto TCP (6), length 52)
> > > > > 10.42.3.34.22 > 10.11.2.11.60820: Flags [.], cksum 0x1988
> > > > > (incorrect -> 0x64a3), seq 1, ack 42, win 486, options [nop,nop,TS val
> > > > > 3490676009 ecr 870680922], length 0
> > > > > 09:40:34.977991 fa:16:3e:fc:82:be > fa:16:3e:c8:19:af, ethertype IPv4
> > > > > (0x0800), length 107: (tos 0x0, ttl 64, id 30246, offset 0, flags
> > > > > [DF], proto TCP (6), length 93)
> > > > > 10.42.3.34.22 > 10.11.2.11.60820: Flags [P.], cksum 0x19b1
> > > > > (incorrect -> 0x4241), seq 1:42, ack 42, win 486, options [nop,nop,TS
> > > > > val 3490676018 ecr 870680922], length 41
> > > > > 09:40:34.978323 fa:16:3e:c8:19:af > fa:16:3e:fc:82:be, ethertype IPv4
> > > > > (0x0800), length 66: (tos 0x0, ttl 62, id 24827, offset 0, flags [DF],
> > > > > proto TCP (6), length 52)
> > > > > 0.0.0.0.60820 > 10.42.3.34.22: Flags [.], cksum 0x706d (correct),
> > > > > seq 2926328772, ack 1596962167, win 502, options [nop,nop,TS val
> > > > > 870680932 ecr 

Re: [ovs-dev] [PATCH] flow: Consistent VXLAN UDP src ports for fragmented packets

2022-04-07 Thread Hemanth Aramadaka via dev
Hi,

 

Here we are not fragmenting VXLAN packets in the outer IP header.The packets 
originating from VM with the larger size than the configured MTU 

will get fragmented and these inner packets are encapsulated with vxlan header 
in which we have the different source ports for UDP.

 

Thanks,

Hemanth.

 

From: Peng He  
Sent: 16 March 2022 11:30
To: Hemanth Aramadaka 
Cc: Sriharsha Basavapatna via dev ; Ilya Maximets 

Subject: Re: [ovs-dev] [PATCH] flow: Consistent VXLAN UDP src ports for 
fragmented packets

 

Normally, VXLAN packets will be set to DF( don't frag) in the outter IP header.

I am trying to understand why fragmentation happens in the first place.

 

Hemanth Aramadaka via dev mailto:ovs-dev@openvswitch.org> > 于2022年3月15日周二 22:38写道:

Issue:

The src-port for UDP is based on RSS hash in the packet metadata.
In case of packets coming from VM it will be 5-tuple, if available,
otherwise just IP addresses.If the VM fragments a large IP packet
and sends the fragments to ovs, only the first fragment will contain
the L4 header. Therefore, the first fragment and subsequent fragments
get different UDP src ports in the outgoing VXLAN header.This can
lead to fragment re-ordering in the fabric as packet will take
different paths.

Fix:

Intention of this is to avoid fragment packets taking different paths.
For example, due to presence of firewalls, fragment packets will take
different paths and will get dropped.To avoid this we ignore the L4
header during hash calculation only in the case of fragmented packets.

P.S: There is already a review request raised for the same.
https://mail.openvswitch.org/pipermail/ovs-dev/2021-January/379395.html. 

 
Re-iterating the same patch request on the latest master code.

Signed-off-by: Hemanth Aramadaka mailto:hemant...@acldigital.com> >
---
 lib/flow.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/lib/flow.c b/lib/flow.c
index dd523c889..17bd47724 100644
--- a/lib/flow.c
+++ b/lib/flow.c
@@ -2238,7 +2238,7 @@ miniflow_hash_5tuple(const struct miniflow *flow, 
uint32_t basis)

 if (flow) {
 ovs_be16 dl_type = MINIFLOW_GET_BE16(flow, dl_type);
-uint8_t nw_proto;
+uint8_t nw_proto,nw_frag;

 if (dl_type == htons(ETH_TYPE_IPV6)) {
 struct flowmap map = FLOWMAP_EMPTY_INITIALIZER;
@@ -2260,6 +2260,14 @@ miniflow_hash_5tuple(const struct miniflow *flow, 
uint32_t basis)

 nw_proto = MINIFLOW_GET_U8(flow, nw_proto);
 hash = hash_add(hash, nw_proto);
+/* Skip l4 header fields if IP packet is fragmented since
+ * only first fragment will carry l4 header.
+ */
+nw_frag = MINIFLOW_GET_U8(flow, nw_frag);
+if ((nw_frag)) {
+goto out;
+}
+
 if (nw_proto != IPPROTO_TCP && nw_proto != IPPROTO_UDP
 && nw_proto != IPPROTO_SCTP && nw_proto != IPPROTO_ICMP
 && nw_proto != IPPROTO_ICMPV6) {
-- 
2.25.1

___
dev mailing list
d...@openvswitch.org  
https://mail.openvswitch.org/mailman/listinfo/ovs-dev




 

-- 

hepeng

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] OVS DPDK DMA-Dev library/Design Discussion

2022-04-07 Thread Bruce Richardson
On Thu, Apr 07, 2022 at 05:46:32PM +0200, Maxime Coquelin wrote:
> 
> 
> On 4/7/22 17:01, Ilya Maximets wrote:
> > On 4/7/22 16:42, Van Haaren, Harry wrote:
> > > > -Original Message-
> > > > From: Ilya Maximets 
> > > > Sent: Thursday, April 7, 2022 3:40 PM
> > > > To: Maxime Coquelin ; Van Haaren, Harry
> > > > ; Morten Brørup 
> > > > ;
> > > > Richardson, Bruce 
> > > > Cc: i.maxim...@ovn.org; Pai G, Sunil ; Stokes, 
> > > > Ian
> > > > ; Hu, Jiayu ; Ferriter, Cian
> > > > ; ovs-dev@openvswitch.org; d...@dpdk.org; 
> > > > Mcnamara,
> > > > John ; O'Driscoll, Tim 
> > > > ;
> > > > Finn, Emma 
> > > > Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> > > > 
> > > > On 4/7/22 16:25, Maxime Coquelin wrote:
> > > > > Hi Harry,
> > > > > 
> > > > > On 4/7/22 16:04, Van Haaren, Harry wrote:
> > > > > > Hi OVS & DPDK, Maintainers & Community,
> > > > > > 
> > > > > > Top posting overview of discussion as replies to thread become 
> > > > > > slower:
> > > > > > perhaps it is a good time to review and plan for next steps?
> > > > > > 
> > > > > >   From my perspective, it those most vocal in the thread seem to be 
> > > > > > in favour
> > > > of the clean
> > > > > > rx/tx split ("defer work"), with the tradeoff that the application 
> > > > > > must be
> > > > aware of handling
> > > > > > the async DMA completions. If there are any concerns opposing 
> > > > > > upstreaming
> > > > of this method,
> > > > > > please indicate this promptly, and we can continue technical 
> > > > > > discussions here
> > > > now.
> > > > > 
> > > > > Wasn't there some discussions about handling the Virtio completions 
> > > > > with
> > > > > the DMA engine? With that, we wouldn't need the deferral of work.
> > > > 
> > > > +1
> > > 
> > > Yes there was, the DMA/virtq completions thread here for reference;
> > > https://mail.openvswitch.org/pipermail/ovs-dev/2022-March/392908.html
> > > 
> > > I do not believe that there is a viable path to actually implementing it, 
> > > and particularly
> > > not in the more complex cases; e.g. virtio with guest-interrupt enabled.
> > > 
> > > The thread above mentions additional threads and various other options; 
> > > none of which
> > > I believe to be a clean or workable solution. I'd like input from other 
> > > folks more familiar
> > > with the exact implementations of VHost/vrings, as well as those with DMA 
> > > engine expertise.
> > 
> > I tend to trust Maxime as a vhost maintainer in such questions. :)
> > 
> > In my own opinion though, the implementation is possible and concerns 
> > doesn't
> > sound deal-breaking as solutions for them might work well enough.  So I 
> > think
> > the viability should be tested out before solution is disregarded.  
> > Especially
> > because the decision will form the API of the vhost library.
> 
> I agree, we need a PoC adding interrupt support to dmadev API using
> eventfd, and adding a thread in Vhost library that polls for DMA
> interrupts and calls vhost_vring_call if needed.
>
Hi Maxime,

couple of questions, perhaps you can clarify. Firstly, why would an eventfd
be needed for the interrupts, can they not just use the regular interrupt
handling like other devices in DPDK, e.g. read on /dev/node.

In terms of the new thread - what is this thread going to handle? Is it
going to take interrupts from all dma operations and handle the
completion/cleanup of all jobs for all queues once the DMA engine is
finished? Or is it just going to periodically be woken up to check for the
edge case if there are any virtio queues sleeping with interrupts enabled
where there are completed - but unsignalled to the VM - packets?

Regards,
/Bruce
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v1] ofproto-dpif: Validate action size and recursion depth

2022-04-07 Thread Mike Pattrick
On Thu, Apr 7, 2022 at 7:26 AM Adrian Moreno  wrote:
>
> Hi Mike,
>
> On 4/6/22 18:48, Mike Pattrick wrote:
> > Verify that a flow isn't adding more than 64kB in actions when an action
> > uses multiple check_pkt_larger statements. Failure to do so can cause
> > ovs-vswitchd to throw signal 6. Also add a depth check to clone and
> > check_pkt_larger.
> >
>
> I think I'm lacking a bit of context. For my understanding, why do we need to
> add depth check to clone and check_pkt_larger?
> AFAICS the number of clone actions is bound to the length of the ofpact which 
> is
> finite, so it'll not be a problem of endless recursion (which is, IIUC, the
> rationale behind depth).
>
> Besides, it seems to me that nested cloning is equivalent to a list of clone
> actions, why should we treat them differently?
>

My rationale is twofold. First, ovs-ofctl already validates depth in
ofpacts_parse, but flow rules are not guaranteed to be programmed by
ovs-ofctl. We have seen in bugs in the past like bz2020770 where flows
programmed externally can cause vswitchd to crash.

For example, if you disable the depth check in ofpacts_parse(), the
following snippet will cause vsiwtchd to segfault:

ovs-ofctl del-flows br0
fr="normal"
for i in $(seq 1 2500); do
fr="clone($fr)"
done
ovs-ofctl add-flow br0 "actions=$fr"

But this one will not:

ovs-ofctl del-flows br0
fr="normal"
for i in $(seq 1 2500); do
fr="$fr,clone(drop)"
done
ovs-ofctl add-flow br0 "actions=$fr"

The second is that a lot of the netlink functions will call abort()
instead of returning errors. This can cause some unfortunate
landmines, I ran into the check_pkt_larger size explosion completely
by accident. I think in these cases having some reasonable limits on
recursion and buffer size to produce an error is a reasonable
solution.


Cheers,
M

>
> > Signed-off-by: Mike Pattrick 
> > ---
> >   ofproto/ofproto-dpif-xlate.c | 26 ++
> >   tests/ofproto-dpif.at| 52 
> >   2 files changed, 78 insertions(+)
> >
> > diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
> > index 5a770f171..50357e6bb 100644
> > --- a/ofproto/ofproto-dpif-xlate.c
> > +++ b/ofproto/ofproto-dpif-xlate.c
> > @@ -5806,7 +5806,9 @@ clone_xlate_actions(const struct ofpact *actions, 
> > size_t actions_len,
> >
> >   if (reversible_actions(actions, actions_len) || is_last_action) {
> >   old_flow = ctx->xin->flow;
> > +ctx->depth++;
> >   do_xlate_actions(actions, actions_len, ctx, is_last_action, 
> > false);
> > +ctx->depth--;
> >   if (!ctx->freezing) {
> >   xlate_action_set(ctx);
> >   }
> > @@ -5830,7 +5832,9 @@ clone_xlate_actions(const struct ofpact *actions, 
> > size_t actions_len,
> >   if (ctx->xbridge->support.clone) { /* Use clone action */
> >   /* Use clone action as datapath clone. */
> >   offset = nl_msg_start_nested(ctx->odp_actions, 
> > OVS_ACTION_ATTR_CLONE);
> > +ctx->depth++;
> >   do_xlate_actions(actions, actions_len, ctx, true, false);
> > +ctx->depth--;
> >   if (!ctx->freezing) {
> >   xlate_action_set(ctx);
> >   }
> > @@ -5846,7 +5850,9 @@ clone_xlate_actions(const struct ofpact *actions, 
> > size_t actions_len,
> >   offset = nl_msg_start_nested(ctx->odp_actions, 
> > OVS_ACTION_ATTR_SAMPLE);
> >   ac_offset = nl_msg_start_nested(ctx->odp_actions,
> >   OVS_SAMPLE_ATTR_ACTIONS);
> > +ctx->depth++;
> >   do_xlate_actions(actions, actions_len, ctx, true, false);
> > +ctx->depth--;
> >   if (!ctx->freezing) {
> >   xlate_action_set(ctx);
> >   }
> > @@ -6412,7 +6418,18 @@ xlate_check_pkt_larger(struct xlate_ctx *ctx,
> >   ctx->odp_actions, OVS_CHECK_PKT_LEN_ATTR_ACTIONS_IF_GREATER);
> >   value.u8_val = 1;
> >   mf_write_subfield_flow(_pkt_larger->dst, , 
> > >xin->flow);
> > +ctx->depth++;
> >   do_xlate_actions(remaining_acts, remaining_acts_len, ctx, true, 
> > false);
> > +ctx->depth--;
> > +
> > +if (!xlate_resubmit_resource_check(ctx)) {
> > +if (!ctx->error) {
> > +/* If we don't set an error here, we end up with corrupted
> > + * odp actions in the slow path. */
> > +ctx->error = XLATE_RECURSION_TOO_DEEP;
> > +}
> > +return;
> > +}
> >   if (!ctx->freezing) {
> >   xlate_action_set(ctx);
> >   }
> > @@ -6437,7 +6454,16 @@ xlate_check_pkt_larger(struct xlate_ctx *ctx,
> >   ctx->odp_actions, OVS_CHECK_PKT_LEN_ATTR_ACTIONS_IF_LESS_EQUAL);
> >   value.u8_val = 0;
> >   mf_write_subfield_flow(_pkt_larger->dst, , 
> > >xin->flow);
> > +ctx->depth++;
> >   do_xlate_actions(remaining_acts, remaining_acts_len, ctx, true, 
> > false);
> > +ctx->depth--;
> > +
> > +if (!xlate_resubmit_resource_check(ctx)) {

Re: [ovs-dev] [PATCH ovn v2] northd: avoid snat on reply packets

2022-04-07 Thread Numan Siddique
On Tue, Apr 5, 2022 at 2:18 PM Mark Michelson  wrote:
>
> Thanks Xavier. It's amazing this issue has been around so long.
>

The  main reason this issue has not been an actual issue is because
generally a gateway router is connected to a join switch
and E-W traffic doesn't reach the gateway router as it will be handled
in the distributed router itself.

Thanks for fixing this issue.

I applied this patch to the main branch and backported to branch-22.03
with the below changes

--
diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
index 85a65703db..db4f4d267c 100644
--- a/northd/ovn-northd.8.xml
+++ b/northd/ovn-northd.8.xml
@@ -4505,7 +4505,8 @@ nd_ns {
   to change the source IP address of a packet from an IP address of
   A or to change the source IP address of a packet that
   belongs to network A to B, a flow matches
-  ip  ip4.src == A with an action
+  ip  ip4.src == A 
+  (!ct.trk || !ct.rpl) with an action
   ct_snat(B);.  The priority of the flow
   is calculated based on the mask of A, with matches
   having larger masks getting higher priorities. If the NAT rule is
-

I couldn't apply the patch cleanly to branch-21.12.  If you think this
needs to be backported further please
submit a backport patch.

Numan


> Acked-by: Mark Michelson 
>
> On 4/1/22 12:34, Xavier Simonart wrote:
> > On gateway routers egress packets might be both:
> > - unDNATted
> > - SNATted
> > Reply packets should not be SNATted (they must of course be UnDNATted
> > if DNAT was applied).
> >
> > Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2061593
> >
> > Signed-off-by: Xavier Simonart 
> >
> > ---
> > v2: replaced "<<<" in tests as failing on github
> > ---
> >   northd/northd.c |   1 +
> >   tests/ovn-northd.at |  33 ++--
> >   tests/system-ovn.at | 119 
> >   3 files changed, 137 insertions(+), 16 deletions(-)
> >
> > diff --git a/northd/northd.c b/northd/northd.c
> > index 2fb0a93c2..511bd3331 100644
> > --- a/northd/northd.c
> > +++ b/northd/northd.c
> > @@ -12919,6 +12919,7 @@ build_lrouter_out_snat_flow(struct hmap *lflows, 
> > struct ovn_datapath *od,
> >   ds_put_format(actions, "ip%s.src=%s; next;",
> > is_v6 ? "6" : "4", nat->external_ip);
> >   } else {
> > +ds_put_format(match, " && (!ct.trk || !ct.rpl)");
> >   ds_put_format(actions, "ct_snat(%s", nat->external_ip);
> >
> >   if (nat->external_port_range[0]) {
> > diff --git a/tests/ovn-northd.at b/tests/ovn-northd.at
> > index daa664982..f462fca86 100644
> > --- a/tests/ovn-northd.at
> > +++ b/tests/ovn-northd.at
> > @@ -1030,7 +1030,7 @@ AT_CHECK([grep -e "lr_out_snat" drflows | sed 
> > 's/table=../table=??/' | sort], [0
> >   AT_CHECK([grep -e "lr_out_snat" crflows | sed 's/table=../table=??/' | 
> > sort], [0], [dnl
> > table=??(lr_out_snat), priority=0, match=(1), action=(next;)
> > table=??(lr_out_snat), priority=120  , match=(nd_ns), 
> > action=(next;)
> > -  table=??(lr_out_snat), priority=33   , match=(ip && ip4.src == 
> > 50.0.0.11 && ip4.dst == $allowed_range), action=(ct_snat(172.16.1.1);)
> > +  table=??(lr_out_snat), priority=33   , match=(ip && ip4.src == 
> > 50.0.0.11 && ip4.dst == $allowed_range && (!ct.trk || !ct.rpl)), 
> > action=(ct_snat(172.16.1.1);)
> >   ])
> >
> >
> > @@ -1062,7 +1062,7 @@ AT_CHECK([grep -e "lr_out_snat" drflows2 | sed 
> > 's/table=../table=??/' | sort], [
> >   AT_CHECK([grep -e "lr_out_snat" crflows2 | sed 's/table=../table=??/' | 
> > sort], [0], [dnl
> > table=??(lr_out_snat), priority=0, match=(1), action=(next;)
> > table=??(lr_out_snat), priority=120  , match=(nd_ns), 
> > action=(next;)
> > -  table=??(lr_out_snat), priority=33   , match=(ip && ip4.src == 
> > 50.0.0.11), action=(ct_snat(172.16.1.1);)
> > +  table=??(lr_out_snat), priority=33   , match=(ip && ip4.src == 
> > 50.0.0.11 && (!ct.trk || !ct.rpl)), action=(ct_snat(172.16.1.1);)
> > table=??(lr_out_snat), priority=35   , match=(ip && ip4.src == 
> > 50.0.0.11 && ip4.dst == $disallowed_range), action=(next;)
> >   ])
> >
> > @@ -1091,7 +1091,7 @@ AT_CHECK([grep -e "lr_out_snat" drflows3 | sed 
> > 's/table=../table=??/' | sort], [
> >   AT_CHECK([grep -e "lr_out_snat" crflows3 | sed 's/table=../table=??/' | 
> > sort], [0], [dnl
> > table=??(lr_out_snat), priority=0, match=(1), action=(next;)
> > table=??(lr_out_snat), priority=120  , match=(nd_ns), 
> > action=(next;)
> > -  table=??(lr_out_snat), priority=33   , match=(ip && ip4.src == 
> > 50.0.0.11 && ip4.dst == $allowed_range), action=(ct_snat(172.16.1.2);)
> > +  table=??(lr_out_snat), priority=33   , match=(ip && ip4.src == 
> > 50.0.0.11 && ip4.dst == $allowed_range && 

Re: [ovs-dev] OVS DPDK DMA-Dev library/Design Discussion

2022-04-07 Thread Maxime Coquelin



On 4/7/22 17:01, Ilya Maximets wrote:

On 4/7/22 16:42, Van Haaren, Harry wrote:

-Original Message-
From: Ilya Maximets 
Sent: Thursday, April 7, 2022 3:40 PM
To: Maxime Coquelin ; Van Haaren, Harry
; Morten Brørup ;
Richardson, Bruce 
Cc: i.maxim...@ovn.org; Pai G, Sunil ; Stokes, Ian
; Hu, Jiayu ; Ferriter, Cian
; ovs-dev@openvswitch.org; d...@dpdk.org; Mcnamara,
John ; O'Driscoll, Tim ;
Finn, Emma 
Subject: Re: OVS DPDK DMA-Dev library/Design Discussion

On 4/7/22 16:25, Maxime Coquelin wrote:

Hi Harry,

On 4/7/22 16:04, Van Haaren, Harry wrote:

Hi OVS & DPDK, Maintainers & Community,

Top posting overview of discussion as replies to thread become slower:
perhaps it is a good time to review and plan for next steps?

  From my perspective, it those most vocal in the thread seem to be in favour

of the clean

rx/tx split ("defer work"), with the tradeoff that the application must be

aware of handling

the async DMA completions. If there are any concerns opposing upstreaming

of this method,

please indicate this promptly, and we can continue technical discussions here

now.


Wasn't there some discussions about handling the Virtio completions with
the DMA engine? With that, we wouldn't need the deferral of work.


+1


Yes there was, the DMA/virtq completions thread here for reference;
https://mail.openvswitch.org/pipermail/ovs-dev/2022-March/392908.html

I do not believe that there is a viable path to actually implementing it, and 
particularly
not in the more complex cases; e.g. virtio with guest-interrupt enabled.

The thread above mentions additional threads and various other options; none of 
which
I believe to be a clean or workable solution. I'd like input from other folks 
more familiar
with the exact implementations of VHost/vrings, as well as those with DMA 
engine expertise.


I tend to trust Maxime as a vhost maintainer in such questions. :)

In my own opinion though, the implementation is possible and concerns doesn't
sound deal-breaking as solutions for them might work well enough.  So I think
the viability should be tested out before solution is disregarded.  Especially
because the decision will form the API of the vhost library.


I agree, we need a PoC adding interrupt support to dmadev API using
eventfd, and adding a thread in Vhost library that polls for DMA
interrupts and calls vhost_vring_call if needed.





With the virtio completions handled by DMA itself, the vhost port
turns almost into a real HW NIC.  With that we will not need any
extra manipulations from the OVS side, i.e. no need to defer any
work while maintaining clear split between rx and tx operations.

I'd vote for that.



Thanks,
Maxime


Thanks for the prompt responses, and lets understand if there is a viable 
workable way
to totally hide DMA-completions from the application.

Regards,  -Harry



In absence of continued technical discussion here, I suggest Sunil and Ian

collaborate on getting

the OVS Defer-work approach, and DPDK VHost Async patchsets available on

GitHub for easier

consumption and future development (as suggested in slides presented on

last call).


Regards, -Harry

No inline-replies below; message just for context.


-Original Message-
From: Van Haaren, Harry
Sent: Wednesday, March 30, 2022 10:02 AM
To: Morten Brørup ; Richardson, Bruce

Cc: Maxime Coquelin ; Pai G, Sunil
; Stokes, Ian ; Hu, Jiayu
; Ferriter, Cian ; Ilya

Maximets

; ovs-dev@openvswitch.org; d...@dpdk.org;

Mcnamara,

John ; O'Driscoll, Tim

;

Finn, Emma 
Subject: RE: OVS DPDK DMA-Dev library/Design Discussion


-Original Message-
From: Morten Brørup 
Sent: Tuesday, March 29, 2022 8:59 PM
To: Van Haaren, Harry ; Richardson, Bruce

Cc: Maxime Coquelin ; Pai G, Sunil
; Stokes, Ian ; Hu, Jiayu
; Ferriter, Cian ; Ilya

Maximets

; ovs-dev@openvswitch.org; d...@dpdk.org;

Mcnamara,

John

; O'Driscoll, Tim ;

Finn,

Emma 
Subject: RE: OVS DPDK DMA-Dev library/Design Discussion


From: Van Haaren, Harry [mailto:harry.van.haa...@intel.com]
Sent: Tuesday, 29 March 2022 19.46


From: Morten Brørup 
Sent: Tuesday, March 29, 2022 6:14 PM


From: Bruce Richardson [mailto:bruce.richard...@intel.com]
Sent: Tuesday, 29 March 2022 19.03

On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:

From: Maxime Coquelin [mailto:maxime.coque...@redhat.com]
Sent: Tuesday, 29 March 2022 18.24

Hi Morten,

On 3/29/22 16:44, Morten Brørup wrote:

From: Van Haaren, Harry [mailto:harry.van.haa...@intel.com]
Sent: Tuesday, 29 March 2022 15.02


From: Morten Brørup 
Sent: Tuesday, March 29, 2022 1:51 PM

Having thought more about it, I think that a completely

different

architectural approach is required:


Many of the DPDK Ethernet PMDs implement a variety of RX

and TX

packet burst functions, each optimized for different CPU vector
instruction sets. The availability of a DMA engine should be

treated

the same way. So I suggest that PMDs copying packet contents,

e.g.

memif, pcap, vmxnet3, should implement DMA 

Re: [ovs-dev] OVS DPDK DMA-Dev library/Design Discussion

2022-04-07 Thread Ilya Maximets
On 4/7/22 16:42, Van Haaren, Harry wrote:
>> -Original Message-
>> From: Ilya Maximets 
>> Sent: Thursday, April 7, 2022 3:40 PM
>> To: Maxime Coquelin ; Van Haaren, Harry
>> ; Morten Brørup ;
>> Richardson, Bruce 
>> Cc: i.maxim...@ovn.org; Pai G, Sunil ; Stokes, Ian
>> ; Hu, Jiayu ; Ferriter, Cian
>> ; ovs-dev@openvswitch.org; d...@dpdk.org; Mcnamara,
>> John ; O'Driscoll, Tim ;
>> Finn, Emma 
>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
>>
>> On 4/7/22 16:25, Maxime Coquelin wrote:
>>> Hi Harry,
>>>
>>> On 4/7/22 16:04, Van Haaren, Harry wrote:
 Hi OVS & DPDK, Maintainers & Community,

 Top posting overview of discussion as replies to thread become slower:
 perhaps it is a good time to review and plan for next steps?

  From my perspective, it those most vocal in the thread seem to be in 
 favour
>> of the clean
 rx/tx split ("defer work"), with the tradeoff that the application must be
>> aware of handling
 the async DMA completions. If there are any concerns opposing upstreaming
>> of this method,
 please indicate this promptly, and we can continue technical discussions 
 here
>> now.
>>>
>>> Wasn't there some discussions about handling the Virtio completions with
>>> the DMA engine? With that, we wouldn't need the deferral of work.
>>
>> +1
> 
> Yes there was, the DMA/virtq completions thread here for reference;
> https://mail.openvswitch.org/pipermail/ovs-dev/2022-March/392908.html
> 
> I do not believe that there is a viable path to actually implementing it, and 
> particularly
> not in the more complex cases; e.g. virtio with guest-interrupt enabled.
> 
> The thread above mentions additional threads and various other options; none 
> of which
> I believe to be a clean or workable solution. I'd like input from other folks 
> more familiar
> with the exact implementations of VHost/vrings, as well as those with DMA 
> engine expertise.

I tend to trust Maxime as a vhost maintainer in such questions. :)

In my own opinion though, the implementation is possible and concerns doesn't
sound deal-breaking as solutions for them might work well enough.  So I think
the viability should be tested out before solution is disregarded.  Especially
because the decision will form the API of the vhost library.

> 
> 
>> With the virtio completions handled by DMA itself, the vhost port
>> turns almost into a real HW NIC.  With that we will not need any
>> extra manipulations from the OVS side, i.e. no need to defer any
>> work while maintaining clear split between rx and tx operations.
>>
>> I'd vote for that.
>>
>>>
>>> Thanks,
>>> Maxime
> 
> Thanks for the prompt responses, and lets understand if there is a viable 
> workable way
> to totally hide DMA-completions from the application.
> 
> Regards,  -Harry
> 
> 
 In absence of continued technical discussion here, I suggest Sunil and Ian
>> collaborate on getting
 the OVS Defer-work approach, and DPDK VHost Async patchsets available on
>> GitHub for easier
 consumption and future development (as suggested in slides presented on
>> last call).

 Regards, -Harry

 No inline-replies below; message just for context.

> -Original Message-
> From: Van Haaren, Harry
> Sent: Wednesday, March 30, 2022 10:02 AM
> To: Morten Brørup ; Richardson, Bruce
> 
> Cc: Maxime Coquelin ; Pai G, Sunil
> ; Stokes, Ian ; Hu, Jiayu
> ; Ferriter, Cian ; Ilya
>> Maximets
> ; ovs-dev@openvswitch.org; d...@dpdk.org;
>> Mcnamara,
> John ; O'Driscoll, Tim
>> ;
> Finn, Emma 
> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
>
>> -Original Message-
>> From: Morten Brørup 
>> Sent: Tuesday, March 29, 2022 8:59 PM
>> To: Van Haaren, Harry ; Richardson, Bruce
>> 
>> Cc: Maxime Coquelin ; Pai G, Sunil
>> ; Stokes, Ian ; Hu, Jiayu
>> ; Ferriter, Cian ; Ilya
>> Maximets
>> ; ovs-dev@openvswitch.org; d...@dpdk.org;
>> Mcnamara,
> John
>> ; O'Driscoll, Tim ;
>> Finn,
>> Emma 
>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
>>
>>> From: Van Haaren, Harry [mailto:harry.van.haa...@intel.com]
>>> Sent: Tuesday, 29 March 2022 19.46
>>>
 From: Morten Brørup 
 Sent: Tuesday, March 29, 2022 6:14 PM

> From: Bruce Richardson [mailto:bruce.richard...@intel.com]
> Sent: Tuesday, 29 March 2022 19.03
>
> On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
>>> From: Maxime Coquelin [mailto:maxime.coque...@redhat.com]
>>> Sent: Tuesday, 29 March 2022 18.24
>>>
>>> Hi Morten,
>>>
>>> On 3/29/22 16:44, Morten Brørup wrote:
> From: Van Haaren, Harry [mailto:harry.van.haa...@intel.com]
> Sent: Tuesday, 29 March 2022 15.02
>
>> From: Morten Brørup 
>> Sent: Tuesday, March 29, 2022 1:51 PM
>>

Re: [ovs-dev] OVS DPDK DMA-Dev library/Design Discussion

2022-04-07 Thread Van Haaren, Harry
> -Original Message-
> From: Ilya Maximets 
> Sent: Thursday, April 7, 2022 3:40 PM
> To: Maxime Coquelin ; Van Haaren, Harry
> ; Morten Brørup ;
> Richardson, Bruce 
> Cc: i.maxim...@ovn.org; Pai G, Sunil ; Stokes, Ian
> ; Hu, Jiayu ; Ferriter, Cian
> ; ovs-dev@openvswitch.org; d...@dpdk.org; Mcnamara,
> John ; O'Driscoll, Tim ;
> Finn, Emma 
> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> 
> On 4/7/22 16:25, Maxime Coquelin wrote:
> > Hi Harry,
> >
> > On 4/7/22 16:04, Van Haaren, Harry wrote:
> >> Hi OVS & DPDK, Maintainers & Community,
> >>
> >> Top posting overview of discussion as replies to thread become slower:
> >> perhaps it is a good time to review and plan for next steps?
> >>
> >>  From my perspective, it those most vocal in the thread seem to be in 
> >> favour
> of the clean
> >> rx/tx split ("defer work"), with the tradeoff that the application must be
> aware of handling
> >> the async DMA completions. If there are any concerns opposing upstreaming
> of this method,
> >> please indicate this promptly, and we can continue technical discussions 
> >> here
> now.
> >
> > Wasn't there some discussions about handling the Virtio completions with
> > the DMA engine? With that, we wouldn't need the deferral of work.
> 
> +1

Yes there was, the DMA/virtq completions thread here for reference;
https://mail.openvswitch.org/pipermail/ovs-dev/2022-March/392908.html

I do not believe that there is a viable path to actually implementing it, and 
particularly
not in the more complex cases; e.g. virtio with guest-interrupt enabled.

The thread above mentions additional threads and various other options; none of 
which
I believe to be a clean or workable solution. I'd like input from other folks 
more familiar
with the exact implementations of VHost/vrings, as well as those with DMA 
engine expertise.


> With the virtio completions handled by DMA itself, the vhost port
> turns almost into a real HW NIC.  With that we will not need any
> extra manipulations from the OVS side, i.e. no need to defer any
> work while maintaining clear split between rx and tx operations.
> 
> I'd vote for that.
> 
> >
> > Thanks,
> > Maxime

Thanks for the prompt responses, and lets understand if there is a viable 
workable way
to totally hide DMA-completions from the application.

Regards,  -Harry


> >> In absence of continued technical discussion here, I suggest Sunil and Ian
> collaborate on getting
> >> the OVS Defer-work approach, and DPDK VHost Async patchsets available on
> GitHub for easier
> >> consumption and future development (as suggested in slides presented on
> last call).
> >>
> >> Regards, -Harry
> >>
> >> No inline-replies below; message just for context.
> >>
> >>> -Original Message-
> >>> From: Van Haaren, Harry
> >>> Sent: Wednesday, March 30, 2022 10:02 AM
> >>> To: Morten Brørup ; Richardson, Bruce
> >>> 
> >>> Cc: Maxime Coquelin ; Pai G, Sunil
> >>> ; Stokes, Ian ; Hu, Jiayu
> >>> ; Ferriter, Cian ; Ilya
> Maximets
> >>> ; ovs-dev@openvswitch.org; d...@dpdk.org;
> Mcnamara,
> >>> John ; O'Driscoll, Tim
> ;
> >>> Finn, Emma 
> >>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
> >>>
>  -Original Message-
>  From: Morten Brørup 
>  Sent: Tuesday, March 29, 2022 8:59 PM
>  To: Van Haaren, Harry ; Richardson, Bruce
>  
>  Cc: Maxime Coquelin ; Pai G, Sunil
>  ; Stokes, Ian ; Hu, Jiayu
>  ; Ferriter, Cian ; Ilya
> Maximets
>  ; ovs-dev@openvswitch.org; d...@dpdk.org;
> Mcnamara,
> >>> John
>  ; O'Driscoll, Tim ;
> Finn,
>  Emma 
>  Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
> 
> > From: Van Haaren, Harry [mailto:harry.van.haa...@intel.com]
> > Sent: Tuesday, 29 March 2022 19.46
> >
> >> From: Morten Brørup 
> >> Sent: Tuesday, March 29, 2022 6:14 PM
> >>
> >>> From: Bruce Richardson [mailto:bruce.richard...@intel.com]
> >>> Sent: Tuesday, 29 March 2022 19.03
> >>>
> >>> On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
> > From: Maxime Coquelin [mailto:maxime.coque...@redhat.com]
> > Sent: Tuesday, 29 March 2022 18.24
> >
> > Hi Morten,
> >
> > On 3/29/22 16:44, Morten Brørup wrote:
> >>> From: Van Haaren, Harry [mailto:harry.van.haa...@intel.com]
> >>> Sent: Tuesday, 29 March 2022 15.02
> >>>
>  From: Morten Brørup 
>  Sent: Tuesday, March 29, 2022 1:51 PM
> 
>  Having thought more about it, I think that a completely
> >>> different
> > architectural approach is required:
> 
>  Many of the DPDK Ethernet PMDs implement a variety of RX
> > and TX
> > packet burst functions, each optimized for different CPU vector
> > instruction sets. The availability of a DMA engine should be
> >>> treated
> > the same way. So I suggest that PMDs copying packet contents,
> > 

Re: [ovs-dev] OVS DPDK DMA-Dev library/Design Discussion

2022-04-07 Thread Ilya Maximets
On 4/7/22 16:25, Maxime Coquelin wrote:
> Hi Harry,
> 
> On 4/7/22 16:04, Van Haaren, Harry wrote:
>> Hi OVS & DPDK, Maintainers & Community,
>>
>> Top posting overview of discussion as replies to thread become slower:
>> perhaps it is a good time to review and plan for next steps?
>>
>>  From my perspective, it those most vocal in the thread seem to be in favour 
>> of the clean
>> rx/tx split ("defer work"), with the tradeoff that the application must be 
>> aware of handling
>> the async DMA completions. If there are any concerns opposing upstreaming of 
>> this method,
>> please indicate this promptly, and we can continue technical discussions 
>> here now.
> 
> Wasn't there some discussions about handling the Virtio completions with
> the DMA engine? With that, we wouldn't need the deferral of work.

+1

With the virtio completions handled by DMA itself, the vhost port
turns almost into a real HW NIC.  With that we will not need any
extra manipulations from the OVS side, i.e. no need to defer any
work while maintaining clear split between rx and tx operations.

I'd vote for that.

> 
> Thanks,
> Maxime
> 
>> In absence of continued technical discussion here, I suggest Sunil and Ian 
>> collaborate on getting
>> the OVS Defer-work approach, and DPDK VHost Async patchsets available on 
>> GitHub for easier
>> consumption and future development (as suggested in slides presented on last 
>> call).
>>
>> Regards, -Harry
>>
>> No inline-replies below; message just for context.
>>
>>> -Original Message-
>>> From: Van Haaren, Harry
>>> Sent: Wednesday, March 30, 2022 10:02 AM
>>> To: Morten Brørup ; Richardson, Bruce
>>> 
>>> Cc: Maxime Coquelin ; Pai G, Sunil
>>> ; Stokes, Ian ; Hu, Jiayu
>>> ; Ferriter, Cian ; Ilya 
>>> Maximets
>>> ; ovs-dev@openvswitch.org; d...@dpdk.org; Mcnamara,
>>> John ; O'Driscoll, Tim ;
>>> Finn, Emma 
>>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
>>>
 -Original Message-
 From: Morten Brørup 
 Sent: Tuesday, March 29, 2022 8:59 PM
 To: Van Haaren, Harry ; Richardson, Bruce
 
 Cc: Maxime Coquelin ; Pai G, Sunil
 ; Stokes, Ian ; Hu, Jiayu
 ; Ferriter, Cian ; Ilya 
 Maximets
 ; ovs-dev@openvswitch.org; d...@dpdk.org; Mcnamara,
>>> John
 ; O'Driscoll, Tim ; Finn,
 Emma 
 Subject: RE: OVS DPDK DMA-Dev library/Design Discussion

> From: Van Haaren, Harry [mailto:harry.van.haa...@intel.com]
> Sent: Tuesday, 29 March 2022 19.46
>
>> From: Morten Brørup 
>> Sent: Tuesday, March 29, 2022 6:14 PM
>>
>>> From: Bruce Richardson [mailto:bruce.richard...@intel.com]
>>> Sent: Tuesday, 29 March 2022 19.03
>>>
>>> On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
> From: Maxime Coquelin [mailto:maxime.coque...@redhat.com]
> Sent: Tuesday, 29 March 2022 18.24
>
> Hi Morten,
>
> On 3/29/22 16:44, Morten Brørup wrote:
>>> From: Van Haaren, Harry [mailto:harry.van.haa...@intel.com]
>>> Sent: Tuesday, 29 March 2022 15.02
>>>
 From: Morten Brørup 
 Sent: Tuesday, March 29, 2022 1:51 PM

 Having thought more about it, I think that a completely
>>> different
> architectural approach is required:

 Many of the DPDK Ethernet PMDs implement a variety of RX
> and TX
> packet burst functions, each optimized for different CPU vector
> instruction sets. The availability of a DMA engine should be
>>> treated
> the same way. So I suggest that PMDs copying packet contents,
> e.g.
> memif, pcap, vmxnet3, should implement DMA optimized RX and TX
>>> packet
> burst functions.

 Similarly for the DPDK vhost library.

 In such an architecture, it would be the application's job
> to
> allocate DMA channels and assign them to the specific PMDs that
>>> should
> use them. But the actual use of the DMA channels would move
> down
>>> below
> the application and into the DPDK PMDs and libraries.


 Med venlig hilsen / Kind regards,
 -Morten Brørup
>>>
>>> Hi Morten,
>>>
>>> That's *exactly* how this architecture is designed &
>>> implemented.
>>> 1.    The DMA configuration and initialization is up to the
>>> application
> (OVS).
>>> 2.    The VHost library is passed the DMA-dev ID, and its
> new
>>> async
> rx/tx APIs, and uses the DMA device to accelerate the copy.
>>>
>>> Looking forward to talking on the call that just started.
>>> Regards, -
> Harry
>>>
>>
>> OK, thanks - as I said on the call, I haven't looked at the
>>> patches.
>>
>> Then, I suppose that the TX completions can be handled in 

Re: [ovs-dev] OVS DPDK DMA-Dev library/Design Discussion

2022-04-07 Thread Maxime Coquelin

Hi Harry,

On 4/7/22 16:04, Van Haaren, Harry wrote:

Hi OVS & DPDK, Maintainers & Community,

Top posting overview of discussion as replies to thread become slower:
perhaps it is a good time to review and plan for next steps?

 From my perspective, it those most vocal in the thread seem to be in favour of 
the clean
rx/tx split ("defer work"), with the tradeoff that the application must be 
aware of handling
the async DMA completions. If there are any concerns opposing upstreaming of 
this method,
please indicate this promptly, and we can continue technical discussions here 
now.


Wasn't there some discussions about handling the Virtio completions with
the DMA engine? With that, we wouldn't need the deferral of work.

Thanks,
Maxime


In absence of continued technical discussion here, I suggest Sunil and Ian 
collaborate on getting
the OVS Defer-work approach, and DPDK VHost Async patchsets available on GitHub 
for easier
consumption and future development (as suggested in slides presented on last 
call).

Regards, -Harry

No inline-replies below; message just for context.


-Original Message-
From: Van Haaren, Harry
Sent: Wednesday, March 30, 2022 10:02 AM
To: Morten Brørup ; Richardson, Bruce

Cc: Maxime Coquelin ; Pai G, Sunil
; Stokes, Ian ; Hu, Jiayu
; Ferriter, Cian ; Ilya Maximets
; ovs-dev@openvswitch.org; d...@dpdk.org; Mcnamara,
John ; O'Driscoll, Tim ;
Finn, Emma 
Subject: RE: OVS DPDK DMA-Dev library/Design Discussion


-Original Message-
From: Morten Brørup 
Sent: Tuesday, March 29, 2022 8:59 PM
To: Van Haaren, Harry ; Richardson, Bruce

Cc: Maxime Coquelin ; Pai G, Sunil
; Stokes, Ian ; Hu, Jiayu
; Ferriter, Cian ; Ilya Maximets
; ovs-dev@openvswitch.org; d...@dpdk.org; Mcnamara,

John

; O'Driscoll, Tim ; Finn,
Emma 
Subject: RE: OVS DPDK DMA-Dev library/Design Discussion


From: Van Haaren, Harry [mailto:harry.van.haa...@intel.com]
Sent: Tuesday, 29 March 2022 19.46


From: Morten Brørup 
Sent: Tuesday, March 29, 2022 6:14 PM


From: Bruce Richardson [mailto:bruce.richard...@intel.com]
Sent: Tuesday, 29 March 2022 19.03

On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:

From: Maxime Coquelin [mailto:maxime.coque...@redhat.com]
Sent: Tuesday, 29 March 2022 18.24

Hi Morten,

On 3/29/22 16:44, Morten Brørup wrote:

From: Van Haaren, Harry [mailto:harry.van.haa...@intel.com]
Sent: Tuesday, 29 March 2022 15.02


From: Morten Brørup 
Sent: Tuesday, March 29, 2022 1:51 PM

Having thought more about it, I think that a completely

different

architectural approach is required:


Many of the DPDK Ethernet PMDs implement a variety of RX

and TX

packet burst functions, each optimized for different CPU vector
instruction sets. The availability of a DMA engine should be

treated

the same way. So I suggest that PMDs copying packet contents,

e.g.

memif, pcap, vmxnet3, should implement DMA optimized RX and TX

packet

burst functions.


Similarly for the DPDK vhost library.

In such an architecture, it would be the application's job

to

allocate DMA channels and assign them to the specific PMDs that

should

use them. But the actual use of the DMA channels would move

down

below

the application and into the DPDK PMDs and libraries.



Med venlig hilsen / Kind regards,
-Morten Brørup


Hi Morten,

That's *exactly* how this architecture is designed &

implemented.

1.  The DMA configuration and initialization is up to the

application

(OVS).

2.  The VHost library is passed the DMA-dev ID, and its

new

async

rx/tx APIs, and uses the DMA device to accelerate the copy.


Looking forward to talking on the call that just started.

Regards, -

Harry




OK, thanks - as I said on the call, I haven't looked at the

patches.


Then, I suppose that the TX completions can be handled in the

TX

function, and the RX completions can be handled in the RX

function,

just like the Ethdev PMDs handle packet descriptors:


TX_Burst(tx_packet_array):
1.  Clean up descriptors processed by the NIC chip. -->

Process

TX

DMA channel completions. (Effectively, the 2nd pipeline stage.)

2.  Pass on the tx_packet_array to the NIC chip

descriptors. --

Pass

on the tx_packet_array to the TX DMA channel. (Effectively, the

1st

pipeline stage.)

The problem is Tx function might not be called again, so

enqueued

packets in 2. may never be completed from a Virtio point of

view.

IOW,

the packets will be copied to the Virtio descriptors buffers,

but

the

descriptors will not be made available to the Virtio driver.


In that case, the application needs to call TX_Burst()

periodically

with an empty array, for completion purposes.


This is what the "defer work" does at the OVS thread-level, but instead
of
"brute-forcing" and *always* making the call, the defer work concept
tracks
*when* there is outstanding work (DMA copies) to be completed
("deferred work")
and calls the generic completion function at that point.

So "defer work" is generic infrastructure at the OVS 

Re: [ovs-dev] OVS DPDK DMA-Dev library/Design Discussion

2022-04-07 Thread Van Haaren, Harry
Hi OVS & DPDK, Maintainers & Community,

Top posting overview of discussion as replies to thread become slower:
perhaps it is a good time to review and plan for next steps?

>From my perspective, it those most vocal in the thread seem to be in favour of 
>the clean
rx/tx split ("defer work"), with the tradeoff that the application must be 
aware of handling
the async DMA completions. If there are any concerns opposing upstreaming of 
this method,
please indicate this promptly, and we can continue technical discussions here 
now.

In absence of continued technical discussion here, I suggest Sunil and Ian 
collaborate on getting
the OVS Defer-work approach, and DPDK VHost Async patchsets available on GitHub 
for easier
consumption and future development (as suggested in slides presented on last 
call).

Regards, -Harry

No inline-replies below; message just for context.

> -Original Message-
> From: Van Haaren, Harry
> Sent: Wednesday, March 30, 2022 10:02 AM
> To: Morten Brørup ; Richardson, Bruce
> 
> Cc: Maxime Coquelin ; Pai G, Sunil
> ; Stokes, Ian ; Hu, Jiayu
> ; Ferriter, Cian ; Ilya Maximets
> ; ovs-dev@openvswitch.org; d...@dpdk.org; Mcnamara,
> John ; O'Driscoll, Tim ;
> Finn, Emma 
> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
> 
> > -Original Message-
> > From: Morten Brørup 
> > Sent: Tuesday, March 29, 2022 8:59 PM
> > To: Van Haaren, Harry ; Richardson, Bruce
> > 
> > Cc: Maxime Coquelin ; Pai G, Sunil
> > ; Stokes, Ian ; Hu, Jiayu
> > ; Ferriter, Cian ; Ilya 
> > Maximets
> > ; ovs-dev@openvswitch.org; d...@dpdk.org; Mcnamara,
> John
> > ; O'Driscoll, Tim ; Finn,
> > Emma 
> > Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
> >
> > > From: Van Haaren, Harry [mailto:harry.van.haa...@intel.com]
> > > Sent: Tuesday, 29 March 2022 19.46
> > >
> > > > From: Morten Brørup 
> > > > Sent: Tuesday, March 29, 2022 6:14 PM
> > > >
> > > > > From: Bruce Richardson [mailto:bruce.richard...@intel.com]
> > > > > Sent: Tuesday, 29 March 2022 19.03
> > > > >
> > > > > On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
> > > > > > > From: Maxime Coquelin [mailto:maxime.coque...@redhat.com]
> > > > > > > Sent: Tuesday, 29 March 2022 18.24
> > > > > > >
> > > > > > > Hi Morten,
> > > > > > >
> > > > > > > On 3/29/22 16:44, Morten Brørup wrote:
> > > > > > > >> From: Van Haaren, Harry [mailto:harry.van.haa...@intel.com]
> > > > > > > >> Sent: Tuesday, 29 March 2022 15.02
> > > > > > > >>
> > > > > > > >>> From: Morten Brørup 
> > > > > > > >>> Sent: Tuesday, March 29, 2022 1:51 PM
> > > > > > > >>>
> > > > > > > >>> Having thought more about it, I think that a completely
> > > > > different
> > > > > > > architectural approach is required:
> > > > > > > >>>
> > > > > > > >>> Many of the DPDK Ethernet PMDs implement a variety of RX
> > > and TX
> > > > > > > packet burst functions, each optimized for different CPU vector
> > > > > > > instruction sets. The availability of a DMA engine should be
> > > > > treated
> > > > > > > the same way. So I suggest that PMDs copying packet contents,
> > > e.g.
> > > > > > > memif, pcap, vmxnet3, should implement DMA optimized RX and TX
> > > > > packet
> > > > > > > burst functions.
> > > > > > > >>>
> > > > > > > >>> Similarly for the DPDK vhost library.
> > > > > > > >>>
> > > > > > > >>> In such an architecture, it would be the application's job
> > > to
> > > > > > > allocate DMA channels and assign them to the specific PMDs that
> > > > > should
> > > > > > > use them. But the actual use of the DMA channels would move
> > > down
> > > > > below
> > > > > > > the application and into the DPDK PMDs and libraries.
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>> Med venlig hilsen / Kind regards,
> > > > > > > >>> -Morten Brørup
> > > > > > > >>
> > > > > > > >> Hi Morten,
> > > > > > > >>
> > > > > > > >> That's *exactly* how this architecture is designed &
> > > > > implemented.
> > > > > > > >> 1. The DMA configuration and initialization is up to the
> > > > > application
> > > > > > > (OVS).
> > > > > > > >> 2. The VHost library is passed the DMA-dev ID, and its
> > > new
> > > > > async
> > > > > > > rx/tx APIs, and uses the DMA device to accelerate the copy.
> > > > > > > >>
> > > > > > > >> Looking forward to talking on the call that just started.
> > > > > Regards, -
> > > > > > > Harry
> > > > > > > >>
> > > > > > > >
> > > > > > > > OK, thanks - as I said on the call, I haven't looked at the
> > > > > patches.
> > > > > > > >
> > > > > > > > Then, I suppose that the TX completions can be handled in the
> > > TX
> > > > > > > function, and the RX completions can be handled in the RX
> > > function,
> > > > > > > just like the Ethdev PMDs handle packet descriptors:
> > > > > > > >
> > > > > > > > TX_Burst(tx_packet_array):
> > > > > > > > 1.  Clean up descriptors processed by the NIC chip. -->
> > > Process
> > > > > TX
> > > > > > > DMA channel completions. (Effectively, the 2nd pipeline stage.)
> > > 

[ovs-dev] [PATCH] ofproto-dpif-xlate: remove mirror assert

2022-04-07 Thread lic121
During the revalidation, mirror could be removed. Instead of crash
the process, we can simply skip the deleted mirror.

Fixes: ec7ceaed4f3e ("ofproto-dpif: Modularize mirror code.")
Signed-off-by: lic121 
---
 ofproto/ofproto-dpif-xlate.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
index 5a770f1..33e56c2 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -2142,9 +2142,13 @@ mirror_packet(struct xlate_ctx *ctx, struct xbundle 
*xbundle,
 int snaplen;
 
 /* Get the details of the mirror represented by the rightmost 1-bit. */
-ovs_assert(mirror_get(xbridge->mbridge, raw_ctz(mirrors),
-  , _mirrors,
-  , , _vlan));
+if (!mirror_get(xbridge->mbridge, raw_ctz(mirrors),
+   , _mirrors,
+   , , _vlan)) {
+/* Mirror could be deleted during revalidation */
+mirrors = zero_rightmost_1bit(mirrors);
+continue;
+}
 
 
 /* If this mirror selects on the basis of VLAN, and it does not select
-- 
1.8.3.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v1] ofproto-dpif: Validate action size and recursion depth

2022-04-07 Thread Adrian Moreno

Hi Mike,

On 4/6/22 18:48, Mike Pattrick wrote:

Verify that a flow isn't adding more than 64kB in actions when an action
uses multiple check_pkt_larger statements. Failure to do so can cause
ovs-vswitchd to throw signal 6. Also add a depth check to clone and
check_pkt_larger.



I think I'm lacking a bit of context. For my understanding, why do we need to 
add depth check to clone and check_pkt_larger?
AFAICS the number of clone actions is bound to the length of the ofpact which is 
finite, so it'll not be a problem of endless recursion (which is, IIUC, the 
rationale behind depth).


Besides, it seems to me that nested cloning is equivalent to a list of clone 
actions, why should we treat them differently?




Signed-off-by: Mike Pattrick 
---
  ofproto/ofproto-dpif-xlate.c | 26 ++
  tests/ofproto-dpif.at| 52 
  2 files changed, 78 insertions(+)

diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
index 5a770f171..50357e6bb 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -5806,7 +5806,9 @@ clone_xlate_actions(const struct ofpact *actions, size_t 
actions_len,
  
  if (reversible_actions(actions, actions_len) || is_last_action) {

  old_flow = ctx->xin->flow;
+ctx->depth++;
  do_xlate_actions(actions, actions_len, ctx, is_last_action, false);
+ctx->depth--;
  if (!ctx->freezing) {
  xlate_action_set(ctx);
  }
@@ -5830,7 +5832,9 @@ clone_xlate_actions(const struct ofpact *actions, size_t 
actions_len,
  if (ctx->xbridge->support.clone) { /* Use clone action */
  /* Use clone action as datapath clone. */
  offset = nl_msg_start_nested(ctx->odp_actions, OVS_ACTION_ATTR_CLONE);
+ctx->depth++;
  do_xlate_actions(actions, actions_len, ctx, true, false);
+ctx->depth--;
  if (!ctx->freezing) {
  xlate_action_set(ctx);
  }
@@ -5846,7 +5850,9 @@ clone_xlate_actions(const struct ofpact *actions, size_t 
actions_len,
  offset = nl_msg_start_nested(ctx->odp_actions, 
OVS_ACTION_ATTR_SAMPLE);
  ac_offset = nl_msg_start_nested(ctx->odp_actions,
  OVS_SAMPLE_ATTR_ACTIONS);
+ctx->depth++;
  do_xlate_actions(actions, actions_len, ctx, true, false);
+ctx->depth--;
  if (!ctx->freezing) {
  xlate_action_set(ctx);
  }
@@ -6412,7 +6418,18 @@ xlate_check_pkt_larger(struct xlate_ctx *ctx,
  ctx->odp_actions, OVS_CHECK_PKT_LEN_ATTR_ACTIONS_IF_GREATER);
  value.u8_val = 1;
  mf_write_subfield_flow(_pkt_larger->dst, , >xin->flow);
+ctx->depth++;
  do_xlate_actions(remaining_acts, remaining_acts_len, ctx, true, false);
+ctx->depth--;
+
+if (!xlate_resubmit_resource_check(ctx)) {
+if (!ctx->error) {
+/* If we don't set an error here, we end up with corrupted
+ * odp actions in the slow path. */
+ctx->error = XLATE_RECURSION_TOO_DEEP;
+}
+return;
+}
  if (!ctx->freezing) {
  xlate_action_set(ctx);
  }
@@ -6437,7 +6454,16 @@ xlate_check_pkt_larger(struct xlate_ctx *ctx,
  ctx->odp_actions, OVS_CHECK_PKT_LEN_ATTR_ACTIONS_IF_LESS_EQUAL);
  value.u8_val = 0;
  mf_write_subfield_flow(_pkt_larger->dst, , >xin->flow);
+ctx->depth++;
  do_xlate_actions(remaining_acts, remaining_acts_len, ctx, true, false);
+ctx->depth--;
+
+if (!xlate_resubmit_resource_check(ctx)) {
+if (!ctx->error) {
+ctx->error = XLATE_RECURSION_TOO_DEEP;
+}
+return;
+}
  if (!ctx->freezing) {
  xlate_action_set(ctx);
  }
diff --git a/tests/ofproto-dpif.at b/tests/ofproto-dpif.at
index dbb3b6dda..3f70e732a 100644
--- a/tests/ofproto-dpif.at
+++ b/tests/ofproto-dpif.at
@@ -11640,3 +11640,55 @@ AT_CHECK([test 1 = `ovs-ofctl parse-pcap p2-tx.pcap | 
wc -l`])
  
  OVS_VSWITCHD_STOP

  AT_CLEANUP
+
+AT_SETUP([ofproto-dpif - discard large flows])
+OVS_VSWITCHD_START
+add_of_ports --pcap br0 `seq 1 2`
+
+AT_DATA([flows.txt], [dnl
+table=0,in_port=1 
actions=clone(clone(clone(clone(clone(clone(clone(clone(clone( dnl Over 64 
recursions
+  
clone(clone(clone(clone(clone(clone(clone(clone(clone( dnl
+  
clone(clone(clone(clone(clone(clone(clone(clone(clone( dnl
+  
clone(clone(clone(clone(clone(clone(clone(clone(clone( dnl
+  
clone(clone(clone(clone(clone(clone(clone(clone(clone( dnl
+  
clone(clone(clone(clone(clone(clone(clone(clone(clone( dnl
+  
clone(clone(clone(clone(clone(clone(clone(clone(clone( dnl
+  clone(resubmit(,2))  
  dnl
+  )
  dnl
+  

Re: [ovs-dev] [PATCH net-next v2] net: openvswitch: fix uAPI incompatibility with existing user space

2022-04-07 Thread Ilya Maximets
On 4/7/22 10:02, Vlad Buslov wrote:
> On Mon 14 Mar 2022 at 20:40, Ilya Maximets  wrote:
>> On 3/14/22 19:33, Roi Dayan wrote:
>>>
>>>
>>> On 2022-03-10 8:44 PM, Aaron Conole wrote:
 Ilya Maximets  writes:

> Few years ago OVS user space made a strange choice in the commit [1]
> to define types only valid for the user space inside the copy of a
> kernel uAPI header.  '#ifndef __KERNEL__' and another attribute was
> added later.
>
> This leads to the inevitable clash between user space and kernel types
> when the kernel uAPI is extended.  The issue was unveiled with the
> addition of a new type for IPv6 extension header in kernel uAPI.
>
> When kernel provides the OVS_KEY_ATTR_IPV6_EXTHDRS attribute to the
> older user space application, application tries to parse it as
> OVS_KEY_ATTR_PACKET_TYPE and discards the whole netlink message as
> malformed.  Since OVS_KEY_ATTR_IPV6_EXTHDRS is supplied along with
> every IPv6 packet that goes to the user space, IPv6 support is fully
> broken.
>
> Fixing that by bringing these user space attributes to the kernel
> uAPI to avoid the clash.  Strictly speaking this is not the problem
> of the kernel uAPI, but changing it is the only way to avoid breakage
> of the older user space applications at this point.
>
> These 2 types are explicitly rejected now since they should not be
> passed to the kernel.  Additionally, OVS_KEY_ATTR_TUNNEL_INFO moved
> out from the '#ifdef __KERNEL__' as there is no good reason to hide
> it from the userspace.  And it's also explicitly rejected now, because
> it's for in-kernel use only.
>
> Comments with warnings were added to avoid the problem coming back.
>
> (1 << type) converted to (1ULL << type) to avoid integer overflow on
> OVS_KEY_ATTR_IPV6_EXTHDRS, since it equals 32 now.
>
>   [1] beb75a40fdc2 ("userspace: Switching of L3 packets in L2 pipeline")
>
> Fixes: 28a3f0601727 ("net: openvswitch: IPv6: Add IPv6 extension header 
> support")
> Link: 
> https://lore.kernel.org/netdev/3adf00c7-fe65-3ef4-b6d7-6d8a0cad8...@nvidia.com
> Link: 
> https://github.com/openvswitch/ovs/commit/beb75a40fdc295bfd6521b0068b4cd12f6de507c
> Reported-by: Roi Dayan 
> Signed-off-by: Ilya Maximets 
> ---

 Acked-by: Aaron Conole 

>>>
>>>
>>>
>>> I got to check traffic with the fix and I do get some traffic
>>> but something is broken. I didn't investigate much but the quick
>>> test shows me rules are not offloaded and dumping ovs rules gives
>>> error like this
>>>
>>> recirc_id(0),in_port(enp8s0f0_1),ct_state(-trk),eth(),eth_type(0x86dd),ipv6(frag=no)(bad
>>> key length 2, expected -1)(00 00/(bad mask length 2, expected -1)(00 00),
>>> packets:2453, bytes:211594, used:0.004s, flags:S., actions:ct,recirc(0x2)
>>
>> Such a dump is expected, because kernel parses fields that current
>> userspace doesn't understand, and at the same time OVS by design is
>> using kernel provided key/mask while installing datapath rules, IIRC.
>> It should be possible to make these dumps a bit more friendly though.
>>
>> For the offloading not working, see my comment in the v2 patch email
>> I sent (top email of this thread).  In short, it's a problem in user
>> space and it can not be fixed from the kernel side, unless we revert
>> IPv6 extension header support and never add any new types, which is
>> unreasonable.  I didn't test any actual offloading, but I had a
>> successful run of 'make check-offloads' with my quick'n'dirty fix from
>> the top email.
> 
> Hi Ilya,
> 
> I can confirm that with latest OvS master IPv6 rules offload still fails
> without your pastebin code applied.
> 
>>
>> Since we're here:
>>
>> Toms, do you plan to submit user space patches for this feature?
> 
> I see there is a patch from you that is supposed to fix compatibility
> issues caused by this change in OvS d96d14b14733 ("openvswitch.h: Align
> uAPI definition with the kernel."), but it doesn't fix offload for me
> without pastebin patch.

Yes.  OVS commit d96d14b14733 is intended to only fix the uAPI.
Issue with offload is an OVS bug that should be fixed separately.
The fix will also need to be backported to OVS stable branches.

> Do you plan to merge that code into OvS or you
> require some help from our side?

I could do that, but I don't really have enough time.  So, if you
can work on that fix, it would be great.  Note that comments inside
the OVS's lib/odp-util.c:parse_key_and_mask_to_match() was blindly
copied from the userspace datapath and are incorrect for the general
case, so has to be fixed alongside the logic of that function.

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [Patch RFC ovn 1/3] OVN: Add support for Port Mirroring

2022-04-07 Thread Abhiram R N
Changes done in ovn-nbctl to support creation of mirror
and attach a source port for the same.

Example commands:
ovn-nbctl mirror-add mirror1 gre 0 to-lport sw0-port2
ovn-nbctl lsp-attach-mirror sw0-port1 mirror1

Changes are needed in ovn-sbctl and ovn-controller as well
to make it functional end to end. That needs to be done in
future patches as incremental change.

This patch prepares for the next step.

Co-authored-by: Veda Barrenkala 
Signed-off-by: Veda Barrenkala 
Signed-off-by: Abhiram R N 
---
 ovn-nb.ovsschema  |  30 +++-
 ovn-nb.xml|  53 ++
 tests/ovn-nbctl.at|  78 +
 utilities/ovn-nbctl.c | 399 ++
 4 files changed, 558 insertions(+), 2 deletions(-)

diff --git a/ovn-nb.ovsschema b/ovn-nb.ovsschema
index 6c69732e9..02e5374ab 100644
--- a/ovn-nb.ovsschema
+++ b/ovn-nb.ovsschema
@@ -1,7 +1,7 @@
 {
 "name": "OVN_Northbound",
-"version": "6.2.0",
-"cksum": "2862883635 31533",
+"version": "6.3.0",
+"cksum": "484817458 32957",
 "tables": {
 "NB_Global": {
 "columns": {
@@ -132,6 +132,11 @@
 "refType": "weak"},
  "min": 0,
  "max": 1}},
+"mirror_rules": {"type": {"key": {"type": "uuid",
+  "refTable": "Mirror",
+  "refType": "weak"},
+  "min": 0,
+  "max": "unlimited"}},
 "ha_chassis_group": {
 "type": {"key": {"type": "uuid",
  "refTable": "HA_Chassis_Group",
@@ -301,6 +306,27 @@
 "type": {"key": "string", "value": "string",
  "min": 0, "max": "unlimited"}}},
 "isRoot": false},
+"Mirror": {
+"columns": {
+"name": {"type": "string"},
+"filter": {"type": {"key": {"type": "string",
+"enum": ["set",
+ ["from-lport",
+  "to-lport","both"]]}}},
+"sink":{"type": "string"},
+"type": {"type": {"key": {"type": "string",
+"enum": ["set",
+ ["gre", "erspan"]]}}},
+"index": {"type": "integer"},
+"src": {"type": {"key": {"type": "uuid",
+   "refTable": "Logical_Switch_Port",
+   "refType": "weak"},
+   "min": 0,
+   "max": "unlimited"}},
+"external_ids": {
+"type": {"key": "string", "value": "string",
+ "min": 0, "max": "unlimited"}}},
+"isRoot": true},
 "Meter": {
 "columns": {
 "name": {"type": "string"},
diff --git a/ovn-nb.xml b/ovn-nb.xml
index 0e3ad2b9b..3163dc2bf 100644
--- a/ovn-nb.xml
+++ b/ovn-nb.xml
@@ -1503,6 +1503,10 @@
   
 
 
+
+  Mirror rules that apply to logical switch port which is the source
+
+
 
   References a row in the OVN Northbound database's
table.
@@ -2424,6 +2428,55 @@
 
   
 
+  
+
+  Each row in this table represents one Mirror that can be used for
+  port mirroring
+
+
+
+  
+Represents the name of the mirror.
+  
+
+
+
+  
+The value of this field represents selection criteria of the mirror.
+  
+
+
+
+  
+The value of this field represents the destination/sink of the mirror.
+  
+
+
+
+  
+The value of this field represents the type of the tunnel used for
+sending the mirrored packets
+  
+
+
+
+  
+The value of this field represents the key/idx depending on the
+tunnel type configured
+  
+
+
+
+  
+The value of this field represents the source port for the mirror.
+  
+
+
+
+  See External IDs at the beginning of this document.
+
+  
+
   
 
   Each row in this table represents a meter that can be used for QoS or
diff --git a/tests/ovn-nbctl.at b/tests/ovn-nbctl.at
index db0d23099..1cb01d456 100644
--- a/tests/ovn-nbctl.at
+++ b/tests/ovn-nbctl.at
@@ -435,6 +435,84 @@ AT_CHECK([ovn-nbctl meter-list], [0], [dnl
 
 dnl -
 
+OVN_NBCTL_TEST([ovn_nbctl_mirrors], [mirrors], [
+AT_CHECK([ovn-nbctl mirror-add mirror1 gre 0 from-lport 10.10.10.1])
+AT_CHECK([ovn-nbctl mirror-add mirror2 erspan 1 both 10.10.10.2])
+AT_CHECK([ovn-nbctl ls-add sw0])
+AT_CHECK([ovn-nbctl lsp-add 

[ovs-dev] [ovs-dev, v3, 6/8] netdev-offload-tc: Cleanup police actions with reserved indexes on startup

2022-04-07 Thread Jianbo Liu via dev
As the police actions with indexes of 0x1000-0x1fff are
reserved for meter offload, to provide a clean environment for OVS,
these reserved police actions should be deleted on startup. So dump
all the police actions, delete those actions with indexes in this
range.

Signed-off-by: Jianbo Liu 
---
 lib/netdev-offload-tc.c | 75 +
 lib/tc.c| 61 +
 lib/tc.h|  2 ++
 3 files changed, 138 insertions(+)

diff --git a/lib/netdev-offload-tc.c b/lib/netdev-offload-tc.c
index 9364cb530..d6ef5159c 100644
--- a/lib/netdev-offload-tc.c
+++ b/lib/netdev-offload-tc.c
@@ -2511,12 +2511,87 @@ meter_tc_del_policer(ofproto_meter_id meter_id,
 return err;
 }
 
+struct policer_node {
+struct hmap_node node;
+uint32_t police_idx;
+};
+
+static int
+tc_get_policer_action_ids(struct hmap *map)
+{
+uint32_t police_idx[TCA_ACT_MAX_PRIO] = {};
+struct policer_node *policer_node;
+struct netdev_flow_dump *dump;
+struct ofpbuf rbuffer, reply;
+size_t hash;
+int i, err;
+
+dump = xzalloc(sizeof *dump);
+dump->nl_dump = xzalloc(sizeof *dump->nl_dump);
+
+ofpbuf_init(, NL_DUMP_BUFSIZE);
+tc_dump_tc_action_start("police", dump->nl_dump);
+
+while (nl_dump_next(dump->nl_dump, , )) {
+if (parse_netlink_to_tc_policer(, police_idx)) {
+continue;
+}
+
+for (i = 0; i < TCA_ACT_MAX_PRIO; i++) {
+if (!police_idx[i]) {
+break;
+}
+policer_node = xzalloc(sizeof *policer_node);
+policer_node->police_idx = police_idx[i];
+hash = hash_int(police_idx[i], 0);
+hmap_insert(map, _node->node, hash);
+}
+memset(police_idx, 0, TCA_ACT_MAX_PRIO * sizeof(uint32_t));
+}
+
+err = nl_dump_done(dump->nl_dump);
+ofpbuf_uninit();
+free(dump->nl_dump);
+free(dump);
+
+return err;
+}
+
+static void
+tc_cleanup_policer_action(struct id_pool *police_ids,
+  uint32_t id_min, uint32_t id_max)
+{
+struct policer_node *policer_node;
+uint32_t police_idx;
+struct hmap map;
+int err;
+
+hmap_init();
+tc_get_policer_action_ids();
+
+HMAP_FOR_EACH_POP (policer_node, node, ) {
+police_idx = policer_node->police_idx;
+if (police_idx >= id_min && police_idx <= id_max) {
+err = tc_del_policer_action(police_idx, NULL);
+if (err) {
+/* don't use this police any more */
+id_pool_add(police_ids, police_idx);
+}
+}
+free(policer_node);
+}
+
+hmap_destroy();
+}
+
 static int
 meter_tc_init_policer(void)
 {
 meter_police_ids = id_pool_create(METER_POLICE_IDS_BASE,
 METER_POLICE_IDS_MAX - METER_POLICE_IDS_BASE + 1);
 
+tc_cleanup_policer_action(meter_police_ids, METER_POLICE_IDS_BASE,
+  METER_POLICE_IDS_MAX);
 return 0;
 }
 
diff --git a/lib/tc.c b/lib/tc.c
index ee16364ea..16d4ae3da 100644
--- a/lib/tc.c
+++ b/lib/tc.c
@@ -2067,6 +2067,67 @@ tc_dump_tc_chain_start(struct tcf_id *id, struct nl_dump 
*dump)
 return 0;
 }
 
+int
+tc_dump_tc_action_start(char *name, struct nl_dump *dump)
+{
+size_t offset, root_offset;
+struct ofpbuf request;
+uint32_t prio = 0;
+
+tc_make_action_request(RTM_GETACTION, NLM_F_DUMP, );
+root_offset = nl_msg_start_nested(, TCA_ACT_TAB);
+offset = nl_msg_start_nested(, ++prio);
+nl_msg_put_string(, TCA_ACT_KIND, name);
+nl_msg_end_nested(, offset);
+nl_msg_end_nested(, root_offset);
+
+nl_dump_start(dump, NETLINK_ROUTE, );
+ofpbuf_uninit();
+
+return 0;
+}
+
+int
+parse_netlink_to_tc_policer(struct ofpbuf *reply, uint32_t police_idx[])
+{
+static struct nl_policy actions_orders_policy[TCA_ACT_MAX_PRIO] = {};
+struct nlattr *actions_orders[ARRAY_SIZE(actions_orders_policy)];
+const int max_size = ARRAY_SIZE(actions_orders_policy);
+const struct nlattr *actions;
+struct tc_flower flower;
+struct tcamsg *tca;
+int i, cnt = 0;
+int err;
+
+for (i = 0; i < max_size; i++) {
+actions_orders_policy[i].type = NL_A_NESTED;
+actions_orders_policy[i].optional = true;
+}
+
+tca = ofpbuf_at_assert(reply, NLMSG_HDRLEN, sizeof *tca);
+actions = nl_attr_find(reply, NLMSG_HDRLEN + sizeof *tca, TCA_ACT_TAB);
+if (!actions || !nl_parse_nested(actions, actions_orders_policy,
+ actions_orders, max_size)) {
+VLOG_ERR_RL(_rl, "failed to parse police actions");
+return EPROTO;
+}
+
+for (i = 0; i < TCA_ACT_MAX_PRIO; i++) {
+if (actions_orders[i]) {
+memset(, 0, sizeof(struct tc_flower));
+err = tc_parse_single_action(actions_orders[i], , false);
+if (err) {
+continue;
+}
+if 

[ovs-dev] [ovs-dev, v3, 8/8] dpif-netlink: Offloading meter to tc police action

2022-04-07 Thread Jianbo Liu via dev
OVS meters are created in advance and openflow rules refer to them by
their unique ID. New tc_police API is used to offload them. By calling
the API, police actions are created and meters are mapped to them.
These actions then can be used in tc filter rules by the index.

Signed-off-by: Jianbo Liu 
---
 lib/dpif-netlink.c | 47 --
 1 file changed, 41 insertions(+), 6 deletions(-)

diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
index 71e35ccdd..7c089b465 100644
--- a/lib/dpif-netlink.c
+++ b/lib/dpif-netlink.c
@@ -350,6 +350,16 @@ dpif_netlink_enumerate(struct sset *all_dps,
 return nl_dump_done();
 }
 
+static int
+dpif_netlink_initialize(void)
+{
+if (netdev_is_flow_api_enabled()) {
+meter_offload_init("tc_police");
+}
+
+return 0;
+}
+
 static int
 dpif_netlink_open(const struct dpif_class *class OVS_UNUSED, const char *name,
   bool create, struct dpif **dpifp)
@@ -733,6 +743,10 @@ dpif_netlink_destroy(struct dpif *dpif_)
 struct dpif_netlink *dpif = dpif_netlink_cast(dpif_);
 struct dpif_netlink_dp dp;
 
+if (netdev_is_flow_api_enabled()) {
+meter_offload_destroy("tc_police");
+}
+
 dpif_netlink_dp_init();
 dp.cmd = OVS_DP_CMD_DEL;
 dp.dp_ifindex = dpif->dp_ifindex;
@@ -4163,11 +4177,18 @@ static int
 dpif_netlink_meter_set(struct dpif *dpif_, ofproto_meter_id meter_id,
struct ofputil_meter_config *config)
 {
+int err;
+
 if (probe_broken_meters(dpif_)) {
 return ENOMEM;
 }
 
-return dpif_netlink_meter_set__(dpif_, meter_id, config);
+err = dpif_netlink_meter_set__(dpif_, meter_id, config);
+if (!err && netdev_is_flow_api_enabled()) {
+meter_offload_set("tc_police", meter_id, config);
+}
+
+return err;
 }
 
 /* Retrieve statistics and/or delete meter 'meter_id'.  Statistics are
@@ -4258,16 +4279,30 @@ static int
 dpif_netlink_meter_get(const struct dpif *dpif, ofproto_meter_id meter_id,
struct ofputil_meter_stats *stats, uint16_t max_bands)
 {
-return dpif_netlink_meter_get_stats(dpif, meter_id, stats, max_bands,
-OVS_METER_CMD_GET);
+int err;
+
+err = dpif_netlink_meter_get_stats(dpif, meter_id, stats, max_bands,
+   OVS_METER_CMD_GET);
+if (!err && netdev_is_flow_api_enabled()) {
+meter_offload_get("tc_police", meter_id, stats, max_bands);
+}
+
+return err;
 }
 
 static int
 dpif_netlink_meter_del(struct dpif *dpif, ofproto_meter_id meter_id,
struct ofputil_meter_stats *stats, uint16_t max_bands)
 {
-return dpif_netlink_meter_get_stats(dpif, meter_id, stats, max_bands,
-OVS_METER_CMD_DEL);
+int err;
+
+err  = dpif_netlink_meter_get_stats(dpif, meter_id, stats,
+max_bands, OVS_METER_CMD_DEL);
+if (!err && netdev_is_flow_api_enabled()) {
+meter_offload_del("tc_police", meter_id, stats, max_bands);
+}
+
+return err;
 }
 
 static bool
@@ -4416,7 +4451,7 @@ dpif_netlink_cache_set_size(struct dpif *dpif_, uint32_t 
level, uint32_t size)
 const struct dpif_class dpif_netlink_class = {
 "system",
 false,  /* cleanup_required */
-NULL,   /* init */
+dpif_netlink_initialize,/* init */
 dpif_netlink_enumerate,
 NULL,
 dpif_netlink_open,
-- 
2.26.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [ovs-dev, v3, 7/8] netdev-offload-tc: Offloading rules with police actions

2022-04-07 Thread Jianbo Liu via dev
When offloading rule, tc should be filled with police index, instead
of meter id. As meter is mapped to police action, and the mapping is
inserted into meter_id_to_police_idx hmap, this hmap is used to find
the police index. Besides, an action cookie is added to save meter id
on rule creation, so meter id can be retrieved from the cookie, and
pass to dpif while dumping rules.

Signed-off-by: Jianbo Liu 
---
 lib/netdev-offload-tc.c | 16 +++-
 lib/tc.c| 25 -
 2 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/lib/netdev-offload-tc.c b/lib/netdev-offload-tc.c
index d6ef5159c..afe566800 100644
--- a/lib/netdev-offload-tc.c
+++ b/lib/netdev-offload-tc.c
@@ -150,6 +150,8 @@ static struct netlink_field set_flower_map[][4] = {
 },
 };
 
+static int meter_id_lookup(uint32_t meter_id, uint32_t *police_idx);
+
 static struct ovs_mutex ufid_lock = OVS_MUTEX_INITIALIZER;
 
 /**
@@ -1019,7 +1021,8 @@ parse_tc_flower_to_match(struct tc_flower *flower,
 }
 break;
 case TC_ACT_POLICE: {
-/* Not supported yet */
+nl_msg_put_u32(buf, OVS_ACTION_ATTR_METER,
+   action->police.meter_id);
 }
 break;
 }
@@ -1936,6 +1939,17 @@ netdev_tc_flow_put(struct netdev *netdev, struct match 
*match,
 action->type = TC_ACT_GOTO;
 action->chain = 0;  /* 0 is reserved and not used by recirc. */
 flower.action_count++;
+} else if (nl_attr_type(nla) == OVS_ACTION_ATTR_METER) {
+uint32_t police_index;
+
+action->type = TC_ACT_POLICE;
+action->police.meter_id = nl_attr_get_u32(nla);
+if (meter_id_lookup(action->police.meter_id, _index)) {
+return EOPNOTSUPP;
+}
+
+action->police.index = police_index;
+flower.action_count++;
 } else {
 VLOG_DBG_RL(, "unsupported put action type: %d",
 nl_attr_type(nla));
diff --git a/lib/tc.c b/lib/tc.c
index 16d4ae3da..ecdcb676d 100644
--- a/lib/tc.c
+++ b/lib/tc.c
@@ -2421,6 +2421,22 @@ nl_msg_put_act_gact(struct ofpbuf *request, uint32_t 
chain)
 nl_msg_end_nested(request, offset);
 }
 
+static void
+nl_msg_put_act_police_index(struct ofpbuf *request, uint32_t police_idx)
+{
+struct tc_police police;
+size_t offset;
+
+memset(, 0, sizeof police);
+police.index = police_idx;
+
+nl_msg_put_string(request, TCA_ACT_KIND, "police");
+offset = nl_msg_start_nested(request, TCA_ACT_OPTIONS);
+nl_msg_put_unspec(request, TCA_POLICE_TBF, , sizeof police);
+nl_msg_put_u32(request, TCA_POLICE_RESULT, TC_ACT_PIPE);
+nl_msg_end_nested(request, offset);
+}
+
 static void
 nl_msg_put_act_ct(struct ofpbuf *request, struct tc_action *action)
 {
@@ -2911,7 +2927,14 @@ nl_msg_put_flower_acts(struct ofpbuf *request, struct 
tc_flower *flower)
 }
 break;
 case TC_ACT_POLICE: {
-/* Not supported yet */
+struct tc_cookie act_cookie;
+
+act_offset = nl_msg_start_nested(request, act_index++);
+nl_msg_put_act_police_index(request, action->police.index);
+act_cookie.data = >police.meter_id;
+act_cookie.len = sizeof(action->police.meter_id);
+nl_msg_put_act_cookie(request, _cookie);
+nl_msg_end_nested(request, act_offset);
 }
 break;
 }
-- 
2.26.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [ovs-dev, v3, 2/8] tc: Add support parsing tc police action

2022-04-07 Thread Jianbo Liu via dev
Add function to parse police action from netlink message, and meter id
can be retrieved from action cockie as it will be saved there in later
patch.

Signed-off-by: Jianbo Liu 
---
 lib/netdev-offload-tc.c |  4 +++
 lib/tc.c| 59 +
 lib/tc.h|  6 +
 3 files changed, 69 insertions(+)

diff --git a/lib/netdev-offload-tc.c b/lib/netdev-offload-tc.c
index bc03fe8c9..cff889c82 100644
--- a/lib/netdev-offload-tc.c
+++ b/lib/netdev-offload-tc.c
@@ -1017,6 +1017,10 @@ parse_tc_flower_to_match(struct tc_flower *flower,
 nl_msg_put_u32(buf, OVS_ACTION_ATTR_RECIRC, action->chain);
 }
 break;
+case TC_ACT_POLICE: {
+/* Not supported yet */
+}
+break;
 }
 }
 }
diff --git a/lib/tc.c b/lib/tc.c
index df73a43d4..af7a7bc6d 100644
--- a/lib/tc.c
+++ b/lib/tc.c
@@ -1339,6 +1339,59 @@ nl_parse_act_gact(struct nlattr *options, struct 
tc_flower *flower)
 return 0;
 }
 
+static const struct nl_policy police_policy[] = {
+[TCA_POLICE_TBF] = { .type = NL_A_UNSPEC,
+ .min_len = sizeof(struct tc_police),
+ .optional = false, },
+[TCA_POLICE_RATE] = { .type = NL_A_UNSPEC,
+  .min_len = 1024,
+  .optional = true, },
+[TCA_POLICE_PEAKRATE] = { .type = NL_A_UNSPEC,
+  .min_len = 1024,
+  .optional = true, },
+[TCA_POLICE_AVRATE] = { .type = NL_A_U32,
+.optional = true, },
+[TCA_POLICE_RESULT] = { .type = NL_A_U32,
+.optional = true, },
+[TCA_POLICE_TM] = { .type = NL_A_UNSPEC,
+.min_len = sizeof(struct tcf_t),
+.optional = true, },
+};
+
+static int
+nl_parse_act_police(const struct nlattr *options, struct tc_flower *flower,
+struct nlattr *act_cookie)
+{
+struct nlattr *police_attrs[ARRAY_SIZE(police_policy)] = {};
+struct tc_action *action;
+const struct tc_police *police;
+struct nlattr *police_tm;
+const struct tcf_t *tm;
+
+if (!nl_parse_nested(options, police_policy, police_attrs,
+ ARRAY_SIZE(police_policy))) {
+VLOG_ERR_RL(_rl, "failed to parse police action options");
+return EPROTO;
+}
+
+police = nl_attr_get(police_attrs[TCA_POLICE_TBF]);
+action = >actions[flower->action_count++];
+action->type = TC_ACT_POLICE;
+action->police.index = police->index;
+
+if (act_cookie) {
+action->police.meter_id = nl_attr_get_u32(act_cookie);
+}
+
+police_tm = police_attrs[TCA_POLICE_TM];
+if (police_tm) {
+tm = nl_attr_get_unspec(police_tm, sizeof *tm);
+nl_parse_tcf(tm, flower);
+}
+
+return 0;
+}
+
 static const struct nl_policy mirred_policy[] = {
 [TCA_MIRRED_PARMS] = { .type = NL_A_UNSPEC,
.min_len = sizeof(struct tc_mirred),
@@ -1761,6 +1814,8 @@ nl_parse_single_action(struct nlattr *action, struct 
tc_flower *flower,
 /* Added for TC rule only (not in OvS rule) so ignore. */
 } else if (!strcmp(act_kind, "ct")) {
 nl_parse_act_ct(act_options, flower);
+} else if (!strcmp(act_kind, "police")) {
+nl_parse_act_police(act_options, flower, act_cookie);
 } else {
 VLOG_ERR_RL(_rl, "unknown tc action kind: %s", act_kind);
 err = EINVAL;
@@ -2773,6 +2828,10 @@ nl_msg_put_flower_acts(struct ofpbuf *request, struct 
tc_flower *flower)
 nl_msg_end_nested(request, act_offset);
 }
 break;
+case TC_ACT_POLICE: {
+/* Not supported yet */
+}
+break;
 }
 }
 }
diff --git a/lib/tc.h b/lib/tc.h
index d6c16..201345672 100644
--- a/lib/tc.h
+++ b/lib/tc.h
@@ -174,6 +174,7 @@ enum tc_action_type {
 TC_ACT_MPLS_SET,
 TC_ACT_GOTO,
 TC_ACT_CT,
+TC_ACT_POLICE,
 };
 
 enum nat_type {
@@ -261,6 +262,11 @@ struct tc_action {
 struct tc_flower_key key;
 struct tc_flower_key mask;
 } rewrite;
+
+struct {
+uint32_t index;
+uint32_t meter_id;
+} police;
  };
 
  enum tc_action_type type;
-- 
2.26.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [ovs-dev, v3, 5/8] netdev-offload-tc: Implement and register meter offload API for tc

2022-04-07 Thread Jianbo Liu via dev
For dpif-netlink, meters are mapped by tc to police actions with
one-to-one relationship. Implement meter offload API to set/get/del
the police action, and a hmap is used to save the mappings.
An id-pool is used to manage all the available police indexes, which
are 0x1000-0x1fff, reserved only for OVS.

Signed-off-by: Jianbo Liu 
---
 lib/netdev-offload-provider.h |   1 +
 lib/netdev-offload-tc.c   | 228 ++
 lib/netdev.c  |   1 +
 3 files changed, 230 insertions(+)

diff --git a/lib/netdev-offload-provider.h b/lib/netdev-offload-provider.h
index 4c34d3f7f..aac746f53 100644
--- a/lib/netdev-offload-provider.h
+++ b/lib/netdev-offload-provider.h
@@ -124,6 +124,7 @@ int meter_register_offload_api_provider(const struct 
meter_offload_api *);
 
 #ifdef __linux__
 extern const struct netdev_flow_api netdev_offload_tc;
+extern const struct meter_offload_api meter_offload_tc;
 #endif
 
 #ifdef DPDK_NETDEV
diff --git a/lib/netdev-offload-tc.c b/lib/netdev-offload-tc.c
index cff889c82..9364cb530 100644
--- a/lib/netdev-offload-tc.c
+++ b/lib/netdev-offload-tc.c
@@ -21,6 +21,7 @@
 
 #include "dpif.h"
 #include "hash.h"
+#include "id-pool.h"
 #include "openvswitch/hmap.h"
 #include "openvswitch/match.h"
 #include "openvswitch/ofpbuf.h"
@@ -2308,3 +2309,230 @@ const struct netdev_flow_api netdev_offload_tc = {
.flow_get_n_flows = netdev_tc_get_n_flows,
.init_flow_api = netdev_tc_init_flow_api,
 };
+
+#define METER_POLICE_IDS_BASE 0x1000
+#define METER_POLICE_IDS_MAX  0x1FFF
+/* Protects below meter ids pool and hashmaps. */
+static struct ovs_mutex meter_mutex = OVS_MUTEX_INITIALIZER;
+static struct id_pool *meter_police_ids;
+static struct hmap meter_id_to_police_idx OVS_GUARDED_BY(meter_mutex)
+= HMAP_INITIALIZER(_id_to_police_idx);
+
+struct meter_id_to_police_idx_data {
+struct hmap_node meter_id_node;
+uint32_t meter_id;
+uint32_t police_idx;
+};
+
+static struct meter_id_to_police_idx_data *
+meter_id_find_locked(uint32_t meter_id)
+OVS_REQUIRES(meter_mutex)
+{
+struct meter_id_to_police_idx_data *data;
+size_t hash = hash_int(meter_id, 0);
+
+HMAP_FOR_EACH_WITH_HASH (data, meter_id_node, hash,
+ _id_to_police_idx) {
+if (data->meter_id == meter_id) {
+return data;
+}
+}
+
+return NULL;
+}
+
+static int
+meter_id_lookup(uint32_t meter_id, uint32_t *police_idx)
+{
+struct meter_id_to_police_idx_data *data;
+int ret = 0;
+
+ovs_mutex_lock(_mutex);
+data = meter_id_find_locked(meter_id);
+if (data) {
+*police_idx = data->police_idx;
+} else {
+ret = ENOENT;
+}
+ovs_mutex_unlock(_mutex);
+
+return ret;
+}
+
+static void
+meter_id_insert(uint32_t meter_id, uint32_t police_idx)
+{
+struct meter_id_to_police_idx_data *data;
+
+ovs_mutex_lock(_mutex);
+data = meter_id_find_locked(meter_id);
+if (!data) {
+data = xzalloc(sizeof *data);
+data->meter_id = meter_id;
+data->police_idx = police_idx;
+hmap_insert(_id_to_police_idx, >meter_id_node,
+hash_int(meter_id, 0));
+} else {
+VLOG_WARN_RL(_rl,
+ "try to insert meter %u (%u) with different police (%u)",
+ meter_id, data->police_idx, police_idx);
+}
+ovs_mutex_unlock(_mutex);
+}
+
+static void
+meter_id_remove(uint32_t meter_id)
+{
+struct meter_id_to_police_idx_data *data;
+
+ovs_mutex_lock(_mutex);
+data = meter_id_find_locked(meter_id);
+if (data) {
+hmap_remove(_id_to_police_idx, >meter_id_node);
+free(data);
+}
+ovs_mutex_unlock(_mutex);
+}
+
+static bool
+meter_alloc_police_index(uint32_t *police_index)
+{
+bool ret;
+
+ovs_mutex_lock(_mutex);
+ret = id_pool_alloc_id(meter_police_ids, police_index);
+ovs_mutex_unlock(_mutex);
+
+return ret;
+}
+
+static void
+meter_free_police_index(uint32_t police_index)
+{
+ovs_mutex_lock(_mutex);
+id_pool_free_id(meter_police_ids, police_index);
+ovs_mutex_unlock(_mutex);
+}
+
+static int
+meter_tc_set_policer(ofproto_meter_id meter_id,
+ struct ofputil_meter_config *config)
+{
+uint32_t police_index;
+uint32_t rate, burst;
+bool add_policer;
+int err;
+
+ovs_assert(config->bands != NULL);
+
+rate = config->bands[0].rate;
+if (config->flags & OFPMF13_BURST) {
+burst = config->bands[0].burst_size;
+} else {
+burst = config->bands[0].rate;
+}
+
+add_policer = (meter_id_lookup(meter_id.uint32, _index) == ENOENT);
+if (add_policer) {
+if (!meter_alloc_police_index(_index)) {
+VLOG_WARN_RL(_rl, "no free police index for meter id %u",
+ meter_id.uint32);
+return ENOENT;
+}
+}
+
+err = tc_add_policer_action(police_index,
+(config->flags & 

[ovs-dev] [ovs-dev, v3, 4/8] netdev-linux: Add functions to manipulate tc police action

2022-04-07 Thread Jianbo Liu via dev
Add helpers to add, delete and get stats of police action with
the specified index.

Signed-off-by: Jianbo Liu 
---
 lib/netdev-linux.c | 133 +
 lib/netdev-linux.h |   6 ++
 lib/tc.c   |  21 +++
 lib/tc.h   |   6 ++
 4 files changed, 166 insertions(+)

diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index eb05153c0..ef6c7312f 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -5664,6 +5664,139 @@ tc_add_policer(struct netdev *netdev, uint32_t 
kbits_rate,
 return 0;
 }
 
+int
+tc_add_policer_action(uint32_t index, uint32_t kbits_rate,
+  uint32_t kbits_burst, uint32_t pkts_rate,
+  uint32_t pkts_burst, bool update)
+{
+struct tc_police tc_police;
+struct ofpbuf request;
+struct tcamsg *tcamsg;
+size_t offset;
+int flags;
+
+tc_policer_init(_police, kbits_rate, kbits_burst);
+tc_police.index = index;
+
+flags = (update ? NLM_F_REPLACE : NLM_F_EXCL) | NLM_F_CREATE;
+tcamsg = tc_make_action_request(RTM_NEWACTION, flags, );
+if (!tcamsg) {
+return ENODEV;
+}
+
+offset = nl_msg_start_nested(, TCA_ACT_TAB);
+nl_msg_put_act_police(, _police, pkts_rate, pkts_burst);
+nl_msg_end_nested(, offset);
+
+return tc_transact(, NULL);
+}
+
+static int
+tc_update_policer_action_stats(struct ofpbuf *msg,
+   struct ofputil_meter_stats *stats)
+{
+const struct nlattr *act = NULL;
+struct tc_flower flower;
+struct nlattr *prio;
+struct tcamsg *tca;
+int error;
+
+if (NLMSG_HDRLEN + sizeof *tca > msg->size) {
+return EPROTO;
+}
+
+tca = ofpbuf_at_assert(msg, NLMSG_HDRLEN, sizeof *tca);
+
+act = nl_attr_find(msg, NLMSG_HDRLEN + sizeof *tca, TCA_ACT_TAB);
+if (!act) {
+return EPROTO;
+}
+
+prio = (struct nlattr *) act + 1;
+memset(, 0, sizeof(struct tc_flower));
+error = tc_parse_single_action(prio, , false);
+if (!error) {
+stats->packet_in_count +=
+get_32aligned_u64(_sw.n_packets);
+stats->byte_in_count += get_32aligned_u64(_sw.n_bytes);
+stats->packet_in_count +=
+get_32aligned_u64(_hw.n_packets);
+stats->byte_in_count += get_32aligned_u64(_hw.n_bytes);
+}
+
+return error;
+}
+
+int
+tc_get_policer_action(uint32_t index, struct ofputil_meter_stats *stats)
+{
+struct ofpbuf *replyp = NULL;
+struct ofpbuf request;
+struct tcamsg *tcamsg;
+size_t root_offset;
+size_t prio_offset;
+int prio = 0;
+int error;
+
+tcamsg = tc_make_action_request(RTM_GETACTION, 0, );
+if (!tcamsg) {
+return ENODEV;
+}
+
+root_offset = nl_msg_start_nested(, TCA_ACT_TAB);
+prio_offset = nl_msg_start_nested(, ++prio);
+nl_msg_put_string(, TCA_ACT_KIND, "police");
+nl_msg_put_u32(, TCA_ACT_INDEX, index);
+nl_msg_end_nested(, prio_offset);
+nl_msg_end_nested(, root_offset);
+
+error = tc_transact(, );
+if (error) {
+VLOG_ERR_RL(, "failed to dump police action (index: %u), err=%d",
+index, error);
+return error;
+}
+
+error = tc_update_policer_action_stats(replyp, stats);
+if (error) {
+VLOG_ERR_RL(, "failed to update police stats (index: %u), err=%d",
+index, error);
+}
+
+return error;
+}
+
+int
+tc_del_policer_action(uint32_t index, struct ofputil_meter_stats *stats)
+{
+struct ofpbuf *replyp = NULL;
+struct ofpbuf request;
+struct tcamsg *tcamsg;
+size_t root_offset;
+size_t prio_offset;
+int prio = 0;
+int error;
+
+tcamsg = tc_make_action_request(RTM_DELACTION, NLM_F_ACK, );
+if (!tcamsg) {
+return ENODEV;
+}
+
+root_offset = nl_msg_start_nested(, TCA_ACT_TAB);
+prio_offset = nl_msg_start_nested(, ++prio);
+nl_msg_put_string(, TCA_ACT_KIND, "police");
+nl_msg_put_u32(, TCA_ACT_INDEX, index);
+nl_msg_end_nested(, prio_offset);
+nl_msg_end_nested(, root_offset);
+
+error = tc_transact(, );
+if (!error && stats) {
+error = tc_update_policer_action_stats(replyp, stats);
+}
+
+return error;
+}
+
 static void
 read_psched(void)
 {
diff --git a/lib/netdev-linux.h b/lib/netdev-linux.h
index e1e30f806..9a416ce50 100644
--- a/lib/netdev-linux.h
+++ b/lib/netdev-linux.h
@@ -19,6 +19,7 @@
 
 #include 
 #include 
+#include "openvswitch/ofp-meter.h"
 
 /* These functions are Linux specific, so they should be used directly only by
  * Linux-specific code. */
@@ -28,5 +29,10 @@ struct netdev;
 int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t flag,
   const char *flag_name, bool enable);
 int linux_get_ifindex(const char *netdev_name);
+int tc_add_policer_action(uint32_t index, uint32_t kbits_rate,
+  uint32_t kbits_burst, uint32_t pkts_rate,
+  uint32_t pkts_burst, 

[ovs-dev] [ovs-dev, v3, 3/8] netdev-linux: Refactor put police action netlink message

2022-04-07 Thread Jianbo Liu via dev
To reuse the code for manipulating police action, move the common
initialization code to a function, and change PPS parameters as meter
pktps is in unit of packet per second.

null_police is redundant because either BPS or PPS, not both, can be
configured in one message. So the police passed in to
nl_msg_put_act_police can be reused as its rate is zero for PPS, and
it also provides the index for police action to be created.

Signed-off-by: Jianbo Liu 
---
 lib/netdev-linux.c | 82 --
 1 file changed, 42 insertions(+), 40 deletions(-)

diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index 9d125029d..eb05153c0 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -2629,37 +2629,27 @@ nl_msg_act_police_end_nest(struct ofpbuf *request, 
size_t offset,
 }
 
 static void
-nl_msg_put_act_police(struct ofpbuf *request, struct tc_police police,
-  uint32_t kpkts_rate, uint32_t kpkts_burst)
+nl_msg_put_act_police(struct ofpbuf *request, struct tc_police *police,
+  uint32_t pkts_rate, uint32_t pkts_burst)
 {
 size_t offset, act_offset;
 uint32_t prio = 0;
-/* used for PPS, set rate as 0 to act as a single action */
-struct tc_police null_police;
-
-memset(_police, 0, sizeof null_police);
-
-if (police.rate.rate) {
-nl_msg_act_police_start_nest(request, ++prio, , _offset);
-tc_put_rtab(request, TCA_POLICE_RATE, );
-nl_msg_put_unspec(request, TCA_POLICE_TBF, , sizeof police);
-nl_msg_act_police_end_nest(request, offset, act_offset);
-}
-if (kpkts_rate) {
-unsigned int pkt_burst_ticks, pps_rate, size;
-nl_msg_act_police_start_nest(request, ++prio, , _offset);
-pps_rate = kpkts_rate * 1000;
-size = MIN(UINT32_MAX / 1000, kpkts_burst) * 1000;
+
+nl_msg_act_police_start_nest(request, ++prio, , _offset);
+if (police->rate.rate) {
+tc_put_rtab(request, TCA_POLICE_RATE, >rate);
+}
+if (pkts_rate) {
+unsigned int pkt_burst_ticks;
 /* Here tc_bytes_to_ticks is used to convert packets rather than bytes
to ticks. */
-pkt_burst_ticks = tc_bytes_to_ticks(pps_rate, size);
-nl_msg_put_u64(request, TCA_POLICE_PKTRATE64, (uint64_t) pps_rate);
+pkt_burst_ticks = tc_bytes_to_ticks(pkts_rate, pkts_burst);
+nl_msg_put_u64(request, TCA_POLICE_PKTRATE64, (uint64_t) pkts_rate);
 nl_msg_put_u64(request, TCA_POLICE_PKTBURST64,
(uint64_t) pkt_burst_ticks);
-nl_msg_put_unspec(request, TCA_POLICE_TBF, _police,
-  sizeof null_police);
-nl_msg_act_police_end_nest(request, offset, act_offset);
 }
+nl_msg_put_unspec(request, TCA_POLICE_TBF, police, sizeof *police);
+nl_msg_act_police_end_nest(request, offset, act_offset);
 }
 
 static int
@@ -2692,7 +2682,8 @@ tc_add_matchall_policer(struct netdev *netdev, uint32_t 
kbits_rate,
 nl_msg_put_string(, TCA_KIND, "matchall");
 basic_offset = nl_msg_start_nested(, TCA_OPTIONS);
 action_offset = nl_msg_start_nested(, TCA_MATCHALL_ACT);
-nl_msg_put_act_police(, pol_act, kpkts_rate, kpkts_burst);
+nl_msg_put_act_police(, _act, kpkts_rate * 1000,
+  kpkts_burst * 1000);
 nl_msg_end_nested(, action_offset);
 nl_msg_end_nested(, basic_offset);
 
@@ -5599,6 +5590,29 @@ netdev_linux_tc_make_request(const struct netdev 
*netdev, int type,
 return tc_make_request(ifindex, type, flags, request);
 }
 
+static void
+tc_policer_init(struct tc_police *tc_police, uint32_t kbits_rate,
+uint32_t kbits_burst)
+{
+int mtu = 65535;
+
+memset(tc_police, 0, sizeof *tc_police);
+
+tc_police->action = TC_POLICE_SHOT;
+tc_police->mtu = mtu;
+tc_fill_rate(_police->rate, ((uint64_t) kbits_rate * 1000) / 8, mtu);
+
+/* The following appears wrong in one way: In networking a kilobit is
+ * usually 1000 bits but this uses 1024 bits.
+ *
+ * However if you "fix" those problems then "tc filter show ..." shows
+ * "125000b", meaning 125,000 bits, when OVS configures it for 1000 kbit ==
+ * 1,000,000 bits, whereas this actually ends up doing the right thing from
+ * tc's point of view.  Whatever. */
+tc_police->burst = tc_bytes_to_ticks(
+tc_police->rate.rate, MIN(UINT32_MAX / 1024, kbits_burst) * 1024 / 8);
+}
+
 /* Adds a policer to 'netdev' with a rate of 'kbits_rate' and a burst size
  * of 'kbits_burst', with a rate of 'kpkts_rate' and a burst size of
  * 'kpkts_burst'.
@@ -5623,22 +5637,8 @@ tc_add_policer(struct netdev *netdev, uint32_t 
kbits_rate,
 struct ofpbuf request;
 struct tcmsg *tcmsg;
 int error;
-int mtu = 65535;
 
-memset(_police, 0, sizeof tc_police);
-tc_police.action = TC_POLICE_SHOT;
-tc_police.mtu = mtu;
-tc_fill_rate(_police.rate, ((uint64_t) kbits_rate * 1000)/8, mtu);
-
-/* The following appears 

[ovs-dev] [ovs-dev, v3, 1/8] netdev-offload: Add meter offload provider

2022-04-07 Thread Jianbo Liu via dev
Add API to offload meter to HW, and the corresponding functions to be
called from dpif by the type.
The interfaces are like those related to meter in dpif_class, in order
to pass necessary info to HW.

Signed-off-by: Jianbo Liu 
---
 lib/netdev-offload-provider.h |  16 +
 lib/netdev-offload.c  | 119 ++
 lib/netdev-offload.h  |  10 +++
 3 files changed, 145 insertions(+)

diff --git a/lib/netdev-offload-provider.h b/lib/netdev-offload-provider.h
index 8ff2de983..4c34d3f7f 100644
--- a/lib/netdev-offload-provider.h
+++ b/lib/netdev-offload-provider.h
@@ -102,10 +102,26 @@ struct netdev_flow_api {
 void (*uninit_flow_api)(struct netdev *);
 };
 
+struct meter_offload_api {
+char *type;
+int (*meter_offload_init)(void);
+int (*meter_offload_destroy)(void);
+int (*meter_offload_set)(ofproto_meter_id meter_id,
+ struct ofputil_meter_config *config);
+int (*meter_offload_get)(ofproto_meter_id meter_id,
+ struct ofputil_meter_stats *stats,
+ uint16_t max_bands);
+int (*meter_offload_del)(ofproto_meter_id meter_id,
+ struct ofputil_meter_stats *stats,
+ uint16_t max_bands);
+};
+
 int netdev_register_flow_api_provider(const struct netdev_flow_api *);
 int netdev_unregister_flow_api_provider(const char *type);
 bool netdev_flow_api_equals(const struct netdev *, const struct netdev *);
 
+int meter_register_offload_api_provider(const struct meter_offload_api *);
+
 #ifdef __linux__
 extern const struct netdev_flow_api netdev_offload_tc;
 #endif
diff --git a/lib/netdev-offload.c b/lib/netdev-offload.c
index fb108c0d5..9787c16c0 100644
--- a/lib/netdev-offload.c
+++ b/lib/netdev-offload.c
@@ -195,6 +195,125 @@ netdev_assign_flow_api(struct netdev *netdev)
 return -1;
 }
 
+/* Protects 'meter_offload_apis'.  */
+static struct ovs_mutex meter_offload_api_provider_mtx = OVS_MUTEX_INITIALIZER;
+
+/* Contains 'struct meter_registered_offload_api's. */
+static struct cmap meter_offload_apis = CMAP_INITIALIZER;
+
+struct meter_registered_offload_api {
+/* In 'meter_offload_apis', by meter_offload_api->type. */
+struct cmap_node cmap_node;
+const struct meter_offload_api *meter_api;
+};
+
+static struct meter_registered_offload_api *
+meter_lookup_offload_api(const char *type)
+{
+struct meter_registered_offload_api *r;
+CMAP_FOR_EACH_WITH_HASH (r, cmap_node, hash_string(type, 0),
+ _offload_apis) {
+if (!strcmp(type, r->meter_api->type)) {
+return r;
+}
+}
+return NULL;
+}
+
+int
+meter_register_offload_api_provider(const struct meter_offload_api *new_api)
+OVS_EXCLUDED(meter_offload_api_provider_mtx)
+{
+int error = 0;
+
+ovs_mutex_lock(_offload_api_provider_mtx);
+if (meter_lookup_offload_api(new_api->type)) {
+VLOG_WARN("fail to register duplicate meter offload api provider: %s",
+   new_api->type);
+error = EEXIST;
+} else {
+struct meter_registered_offload_api *r;
+
+r = xmalloc(sizeof *r);
+cmap_insert(_offload_apis, >cmap_node,
+hash_string(new_api->type, 0));
+r->meter_api = new_api;
+VLOG_DBG("meter offload API '%s' registered.", new_api->type);
+}
+ovs_mutex_unlock(_offload_api_provider_mtx);
+
+return error;
+}
+
+static const struct meter_offload_api *
+meter_offload_get_api(const char *type)
+OVS_EXCLUDED(meter_offload_api_provider_mtx)
+{
+struct meter_registered_offload_api *r;
+
+ovs_mutex_lock(_offload_api_provider_mtx);
+r = meter_lookup_offload_api(type);
+ovs_mutex_unlock(_offload_api_provider_mtx);
+
+return r ? r->meter_api : NULL;
+}
+
+int
+meter_offload_set(const char *type, ofproto_meter_id meter_id,
+  struct ofputil_meter_config *config)
+{
+const struct meter_offload_api *meter_api;
+
+meter_api = meter_offload_get_api(type);
+
+return meter_api ? meter_api->meter_offload_set(meter_id, config)
+ : EOPNOTSUPP;
+}
+
+int
+meter_offload_get(const char *type, ofproto_meter_id meter_id,
+  struct ofputil_meter_stats *stats, uint16_t max_bands)
+{
+const struct meter_offload_api *meter_api;
+
+meter_api = meter_offload_get_api(type);
+
+return meter_api ? meter_api->meter_offload_get(meter_id, stats, max_bands)
+ : EOPNOTSUPP;
+}
+
+int
+meter_offload_del(const char *type, ofproto_meter_id meter_id,
+  struct ofputil_meter_stats *stats, uint16_t max_bands)
+{
+const struct meter_offload_api *meter_api;
+
+meter_api = meter_offload_get_api(type);
+
+return meter_api ? meter_api->meter_offload_del(meter_id, stats, max_bands)
+ : EOPNOTSUPP;
+}
+
+int
+meter_offload_init(const char *type)
+{
+const struct 

[ovs-dev] [ovs-dev, v3, 0/8] Add support for ovs metering with tc offload

2022-04-07 Thread Jianbo Liu via dev
This series is to add support for tc offloading of ovs metering, and
enhance OVS to use new kernel feature which offload tc police action to
hardware.
To do the offloading, new APIs for meter are added in netdev-offload,
and OVS meters are mapped to tc police actions with one-to-one
relationship for dpif-netlink.

Notes:
v3
- Add netdev-offload APIs for meter.
- Move the implementation to lower netdev-offload-tc layer.

v2
- Move tc police parse call from last patch to 2nd patch.
  2nd patch is adding the parse lib func and add the call to it in
that patch.
  Last patch is the put of tc police act.
  In 2nd patch also add empty switch case for tc police act put as
the impl.
  is done in the last patch when support for put is being added.

Jianbo Liu (8):
  netdev-offload: Add meter offload provider
  tc: Add support parsing tc police action
  netdev-linux: Refactor put police action netlink message
  netdev-linux: Add functions to manipulate tc police action
  netdev-offload-tc: Implement and register meter offload API for tc
  netdev-offload-tc: Cleanup police actions with reserved indexes on
startup
  netdev-offload-tc: Offloading rules with police actions
  dpif-netlink: Offloading meter to tc police action

 lib/dpif-netlink.c|  47 -
 lib/netdev-linux.c| 215 ++-
 lib/netdev-linux.h|   6 +
 lib/netdev-offload-provider.h |  17 ++
 lib/netdev-offload-tc.c   | 321 ++
 lib/netdev-offload.c  | 119 +
 lib/netdev-offload.h  |  10 ++
 lib/netdev.c  |   1 +
 lib/tc.c  | 164 +
 lib/tc.h  |  14 ++
 10 files changed, 868 insertions(+), 46 deletions(-)

-- 
2.26.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v5 5/6] ofp-actions: Use aligned structures when decoding ofp actions.

2022-04-07 Thread Adrian Moreno




On 4/5/22 14:43, Dumitru Ceara wrote:

Some openflow actions can be misaligned, e.g., actions whithin OF 1.0
replies to statistics reply messages which have a header of 12 bytes
and no additional padding.

Also, buggy controllers might incorrectly encode actions.

When decoding multiple actions in ofpacts_decode(), make sure that
when advancing to the next action it will be properly aligned
(multiple of OFPACT_ALIGNTO).

Detected by UB Sanitizer when running one of the fuzz tests:
   lib/ofp-actions.c:5347:12: runtime error: member access within misaligned 
address 0x016ba274 for type 'const struct nx_action_learn', which requires 
8 byte alignment
   0x016ba274: note: pointer points here
 20 20 20 20 ff ff 00 38  00 00 23 20 00 10 20 20  20 20 20 20 20 20 20 20  
20 20 20 20 00 03 20 00
 ^
   #0 0x52cece in decode_LEARN_common lib/ofp-actions.c:5347
   #1 0x52dcf6 in decode_NXAST_RAW_LEARN lib/ofp-actions.c:5463
   #2 0x548604 in ofpact_decode lib/ofp-actions.inc2:4723
   #3 0x53ee43 in ofpacts_decode lib/ofp-actions.c:7781
   #4 0x53efc1 in ofpacts_pull_openflow_actions__ lib/ofp-actions.c:7820
   #5 0x5409e1 in ofpacts_pull_openflow_instructions lib/ofp-actions.c:8396
   #6 0x5608a8 in ofputil_decode_flow_stats_reply lib/ofp-flow.c:1100

Signed-off-by: Dumitru Ceara 
---
v5: Addressed Adrian's comments:
- Correctly compute decoded_len.
- Fix ofpbuf_align() comment.
v4: Rebased.
v3: Split out from old patch 07/11.
---
  include/openvswitch/ofp-actions.h |1
  include/openvswitch/ofpbuf.h  |2 +
  lib/ofp-actions.c |   81 +
  lib/ofpbuf.c  |   39 ++
  4 files changed, 105 insertions(+), 18 deletions(-)

diff --git a/include/openvswitch/ofp-actions.h 
b/include/openvswitch/ofp-actions.h
index 711e7c773d6c..7b57e49ad657 100644
--- a/include/openvswitch/ofp-actions.h
+++ b/include/openvswitch/ofp-actions.h
@@ -203,6 +203,7 @@ BUILD_ASSERT_DECL(sizeof(struct ofpact) == 4);
  /* Alignment. */
  #define OFPACT_ALIGNTO 8
  #define OFPACT_ALIGN(SIZE) ROUND_UP(SIZE, OFPACT_ALIGNTO)
+#define OFPACT_IS_ALIGNED(ADDR) ((uintptr_t) (ADDR) % OFPACT_ALIGNTO == 0)
  #define OFPACT_PADDED_MEMBERS(MEMBERS) PADDED_MEMBERS(OFPACT_ALIGNTO, MEMBERS)
  
  /* Returns the ofpact following 'ofpact'. */

diff --git a/include/openvswitch/ofpbuf.h b/include/openvswitch/ofpbuf.h
index 7b6aba9dc29c..50630c9f9baf 100644
--- a/include/openvswitch/ofpbuf.h
+++ b/include/openvswitch/ofpbuf.h
@@ -111,6 +111,7 @@ void ofpbuf_use_ds(struct ofpbuf *, const struct ds *);
  void ofpbuf_use_stack(struct ofpbuf *, void *, size_t);
  void ofpbuf_use_stub(struct ofpbuf *, void *, size_t);
  void ofpbuf_use_const(struct ofpbuf *, const void *, size_t);
+void ofpbuf_use_data(struct ofpbuf *, const void *, size_t);
  
  void ofpbuf_init(struct ofpbuf *, size_t);

  void ofpbuf_uninit(struct ofpbuf *);
@@ -149,6 +150,7 @@ static inline size_t ofpbuf_msgsize(const struct ofpbuf *);
  void ofpbuf_prealloc_headroom(struct ofpbuf *, size_t);
  void ofpbuf_prealloc_tailroom(struct ofpbuf *, size_t);
  void ofpbuf_trim(struct ofpbuf *);
+void ofpbuf_align(struct ofpbuf *);
  void ofpbuf_padto(struct ofpbuf *, size_t);
  void ofpbuf_shift(struct ofpbuf *, int);
  
diff --git a/lib/ofp-actions.c b/lib/ofp-actions.c

index 783af84439e2..d31b26583759 100644
--- a/lib/ofp-actions.c
+++ b/lib/ofp-actions.c
@@ -424,7 +424,8 @@ static void put_reg_load(struct ofpbuf *openflow,
   const struct mf_subfield *, uint64_t value);
  
  static enum ofperr ofpact_pull_raw(struct ofpbuf *, enum ofp_version,

-   enum ofp_raw_action_type *, uint64_t *arg);
+   enum ofp_raw_action_type *, uint64_t *arg,
+   size_t *raw_len);
  static void *ofpact_put_raw(struct ofpbuf *, enum ofp_version,
  enum ofp_raw_action_type, uint64_t arg);
  
@@ -7768,45 +7769,87 @@ check_GOTO_TABLE(const struct ofpact_goto_table *a,

  
  static void
  log_bad_action(const struct ofp_action_header *actions, size_t actions_len,
-   const struct ofp_action_header *bad_action, enum ofperr error)
+   size_t action_offset, enum ofperr error)
  {
  if (!VLOG_DROP_WARN()) {
  struct ds s;
  
  ds_init();

  ds_put_hex_dump(, actions, actions_len, 0, false);
-VLOG_WARN("bad action at offset %#"PRIxPTR" (%s):\n%s",
-  (char *)bad_action - (char *)actions,
+VLOG_WARN("bad action at offset %"PRIuSIZE" (%s):\n%s", action_offset,
ofperr_get_name(error), ds_cstr());
  ds_destroy();
  }
  }
  
  static enum ofperr

-ofpacts_decode(const void *actions, size_t actions_len,
-   enum ofp_version ofp_version,
-   const struct vl_mff_map *vl_mff_map,
-   uint64_t