Re: [ovs-dev] [PATCH] dpif-netdev: Use unmasked key when adding datapath flows.

2022-10-18 Thread Jan Scheurich via dev
Hi guys,

I am afraid that commit is too long ago that would remember any details that 
caused us to change the code in beb75a40fdc2 ("userspace: Switching of L3 
packets in L2 pipeline"). What I vaguely remember was that I couldn’t 
comprehend the original code and it was not working correctly in some of the 
cases we needed/tested. But perhaps the changes we introduced also had corner 
cases we didn't consider. 

Question though: 

> > The datapath supports installing wider flows, and OVS relies on this
> > behavior. For example if ipv4(src=1.1.1.1/192.0.0.0,
> > dst=1.1.1.2/192.0.0.0) exists, a wider flow (smaller mask) of
> > ipv4(src=192.1.1.1/128.0.0.0,dst=192.1.1.2/128.0.0.0) is allowed to be
> > added.

That sounds strange to me. I always believed the datapath only supports 
non-overlapping flows, i.e. a packet can match at most one datapath flow. Only 
with that pre-requisite the dpcls classifier can work without priorities. Have 
I been wrong in this? What would be the semantics of adding a wider flow to the 
datapath? To my knowledge there is no guarantee that the dpcls subtables are 
visited in any specific order that would honor the mask width. And the first 
match will win.

Please clarify this. And in which sense OVS relies on this behavior?

BR, Jan

> -Original Message-
> From: Ilya Maximets 
> Sent: Tuesday, 18 October 2022 21:40
> To: Eelco Chaudron ; d...@openvswitch.org
> Cc: i.maxim...@ovn.org; Jan Scheurich 
> Subject: Re: [ovs-dev] [PATCH] dpif-netdev: Use unmasked key when adding
> datapath flows.
> 
> On 10/18/22 18:42, Eelco Chaudron wrote:
> > The datapath supports installing wider flows, and OVS relies on this
> > behavior. For example if ipv4(src=1.1.1.1/192.0.0.0,
> > dst=1.1.1.2/192.0.0.0) exists, a wider flow (smaller mask) of
> > ipv4(src=192.1.1.1/128.0.0.0,dst=192.1.1.2/128.0.0.0) is allowed to be
> > added.
> >
> > However, if we try to add a wildcard rule, the installation fails:
> >
> > # ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \
> >   ipv4(src=1.1.1.1/192.0.0.0,dst=1.1.1.2/192.0.0.0,frag=no)" 2 #
> > ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \
> >   ipv4(src=192.1.1.1/0.0.0.0,dst=49.1.1.2/0.0.0.0,frag=no)" 2
> > ovs-vswitchd: updating flow table (File exists)
> >
> > The reason is that the key used to determine if the flow is already
> > present in the system uses the original key ANDed with the mask.
> > This results in the IP address not being part of the (miniflow) key,
> > i.e., being substituted with an all-zero value. When doing the actual
> > lookup, this results in the key wrongfully matching the first flow,
> > and therefore the flow does not get installed. The solution is to use
> > the unmasked key for the existence check, the same way this is handled
> > in the userspace datapath.
> >
> > Signed-off-by: Eelco Chaudron 
> > ---
> >  lib/dpif-netdev.c|   33 +
> >  tests/dpif-netdev.at |   14 ++
> >  2 files changed, 43 insertions(+), 4 deletions(-)
> >
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> > a45b46014..daa00aa2f 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -3321,6 +3321,28 @@ netdev_flow_key_init_masked(struct
> netdev_flow_key *dst,
> >  (dst_u64 - miniflow_get_values(>mf))
> > * 8);  }
> >
> > +/* Initializes 'dst' as a copy of 'flow'. */ static inline void
> > +netdev_flow_key_init(struct netdev_flow_key *key,
> > + const struct flow *flow) {
> > +uint64_t *dst = miniflow_values(>mf);
> > +uint32_t hash = 0;
> > +uint64_t value;
> > +
> > +miniflow_map_init(>mf, flow);
> > +miniflow_init(>mf, flow);
> > +
> > +size_t n = dst - miniflow_get_values(>mf);
> > +
> > +FLOW_FOR_EACH_IN_MAPS (value, flow, key->mf.map) {
> > +hash = hash_add64(hash, value);
> > +}
> > +
> > +key->hash = hash_finish(hash, n * 8);
> > +key->len = netdev_flow_key_size(n); }
> > +
> >  static inline void
> >  emc_change_entry(struct emc_entry *ce, struct dp_netdev_flow *flow,
> >   const struct netdev_flow_key *key) @@ -4195,7
> > +4217,7 @@ static int  dpif_netdev_flow_put(struct dpif *dpif, const
> > struct dpif_flow_put *put)  {
> >  struct dp_netdev *dp = get_dp_netdev(dpif);
> > -struct netdev_flow_key key, mask;
> > +struct netdev_flow_key key;
> >  struct dp_netdev_pmd_thread *pmd;
> >  struct match match;
> >  ovs_u128 ufid;
> > @@ -4244,9 +4266,12 @@ dpif_netdev_flow_put(struct dpif *dpif, const
> > struct dpif_flow_put *put)
> >
> >  /* Must produce a netdev_flow_key for lookup.
> >   * Use the same method as employed to create the key when adding
> > - * the flow to the dplcs to make sure they match. */
> > -netdev_flow_mask_init(, );
> > -netdev_flow_key_init_masked(, , );
> > + * the flow to the dplcs to make sure they match.
> > + * We need to put in 

Re: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on selected ports

2022-07-21 Thread Jan Scheurich via dev
Hi,

True, the cost of polling a packet from a physical port on a remote NUMA node 
is slightly higher than from local NUMA node. Hence the cross-NUMA polling of 
rx queues has some overhead. However, the packet processing cost is much more 
influence by the location of the target vhostuser ports. If the majority of the 
rx queue traffic is going to a VM on the other NUMA node, it is actually 
*better* to poll the packets in a PMD on the VM's NUMA node.

Long story short, OVS doesn't have sufficient data to be able to correctly 
predict the actual rxq load when assigned to another PMD in a different queue 
configuration. The rxq processing cycles measured on the current PMD is the 
best estimate we have for balancing the overall load on the PMDs. We need to 
live with the inevitable inaccuracies.

My main point is: these inaccuracies don't matter. The purpose of balancing the 
load over PMDs is *not* to minimize the total cycles spent by PMDs on 
processing packets. The PMD run in a busy loop anyhow and burn all cycles of 
the CPU. The purpose is to prevent that some PMD unnecessarily gets congested 
(i.e. load > 95%) while others have a lot of spare capacity and could take over 
some rxqs.

Cross-NUMA polling of physical port rxqs has proven to be an extremely valuable 
tool to help OVS's cycle-based rxq-balancing algorithm to do its job, and I 
strongly suggest we allow the proposed per-port opt-in option.

BR, Jan

From: Anurag Agarwal 
Sent: Thursday, 21 July 2022 07:15
To: lic...@chinatelecom.cn
Cc: Jan Scheurich 
Subject: RE: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on 
selected ports

Hello Cheng,
With cross-numa enabled, we flatten the PMD list across NUMAs and select 
the least loaded PMD. Thus I would not like to consider the case below.

Regards,
Anurag

From: lic...@chinatelecom.cn 
mailto:lic...@chinatelecom.cn>>
Sent: Thursday, July 21, 2022 8:19 AM
To: Anurag Agarwal 
mailto:anurag.agar...@ericsson.com>>
Cc: Jan Scheurich 
mailto:jan.scheur...@ericsson.com>>
Subject: Re: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on 
selected ports

Hi Anurag,

"If local numa has bandwidth for rxq, we are not supposed to assign a rxq to 
remote pmd."
Would you like to consider this case? If not, I think we don't have to resolve 
the cycles measurement issue for cross numa case.


李成

From: Anurag Agarwal
Date: 2022-07-21 10:21
To: lic...@chinatelecom.cn
CC: Jan Scheurich
Subject: RE: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on 
selected ports
+ Jan

Hello Cheng,
  Thanks for your insightful comments. Please find my inputs inline.

Regards,
Anurag

From: lic...@chinatelecom.cn 
mailto:lic...@chinatelecom.cn>>
Sent: Monday, July 11, 2022 7:51 AM
To: Anurag Agarwal 
mailto:anurag.agar...@ericsson.com>>
Subject: Re: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on 
selected ports

Hi Anurag,

Sorry for late reply, I was busy on a task last two weeks.

I think you proposal can cover the case I reported. It looks good to me.
>> Thanks for your review and positive feedback

However, to enable cross numa rxq pollin, we may have another problem to 
address.

From my test, cross numa polling has worse performance than numa affinity 
polling.(at least 10%)
So if local numa has bandwidth for rxq, we are not supposed to assign a rxq to 
remote pmd.
Unfortunately, we don't know if a pmd is out of bandwidth from it's assigned 
rxq cycles.
Because rx batchs size impacts the rxq cycle a lot in my test:
rx batch  cycles per pkt
1.00 5738
5.00 2353
12.15   1770
32.00   1533

Pkts come faster, the rx batch size is larger. More rxqs a pmd is assigned, rx 
batch size if larger.
Imaging that pmd pA has only one rxq assigned. Pkts comes at 1.00 pkt/5738 
cycle, the rxq rx batch size is 1.00.
Now pA has 2 rxq assigned, each rxq has pkts comes at 1.00 pkt/5738 cycle.
pmd spends 5738 cycles process the first rxq, and then the second.
After the second rxq is processed, pmd comes back to the first rxq, now first 
rxq has 2 pkts ready(becase 2*5738 cycles passed).
The rxq batch size becomes 2.
>> Ok. Do you think it is a more generic problem with cycles measurement and 
>> PMD utilization? Not specific to cross-numa feature..

So it's hard to say if a pmd is overload from the rxq cycles.
At last, I think cross numa feature is very nice. I will make effort on this as 
well to cover cases in our company.
Let keep in sync on progress :)
>> Thanks



李成

From: Anurag Agarwal
Date: 2022-06-29 14:12
To: lic...@chinatelecom.cn
Subject: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on 

Re: [ovs-dev] [PATCH v2] ofproto-xlate: Fix crash when forwarding packet between legacy_l3 tunnels

2022-04-04 Thread Jan Scheurich via dev
> I found this one though:
>   https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-
> 45444731-805e1f47a2a28e92=1=1fbc0307-e0af-4087-98ef-
> d9dbff40359a=https%3A%2F%2Fpatchwork.ozlabs.org%2Fproject%2Fopenvs
> witch%2Fpatch%2F20220403222617.31688-1-jan.scheurich%40web.de%2F
> 
> It was held back by the mail list and appears to be better formatted.
> Let's see if it works.
> 
> P.S. I marked that email address for acceptance, so the mail list
>  should no longer block it.

Thanks Ilya, that is a good work-around for me until I find a solution for SMTP 
with my Ericsson mail.

Regards, Jan

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2] ofproto-xlate: Fix crash when forwarding packet between legacy_l3 tunnels

2022-04-04 Thread Jan Scheurich via dev
Hi Ilya,

Sorry for spamming but I have problems again to send correctly formatted 
patches to the ovs-dev list. My previous SMTP server for git-send email no 
longer works and patches I send through my private SMTP provider do not reach 
the mailing list. Resending from Outlook obviously still screws up the patch ☹.

Any chance that you could pick the PATCH v2 manually and feed it through the 
process? I doubt that I will find a solution to my mailing issues soon.

Thanks, Jan

> -Original Message-
> From: 0-day Robot 
> Sent: Monday, 4 April, 2022 09:56
> To: Jan Scheurich 
> Cc: d...@openvswitch.org
> Subject: Re: [ovs-dev] [PATCH v2] ofproto-xlate: Fix crash when forwarding
> packet between legacy_l3 tunnels
> 
> Bleep bloop.  Greetings Jan Scheurich, I am a robot and I have tried out your
> patch.
> Thanks for your contribution.
> 
> I encountered some error that I wasn't expecting.  See the details below.
> 
> 
> git-am:
> error: corrupt patch at line 85
> error: could not build fake ancestor
> hint: Use 'git am --show-current-patch' to see the failed patch Patch failed 
> at
> 0001 ofproto-xlate: Fix crash when forwarding packet between legacy_l3
> tunnels When you have resolved this problem, run "git am --continue".
> If you prefer to skip this patch, run "git am --skip" instead.
> To restore the original branch and stop patching, run "git am --abort".
> 
> 
> Patch skipped due to previous failure.
> 
> Please check this out.  If you feel there has been an error, please email
> acon...@redhat.com
> 
> Thanks,
> 0-day Robot
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2] ofproto-xlate: Fix crash when forwarding packet between legacy_l3 tunnels

2022-04-04 Thread Jan Scheurich via dev
From: Jan Scheurich 

A packet received from a tunnel port with legacy_l3 packet-type (e.g.
lisp, L3 gre, gtpu) is conceptually wrapped in a dummy Ethernet header
for processing in an OF pipeline that is not packet-type-aware. Before
transmission of the packet to another legacy_l3 tunnel port, the dummy
Ethernet header is stripped again.

In ofproto-xlate, wrapping in the dummy Ethernet header is done by 
simply changing the packet_type to PT_ETH. The generation of the 
push_eth datapath action is deferred until the packet's flow changes 
need to be committed, for example at output to a normal port. The 
deferred Ethernet encapsulation is marked in the pending_encap flag.

This patch fixes a bug in the translation of the output action to a
legacy_l3 tunnel port, where the packet_type of the flow is reverted 
from PT_ETH to PT_IPV4 or PT_IPV6 (depending on the dl_type) to 
remove its Ethernet header without clearing the pending_encap flag 
if it was set. At the subsequent commit of the flow changes, the 
unexpected combination of pending_encap == true with an PT_IPV4 
or PT_IPV6 packet_type hit the OVS_NOT_REACHED() abortion clause.

The pending_encap is now cleared in this situation.

Reported-By: Dincer Beken 
Signed-off-By: Jan Scheurich 
Signed-off-By: Dincer Beken 

v1->v2:
  A specific test has been added to tunnel-push-pop.at to verify the
  correct behavior.
---
 ofproto/ofproto-dpif-xlate.c |  4 
 tests/tunnel-push-pop.at | 22 ++
 2 files changed, 26 insertions(+)

diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c index 
cc9c1c628..1843a5d66 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -4195,6 +4195,10 @@ compose_output_action__(struct xlate_ctx *ctx, 
ofp_port_t ofp_port,
 if (xport->pt_mode == NETDEV_PT_LEGACY_L3) {
 flow->packet_type = PACKET_TYPE_BE(OFPHTN_ETHERTYPE,
ntohs(flow->dl_type));
+if (ctx->pending_encap) {
+/* The Ethernet header was not actually added yet. */
+ctx->pending_encap = false;
+}
 }
 }

diff --git a/tests/tunnel-push-pop.at b/tests/tunnel-push-pop.at index 
57589758f..70462d905 100644
--- a/tests/tunnel-push-pop.at
+++ b/tests/tunnel-push-pop.at
@@ -546,6 +546,28 @@ AT_CHECK([ovs-ofctl dump-ports int-br | grep 'port  
[[37]]' | sort], [0], [dnl
   port  7: rx pkts=5, bytes=434, drop=?, errs=?, frame=?, over=?, crc=?
 ])

+dnl Send out packets received from L3GRE tunnel back to L3GRE tunnel 
+AT_CHECK([ovs-ofctl del-flows int-br]) AT_CHECK([ovs-ofctl add-flow 
+int-br "in_port=7,actions=set_field:3->in_port,7"])
+AT_CHECK([ovs-vsctl -- set Interface br0 options:pcap=br0.pcap])
+
+AT_CHECK([ovs-appctl netdev-dummy/receive p0 
+'aa55aa55001b213cab640800457079464000402fba630101025c0101025820
+00080001c84554ba20400184861e011e024227e75400030
+af31955f2650100101112131415161718191a1b1c1d1e1f20212223
+2425262728292a2b2c2d2e2f3031323334353637'])
+AT_CHECK([ovs-appctl netdev-dummy/receive p0 
+'aa55aa55001b213cab640800457079464000402fba630101025c0101025820
+00080001c84554ba20400184861e011e024227e75400030
+af31955f2650100101112131415161718191a1b1c1d1e1f20212223
+2425262728292a2b2c2d2e2f3031323334353637'])
+AT_CHECK([ovs-appctl netdev-dummy/receive p0 
+'aa55aa55001b213cab640800457079464000402fba630101025c0101025820
+00080001c84554ba20400184861e011e024227e75400030
+af31955f2650100101112131415161718191a1b1c1d1e1f20212223
+2425262728292a2b2c2d2e2f3031323334353637'])
+
+ovs-appctl time/warp 1000
+
+AT_CHECK([ovs-pcap p0.pcap > p0.pcap.txt 2>&1]) AT_CHECK([tail -6 
+p0.pcap.txt], [0], [dnl
+aa55aa55001b213cab640800457079464000402fba630101025c01010258200
+0080001c84554ba20400184861e011e024227e75400030a
+f31955f2650100101112131415161718191a1b1c1d1e1f202122232
+425262728292a2b2c2d2e2f3031323334353637
+001b213cab64aa55aa55080045704000402f33aa010102580101025c200
+0080001c84554ba20400184861e011e024227e75400030a
+f31955f2650100101112131415161718191a1b1c1d1e1f202122232
+425262728292a2b2c2d2e2f3031323334353637
+aa55aa55001b213cab640800457079464000402fba630101025c01010258200
+0080001c84554ba20400184861e011e024227e75400030a
+f31955f2650100101112131415161718191a1b1c1d1e1f202122232
+425262728292a2b2c2d2e2f3031323334353637
+001b213cab64aa55aa55080045704000402f33aa010102580101025c200
+0080001c84554ba20400184861e011e024227e75400030a
+f31955f2650100101112131415161718191a1b1c1d1e1f202122232
+425262728292a2b2c2d2e2f3031323334353637
+aa55aa55001b213cab640800457079464000402fba630101025c01010258200

Re: [ovs-dev] [PATCH v2] dpif-netdev: Allow cross-NUMA polling on selected ports

2022-03-24 Thread Jan Scheurich via dev
Hi Kevin,

This was a bit of a misunderstanding. We didn't check your RFC patch carefully 
enough to realize that you had meant to encompass our cross-numa-polling 
function in that RFC patch. Sorry for the confusion.

I wouldn't say we are particularly keen on upstreaming exactly our 
implementation of cross-numa-polling for ALB, as long as we get the 
functionality with a per-interface configuration option (preferably as in out 
patch, so that we can maintain backward compatibility with our downstream 
solution).

I suggest we have a closer look at your RFC and come back with comments on that.

> 
> 'roundrobin' and 'cycles', are dependent on RR of cores per numa, so I agree 
> it
> makes sense in those cases to still RR the numas. Otherwise a mix of cross-
> numa enabled and cross-numa disabled interfaces would conflict with each
> other when selecting a core. So even though the user has selected to ignore
> numa for this interface, we don't have a choice but to still RR the numa.
> 
> For 'group' it is about finding the lowest loaded pmd core. In that case we 
> don't
> need to RR numa. We can just find the lowest loaded pmd core from any numa.
> 
> This is better because the user is choosing to remove numa based selection for
> the interface and as the rxq scheduling algorthim is not dependent on it, we
> can fully remove it too by checking pmds from all numas. I have done this in 
> my
> RFC.
> 
> It is also better to do this where possible because there may not be same
> amount of pmd cores on each numa, or one numa could already be more
> heavily loaded than the other.
> 
> Another difference is with above for 'group' I added tiebreaker for a 
> local-to-
> interface numa pmd to be selected if multiple pmds from different numas were
> available with same load. This is most likely to be helpful for initial 
> selection
> when there is no load on any pmds.

At first glance I agree with your reasoning. Choosing the least-loaded PMD from 
all NUMAs using NUMA-locality as a tie-breaker makes sense in 'group' 
algorithm. How do you select between equally loaded PMDs on the same NUMA node? 
Just pick any?

BR, Jan

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas

2022-03-09 Thread Jan Scheurich via dev
> Thanks for sharing your experience with it. My fear with the proposal is that
> someone turns this on and then tells us performance is worse and/or OVS
> assignments/ALB are broken, because it has an impact on their case.
> 
> In terms of limiting possible negative effects,
> - it can be opt-in and recommended only for phy ports
> - could print a warning when it is enabled
> - ALB is currently disabled with cross-numa polling (except a limited
> case) but it's clear you want to remove that restriction too
> - for ALB, a user could increase the improvement threshold to account for any
> reassignments triggered by inaccuracies

[Jan] Yes, we want to enable cross-NUMA polling of selected (typically phy) 
ports in ALB "group" mode as an opt-in config option (default off). Based on 
our observations we are not too much concerned with the loss of ALB prediction 
accuracy but increasing the threshold may be a way of taking that into account, 
if wanted.

> 
> There is also some improvements that can be made to the proposed method
> when used with group assignment,
> - we can prefer local numa where there is no difference between pmd cores.
> (e.g. two unused cores available, pick the local numa one)
> - we can flatten the list of pmds, so best pmd can be selected. This will 
> remove
> issues with RR numa when there are different num of pmd cores or loads per
> numa.
> - I wrote an RFC that does these two items, I can post when(/if!) consensus is
> reached on the broader topic

[Jan] In our alternative version of the current upstream "group" ALB [1] we 
already maintained a flat list of PMDs. So we would support that feature. Using 
NUMA-locality as a tie-breaker makes sense.

[1] https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384546.html

> 
> In summary, it's a trade-off,
> 
> With no cross-numa polling (current):
> - won't have any impact to OVS assignment or ALB accuracy
> - there could be a bottleneck on one numa pmds while other numa pmd cores
> are idle and unused
> 
> With cross-numa rx pinning (current):
> - will have access to pmd cores on all numas
> - may require more cycles for some traffic paths
> - won't have any impact to OVS assignment or ALB accuracy
> - >1 pinned rxqs per core may cause a bottleneck depending on traffic
> 
> With cross-numa interface setting (proposed):
> - will have access to all pmd cores on all numas (i.e. no unused pmd cores
> during highest load)
> - will require more cycles for some traffic paths
> - will impact on OVS assignment and ALB accuracy
> 
> Anything missing above, or is it a reasonable summary?

I think that is a reasonable summary, albeit I would have characterized the 
third option a bit more positively:
- Gives ALB maximum freedom to balance load of PMDs on all NUMA nodes (in the 
likely scenario of uneven VM load on the NUMAs)
- Accepts an increase of cycles on cross-NUMA paths for a better utilization of 
a free PMD cycles
- Mostly suitable for phy ports due to limited cycle increase for cross-NUMA 
polling of phy rx queues
- Could negatively impact the ALB prediction accuracy in certain scenarios

We will post a new version of our patch [2] for cross-numa polling on selected 
ports adapted to the current OVS master shortly.

[2] https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384547.html

Thanks, Jan


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas

2022-02-17 Thread Jan Scheurich via dev
Hi Kevin,

> > We have done extensive benchmarking and found that we get better overall
> PMD load balance and resulting OVS performance when we do not statically
> pin any rx queues and instead let the auto-load-balancing find the optimal
> distribution of phy rx queues over both NUMA nodes to balance an asymmetric
> load of vhu rx queues (polled only on the local NUMA node).
> >
> > Cross-NUMA polling of vhu rx queues comes with a very high latency cost due
> to cross-NUMA access to volatile virtio ring pointers in every iteration (not 
> only
> when actually copying packets). Cross-NUMA polling of phy rx queues doesn't
> have a similar issue.
> >
> 
> I agree that for vhost rxq polling, it always causes a performance penalty 
> when
> there is cross-numa polling.
> 
> For polling phy rxq, when phy and vhost are in different numas, I don't see 
> any
> additional penalty for cross-numa polling the phy rxq.
> 
> For the case where phy and vhost are both in the same numa, if I change to 
> poll
> the phy rxq cross-numa, then I see about a >20% tput drop for traffic from 
> phy -
> > vhost. Are you seeing that too?

Yes, but the performance drop is mostly due to the extra cost of copying the 
packets across the UPI bus to the virtio buffers on the other NUMA, not because 
of polling the phy rxq on the other NUMA.

> 
> Also, the fact that a different numa can poll the phy rxq after every 
> rebalance
> means that the ability of the auto-load-balancer to estimate and trigger a
> rebalance is impacted.

Agree, there is some inaccuracy in the estimation of the load a phy rx queue 
creates when it is moved to another NUMA node. So far we have not seen that as 
a practical problem.

> 
> It seems like simple pinning some phy rxqs cross-numa would avoid all the
> issues above and give most of the benefit of cross-numa polling for phy rxqs.

That is what we have done in the past (far a lack of alternatives). But any 
static pinning reduces the ability of the auto-load balancer to do its job. 
Consider the following scenarios:

1. The phy ingress traffic is not evenly distributed by RSS due to lack of 
entropy (Examples for this are IP-IP encapsulated traffic, e.g. Calico, or 
MPLSoGRE encapsulated traffic).

2. VM traffic is very asymmetric, e.g. due to a large dual-NUMA VM whose vhu 
ports are all on NUMA 0.

In all such scenarios, static pinning of phy rxqs may lead to unnecessarily 
uneven PMD load and loss of overall capacity.

> 
> With the pmd-rxq-assign=group and pmd-rxq-isolate=false options, OVS could
> still assign other rxqs to those cores which have with pinned phy rxqs and
> properly adjust the assignments based on the load from the pinned rxqs.

Yes, sometimes the vhu rxq load is distributed such that it can be use to 
balance the PMD, but not always. Sometimes the balance is just better when phy 
rxqs are not pinned.

> 
> New assignments or auto-load-balance would not change the numa polling
> those rxqs, so it it would have no impact to ALB or ability to assign based on
> load.

In our practical experience the new "group" algorithm for load-based rxq 
distribution is able to balance the PMD load best when none of the rxqs are 
pinned and cross-NUMA polling of phy rxqs is enabled. So the effect of the 
prediction error when doing auto-lb dry-runs cannot be significant.

In our experience we consistently get the best PMD balance and OVS throughput 
when we give the auto-lb free hands (no cross-NUMA polling of vhu rxqs, 
through).

BR, Jan
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas

2022-02-14 Thread Jan Scheurich via dev
> > We do acknowledge the benefit of non-pinned polling of phy rx queues by
> PMD threads on all NUMA nodes. It gives the auto-load balancer much better
> options to utilize spare capacity on PMDs on all NUMA nodes.
> >
> > Our patch proposed in
> > https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-45444
> > 731-c996011189a3eea8=1=0dc6a0b0-959c-493e-a3de-
> fea8f3151705=
> > https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fovs-dev%2F2021-
> June%2
> > F384547.html indeed covers the difference between phy and vhu ports.
> > One has to explicitly enable cross-NUMA-polling for individual interfaces
> with:
> >
> >ovs-vsctl set interface  other_config:cross-numa-polling=true
> >
> > This would typically only be done by static configuration for the fixed set 
> > of
> physical ports. There is no code in the OpenStack's os-vif handler to apply 
> such
> configuration for dynamically created vhu ports.
> >
> > I would strongly suggest that cross-num-polling be introduced as a per-
> interface option as in our patch rather than as a per-datapath option as in 
> your
> patch. Why not adapt our original patch to the latest OVS code base? We can
> help you with that.
> >
> > BR, Jan
> >
> 
> Hi, Jan Scheurich
> 
> We can achieve the static setting of pinning a phy port by combining pmd-rxq-
> isolate and pmd-rxq-affinity.  This setting can get the same result. And we 
> have
> seen the benefits.
> The new issue is the polling of vhu on one numa. Under heavy traffic, polling
> vhu + phy will make the pmds reach 100% usage. While other pmds on the
> other numa with only phy port reaches 70% usage. Enabling cross-numa polling
> for a vhu port would give us more benefits in this case. Overloads of 
> different
> pmds on both numa would be balanced.
> As you have mentioned, there is no code to apply this config for vhu while
> creating them. A global setting would save us from dynamically detecting the
> vhu name or any new creation.

Hi Wan Junjie,

We have done extensive benchmarking and found that we get better overall PMD 
load balance and resulting OVS performance when we do not statically pin any rx 
queues and instead let the auto-load-balancing find the optimal distribution of 
phy rx queues over both NUMA nodes to balance an asymmetric load of vhu rx 
queues (polled only on the local NUMA node).

Cross-NUMA polling of vhu rx queues comes with a very high latency cost due to 
cross-NUMA access to volatile virtio ring pointers in every iteration (not only 
when actually copying packets). Cross-NUMA polling of phy rx queues doesn't 
have a similar issue.

BR, Jan

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas

2022-02-14 Thread Jan Scheurich via dev
> >
> > Btw, this patch is similar in functionality to the one posted by
> > Anurag [0] and there was also some discussion about this approach here [1].
> >
> 
> Thanks for pointing this out.
> IMO, setting interface cross-numa would be good for phy port but not good for
> vhu.  Since vhu can be destroyed and created relatively frequently.
> But yes the main idea is the same.
> 

We do acknowledge the benefit of non-pinned polling of phy rx queues by PMD 
threads on all NUMA nodes. It gives the auto-load balancer much better options 
to utilize spare capacity on PMDs on all NUMA nodes.

Our patch proposed in 
https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384547.html
indeed covers the difference between phy and vhu ports. One has to explicitly 
enable cross-NUMA-polling for individual interfaces with:

   ovs-vsctl set interface  other_config:cross-numa-polling=true

This would typically only be done by static configuration for the fixed set of 
physical ports. There is no code in the OpenStack's os-vif handler to apply 
such configuration for dynamically created vhu ports.

I would strongly suggest that cross-num-polling be introduced as a 
per-interface option as in our patch rather than as a per-datapath option as in 
your patch. Why not adapt our original patch to the latest OVS code base? We 
can help you with that.

BR, Jan

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v4 2/7] dpif-netdev: Make PMD auto load balance use common rxq scheduling.

2021-07-14 Thread Jan Scheurich via dev
> > In our patch series we decided to skip the check on cross-numa polling 
> > during
> auto-load balancing. The rationale is as follows:
> >
> > If the estimated PMD-rxq distribution includes cross-NUMA rxq assignments,
> the same must apply for the current distribution, as none of the scheduling
> algorithms would voluntarily assign rxqs across NUMA nodes. So, current and
> estimated rxq assignments are comparable and it makes sense to consider
> rebalancing when the variance improves.
> >
> > Please consider removing this check.
> >
> 
> The first thing is that this patch is not changing any behaviour, just re-
> implementing to reuse the common code, so it would not be the place to
> change this functionality.

Fair enough. We should address this in a separate patch.

> About the proposed change itself, just to be clear what is allowed currently. 
> It
> will allow rebalance when there are local pmds, OR there are no local pmds
> and there is one other NUMA node with pmds available for cross-numa polling.
> 
> The rationale of not doing a rebalance when there are no local pmds but
> multiple other NUMAs available for cross-NUMA polling is that the estimate
> may be incorrect due a different cross-NUMA being choosen for an Rxq than is
> currently used.
> 
> I thought about some things like making an Rxq sticky with a particular cross-
> NUMA etc for this case but that brings a whole new set of problems, e.g. what
> happens if that NUMA gets overloaded, reduced cores, how can it ever be reset
> etc. so I decided not to pursue it as I think it is probably a corner case 
> (at least
> for now).

We currently don't see any scenarios with more than two NUMA nodes, but 
different CPU/server architectures may perhaps have more NUMA nodes than CPU 
sockets. 

> I know the case of no local pmd and one NUMA with pmds is not a corner case
> as I'm aware of users doing that.

Agree such configurations are a must to support with auto-lb.

> We can discuss further about the multiple non-local NUMA case and maybe
> there's some improvements we can think of, or maybe I've made some wrong
> assumptions but it would be a follow on from the current patchset.

Our main use case for cross-NUMA balancing comes with the additional freedom to 
allow cross-NUMA polling for selected ports that we introduce with fourth patch:

dpif-netdev: Allow cross-NUMA polling on selected ports

Today dpif-netdev considers PMD threads on a non-local NUMA node for
automatic assignment of the rxqs of a port only if there are no local,
non-isolated PMDs.

On typical servers with both physical ports on one NUMA node, this often
leaves the PMDs on the other NUMA node under-utilized, wasting CPU
resources. The alternative, to manually pin the rxqs to PMDs on remote
NUMA nodes, also has drawbacks as it limits OVS' ability to auto
load-balance the rxqs.

This patch introduces a new interface configuration option to allow
ports to be automatically polled by PMDs on any NUMA node:

ovs-vsctl set interface  other_config:cross-numa-polling=true

If this option is not present or set to false, legacy behaviour applies.

We indeed use this for our physical ports to be polled by non-isolated PMDs on 
both NUMAs. The observed capacity improvement is very substantial, so we plan 
to port this feature on top of your patches once they are merged. 

This can only fly if the auto-load balancing is allowed to activate rxq 
assignments with cross-numa polling also in the case there are local 
non-isolated PMDs.

Anyway, we can take this up later in our upcoming patch that introduces this 
option.

BR, Jan
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v4 6/7] dpif-netdev: Allow pin rxq and non-isolate PMD.

2021-07-14 Thread Jan Scheurich via dev
> >> +If using ``pmd-rxq-assign=group`` PMD threads with *pinned* Rxqs can
> >> +be
> >> +*non-isolated* by setting::
> >> +
> >> +  $ ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-isolate=false
> >
> > Is there any specific reason why the new pmd-rxq-isolate option should be
> limited to the "group" scheduling algorithm? In my view it would make more
> sense and simplify documentation and code if these aspects of scheduling were
> kept orthogonal.
> >
> 
> David had a similar comment on an earlier version. I will add the fuller reply
> below. In summary, pinning and the other algorithms (particularly if there was
> multiple pinnings) conflict in how they operate because they are based on RR
> pmd's to add equal number of Rxqs/PMD.
> 
> ---
> Yes, the main issue is that the other algorithms are based on a pmd order on
> the assumption that they start from empty. For 'roundrobin' it is to equally
> distribute num of rxqs on pmd RR - if we pin several rxqs on one pmds, it is 
> not
> clear what to do etc. Even worse for 'cycles' it is based on placing rxqs in 
> order
> of busy cycles on the pmds RR. If we pin an rxq initially we are conflicting
> against how the algorithm operates.
> 
> Because 'group' algorithm is not tied to a defined pmd order, it is better 
> suited
> to be able to place a pinned rxq initially and be able to consider that pmd on
> it's merits along with the others later.
> ---

Makes sense. The legacy round-robin and cycles algorithms are not really 
prepared for pinned rxqs.

In our patch set we had replaced the original round-robin and cycles algorithms 
with a single new algorithm that simply distributed the rxqs to the 
"least-loaded" non-isolated PMD (based on #rxqs or cycles assigned so far). 
With pinning=isolation, this exactly reproduced the round-robin behavior and 
created at least as good balance as the original cycles algorithm. In that new 
algorithm any pinned rxqs were already considered and didn't lead to distorted 
results.

BR, Jan
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v4 2/7] dpif-netdev: Make PMD auto load balance use common rxq scheduling.

2021-07-13 Thread Jan Scheurich via dev
> -Original Message-
> From: dev  On Behalf Of Kevin Traynor
> Sent: Thursday, 8 July, 2021 15:54
> To: d...@openvswitch.org
> Cc: david.march...@redhat.com
> Subject: [ovs-dev] [PATCH v4 2/7] dpif-netdev: Make PMD auto load balance
> use common rxq scheduling.
> 
> PMD auto load balance had its own separate implementation of the rxq
> scheduling that it used for dry runs. This was done because previously the rxq
> scheduling was not made reusable for a dry run.
> 
> Apart from the code duplication (which is a good enough reason to replace it
> alone) this meant that if any further rxq scheduling changes or assignment
> types were added they would also have to be duplicated in the auto load
> balance code too.
> 
> This patch replaces the current PMD auto load balance rxq scheduling code to
> reuse the common rxq scheduling code.
> 
> The behaviour does not change from a user perspective, except the logs are
> updated to be more consistent.
> 
> As the dry run will compare the pmd load variances for current and estimated
> assignments, new functions are added to populate the current assignments and
> use the rxq scheduling data structs for variance calculations.
> 
> Now that the new rxq scheduling data structures are being used in PMD auto
> load balance, the older rr_* data structs and associated functions can be
> removed.
> 
> Signed-off-by: Kevin Traynor 
> ---
>  lib/dpif-netdev.c | 508 +++---
>  1 file changed, 161 insertions(+), 347 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index beafa00a0..338ffd971
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -4903,138 +4903,4 @@ port_reconfigure(struct dp_netdev_port *port)  }
> 
> -struct rr_numa_list {
> -struct hmap numas;  /* Contains 'struct rr_numa' */
> -};
> -
> -struct rr_numa {
> -struct hmap_node node;
> -
> -int numa_id;
> -
> -/* Non isolated pmds on numa node 'numa_id' */
> -struct dp_netdev_pmd_thread **pmds;
> -int n_pmds;
> -
> -int cur_index;
> -bool idx_inc;
> -};
> -
> -static size_t
> -rr_numa_list_count(struct rr_numa_list *rr) -{
> -return hmap_count(>numas);
> -}
> -
> -static struct rr_numa *
> -rr_numa_list_lookup(struct rr_numa_list *rr, int numa_id) -{
> -struct rr_numa *numa;
> -
> -HMAP_FOR_EACH_WITH_HASH (numa, node, hash_int(numa_id, 0), 
> >numas) {
> -if (numa->numa_id == numa_id) {
> -return numa;
> -}
> -}
> -
> -return NULL;
> -}
> -
> -/* Returns the next node in numa list following 'numa' in round-robin 
> fashion.
> - * Returns first node if 'numa' is a null pointer or the last node in 'rr'.
> - * Returns NULL if 'rr' numa list is empty. */ -static struct rr_numa * -
> rr_numa_list_next(struct rr_numa_list *rr, const struct rr_numa *numa) -{
> -struct hmap_node *node = NULL;
> -
> -if (numa) {
> -node = hmap_next(>numas, >node);
> -}
> -if (!node) {
> -node = hmap_first(>numas);
> -}
> -
> -return (node) ? CONTAINER_OF(node, struct rr_numa, node) : NULL;
> -}
> -
> -static void
> -rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list *rr) -{
> -struct dp_netdev_pmd_thread *pmd;
> -struct rr_numa *numa;
> -
> -hmap_init(>numas);
> -
> -CMAP_FOR_EACH (pmd, node, >poll_threads) {
> -if (pmd->core_id == NON_PMD_CORE_ID || pmd->isolated) {
> -continue;
> -}
> -
> -numa = rr_numa_list_lookup(rr, pmd->numa_id);
> -if (!numa) {
> -numa = xzalloc(sizeof *numa);
> -numa->numa_id = pmd->numa_id;
> -hmap_insert(>numas, >node, hash_int(pmd->numa_id, 0));
> -}
> -numa->n_pmds++;
> -numa->pmds = xrealloc(numa->pmds, numa->n_pmds * sizeof *numa-
> >pmds);
> -numa->pmds[numa->n_pmds - 1] = pmd;
> -/* At least one pmd so initialise curr_idx and idx_inc. */
> -numa->cur_index = 0;
> -numa->idx_inc = true;
> -}
> -}
> -
> -/*
> - * Returns the next pmd from the numa node.
> - *
> - * If 'updown' is 'true' it will alternate between selecting the next pmd in
> - * either an up or down walk, switching between up/down when the first or
> last
> - * core is reached. e.g. 1,2,3,3,2,1,1,2...
> - *
> - * If 'updown' is 'false' it will select the next pmd wrapping around when 
> last
> - * core reached. e.g. 1,2,3,1,2,3,1,2...
> - */
> -static struct dp_netdev_pmd_thread *
> -rr_numa_get_pmd(struct rr_numa *numa, bool updown) -{
> -int numa_idx = numa->cur_index;
> -
> -if (numa->idx_inc == true) {
> -/* Incrementing through list of pmds. */
> -if (numa->cur_index == numa->n_pmds-1) {
> -/* Reached the last pmd. */
> -if (updown) {
> -numa->idx_inc = false;
> -} else {
> -numa->cur_index = 0;
> -}
> -} else {
> -numa->cur_index++;
> -}
> -} 

Re: [ovs-dev] [PATCH v4 6/7] dpif-netdev: Allow pin rxq and non-isolate PMD.

2021-07-13 Thread Jan Scheurich via dev


> -Original Message-
> From: dev  On Behalf Of Kevin Traynor
> Sent: Thursday, 8 July, 2021 15:54
> To: d...@openvswitch.org
> Cc: david.march...@redhat.com
> Subject: [ovs-dev] [PATCH v4 6/7] dpif-netdev: Allow pin rxq and non-isolate
> PMD.
> 
> Pinning an rxq to a PMD with pmd-rxq-affinity may be done for various reasons
> such as reserving a full PMD for an rxq, or to ensure that multiple rxqs from 
> a
> port are handled on different PMDs.
> 
> Previously pmd-rxq-affinity always isolated the PMD so no other rxqs could be
> assigned to it by OVS. There may be cases where there is unused cycles on
> those pmds and the user would like other rxqs to also be able to be assigned 
> to
> it by OVS.
> 
> Add an option to pin the rxq and non-isolate the PMD. The default behaviour is
> unchanged, which is pin and isolate the PMD.
> 
> In order to pin and non-isolate:
> ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-isolate=false
> 
> Note this is available only with group assignment type, as pinning conflicts 
> with
> the operation of the other rxq assignment algorithms.
> 
> Signed-off-by: Kevin Traynor 
> ---
>  Documentation/topics/dpdk/pmd.rst |   9 ++-
>  NEWS  |   3 +
>  lib/dpif-netdev.c |  34 --
>  tests/pmd.at  | 105 ++
>  vswitchd/vswitch.xml  |  19 ++
>  5 files changed, 162 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/topics/dpdk/pmd.rst
> b/Documentation/topics/dpdk/pmd.rst
> index 29ba53954..30040d703 100644
> --- a/Documentation/topics/dpdk/pmd.rst
> +++ b/Documentation/topics/dpdk/pmd.rst
> @@ -102,6 +102,11 @@ like so:
>  - Queue #3 pinned to core 8
> 
> -PMD threads on cores where Rx queues are *pinned* will become *isolated*.
> This -means that this thread will only poll the *pinned* Rx queues.
> +PMD threads on cores where Rx queues are *pinned* will become
> +*isolated* by default. This means that this thread will only poll the 
> *pinned*
> Rx queues.
> +
> +If using ``pmd-rxq-assign=group`` PMD threads with *pinned* Rxqs can be
> +*non-isolated* by setting::
> +
> +  $ ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-isolate=false

Is there any specific reason why the new pmd-rxq-isolate option should be 
limited to the "group" scheduling algorithm? In my view it would make more 
sense and simplify documentation and code if these aspects of scheduling were 
kept orthogonal.

Regards, Jan
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] tests: Fixed L3 over patch port tests

2021-06-09 Thread Jan Scheurich via dev
LGTM.
Acked-by: Jan Scheurich 

> -Original Message-
> From: Martin Varghese 
> Sent: Wednesday, 9 June, 2021 15:36
> To: d...@openvswitch.org; i.maxim...@ovn.org; Jan Scheurich
> 
> Cc: Martin Varghese 
> Subject: [PATCH] tests: Fixed L3 over patch port tests
> 
> From: Martin Varghese 
> 
> Normal action is replaced with output to GRE port for sending
> l3 packets over GRE tunnel. Normal action cannot be used with
> l3 packets.
> 
> Fixes: d03d0cf2b71b ("tests: Extend PTAP unit tests with decap action")
> Signed-off-by: Martin Varghese 
> ---
>  tests/packet-type-aware.at | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/tests/packet-type-aware.at b/tests/packet-type-aware.at index
> 540cf98f3..73aa14cea 100644
> --- a/tests/packet-type-aware.at
> +++ b/tests/packet-type-aware.at
> @@ -697,7 +697,7 @@ AT_CHECK([
>  ovs-ofctl del-flows br1 &&
>  ovs-ofctl del-flows br2 &&
>  ovs-ofctl add-flow br0 in_port=n0,actions=decap,output=p0 -OOpenFlow13
> &&
> -ovs-ofctl add-flow br1 in_port=p1,actions=NORMAL &&
> +ovs-ofctl add-flow br1 in_port=p1,actions=output=gre1 &&
>  ovs-ofctl add-flow br2 in_port=LOCAL,actions=output=n2  ], [0])
> 
> @@ -708,7 +708,7 @@ AT_CHECK([ovs-ofctl -OOpenFlow13 dump-flows br0 |
> ofctl_strip | grep actions],
> 
>  AT_CHECK([ovs-ofctl -OOpenFlow13 dump-flows br1 | ofctl_strip | grep
> actions],  [0], [dnl
> - reset_counts in_port=20 actions=NORMAL
> + reset_counts in_port=20 actions=output:100
>  ])
> 
>  AT_CHECK([ovs-ofctl -OOpenFlow13 dump-flows br2 | ofctl_strip | grep
> actions], @@ -726,7 +726,7 @@ ovs-appctl time/warp 1000  AT_CHECK([
>  ovs-appctl dpctl/dump-flows --names dummy@ovs-dummy | strip_used |
> grep -v ipv6 | sort  ], [0], [flow-dump from the main thread:
> -
> recirc_id(0),in_port(n0),packet_type(ns=0,id=0),eth(src=3a:6d:d2:09:9c:ab,dst=1
> e:2c:e9:2a:66:9e),eth_type(0x0800),ipv4(tos=0/0x3,frag=no), packets:1,
> bytes:98, used:0.0s,
> actions:pop_eth,clone(tnl_push(tnl_port(gre_sys),header(size=38,type=3,eth(dst
> =de:af:be:ef:ba:be,src=aa:55:00:00:00:02,dl_type=0x0800),ipv4(src=10.0.0.1,dst
> =10.0.0.2,proto=47,tos=0,ttl=64,frag=0x4000),gre((flags=0x0,proto=0x800))),out
> _port(br2)),n2)
> +recirc_id(0),in_port(n0),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(t
> +os=0/0x3,frag=no), packets:1, bytes:98, used:0.0s,
> +actions:pop_eth,clone(tnl_push(tnl_port(gre_sys),header(size=38,type=3,
> +eth(dst=de:af:be:ef:ba:be,src=aa:55:00:00:00:02,dl_type=0x0800),ipv4(sr
> +c=10.0.0.1,dst=10.0.0.2,proto=47,tos=0,ttl=64,frag=0x4000),gre((flags=0
> +x0,proto=0x800))),out_port(br2)),n2)
>  ])
> 
>  AT_CHECK([
> --
> 2.18.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2] Fix redundant datapath set ethernet action with NSH Decap

2021-06-07 Thread Jan Scheurich via dev
> -Original Message-
> From: Martin Varghese 
> Sent: Monday, 7 June, 2021 16:47
> To: Ilya Maximets 
> Cc: d...@openvswitch.org; echau...@redhat.com; Jan Scheurich
> ; Martin Varghese
> 
> Subject: Re: [ovs-dev] [PATCH v2] Fix redundant datapath set ethernet action
> with NSH Decap
> 
> On Wed, May 19, 2021 at 12:26:40PM +0200, Ilya Maximets wrote:
> > On 5/19/21 5:26 AM, Martin Varghese wrote:
> > > On Tue, May 18, 2021 at 10:03:39PM +0200, Ilya Maximets wrote:
> > >> On 5/17/21 3:45 PM, Martin Varghese wrote:
> > >>> From: Martin Varghese 
> > >>>
> > >>> When a decap action is applied on NSH header encapsulatiing a
> > >>> ethernet packet a redundant set mac address action is programmed
> > >>> to the datapath.
> > >>>
> > >>> Fixes: f839892a206a ("OF support and translation of generic encap
> > >>> and decap")
> > >>> Signed-off-by: Martin Varghese 
> > >>> Acked-by: Jan Scheurich 
> > >>> Acked-by: Eelco Chaudron 
> > >>> ---
> > >>> Changes in v2:
> > >>>   - Fixed code styling
> > >>>   - Added Ack from jan.scheur...@ericsson.com
> > >>>   - Added Ack from echau...@redhat.com
> > >>>
> > >>
> > >> Hi, Martin.
> > >> For some reason this patch triggers frequent failures of the
> > >> following unit test:
> > >>
> > >> 2314. packet-type-aware.at:619: testing ptap - L3 over patch port
> > >> ...
> 
> The test is failing as, during revalidation, NORMAL action is dropping 
> packets.
> With these changes, the mac address in flow structures get cleared with decap
> action. Hence the NORMAL action drops the packet assuming a loop (SRC and
> DST mac address are zero). I assume NORMAL action handling in
> xlate_push_stats_entry is not adapted for l3 packet. The timing at which
> revalidator gets triggered explains the sporadicity of the issue. The issue is
> never seen as the MAC addresses in flow structure were not cleared with decap
> before.
> 
> So can we use NORMAL action with a L3 packet ?  Does OVS handle all the L3
> use cases with Normal action ? If not, shouldn't we not use NORMAL action in
> this test case
> 
> Comments?
> 

Good catch! Normal flow L2 bridging is of course nonsense for the use case of 
forwarding an L3 packet. I am surprised that the packet was forwarded at all in 
the first place. That in itself can be considered a bug. Correctly, a Normal 
flow should drop non-Ethernet packets, I would say.

To fix the test case I suggest to replace the Normal action in br1 with 
"output:gre1" in line 700.

> 
> > >> stdout:
> > >> warped
> > >> ./packet-type-aware.at:726:
> > >> ovs-appctl dpctl/dump-flows --names dummy@ovs-dummy |
> > >> strip_used | grep -v ipv6 | sort
> > >>
> > >> --- -   2021-05-18 21:57:56.810513366 +0200
> > >> +++ /home/i.maximets/work/git/ovs/tests/testsuite.dir/at-
> groups/2314/stdout 2021-05-18 21:57:56.806609814 +0200
> > >> @@ -1,3 +1,3 @@
> > >>  flow-dump from the main thread:
> > >> -recirc_id(0),in_port(n0),packet_type(ns=0,id=0),eth(src=3a:6d:d2:0
> > >> 9:9c:ab,dst=1e:2c:e9:2a:66:9e),eth_type(0x0800),ipv4(tos=0/0x3,frag
> > >> =no), packets:1, bytes:98, used:0.0s,
> > >> actions:pop_eth,clone(tnl_push(tnl_port(gre_sys),header(size=38,typ
> > >> e=3,eth(dst=de:af:be:ef:ba:be,src=aa:55:00:00:00:02,dl_type=0x0800)
> > >> ,ipv4(src=10.0.0.1,dst=10.0.0.2,proto=47,tos=0,ttl=64,frag=0x4000),
> > >> gre((flags=0x0,proto=0x800))),out_port(br2)),n2)
> > >> +recirc_id(0),in_port(n0),packet_type(ns=0,id=0),eth(src=3a:6d:d2:0
> > >> +9:9c:ab,dst=1e:2c:e9:2a:66:9e),eth_type(0x0800),ipv4(tos=0/0x3,fra
> > >> +g=no), packets:1, bytes:98, used:0.0s, actions:drop
> > >>
> > >>
> > >> It fails very frequently in GitHub Actions, but it's harder to make
> > >> it fail on my local machine.  Following change to the test allows
> > >> to reproduce the failure almost always on my local machine:
> > >>
> > >> diff --git a/tests/packet-type-aware.at
> > >> b/tests/packet-type-aware.at index 540cf98f3..01dbc8030 100644
> > >> --- a/tests/packet-type-aware.at
> > >> +++ b/tests/packet-type-aware.at
> > >> @@ -721,7 +721,7 @@ AT_CHECK([
> > >>  ovs-appctl netdev-dummy/receive n0
> > >>
> 1e2ce92a669e3a6dd2099cab080045548a83400040011aadc0a80a0ac0a80
> a1
> > >>
> e0800b7170a4d0002fd509a58de1c02001011121314151617
> 18
> > >>
> 191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031323334353637
> > >>  ], [0], [ignore])
> > >>
> > >> -ovs-appctl time/warp 1000
> > >> +ovs-appctl time/warp 1000 100
> > >>
> > >>  AT_CHECK([
> > >>  ovs-appctl dpctl/dump-flows --names dummy@ovs-dummy |
> > >> strip_used | grep -v ipv6 | sort
> > >> --
> > >>
> > >> Without your patch I can not make it fail locally even with above
> > >> wrapping change applied.
> > >>
> > >> Could you, please, take a look?
> > >>
> > >
> > > Hi Ilya
> > >
> > > Travis CI was good. i will rebase & try again
> > > https://protect2.fireeye.com/v1/url?k=8ec813e4-d1532ae4-8ec8537f-86f
> > > c6812c361-3273f254661f9c75=1=c83c3bb8-d3c9-439a-8031-
> 72634d0fecc
> > > 2=https%3A%2F%2Ftravis-
> 

Re: [ovs-dev] [PATCH] Fix redundant datapath set ethernet action with NSH Decap

2021-05-07 Thread Jan Scheurich via dev
Hi Martin,

Somehow I didn’t receive this patch email via the ovs-dev mailing list, perhaps 
one of the many spam filters on the way interfered. Don't know if this response 
email will be recognized by ovs patchwork.

The nsh.at lines are broken wrongly in this email, but they look OK in 
patchworks.
Otherwise, LGTM.

Acked-by: Jan Scheurich 

/Jan

> -Original Message-
> From: Varghese, Martin (Nokia - IN/Bangalore)
> 
> Sent: Monday, 3 May, 2021 15:25
> To: Jan Scheurich 
> Subject: RE: [PATCH] Fix redundant datapath set ethernet action with NSH
> Decap
> 
> Hi Jan,
> 
> Could you please review this patch.
> 
> Regards,
> Martin
> 
> -Original Message-
> From: Martin Varghese 
> Sent: Tuesday, April 27, 2021 6:13 PM
> To: d...@openvswitch.org; echau...@redhat.com;
> jan.scheur...@ericsson.com
> Cc: Varghese, Martin (Nokia - IN/Bangalore) 
> Subject: [PATCH] Fix redundant datapath set ethernet action with NSH Decap
> 
> From: Martin Varghese 
> 
> When a decap action is applied on NSH header encapsulatiing a ethernet
> packet a redundant set mac address action is programmed to the datapath.
> 
> Fixes: f839892a206a ("OF support and translation of generic encap and
> decap")
> Signed-off-by: Martin Varghese 
> ---
>  lib/odp-util.c   | 3 ++-
>  ofproto/ofproto-dpif-xlate.c | 2 ++
>  tests/nsh.at | 8 
>  3 files changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/lib/odp-util.c b/lib/odp-util.c index e1199d1da..9d558082f 100644
> --- a/lib/odp-util.c
> +++ b/lib/odp-util.c
> @@ -7830,7 +7830,8 @@ commit_set_ether_action(const struct flow *flow,
> struct flow *base_flow,
>  struct offsetof_sizeof ovs_key_ethernet_offsetof_sizeof_arr[] =
>  OVS_KEY_ETHERNET_OFFSETOF_SIZEOF_ARR;
> 
> -if (flow->packet_type != htonl(PT_ETH)) {
> +if ((flow->packet_type != htonl(PT_ETH)) ||
> +(base_flow->packet_type != htonl(PT_ETH))) {
>  return;
>  }
> 
> diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c index
> 7108c8a30..a6f4ea334 100644
> --- a/ofproto/ofproto-dpif-xlate.c
> +++ b/ofproto/ofproto-dpif-xlate.c
> @@ -6549,6 +6549,8 @@ xlate_generic_decap_action(struct xlate_ctx *ctx,
>   * Delay generating pop_eth to the next commit. */
>  flow->packet_type = htonl(PACKET_TYPE(OFPHTN_ETHERTYPE,
>ntohs(flow->dl_type)));
> +flow->dl_src = eth_addr_zero;
> +flow->dl_dst = eth_addr_zero;
>  ctx->wc->masks.dl_type = OVS_BE16_MAX;
>  }
>  return false;
> diff --git a/tests/nsh.at b/tests/nsh.at index d5c772ff0..e84134e42 100644
> --- a/tests/nsh.at
> +++ b/tests/nsh.at
> @@ -105,7 +105,7 @@ bridge("br0")
> 
>  Final flow:
> in_port=1,vlan_tci=0x,dl_src=00:00:00:00:00:00,dl_dst=11:22:33:44:55:6
> 6,dl_type=0x894f,nsh_flags=0,nsh_ttl=63,nsh_mdtype=1,nsh_np=3,nsh_spi=0
> x1234,nsh_si=255,nsh_c1=0x11223344,nsh_c2=0x0,nsh_c3=0x0,nsh_c4=0x0,
> nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=0
>  Megaflow:
> recirc_id=0,eth,ip,in_port=1,dl_dst=66:77:88:99:aa:bb,nw_frag=no
> -Datapath actions:
> push_nsh(flags=0,ttl=63,mdtype=1,np=3,spi=0x1234,si=255,c1=0x11223344,c
> 2=0x0,c3=0x0,c4=0x0),push_eth(src=00:00:00:00:00:00,dst=11:22:33:44:55:6
> 6),pop_eth,pop_nsh(),set(eth(dst=11:22:33:44:55:66)),recirc(0x1)
> +Datapath actions:
> +push_nsh(flags=0,ttl=63,mdtype=1,np=3,spi=0x1234,si=255,c1=0x11223344,
> c
> +2=0x0,c3=0x0,c4=0x0),push_eth(src=00:00:00:00:00:00,dst=11:22:33:44:55:
> +66),pop_eth,pop_nsh(),recirc(0x1)
>  ])
> 
>  AT_CHECK([
> @@ -139,7 +139,7 @@ ovs-appctl time/warp 1000  AT_CHECK([
>  ovs-appctl dpctl/dump-flows dummy@ovs-dummy | strip_used | grep -v
> ipv6 | sort  ], [0], [flow-dump from the main thread:
> -
> recirc_id(0),in_port(1),packet_type(ns=0,id=0),eth(dst=1e:2c:e9:2a:66:9e),eth
> _type(0x0800),ipv4(frag=no), packets:1, bytes:98, used:0.0s,
> actions:push_nsh(flags=0,ttl=63,mdtype=1,np=3,spi=0x1234,si=255,c1=0x112
> 23344,c2=0x0,c3=0x0,c4=0x0),push_eth(src=00:00:00:00:00:00,dst=11:22:33:
> 44:55:66),pop_eth,pop_nsh(),set(eth(dst=11:22:33:44:55:66)),recirc(0x3)
> +recirc_id(0),in_port(1),packet_type(ns=0,id=0),eth(dst=1e:2c:e9:2a:66:9
> +e),eth_type(0x0800),ipv4(frag=no), packets:1, bytes:98, used:0.0s,
> +actions:push_nsh(flags=0,ttl=63,mdtype=1,np=3,spi=0x1234,si=255,c1=0x11
> +223344,c2=0x0,c3=0x0,c4=0x0),push_eth(src=00:00:00:00:00:00,dst=11:22:
> 3
> +3:44:55:66),pop_eth,pop_nsh(),recirc(0x3)
> 
> recirc_id(0x3),in_port(1),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=
> no), packets:1, bytes:98, used:0.0s, actions:2
>  ])
> 
> @@ -232,7 +232,7 @@ bridge("br0")
> 
>  Final flow:
> in_port=1,vlan_tci=0x,dl_src=00:00:00:00:00:00,dl_dst=11:22:33:44:55:6
> 6,dl_type=0x894f,nsh_flags=0,nsh_ttl=63,nsh_mdtype=2,nsh_np=3,nsh_spi=0
> x1234,nsh_si=255,nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=0
>  Megaflow:
> 

Re: [ovs-dev] [PATCH v4 1/2] Encap & Decap actions for MPLS packet type.

2021-04-17 Thread Jan Scheurich via dev


> -Original Message-
> From: Martin Varghese 
> Sent: Tuesday, 13 April, 2021 16:20
> To: Jan Scheurich 
> Cc: Eelco Chaudron ; d...@openvswitch.org;
> pshe...@ovn.org; martin.vargh...@nokia.com
> Subject: Re: [PATCH v4 1/2] Encap & Decap actions for MPLS packet type.
> 
> On Wed, Apr 07, 2021 at 03:49:07PM +, Jan Scheurich wrote:
> > Hi Martin,
> >
> > I guess you are aware of the original design document we wrote for generic
> encap/decap and NSH support:
> > https://protect2.fireeye.com/v1/url?k=993ba795-c6a09d8c-993be70e-8682a
> > aa22bc0-3c9b4464027ca7bf=1=f89fc25e-8dc0-45bf-bdd9-
> 2d0ca03b5686=
> >
> https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1oWMYUH8sjZJzWa72
> o2q9kU
> > 0N6pNE-rwZcLH3-kbbDR8%2Fedit%23
> >
> > It is no longer 100% aligned with the final implementation in OVS but still 
> > a
> good reference for understanding the design principles behind the
> implementation and some specifics for Ethernet and NSH encap/decap use
> cases.
> >
> > Please find some more answers/comments below.
> >
> > BR, Jan
> >
> > > -Original Message-
> > > From: Martin Varghese 
> > > Sent: Wednesday, 7 April, 2021 10:43
> > > To: Jan Scheurich 
> > > Cc: Eelco Chaudron ; d...@openvswitch.org;
> > > pshe...@ovn.org; martin.vargh...@nokia.com
> > > Subject: Re: [PATCH v4 1/2] Encap & Decap actions for MPLS packet type.
> > >
> > > On Tue, Apr 06, 2021 at 09:00:16AM +, Jan Scheurich wrote:
> > > > Hi,
> > > >
> > > > Thanks for the heads up. The interaction with MPLS push/pop is a
> > > > use case
> > > that was likely not tested during the NSH and generic encap/decap
> > > design. It's complex code and a long time ago. I'm willing to help,
> > > but I will need some time to go back and have a look.
> > > >
> > > > It would definitely help, if you could provide a minimal example
> > > > for
> > > reproducing the problem.
> > > >
> > >
> > > Hi Jan ,
> > >
> > > Thanks for your help.
> > >
> > > I was trying to implement ENCAP/DECAP support for MPLS.
> > >
> > > The programming of datapath flow for the below  userspace rule fails
> > > as there is set(eth() action between pop_mpls and recirc ovs-ofctl
> > > -O OpenFlow13 add- flow br_mpls2
> > > "in_port=$egress_port,dl_type=0x8847
> > > actions=decap(),decap(packet_type(ns=0,type=0),goto_table:1
> > >
> > > 2021-04-05T05:46:49.192Z|00068|dpif(handler51)|WARN|system@ovs-
> > > system: failed to put[create] (Invalid argument)
> > > ufid:1dddb0ba-27fe-44ea- 9a99-5815764b4b9c
> > > recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(6),skb_mark(0/0)
> > > ,ct_state
> > > (0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=00:00:00:00:00:01/00:
> > > 00:00:00:00:00,dst=00:00:00:00:00:02/00:00:00:00:00:00),eth_type(0x8
> > > 847) ,mpls(label=2/0x0,tc=0/0,ttl=64/0x0,bos=1/1),
> > > actions:pop_eth,pop_mpls(eth_type=0x6558),set(eth()),recirc(0x45)
> > >
> >
> > Conceptually, what should happen in this scenario is that, after the second
> decap(packet_type(ns=0,type=0) action, OVS processes the unchanged inner
> packet as packet type PT_ETH, i.e. as L2 Ethernet frame. Overwriting the
> existing Ethernet header with zero values  through set(eth()) is clearly
> incorrect. That is a logical error inside the ofproto-dpif-xlate module (see
> below).
> >
> > I believe the netdev userspace datapath would still have accepted the
> incorrect datapath flow. I have too little experience with the kernel 
> datapath to
> explain why that rejects the datapath flow as invalid.
> >
> > Unlike in the Ethernet and NSH cases, the MPLS header does not contain any
> indication about the inner packet type. That is why the packet_type must be
> provided by the SDN controller as part of the decap() action.  And the 
> ofproto-
> dpif-xlate module must consider the specified inner packet type when
> continuing the translation. In the general case, a decap() action should 
> trigger
> recirculation for reparsing of the inner packet, so the new packet type must 
> be
> set before recirculation. (Exceptions to the general recirculation rule are 
> those
> where OVS has already parsed further into the packet and ofproto can modify
> the flow on the fly: decap(Ethernet) and possibly decap(MPLS) for all but the
> last bottom of stack label).
> >
> > I have had a look at your new code for encap/decap of MPLS headers, but I
> must admit I cannot fully judge in how far re-using the existing translation
> functions for MPLS label stacks written for the legacy push/pop_mpls case 
> (i.e.
> manipulating a label stack between the L2 and the L3 headers of a PT_ETH
> Packet) are possible to re-use in the new context.
> >
> > BTW: Do you support multiple MPLS label encap or decap actions with your
> patch? Have you tested that?
> >
> > I am uncertain about the handling of the ethertype of the decapsulated inner
> packet. In the design base, the ethertype that is set in the existing L2 
> header of
> the packet after pop_mpls of the last label is coming from the pop_mpls 
> action,

Re: [ovs-dev] [PATCH v4 1/2] Encap & Decap actions for MPLS packet type.

2021-04-07 Thread Jan Scheurich via dev
Hi Martin,

I guess you are aware of the original design document we wrote for generic 
encap/decap and NSH support:
https://docs.google.com/document/d/1oWMYUH8sjZJzWa72o2q9kU0N6pNE-rwZcLH3-kbbDR8/edit#

It is no longer 100% aligned with the final implementation in OVS but still a 
good reference for understanding the design principles behind the 
implementation and some specifics for Ethernet and NSH encap/decap use cases.

Please find some more answers/comments below.

BR, Jan 

> -Original Message-
> From: Martin Varghese 
> Sent: Wednesday, 7 April, 2021 10:43
> To: Jan Scheurich 
> Cc: Eelco Chaudron ; d...@openvswitch.org;
> pshe...@ovn.org; martin.vargh...@nokia.com
> Subject: Re: [PATCH v4 1/2] Encap & Decap actions for MPLS packet type.
> 
> On Tue, Apr 06, 2021 at 09:00:16AM +, Jan Scheurich wrote:
> > Hi,
> >
> > Thanks for the heads up. The interaction with MPLS push/pop is a use case
> that was likely not tested during the NSH and generic encap/decap design. It's
> complex code and a long time ago. I'm willing to help, but I will need some
> time to go back and have a look.
> >
> > It would definitely help, if you could provide a minimal example for
> reproducing the problem.
> >
> 
> Hi Jan ,
> 
> Thanks for your help.
> 
> I was trying to implement ENCAP/DECAP support for MPLS.
> 
> The programming of datapath flow for the below  userspace rule fails as there
> is set(eth() action between pop_mpls and recirc ovs-ofctl -O OpenFlow13 add-
> flow br_mpls2 "in_port=$egress_port,dl_type=0x8847
> actions=decap(),decap(packet_type(ns=0,type=0),goto_table:1
> 
> 2021-04-05T05:46:49.192Z|00068|dpif(handler51)|WARN|system@ovs-
> system: failed to put[create] (Invalid argument) ufid:1dddb0ba-27fe-44ea-
> 9a99-5815764b4b9c
> recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(6),skb_mark(0/0),ct_state
> (0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=00:00:00:00:00:01/00:
> 00:00:00:00:00,dst=00:00:00:00:00:02/00:00:00:00:00:00),eth_type(0x8847)
> ,mpls(label=2/0x0,tc=0/0,ttl=64/0x0,bos=1/1),
> actions:pop_eth,pop_mpls(eth_type=0x6558),set(eth()),recirc(0x45)
> 

Conceptually, what should happen in this scenario is that, after the second 
decap(packet_type(ns=0,type=0) action, OVS processes the unchanged inner packet 
as packet type PT_ETH, i.e. as L2 Ethernet frame. Overwriting the existing 
Ethernet header with zero values  through set(eth()) is clearly incorrect. That 
is a logical error inside the ofproto-dpif-xlate module (see below).

I believe the netdev userspace datapath would still have accepted the incorrect 
datapath flow. I have too little experience with the kernel datapath to explain 
why that rejects the datapath flow as invalid.

Unlike in the Ethernet and NSH cases, the MPLS header does not contain any 
indication about the inner packet type. That is why the packet_type must be 
provided by the SDN controller as part of the decap() action.  And the 
ofproto-dpif-xlate module must consider the specified inner packet type when 
continuing the translation. In the general case, a decap() action should 
trigger recirculation for reparsing of the inner packet, so the new packet type 
must be set before recirculation. (Exceptions to the general recirculation rule 
are those where OVS has already parsed further into the packet and ofproto can 
modify the flow on the fly: decap(Ethernet) and possibly decap(MPLS) for all 
but the last bottom of stack label).

I have had a look at your new code for encap/decap of MPLS headers, but I must 
admit I cannot fully judge in how far re-using the existing translation 
functions for MPLS label stacks written for the legacy push/pop_mpls case (i.e. 
manipulating a label stack between the L2 and the L3 headers of a PT_ETH 
Packet) are possible to re-use in the new context. 

BTW: Do you support multiple MPLS label encap or decap actions with your patch? 
Have you tested that? 

I am uncertain about the handling of the ethertype of the decapsulated inner 
packet. In the design base, the ethertype that is set in the existing L2 header 
of the packet after pop_mpls of the last label is coming from the pop_mpls 
action, while in the decap(packet_type(0,0)) case the entire inner packet 
should be recirculated as is with packet_type PT_ETH. 

case PT_MPLS: {
 int n;
 ovs_be16 ethertype;

 flow->packet_type = decap->new_pkt_type;
 ethertype = pt_ns_type_be(flow->packet_type);

 n = flow_count_mpls_labels(flow, ctx->wc);
 flow_pop_mpls(flow, n, ethertype, ctx->wc);
 if (!ctx->xbridge->support.add_mpls) {
ctx->xout->slow |= SLOW_ACTION;
 }
 ctx->pending_decap = true;
 return true;

In the example scenario the new_pkt_type is PT_ETH, so the ethertype and hence 
flow->dl_type will be also be set to zero. Is that intentional? For target 
packet types of form PACKET_TYPE(OFPHTN_ETHERTYPE, ) you would set 
the 

Re: [ovs-dev] [PATCH v4 1/2] Encap & Decap actions for MPLS packet type.

2021-04-06 Thread Jan Scheurich via dev
Hi,

Thanks for the heads up. The interaction with MPLS push/pop is a use case that 
was likely not tested during the NSH and generic encap/decap design. It's 
complex code and a long time ago. I'm willing to help, but I will need some 
time to go back and have a look.

It would definitely help, if you could provide a minimal example for 
reproducing the problem.

BR, Jan

> -Original Message-
> From: Eelco Chaudron 
> Sent: Tuesday, 6 April, 2021 10:55
> To: Martin Varghese ; Jan Scheurich
> 
> Cc: d...@openvswitch.org; pshe...@ovn.org; martin.vargh...@nokia.com
> Subject: Re: [PATCH v4 1/2] Encap & Decap actions for MPLS packet type.
> 
> 
> 
> On 6 Apr 2021, at 10:27, Martin Varghese wrote:
> 
> > On Thu, Apr 01, 2021 at 11:32:06AM +0200, Eelco Chaudron wrote:
> >>
> >>
> >> On 1 Apr 2021, at 11:28, Martin Varghese wrote:
> >>
> >>> On Thu, Apr 01, 2021 at 11:17:14AM +0200, Eelco Chaudron wrote:
> 
> 
>  On 1 Apr 2021, at 11:09, Martin Varghese wrote:
> 
> > On Thu, Apr 01, 2021 at 10:54:42AM +0200, Eelco Chaudron wrote:
> >>
> >>
> >> On 1 Apr 2021, at 10:35, Martin Varghese wrote:
> >>
> >>> On Thu, Apr 01, 2021 at 08:59:27AM +0200, Eelco Chaudron wrote:
> 
> 
>  On 1 Apr 2021, at 6:10, Martin Varghese wrote:
> 
> > On Wed, Mar 31, 2021 at 03:59:40PM +0200, Eelco Chaudron
> > wrote:
> >>
> >>
> >> On 26 Mar 2021, at 7:21, Martin Varghese wrote:
> >>
> >>> From: Martin Varghese 
> >>>
> >>> The encap & decap actions are extended to support MPLS
> >>> packet type.
> >>> Encap & decap actions adds and removes MPLS header at start
> >>> of the packet.
> >>
> >> Hi Martin,
> >>
> >> I’m trying to do some real-life testing, and I’m running into
> >> issues. This might be me setting it up wrongly but just
> >> wanting to confirm…
> >>
> >> I’m sending an MPLS packet that contains an ARP packet into a
> >> physical port.
> >> This is the packet:
> >>
> >> Frame 4: 64 bytes on wire (512 bits), 64 bytes captured (512
> >> bits)
> >> Encapsulation type: Ethernet (1)
> >> [Protocols in frame: eth:ethertype:mpls:data] Ethernet
> >> II, Src: 00:00:00_00:00:01 (00:00:00:00:00:01), Dst:
> >> 00:00:00_00:00:02 (00:00:00:00:00:02)
> >> Destination: 00:00:00_00:00:02 (00:00:00:00:00:02)
> >> Address: 00:00:00_00:00:02 (00:00:00:00:00:02)
> >>  ..0.     = LG bit: Globally
> >> unique address (factory default)
> >>  ...0     = IG bit:
> >> Individual address
> >> (unicast)
> >> Source: 00:00:00_00:00:01 (00:00:00:00:00:01)
> >> Address: 00:00:00_00:00:01 (00:00:00:00:00:01)
> >>  ..0.     = LG bit: Globally
> >> unique address (factory default)
> >>  ...0     = IG bit:
> >> Individual address
> >> (unicast)
> >> Type: MPLS label switched packet (0x8847) MultiProtocol
> >> Label Switching Header, Label: 100, Exp: 0, S:
> >> 1, TTL:
> >> 64
> >>    0110 0100    = MPLS Label: 100
> >>      000.   = MPLS
> >> Experimental
> >> Bits: 0
> >>      ...1   = MPLS Bottom Of
> >> Label
> >> Stack: 1
> >>       0100  = MPLS TTL: 64
> >> Data (46 bytes)
> >>
> >>   ff ff ff ff ff ff 52 54 00 88 51 38 08 06 00 01
> >> ..RT..Q8
> >> 0010  08 00 06 04 00 01 52 54 00 88 51 38 01 01 01 65
> >> ..RT..Q8...e
> >> 0020  00 00 00 00 00 00 01 01 01 64 27 98 a0 47
> >> .d'..G
> >> Data:
> >>
> 52540088513808060001080006040001525400885138010101650
> 000?
> >>
> >>
> >> I’m trying to use the following rules:
> >>
> >>   ovs-ofctl del-flows ovs_pvp_br0
> >>   ovs-ofctl add-flow -O OpenFlow13 ovs_pvp_br0
> >> "priority=100,dl_type=0x8847,mpls_label=100
> >>
> actions=decap(),decap(packet_type(ns=0,type=0x806)),resubmit(,3)"
> >>   ovs-ofctl add-flow -O OpenFlow13 ovs_pvp_br0
> >> "table=3,priority=10
> >> actions=normal"
> >>
> >> With these, I expect the packet to be sent to vnet0, but it’s
> >> not.
> >> Actually,
> >> the datapath rule looks odd, while the userspace rules seem
> >> to match:
> >>
> >>   $ ovs-dpctl dump-flows
> >>
> >> 

Re: [ovs-dev] [PATCH v2] ofp-ed-props: Fix using uninitialized padding for NSH encap actions.

2020-10-14 Thread Jan Scheurich via dev
LGTM. Please back-port to stable branches.

Acked-by: Jan Scheurich 

/Jan

> -Original Message-
> From: Ilya Maximets 
> Sent: Wednesday, 14 October, 2020 18:14
> To: ovs-dev@openvswitch.org; Jan Scheurich 
> Cc: Ben Pfaff ; Ilya Maximets 
> Subject: [PATCH v2] ofp-ed-props: Fix using uninitialized padding for NSH
> encap actions.
> 
> OVS uses memcmp to compare actions of existing and new flows, but 'struct
> ofp_ed_prop_nsh_md_type' and corresponding ofpact structure has
> 3 bytes of padding that never initialized and passed around within OF data
> structures and messages.
> 
>   Uninitialized bytes in MemcmpInterceptorCommon
> at offset 21 inside [0x709003f8, 136)
>   WARNING: MemorySanitizer: use-of-uninitialized-value
> #0 0x4a184e in bcmp (vswitchd/ovs-vswitchd+0x4a184e)
> #1 0x896c8a in ofpacts_equal lib/ofp-actions.c:9121:31
> #2 0x564403 in replace_rule_finish ofproto/ofproto.c:5650:37
> #3 0x563462 in add_flow_finish ofproto/ofproto.c:5218:13
> #4 0x54a1ff in ofproto_flow_mod_finish ofproto/ofproto.c:8091:17
> #5 0x5433b2 in handle_flow_mod__ ofproto/ofproto.c:6216:17
> #6 0x56a2fc in handle_flow_mod ofproto/ofproto.c:6190:17
> #7 0x565bda in handle_single_part_openflow ofproto/ofproto.c:8504:16
> #8 0x540b25 in handle_openflow ofproto/ofproto.c:8685:21
> #9 0x6697fd in ofconn_run ofproto/connmgr.c:1329:13
> #10 0x668e6e in connmgr_run ofproto/connmgr.c:356:9
> #11 0x53f1bc in ofproto_run ofproto/ofproto.c:1890:5
> #12 0x4ead0c in bridge_run__ vswitchd/bridge.c:3250:9
> #13 0x4e9bc8 in bridge_run vswitchd/bridge.c:3309:5
> #14 0x51c072 in main vswitchd/ovs-vswitchd.c:127:9
> #15 0x7f23a99011a2 in __libc_start_main (/lib64/libc.so.6)
> #16 0x46b92d in _start (vswitchd/ovs-vswitchd+0x46b92d)
> 
>   Uninitialized value was stored to memory at
> #0 0x4745aa in __msan_memcpy.part.0 (vswitchd/ovs-vswitchd)
> #1 0x54529f in rule_actions_create ofproto/ofproto.c:3134:5
> #2 0x54915e in ofproto_rule_create ofproto/ofproto.c:5284:11
> #3 0x55d419 in add_flow_init ofproto/ofproto.c:5123:17
> #4 0x54841f in ofproto_flow_mod_init ofproto/ofproto.c:7987:17
> #5 0x543250 in handle_flow_mod__ ofproto/ofproto.c:6206:13
> #6 0x56a2fc in handle_flow_mod ofproto/ofproto.c:6190:17
> #7 0x565bda in handle_single_part_openflow ofproto/ofproto.c:8504:16
> #8 0x540b25 in handle_openflow ofproto/ofproto.c:8685:21
> #9 0x6697fd in ofconn_run ofproto/connmgr.c:1329:13
> #10 0x668e6e in connmgr_run ofproto/connmgr.c:356:9
> #11 0x53f1bc in ofproto_run ofproto/ofproto.c:1890:5
> #12 0x4ead0c in bridge_run__ vswitchd/bridge.c:3250:9
> #13 0x4e9bc8 in bridge_run vswitchd/bridge.c:3309:5
> #14 0x51c072 in main vswitchd/ovs-vswitchd.c:127:9
> #15 0x7f23a99011a2 in __libc_start_main (/lib64/libc.so.6)
> 
>   Uninitialized value was created by an allocation of 'ofpacts_stub'
>   in the stack frame of function 'handle_flow_mod'
> #0 0x569e80 in handle_flow_mod ofproto/ofproto.c:6170
> 
> This could cause issues with flow modifications or other operations.
> 
> To reproduce, some NSH tests could be run under valgrind or clang
> MemorySantizer. Ex. "nsh - md1 encap over a veth link" test.
> 
> Fix that by clearing padding bytes while encoding and decoding.
> OVS will still accept OF messages with non-zero padding from controllers.
> 
> New tests added to tests/ofp-actions.at.
> 
> Fixes: 1fc11c5948cf ("Generic encap and decap support for NSH")
> Signed-off-by: Ilya Maximets 
> ---
>  lib/ofp-ed-props.c   |  3 ++-
>  tests/ofp-actions.at | 11 +++
>  2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/ofp-ed-props.c b/lib/ofp-ed-props.c index
> 28382e012..02a9235d5 100644
> --- a/lib/ofp-ed-props.c
> +++ b/lib/ofp-ed-props.c
> @@ -49,7 +49,7 @@ decode_ed_prop(const struct ofp_ed_prop_header
> **ofp_prop,
>  return OFPERR_NXBAC_BAD_ED_PROP;
>  }
>  struct ofpact_ed_prop_nsh_md_type *pnmt =
> -ofpbuf_put_uninit(out, sizeof(*pnmt));
> +ofpbuf_put_zeros(out, sizeof *pnmt);
>  pnmt->header.prop_class = prop_class;
>  pnmt->header.type = prop_type;
>  pnmt->header.len = len;
> @@ -108,6 +108,7 @@ encode_ed_prop(const struct ofpact_ed_prop
> **prop,
>  opnmt->header.len =
>  offsetof(struct ofp_ed_prop_nsh_md_type, pad);
>  opnmt->md_type = pnmt->md_type;
> +memset(opnmt->pad, 0, sizeof opnmt->pad);
>  prop_len = sizeof(*pnmt);
>  break;
>  }
> diff --git a/tests/ofp-actions.at b/tests/ofp-actions.at index
> 28b2099a0..c79d7d0e2 100644
> --- a/tests/ofp-actions.at
> +++ b/tests/ofp-actions.at
> @@ -769,6 +769,17 @@ dnl Check OpenFlow v1.3.4 Conformance Test:
> 430.510.
>  & 0010  00 00 00 10 00 00 00 01-
>  0019 0010 8807 000102030405 0010 0001
> 

Re: [ovs-dev] [PATCH] ofp-ed-props: Fix using uninitialized padding for NSH encap actions.

2020-10-14 Thread Jan Scheurich via dev
> >> Fix that by clearing padding bytes while encoding, and checking that
> >> these bytes are all zeros on decoding.
> >
> > Is the latter strictly necessary? It may break existing controllers that do 
> > not
> initialize the padding bytes to zero.
> > Wouldn't it be sufficient to just zero the padding bytes at reception?
> 
> I do not have a strong opinion.  I guess, we could not fail OF request if
> padding is not all zeroes for backward compatibility.
> Anyway, it seems like I missed one part of this change (see inline).
> 
> On the other hand, AFAIU, NXOXM_NSH_ is not standardized, so, technically,
> we could change the rules here.  As an option, we could apply the patch
> without checking for all-zeroes padding and backport it this way to stable
> branches.  Afterwards, we could introduce the 'is_all_zeros' check and
> mention this change in release notes for the new version.  Anyway OpenFlow
> usually requires paddings to be all-zeroes for most of matches and actions, so
> this should be a sane requirement for controllers.
> What do you think?
> 

I think there is little to gain by enforcing strict rules on zeroed padding 
bytes in a future release. It just creates grief with users of OVS by 
unnecessarily breaking backward compatibility without any benefit for OVS. No 
matter if OVS is has the right to do so or not.

> >> diff --git a/lib/ofp-ed-props.c b/lib/ofp-ed-props.c index
> >> 28382e012..5a4b12d9f 100644
> >> --- a/lib/ofp-ed-props.c
> >> +++ b/lib/ofp-ed-props.c
> >> @@ -48,6 +48,9 @@ decode_ed_prop(const struct ofp_ed_prop_header
> >> **ofp_prop,
> >>  if (len > sizeof(*opnmt) || len > *remaining) {
> >>  return OFPERR_NXBAC_BAD_ED_PROP;
> >>  }
> >> +if (!is_all_zeros(opnmt->pad, sizeof opnmt->pad)) {
> >> +return OFPERR_NXBRC_MUST_BE_ZERO;
> >> +}
> >>  struct ofpact_ed_prop_nsh_md_type *pnmt =
> >>  ofpbuf_put_uninit(out, sizeof(*pnmt));
> 
> This should be 'ofpbuf_put_zeroes' because 'struct
> ofpact_ed_prop_nsh_md_type'
> contains padding too that must be cleared while constructing ofpacts.
> Since OVS compares decoded ofpacts' and not the original OF messages, this
> should do the trick.

Agree.

> 
> I'll send v2 with this change and will remove 'is_all_zeros' check for this 
> fix.

Thanks, Jan

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] ofp-ed-props: Fix using uninitialized padding for NSH encap actions.

2020-10-14 Thread Jan Scheurich via dev
Hi Ilya,

Good catch. One comment below.

/Jan

> -Original Message-
> From: Ilya Maximets 
> Sent: Tuesday, 13 October, 2020 21:02
> To: ovs-dev@openvswitch.org; Jan Scheurich 
> Cc: Ben Pfaff ; Yi Yang ; Ilya Maximets
> 
> Subject: [PATCH] ofp-ed-props: Fix using uninitialized padding for NSH encap
> actions.
> 
> OVS uses memcmp to compare actions of existing and new flows, but 'struct
> ofp_ed_prop_nsh_md_type' has 3 bytes of padding that never initialized and
> passed around within OF data structures and messages.
> 
>   Uninitialized bytes in MemcmpInterceptorCommon
> at offset 21 inside [0x709003f8, 136)
>   WARNING: MemorySanitizer: use-of-uninitialized-value
> #0 0x4a184e in bcmp (vswitchd/ovs-vswitchd+0x4a184e)
> #1 0x896c8a in ofpacts_equal lib/ofp-actions.c:9121:31
> #2 0x564403 in replace_rule_finish ofproto/ofproto.c:5650:37
> #3 0x563462 in add_flow_finish ofproto/ofproto.c:5218:13
> #4 0x54a1ff in ofproto_flow_mod_finish ofproto/ofproto.c:8091:17
> #5 0x5433b2 in handle_flow_mod__ ofproto/ofproto.c:6216:17
> #6 0x56a2fc in handle_flow_mod ofproto/ofproto.c:6190:17
> #7 0x565bda in handle_single_part_openflow ofproto/ofproto.c:8504:16
> #8 0x540b25 in handle_openflow ofproto/ofproto.c:8685:21
> #9 0x6697fd in ofconn_run ofproto/connmgr.c:1329:13
> #10 0x668e6e in connmgr_run ofproto/connmgr.c:356:9
> #11 0x53f1bc in ofproto_run ofproto/ofproto.c:1890:5
> #12 0x4ead0c in bridge_run__ vswitchd/bridge.c:3250:9
> #13 0x4e9bc8 in bridge_run vswitchd/bridge.c:3309:5
> #14 0x51c072 in main vswitchd/ovs-vswitchd.c:127:9
> #15 0x7f23a99011a2 in __libc_start_main (/lib64/libc.so.6)
> #16 0x46b92d in _start (vswitchd/ovs-vswitchd+0x46b92d)
> 
>   Uninitialized value was stored to memory at
> #0 0x4745aa in __msan_memcpy.part.0 (vswitchd/ovs-vswitchd)
> #1 0x54529f in rule_actions_create ofproto/ofproto.c:3134:5
> #2 0x54915e in ofproto_rule_create ofproto/ofproto.c:5284:11
> #3 0x55d419 in add_flow_init ofproto/ofproto.c:5123:17
> #4 0x54841f in ofproto_flow_mod_init ofproto/ofproto.c:7987:17
> #5 0x543250 in handle_flow_mod__ ofproto/ofproto.c:6206:13
> #6 0x56a2fc in handle_flow_mod ofproto/ofproto.c:6190:17
> #7 0x565bda in handle_single_part_openflow ofproto/ofproto.c:8504:16
> #8 0x540b25 in handle_openflow ofproto/ofproto.c:8685:21
> #9 0x6697fd in ofconn_run ofproto/connmgr.c:1329:13
> #10 0x668e6e in connmgr_run ofproto/connmgr.c:356:9
> #11 0x53f1bc in ofproto_run ofproto/ofproto.c:1890:5
> #12 0x4ead0c in bridge_run__ vswitchd/bridge.c:3250:9
> #13 0x4e9bc8 in bridge_run vswitchd/bridge.c:3309:5
> #14 0x51c072 in main vswitchd/ovs-vswitchd.c:127:9
> #15 0x7f23a99011a2 in __libc_start_main (/lib64/libc.so.6)
> 
>   Uninitialized value was created by an allocation of 'ofpacts_stub'
>   in the stack frame of function 'handle_flow_mod'
> #0 0x569e80 in handle_flow_mod ofproto/ofproto.c:6170
> 
> This could cause issues with flow modifications or other operations.
> 
> To reproduce, some NSH tests could be run under valgrind or clang
> MemorySantizer. Ex. "nsh - md1 encap over a veth link" test.
> 
> Fix that by clearing padding bytes while encoding, and checking that these
> bytes are all zeros on decoding.

Is the latter strictly necessary? It may break existing controllers that do not 
initialize the padding bytes to zero.
Wouldn't it be sufficient to just zero the padding bytes at reception?

> 
> New tests added to tests/ofp-actions.at.
> 
> Fixes: 1fc11c5948cf ("Generic encap and decap support for NSH")
> Signed-off-by: Ilya Maximets 
> ---
>  lib/ofp-ed-props.c   |  4 
>  tests/ofp-actions.at | 11 +++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/lib/ofp-ed-props.c b/lib/ofp-ed-props.c index
> 28382e012..5a4b12d9f 100644
> --- a/lib/ofp-ed-props.c
> +++ b/lib/ofp-ed-props.c
> @@ -48,6 +48,9 @@ decode_ed_prop(const struct ofp_ed_prop_header
> **ofp_prop,
>  if (len > sizeof(*opnmt) || len > *remaining) {
>  return OFPERR_NXBAC_BAD_ED_PROP;
>  }
> +if (!is_all_zeros(opnmt->pad, sizeof opnmt->pad)) {
> +return OFPERR_NXBRC_MUST_BE_ZERO;
> +}
>  struct ofpact_ed_prop_nsh_md_type *pnmt =
>  ofpbuf_put_uninit(out, sizeof(*pnmt));
>  pnmt->header.prop_class = prop_class; @@ -108,6 +111,7 @@
> encode_ed_prop(const struct ofpact_ed_prop **prop,
>  opnmt->header.len =
>  offsetof(struct ofp_ed_prop_nsh_md_type, pad);
>  opnmt->md_type = pnmt->md_type;
> +memset(opnmt->pad, 0, sizeof opnmt->pad);
>  prop_len = sizeof(*pnmt);
>  break;
>  }
> diff --git a/tests/ofp-actions.at b/tests/ofp-actions.at index
> 28b2099a0..18ba8206f 100644
> --- a/tests/ofp-actions.at
> +++ b/tests/ofp-actions.at
> @@ -769,6 +769,17 @@ dnl Check 

Re: [ovs-dev] [PATCH] userspace: Switch default cache from EMC to SMC.

2020-09-22 Thread Jan Scheurich via dev
> -Original Message-
> From: Flavio Leitner 
> On Tue, Sep 22, 2020 at 01:22:58PM +0200, Ilya Maximets wrote:
> > On 9/19/20 3:07 PM, Flavio Leitner wrote:
> > > The EMC is not large enough for current production cases and they
> > > are scaling up, so this change switches over from EMC to SMC by
> > > default, which provides better results.
> > >
> > > The EMC is still available and could be used when only a few number
> > > of flows is used.


> 
> I am curious to find out what others think about this change, so going to 
> wait a
> bit before following up with the next version if that sounds OK.
> 

For production deployments of OVS-DPDK in NFVI we also recommend switching SMC 
on and EMC off.
No problem with making that configuration default in the future.
As to be expected SMC only provides acceleration over DPCLS if the avg. number 
of sub-table lookups is > 1.

BR, Jan
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] dpif-netdev: Do not mix recirculation depth into RSS hash itself.

2019-10-24 Thread Jan Scheurich via dev
Even simpler solution to the problem.
Acked-by: Jan Scheurich 

BR, Jan

> -Original Message-
> From: Ilya Maximets 
> Sent: Thursday, 24 October, 2019 14:32
> To: ovs-dev@openvswitch.org
> Cc: Ian Stokes ; Kevin Traynor ;
> Jan Scheurich ; ychen103...@163.com; Ilya
> Maximets 
> Subject: [PATCH] dpif-netdev: Do not mix recirculation depth into RSS hash
> itself.
> 
> Mixing of RSS hash with recirculation depth is useful for flow lookup because
> same packet after recirculation should match with different datapath rule.
> Setting of the mixed value back to the packet is completely unnecessary
> because recirculation depth is different on each recirculation, i.e. we will 
> have
> different packet hash for flow lookup anyway.
> 
> This should fix the issue that packets from the same flow could be directed to
> different buckets based on a dp_hash or different ports of a balanced bonding
> in case they were recirculated different number of times (e.g. due to 
> conntrack
> rules).
> With this change, the original RSS hash will remain the same making it 
> possible
> to calculate equal dp_hash values for such packets.
> 
> Reported-at: https://protect2.fireeye.com/v1/url?k=0a51a6c3-56db840c-
> 0a51e658-0cc47ad93ea4-b7f1e9be8f7bbef8=1=c9f55798-de3c-45f4-afeb-
> 9a87d3d594ca=https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fovs-
> dev%2F2019-September%2F363127.html
> Fixes: 048963aa8507 ("dpif-netdev: Reset RSS hash when recirculating.")
> Signed-off-by: Ilya Maximets 
> ---
>  lib/dpif-netdev.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 4546b55e8..c09b8fd95
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -6288,7 +6288,6 @@ dpif_netdev_packet_get_rss_hash(struct dp_packet
> *packet,
>  recirc_depth = *recirc_depth_get_unsafe();
>  if (OVS_UNLIKELY(recirc_depth)) {
>  hash = hash_finish(hash, recirc_depth);
> -dp_packet_set_rss_hash(packet, hash);
>  }
>  return hash;
>  }
> --
> 2.17.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] group dp_hash method works incorrectly when using snat

2019-09-30 Thread Jan Scheurich via dev
Hi,

You have pointed out an interesting issue in the netdev datapath implementation 
(not sure in how far the same applies also to the kernel datapath).

Conceptually, the dp_hash of a packet should be based on the current packet's 
flow. It should not change if the headers remain unchanged.

For performance reasons, the actual implementation of the dp_hash action in 
odp_execute.c bases the dp_hash value on the current RSS hash of the packet, if 
it has one, and only computes it from the actual packet content if not.

However, the RSS hash of the packet is updated with every recirculation in 
order to improve the EMC lookup success rate. So even if initially the RSS hash 
was a suitable base for dp_hash (that itself is uncertain, as the 
implemantation of the RSS hash is dependent on the NIC HW and might not satisfy 
the algorithm specified as part of the dp_hash action), its volatility at 
recirculation destroys the required property of the dp_hash.

What we could do is something like the following (not even compiler-tested):

diff --git a/lib/odp-execute.c b/lib/odp-execute.c
index 563ad1da8..1937bb1e6 100644
--- a/lib/odp-execute.c
+++ b/lib/odp-execute.c
@@ -820,16 +820,22 @@ odp_execute_actions(void *dp, struct dp_packet_batch 
*batch, bool steal,
 uint32_t hash;

 DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
-/* RSS hash can be used here instead of 5tuple for
- * performance reasons. */
-if (dp_packet_rss_valid(packet)) {
-hash = dp_packet_get_rss_hash(packet);
-hash = hash_int(hash, hash_act->hash_basis);
-} else {
-flow_extract(packet, );
-hash = flow_hash_5tuple(, hash_act->hash_basis);
+if (packet->md.dp_hash == 0) {
+if (packet->md.recirc_id == 0 &&
+dp_packet_rss_valid(packet)) {
+/* RSS hash is used here instead of 5tuple for
+ * performance reasons. */
+hash = dp_packet_get_rss_hash(packet);
+hash = hash_int(hash, hash_act->hash_basis);
+} else {
+flow_extract(packet, );
+hash = flow_hash_5tuple(, 
hash_act->hash_basis);
+}
+if (unlikely(hash == 0)) {
+hash = 1;
+}
+packet->md.dp_hash = hash;
 }
-packet->md.dp_hash = hash;
 }
 break;
 }
@@ -842,6 +848,9 @@ odp_execute_actions(void *dp, struct dp_packet_batch 
*batch, bool steal,
 hash = flow_hash_symmetric_l3l4(,
 hash_act->hash_basis,
 false);
+if (unlikely(hash == 0)) {
+hash = 1;
+}
 packet->md.dp_hash = hash;
 }
 break;
diff --git a/lib/packets.c b/lib/packets.c
index ab0b1a36d..a03a3ab61 100644
--- a/lib/packets.c
+++ b/lib/packets.c
@@ -391,6 +391,8 @@ push_mpls(struct dp_packet *packet, ovs_be16 ethtype, 
ovs_be32 lse)
 header = dp_packet_resize_l2_5(packet, MPLS_HLEN);
 memmove(header, header + MPLS_HLEN, len);
 memcpy(header + len, , sizeof lse);
+/* Invalidate dp_hash */
+packet->md.dp_hash = 0;
 }

 /* If 'packet' is an MPLS packet, removes its outermost MPLS label stack entry.
@@ -411,6 +413,8 @@ pop_mpls(struct dp_packet *packet, ovs_be16 ethtype)
 /* Shift the l2 header forward. */
 memmove((char*)dp_packet_data(packet) + MPLS_HLEN, 
dp_packet_data(packet), len);
 dp_packet_resize_l2_5(packet, -MPLS_HLEN);
+/* Invalidate dp_hash */
+packet->md.dp_hash = 0;
 }
 }

@@ -444,6 +448,8 @@ push_nsh(struct dp_packet *packet, const struct nsh_hdr 
*nsh_hdr_src)
 packet->packet_type = htonl(PT_NSH);
 dp_packet_reset_offsets(packet);
 packet->l3_ofs = 0;
+/* Invalidate dp_hash */
+packet->md.dp_hash = 0;
 }

 bool
@@ -474,6 +480,8 @@ pop_nsh(struct dp_packet *packet)

 length = nsh_hdr_len(nsh);
 dp_packet_reset_packet(packet, length);
+/* Invalidate dp_hash */
+packet->md.dp_hash = 0;
 packet->packet_type = htonl(next_pt);
 /* Packet must be recirculated for further processing. */
 }
diff --git a/lib/packets.h b/lib/packets.h
index a4bee3819..8691fa0c2 100644
--- a/lib/packets.h
+++ b/lib/packets.h
@@ -98,8 +98,7 @@ PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline0,
 uint32_t recirc_id; /* Recirculation id carried with the
recirculating