from:". 贺鹏"

Re: [ovs-dev] [External] Re: [PATCH v1] ofproto-dpif-xlate: ovs-tcpdump cannot capture incomming vxlan packets

2023-06-28 Thread . 贺鹏 via dev

On Mon, Jun 26, 2023 at 10:57 PM Simon Horman  wrote:
>
> On Fri, Oct 09, 2020 at 08:15:03PM +0800, hepeng.0320 wrote:
> > when running ovs-tcpdump -i ethX and the port is used as the incomming port 
> > for a vxlan port.
> >
> > The callstack for the upcall:
> >
> > mirror_ingress_packet
> > mirror_packet
> > output_normal
> > compose_output_action
> > compose_output_action__
> > terminate_native_tunnel
> >
> > will xlate the action into a tnl_pop action, not an output action to the
> > mirror port. So eventually the translated actions will be 'tnl_pop(x), 
> > tnl_pop(x)'.
> > However, the right action should be '(mirror port), tnl_pop(x)'
> >
> > This patch adds a flag in xlate_ctx indicating the current output_normal
> > is used by mirroring. Note that we cannot use ctx->mirrors as the
> > indicator as in the mirror code, the ctx->mirrors will not be cleared
> > after mirror action finished.
> >
> > Signed-off-by: hepeng.0320 
>
> Hi all,
>
> this patch appears to have gone stale - no response in the best part of
> three years.
>
> Accordingly, I am marking it as Changes Requested in patchwork
> (I'm still searching for the best status for such cases).

IIRC, there is another patch which has already been merged. It solves
the same problem through another approach.
Although I am not sure its approach works in every situation, the
problem stated in the comments should have
been solved.

I think you can drop the patch.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [来自外部的邮件]Re: [External] Re:[ovs-dev,ovs-dev,v2,4/4] dpif-netdev: fix inconsistent processing between ukey and megaflow

2022-09-23 Thread . 贺鹏

Hi, Zhike,

First I am trying to explain why this fix is not fixing it.

This lock-and-lookup code is actually for the case when all megaflows are
not per-PMD stored according to the ovs commit history.
So it's necessary to check if the megaflow is installed by another PMD.

Now megaflows are per-PMD stored and so the reason for megaflow insertion
can only be upcalls from this PMD,
except you are doing dpctl/flow-add to insert a megaflow manually, but I
guess this is not the case.

So, this code is mostly dead code, meaning that it's mostly impossible for
the code to run into this place in the real environment.

Revalidators only modify and delete megaflows. it will not add one.

Upcalls only happends when the megaflow do not exist in the datapath. We
can not expect a megaflow with drop action already resides in the cmap,
while upcall still happens, brings the correct actions, however does not
install the megaflow but reuse the old one.

it's just impossible.

Then my new suspect is that perhaps the ukey is now with drop action and
megaflow is also with drop action.

And now the revalidator tries to modify the actions, but somehow the
modifying action fails.
However, this failure is ignored. So the inconsistency now exists

So my first patch actually fixes it...
"[ovs-dev,ovs-dev,v2,1/4] ofproto-dpif-upcall: fix push_dp_ops"

On Fri, Sep 23, 2022 at 9:39 PM 王志克  wrote:

> Hi Peng,
>
>
>
> Right, I also met this issue, and wondering the sequence for this
> inconsistence, and would like to hear your “new cause”.
>
>
>
> Anyway I believe your below patch should fix this issue.
>
> [ovs-dev,ovs-dev,v2,1/4] ofproto-dpif-upcall: fix push_dp_ops
>
>
>
> Br,
>
> Zhike
>
>
>
> *From: *".贺鹏" 
> *Date: *Friday, September 23, 2022 at 8:59 PM
> *To: *王志克 
> *Cc: *"ovs-dev@openvswitch.org" , "
> d...@openvswitch.org" 
> *Subject: *[来自外部的邮件]Re: [External] Re:[ovs-dev,ovs-dev,v2,4/4]
> dpif-netdev: fix inconsistent processing between ukey and megaflow
>
>
>
> 京东安全提示：此封邮件来自公司外部，除非您能判断发件人和知道邮件内容安全，否则请勿打开链接或者附件。
> * JD Security Tips: Please do not click on links or open attachments
> unless you trust the sender and know the content is safe.*
>
>
>
> Hi, Zhike,
>
>
>
> After receiving your email, I was becoming curious about this code and did
> more investigation on it.
>
>
>
> and I found some problems with the code and now I believe this
> inconsistent processing is NOT the root cause for the inconsistent actions
> between ukey and datapath.
>
> and I found a new cause for that, but due to this complex race between PMD
> and revalidator, I wish this time I am right.
>
>
>
> But before that, why are you interested in this patch? Have you found the
> same issue in your environment?
>
>
>
>
>
>
>
>
>
> On Thu, Sep 22, 2022 at 6:54 PM .贺鹏  wrote:
>
> Hi, zhike,
>
>
>
> It's difficult to give a very clear sequences about how this inconsistency
> happens, but I can give you more details.
>
>
>
> This is observed in our production environment. The correct megaflow
> should encap packets with vxlan header and send out, but the action is drop.
>
> This is usually because the neigh info is not available at the moment when
> the upcall happens.
>
>
>
> Normally, the drop action is ephemeral, and reavalidator will later modify
> the megaflow's action into the tnl_push.
>
>
>
> But there are a few of cases, only happened 1~2 times in a year, where the
> drop actions will never be replaced by tnl_push.
>
>
>
> just like in the commits mentioned,
>
>
>
> "The coverage command shows revalidators have dumped several times,
>
> however the correct actions are not set. This implies that the ukey's
>
> action does not equal to the meagaflow's, i.e. revalidators think the
> underlying
>
> megaflow's actions are correct however they are not."
>
>
>
> I do not know how this happened, but I do think this inconsistent processing 
> could be one of the reasons.
>
> Even there is no such bug, I think keeping processing inconsistent is 
> necessary.
>
>
>
>
>
>
>
>
>
> On Wed, Sep 21, 2022 at 5:57 PM 王志克  wrote:
>
> Hi Hepeng,
>
>
>
> Can you please explain the sequence that how this inconsistence could
> happen? Why you believe the current actions in existing netdev_flow is old?
>
>
>
> Thanks.
>
>
>
> Br,
>
> wangzhike
>
>
>
>
>
>
>
>
>
>
> *
>
> [ovs-dev,ovs-dev,v2,4/4] dpif-netdev: fix inconsistent proce

Re: [ovs-dev] [External] Re:[ovs-dev, ovs-dev, v2, 4/4] dpif-netdev: fix inconsistent processing between ukey and megaflow

2022-09-23 Thread . 贺鹏

Hi, Zhike,

After receiving your email, I was becoming curious about this code and did
more investigation on it.

and I found some problems with the code and now I believe this inconsistent
processing is NOT the root cause for the inconsistent actions between ukey
and datapath.
and I found a new cause for that, but due to this complex race between PMD
and revalidator, I wish this time I am right.

But before that, why are you interested in this patch? Have you found the
same issue in your environment?




On Thu, Sep 22, 2022 at 6:54 PM .贺鹏  wrote:

> Hi, zhike,
>
> It's difficult to give a very clear sequences about how this inconsistency
> happens, but I can give you more details.
>
> This is observed in our production environment. The correct megaflow
> should encap packets with vxlan header and send out, but the action is drop.
> This is usually because the neigh info is not available at the moment when
> the upcall happens.
>
> Normally, the drop action is ephemeral, and reavalidator will later modify
> the megaflow's action into the tnl_push.
>
> But there are a few of cases, only happened 1~2 times in a year, where the
> drop actions will never be replaced by tnl_push.
>
> just like in the commits mentioned,
>
> "The coverage command shows revalidators have dumped several times,
> however the correct actions are not set. This implies that the ukey's
> action does not equal to the meagaflow's, i.e. revalidators think the
> underlying
>
> megaflow's actions are correct however they are not."
>
>
> I do not know how this happened, but I do think this inconsistent processing 
> could be one of the reasons.
>
> Even there is no such bug, I think keeping processing inconsistent is 
> necessary.
>
>
>
>
>
> On Wed, Sep 21, 2022 at 5:57 PM 王志克  wrote:
>
>> Hi Hepeng,
>>
>>
>>
>> Can you please explain the sequence that how this inconsistence could
>> happen? Why you believe the current actions in existing netdev_flow is old?
>>
>>
>>
>> Thanks.
>>
>>
>>
>> Br,
>>
>> wangzhike
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *
>>
>> [ovs-dev,ovs-dev,v2,4/4] dpif-netdev: fix inconsistent processing between
>> ukey and megaflow
>>
>> Message ID
>>
>> 20220604151857.66550-4-hepeng.0...@bytedance.com
>>
>> State
>>
>> New
>>
>> Headers
>>
>> show
>>
>> Series
>>
>> [ovs-dev,ovs-dev,v2,1/4] ofproto-dpif-upcall: fix push_dp_ops
>> <http://patchwork.ozlabs.org/project/openvswitch/list/?series=303324>|
>> expand
>> Checks
>>
>> Context
>>
>> Check
>>
>> Description
>>
>> ovsrobot/apply-robot
>>
>> *warning*
>>
>> apply and check: warning
>> <https://mail.openvswitch.org/pipermail/ovs-build/2022-June/022431.html>
>>
>> ovsrobot/github-robot-_Build_and_Test
>>
>> *success*
>>
>> github build: passed
>> <https://mail.openvswitch.org/pipermail/ovs-build/2022-June/022436.html>
>>
>> ovsrobot/intel-ovs-compilation
>>
>> *success*
>>
>> test: success
>> <https://mail.openvswitch.org/pipermail/ovs-build/2022-June/022439.html>
>> Commit Message
>>
>> Peng He
>> <http://patchwork.ozlabs.org/project/openvswitch/list/?submitter=78087>June
>> 4, 2022, 3:18 p.m. UTC
>>
>> When PMDs perform upcalls, the newly generated ukey will replace
>>
>> the old, however, the newly generated mageflow will be discard
>>
>> to reuse the old one without checking if the actions of new and
>>
>> old are equal.
>>
>>
>>
>> We observe in the production environment that sometimes a megaflow
>>
>> with wrong actions keep staying in datapath. The coverage command shows
>>
>> revalidators have dumped serveral times, however the correct
>>
>> actions are not set. This implies that the ukey's action does not
>>
>> equal to the meagaflow's, i.e. revalidators think the underlying
>>
>> megaflow's actions are correct however they are not.
>>
>>
>>
>> We also check the megaflow using the ofproto/trace command, and the
>>
>> actions are not matched with the ones in the actual magaflow. By
>>
>> performing a revalidator/purge command, the right actions are set.
>>
>>
>>
>> *Signed-off-by: Peng He > >*
>>

Re: [ovs-dev] [External] Re:[ovs-dev, ovs-dev, v2, 4/4] dpif-netdev: fix inconsistent processing between ukey and megaflow

2022-09-22 Thread . 贺鹏

Hi, zhike,

It's difficult to give a very clear sequences about how this inconsistency
happens, but I can give you more details.

This is observed in our production environment. The correct megaflow should
encap packets with vxlan header and send out, but the action is drop.
This is usually because the neigh info is not available at the moment when
the upcall happens.

Normally, the drop action is ephemeral, and reavalidator will later modify
the megaflow's action into the tnl_push.

But there are a few of cases, only happened 1~2 times in a year, where the
drop actions will never be replaced by tnl_push.

just like in the commits mentioned,

"The coverage command shows revalidators have dumped several times,
however the correct actions are not set. This implies that the ukey's
action does not equal to the meagaflow's, i.e. revalidators think the
underlying

megaflow's actions are correct however they are not."


I do not know how this happened, but I do think this inconsistent
processing could be one of the reasons.

Even there is no such bug, I think keeping processing inconsistent is necessary.





On Wed, Sep 21, 2022 at 5:57 PM 王志克  wrote:

> Hi Hepeng,
>
>
>
> Can you please explain the sequence that how this inconsistence could
> happen? Why you believe the current actions in existing netdev_flow is old?
>
>
>
> Thanks.
>
>
>
> Br,
>
> wangzhike
>
>
>
>
>
>
>
>
>
>
> *
>
> [ovs-dev,ovs-dev,v2,4/4] dpif-netdev: fix inconsistent processing between
> ukey and megaflow
>
> Message ID
>
> 20220604151857.66550-4-hepeng.0...@bytedance.com
>
> State
>
> New
>
> Headers
>
> show
>
> Series
>
> [ovs-dev,ovs-dev,v2,1/4] ofproto-dpif-upcall: fix push_dp_ops
> |
> expand
> Checks
>
> Context
>
> Check
>
> Description
>
> ovsrobot/apply-robot
>
> *warning*
>
> apply and check: warning
> 
>
> ovsrobot/github-robot-_Build_and_Test
>
> *success*
>
> github build: passed
> 
>
> ovsrobot/intel-ovs-compilation
>
> *success*
>
> test: success
> 
> Commit Message
>
> Peng He
> June
> 4, 2022, 3:18 p.m. UTC
>
> When PMDs perform upcalls, the newly generated ukey will replace
>
> the old, however, the newly generated mageflow will be discard
>
> to reuse the old one without checking if the actions of new and
>
> old are equal.
>
>
>
> We observe in the production environment that sometimes a megaflow
>
> with wrong actions keep staying in datapath. The coverage command shows
>
> revalidators have dumped serveral times, however the correct
>
> actions are not set. This implies that the ukey's action does not
>
> equal to the meagaflow's, i.e. revalidators think the underlying
>
> megaflow's actions are correct however they are not.
>
>
>
> We also check the megaflow using the ofproto/trace command, and the
>
> actions are not matched with the ones in the actual magaflow. By
>
> performing a revalidator/purge command, the right actions are set.
>
>
>
> *Signed-off-by: Peng He  >*
>
> ---
>
>  lib/dpif-netdev.c | 17 -
>
>  1 file changed, 16 insertions(+), 1 deletion(-)
>
> Comments
>
> 0-day Robot
> June
> 4, 2022, 3:44 p.m. UTC | #1 
>
> Bleep bloop.  Greetings Peng He, I am a robot and I have tried out your patch.
>
> Thanks for your contribution.
>
>
>
> I encountered some error that I wasn't expecting.  See the details below.
>
>
>
>
>
> checkpatch:
>
> ERROR: Author Peng He  needs to sign off.
>
> WARNING: Unexpected sign-offs from developers who are not authors or 
> co-authors or committers: Peng He 
>
> Lines checked: 58, Warnings: 1, Errors: 1
>
>
>
>
>
> Please check this out.  If you feel there has been an error, please email 
> acon...@redhat.com
>
>
>
> Thanks,
>
> 0-day Robot
>
> 1638948diff
> 
> mbox
> 
> series 
> Patch
>
> *diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c*
>
> *index ff57b3961..985c25c58 100644*
>
> *--- a/lib/dpif-netdev.c*
>
> *+++ b/lib/dpif-netdev.c*
>
> *@@ -8305,7 +8305,22 @@*  handle_packet_upcall(struct dp_netdev_pmd_thread 
> *pmd,
>
>   * to be locking revalidators out of making flow modifications. */
>
>  ovs_mutex_lock(>flow_mutex);
>
>  netdev_flow = dp_netdev_pmd_lookup_flow(pmd, key,

Re: [ovs-dev] [External] Re: [PATCH 2/5] ipf: add ipf context

2022-02-14 Thread . 贺鹏

On Tue, Feb 15, 2022 at 2:10 AM Ilya Maximets  wrote:
>
> On 1/28/22 17:14, Aaron Conole wrote:
> > From: Peng He 
> >
> > ipf_postprocess will emit packets into the datapath pipeline ignoring
> > the conntrack context, this might casuse weird issues when a packet
> > batch has less space to hold all the fragments belonging to single
> > packet.
> >
> > Given the below ruleest and consider sending a 64K ICMP packet which
> > is splitted into 64 fragments.
> >
> > priority=1,action=drop
> > priority=10,arp,action=normal
> > priority=100,in_port=1,ct_state=-trk,icmp,action=ct(zone=9,table=0)
> > priority=100,in_port=1,ct_state=+new+trk,icmp,action=ct(zone=9,commit),2
> > priority=100,in_port=1,ct_state=-new+est+trk,icmp,action=2
> > priority=100,in_port=2,ct_state=-trk,icmp,action=ct(table=0,zone=9)
> > priority=100,in_port=2,ct_state=+trk+est-new,icmp,action=1
> >
> > Batch 1:
> > the first 32 packets will be buffered in the ipf preprocessing, nothing
> > more proceeds.
> >
> > Batch 2:
> > the second 32 packets succeed the fragment reassembly and goes to ct
> > and ipf_post will emits the first 32 packets due to the limit of batch
> > size.
> >
> > the first 32 packets goes to the datapath again due to the
> > recirculation, and again buffered at ipf preprocessing before ct commit,
> > then the ovs tries to call ct commit as well as ipf_postprocessing which 
> > emits
> > the last 32 packets, in this case the last 32 packets will follow
> > the current actions which will be sent to port 2 directly without
> > recirculation and going to ipf preprocssing and ct commit again.
> >
> > This will cause the first 32 packets never get the chance to
> > reassemble and evevntually this large ICMP packets fail to transmit.
> >
> > this patch fixes this issue by adding firstly ipf context to avoid
> > ipf_postprocessing emits packets in the wrong context. Then by
> > re-executing the action list again to emit the last 32 packets
> > in the right context to correctly transmitting multiple fragments.
> >
> > This might be one of the root cause why the OVS log warns:
> >
> > "
> > skipping output to input port on bridge br-int while processing
> > "
> >
> > As a pkt might enter into different ipf context.
> >
> > One implementation trick of implementing ipf_ctx_eq, is to use ipf_list's
> > key instead of the first frag's metadata, this can reduce quite a lot of
> > cache misses as at the time of calling ipf_ctx_eq, ipf_list is cache-hot.
> > On x86_64, this trick saves 2% of CPU usage of ipf processing.
> >
> > Signed-off-by: Peng He 
> > Co-authored-by: Mike Pattrick 
> > Signed-off-by: Aaron Conole 
> > ---
> >  lib/conntrack.c |  9 ---
> >  lib/conntrack.h | 15 ++--
> >  lib/dpif-netdev.c   | 52 +
> >  lib/ipf.c   | 39 ---
> >  lib/ipf.h   | 11 +++--
> >  tests/system-traffic.at | 33 ++
> >  tests/test-conntrack.c  |  6 ++---
> >  7 files changed, 135 insertions(+), 30 deletions(-)
> >
> > diff --git a/lib/conntrack.c b/lib/conntrack.c
> > index 33a1a92953..72999f1aed 100644
> > --- a/lib/conntrack.c
> > +++ b/lib/conntrack.c
> > @@ -1434,14 +1434,14 @@ process_one(struct conntrack *ct, struct dp_packet 
> > *pkt,
> >   * connection tables.  'setmark', if not NULL, should point to a two
> >   * elements array containing a value and a mask to set the connection mark.
> >   * 'setlabel' behaves similarly for the connection label.*/
> > -int
> > +bool
> >  conntrack_execute(struct conntrack *ct, struct dp_packet_batch *pkt_batch,
> >ovs_be16 dl_type, bool force, bool commit, uint16_t zone,
> >const uint32_t *setmark,
> >const struct ovs_key_ct_labels *setlabel,
> >ovs_be16 tp_src, ovs_be16 tp_dst, const char *helper,
> >const struct nat_action_info_t *nat_action_info,
> > -  long long now, uint32_t tp_id)
> > +  long long now, uint32_t tp_id, struct ipf_ctx *ipf_ctx)
> >  {
> >  ipf_preprocess_conntrack(ct->ipf, pkt_batch, now, dl_type, zone,
> >   ct->hash_basis);
> > @@ -1468,9 +1468,8 @@ conntrack_execute(struct conntrack *ct, struct 
> > dp_packet_batch *pkt_batch,
> >  }
> >  }
> >
> > -ipf_postprocess_conntrack(ct->ipf, pkt_batch, now, dl_type);
> > -
> > -return 0;
> > +return ipf_postprocess_conntrack(ct->ipf, pkt_batch, ipf_ctx, \
> > + now, dl_type);
> >  }
> >
> >  void
> > diff --git a/lib/conntrack.h b/lib/conntrack.h
> > index 9553b188a4..211efad3f3 100644
> > --- a/lib/conntrack.h
> > +++ b/lib/conntrack.h
> > @@ -64,6 +64,7 @@
> >  struct dp_packet_batch;
> >
> >  struct conntrack;
> > +struct ipf_ctx;
> >
> >  union ct_addr {
> >  ovs_be32 ipv4;
> > @@ -88,13 +89,13 @@ struct nat_action_info_t {
> >  struct conntrack

Re: [ovs-dev] [ovs-dev v3] ofproto: add refcount to ofproto to fix ofproto use-after-free

2022-01-18 Thread 贺鹏

Ah,
found a bug, will submit a v4.

Peng He  于2022年1月18日周二 21:08写道：

> From hepeng:
>
> https://patchwork.ozlabs.org/project/openvswitch/patch/20200717015041.82746-1-hepeng.0...@bytedance.com/#2487473
>
> also from guohongzhi :
>
> http://patchwork.ozlabs.org/project/openvswitch/patch/20200306130555.19884-1-guohongz...@huawei.com/
>
> also from a discussion about the mixing use of RCU and refcount in the mail
> list with Ilya Maximets, William Tu, Ben Pfaf, and Gaëtan Rivet.
>
> A summary, as quoted from Ilya:
>
> "
> RCU for ofproto was introduced for one
> and only one reason - to avoid freeing ofproto while rules are still
> alive.  This was done in commit f416c8d61601 ("ofproto: RCU postpone
> rule destruction.").  The goal was to allow using rules without
> refcounting them within a single grace period.  And that forced us
> to postpone destruction of the ofproto for a single grace period.
> Later commit 39c9459355b6 ("Use classifier versioning.") made it
> possible for rules to be alive for more than one grace period, so
> the commit made ofproto wait for 2 grace periods by double postponing.
> As we can see now, that wasn't enough and we have to wait for more
> than 2 grace periods in certain cases.
> "
>
> In a short, the ofproto should have a longer life time than rule, if
> the rule lasts for more than 2 grace periods, the ofproto should live
> longer to ensure rule->ofproto is valid. It's hard to predict how long
> a ofproto should live, thus we need to use refcount on ofproto to make
> things easy. The controversial part is that we have already used RCU
> postpone
> to delay ofproto destrution, if we have to add refcount, is it simpler to
> use just refcount without RCU postpone?
>
> IMO, I think going back to the pure refcount solution is more
> complicated than mixing using both.
>
> Gaëtan Rive asks some questions on guohongzhi's v2 patch:
>
> during ofproto_rule_create, should we use ofproto_ref
> or ofproto_try_ref? how can we make sure the ofproto is alive?
>
> by using RCU, I think the answer is that, when you see a pointer to
> ofproto, it's alive at least in this grace peroid, so it is ok to use
> ofproto_ref instead of ofproto_try_ref. This shows that by mixing use
> of RCU and refcount we can save a lot of work to query if ofproto is
> alive.
>
> In short, the RCU part makes sure the ofproto is alive when we use it,
> and the refcount part makes sure it lives longer enough.
>
> Also regarding a new patch filed recently, people are now making use
> of RCU to protect ofproto:
>
>
> https://patchwork.ozlabs.org/project/openvswitch/patch/1638530715-44436-1-git-send-email-wangyunj...@huawei.com/
>
> In this patch, I have merged guohongzhi's patch and mine, and fixes
> accoring to the previous comments.
>
> Signed-off-by: Peng He 
> Signed-off-by: guohongzhi 
> ---
>  ofproto/ofproto-dpif-xlate-cache.c |  1 +
>  ofproto/ofproto-dpif-xlate.c   |  1 +
>  ofproto/ofproto-dpif.c |  4 +--
>  ofproto/ofproto-provider.h |  2 ++
>  ofproto/ofproto.c  | 45 ++
>  ofproto/ofproto.h  |  3 ++
>  6 files changed, 49 insertions(+), 7 deletions(-)
>
> diff --git a/ofproto/ofproto-dpif-xlate-cache.c
> b/ofproto/ofproto-dpif-xlate-cache.c
> index dcc91cb38..0deee365d 100644
> --- a/ofproto/ofproto-dpif-xlate-cache.c
> +++ b/ofproto/ofproto-dpif-xlate-cache.c
> @@ -209,6 +209,7 @@ xlate_cache_clear_entry(struct xc_entry *entry)
>  {
>  switch (entry->type) {
>  case XC_TABLE:
> +ofproto_unref(&(entry->table.ofproto->up));
>  break;
>  case XC_RULE:
>  ofproto_rule_unref(>rule->up);
> diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
> index 6fb59e170..0380928a9 100644
> --- a/ofproto/ofproto-dpif-xlate.c
> +++ b/ofproto/ofproto-dpif-xlate.c
> @@ -3025,6 +3025,7 @@ xlate_normal(struct xlate_ctx *ctx)
>
>  /* Save just enough info to update mac learning table later. */
>  entry = xlate_cache_add_entry(ctx->xin->xcache, XC_NORMAL);
> +ofproto_ref(>xbridge->ofproto->up);
>  entry->normal.ofproto = ctx->xbridge->ofproto;
>  entry->normal.in_port = flow->in_port.ofp_port;
>  entry->normal.dl_src = flow->dl_src;
> diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
> index 8143dd965..6816896dd 100644
> --- a/ofproto/ofproto-dpif.c
> +++ b/ofproto/ofproto-dpif.c
> @@ -4472,8 +4472,8 @@ rule_dpif_lookup_from_table(struct ofproto_dpif
> *ofproto,
>  }
>  if (xcache) {
>  struct xc_entry *entry;
> -
>  entry = xlate_cache_add_entry(xcache, XC_TABLE);
> +ofproto_ref(>up);
>  entry->table.ofproto = ofproto;
>  entry->table.id = *table_id;
>  entry->table.match = true;
> @@ -4508,8 +4508,8 @@ rule_dpif_lookup_from_table(struct ofproto_dpif
> *ofproto,
>  }
>  if (xcache) {
>

Re: [ovs-dev] [PATCH] ofproto: fix use-after-free for "ofproto".

2022-01-17 Thread 贺鹏

Hi， yunjian and Ilya,

this patch reminds me of the discussion we have in March last year, and
during the discussion, you have spotted this
thread-safety issue of uuid map. Unfortunately, in that email, you did not
reply to the mailist, so I cannot give a link to
the email. I attach it as a reference.

I quote it here:

"
That is an interesting example here...
I can't help but notice that this function is typically called
from different handler or pmd threads and modified by the main thread
while upcalls enabled.  And hmap is not a thread-safe structure.
I guess, we have another possible problem here.  We need to protect
at least this hmap and the other one with rw lock or something...
Am I right or am I missing something?  What else we need to protect?
"

And this patch fixes exactly the problem you stated.



wangyunjian via dev  于2022年1月8日周六 17:18写道：

> Friendly ping.
>
> > -Original Message-
> > From: wangyunjian
> > Sent: Friday, December 3, 2021 7:25 PM
> > To: d...@openvswitch.org; i.maxim...@ovn.org
> > Cc: dingxiaoxiong ; xudingke
> > ; wangyunjian ; Justin
> > Pettit 
> > Subject: [ovs-dev] [PATCH] ofproto: fix use-after-free for "ofproto".
> >
> > When handler threads lookup a "ofproto" and use it, main thread maybe
> > remove and free the "ofproto" at the same time. The "ofproto" has not
> > been protected well, which can lead to an OVS crash.
> >
> > This patch fixes this by making the "ofproto" lookup RCU-safe by using
> > cmap instead of hmap and moving remove "ofproto" call before
> > xlate_txn_commit().
> >
> > (gdb) bt
> >   hmap_next (hmap.h:398)
> >   seq_wake_waiters (seq.c:326)
> >   seq_change_protected (seq.c:134)
> >   seq_change (seq.c:144)
> >   ofproto_dpif_send_async_msg (ofproto_dpif.c:263)
> >   process_upcall (ofproto_dpif_upcall.c:1782)
> >   recv_upcalls (ofproto_dpif_upcall.c:1026)
> >   udpif_upcall_handler (ofproto/ofproto_dpif_upcall.c:945)
> >   ovsthread_wrapper (ovs_thread.c:734)
> >
> > Cc: Justin Pettit 
> > Fixes: fcb9579be3c7 ("ofproto: Add 'ofproto_uuid' and 'ofp_in_port' to
> user
> > action cookie.")
> > Signed-off-by: Yunjian Wang 
> > ---
> >  ofproto/ofproto-dpif.c | 12 ++--
> >  ofproto/ofproto-dpif.h |  2 +-
> >  2 files changed, 7 insertions(+), 7 deletions(-)
> >
> > diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
> > index cba49a99e..aa8d32f81 100644
> > --- a/ofproto/ofproto-dpif.c
> > +++ b/ofproto/ofproto-dpif.c
> > @@ -216,8 +216,7 @@ static struct hmap all_ofproto_dpifs_by_name =
> >
> > HMAP_INITIALIZER(_ofproto_dpifs_by_name);
> >
> >  /* All existing ofproto_dpif instances, indexed by ->uuid. */
> > -static struct hmap all_ofproto_dpifs_by_uuid =
> > -
> > HMAP_INITIALIZER(_ofproto_dpifs_by_uuid);
> > +static struct cmap all_ofproto_dpifs_by_uuid = CMAP_INITIALIZER;
> >
> >  static bool ofproto_use_tnl_push_pop = true;
> >  static void ofproto_unixctl_init(void);
> > @@ -1682,7 +1681,7 @@ construct(struct ofproto *ofproto_)
> >  hmap_insert(_ofproto_dpifs_by_name,
> >  >all_ofproto_dpifs_by_name_node,
> >  hash_string(ofproto->up.name, 0));
> > -hmap_insert(_ofproto_dpifs_by_uuid,
> > +cmap_insert(_ofproto_dpifs_by_uuid,
> >  >all_ofproto_dpifs_by_uuid_node,
> >  uuid_hash(>uuid));
> >  memset(>stats, 0, sizeof ofproto->stats);
> > @@ -1778,12 +1777,13 @@ destruct(struct ofproto *ofproto_, bool del)
> >  ofproto->backer->need_revalidate = REV_RECONFIGURE;
> >  xlate_txn_start();
> >  xlate_remove_ofproto(ofproto);
> > +cmap_remove(_ofproto_dpifs_by_uuid,
> > +>all_ofproto_dpifs_by_uuid_node,
> > +uuid_hash(>uuid));
>

I guess some comments are needed here, you actually make use of
the rcu_synchronize in the xlate_txn_commit to avoid access to the
ofproto from other thread through uuid map.


> >  xlate_txn_commit();
> >
> >  hmap_remove(_ofproto_dpifs_by_name,
> >  >all_ofproto_dpifs_by_name_node);
> > -hmap_remove(_ofproto_dpifs_by_uuid,
> > ->all_ofproto_dpifs_by_uuid_node);
> >
> >  OFPROTO_FOR_EACH_TABLE (table, >up) {
> >  CLS_FOR_EACH (rule, up.cr, >cls) {
> > @@ -5781,7 +5781,7 @@ ofproto_dpif_lookup_by_uuid(const struct uuid
> > *uuid)
> >  {
> >  struct ofproto_dpif *ofproto;
> >
> > -HMAP_FOR_EACH_WITH_HASH (ofproto,
> > all_ofproto_dpifs_by_uuid_node,
> > +CMAP_FOR_EACH_WITH_HASH (ofproto,
> > all_ofproto_dpifs_by_uuid_node,
> >   uuid_hash(uuid),
> > _ofproto_dpifs_by_uuid) {
> >  if (uuid_equals(>uuid, uuid)) {
> >  return ofproto;
> > diff --git a/ofproto/ofproto-dpif.h b/ofproto/ofproto-dpif.h
> > index 191cfcb0d..fba03f2cc 100644
> > --- a/ofproto/ofproto-dpif.h
> > +++ b/ofproto/ofproto-dpif.h
> > @@ -295,7 +295,7 @@ struct ofproto_dpif {
> >  struct hmap_node all_ofproto_dpifs_by_name_node;
> >
> >  /* In 'all_ofproto_dpifs_by_uuid'. */
> > -

Re: [ovs-dev] [PATCH] netdev-dpdk: prepare for tso offload for tx copy packet

2022-01-17 Thread 贺鹏

Hi, Harold,

I am curious how this bug is found as I have an unsolved bug that I believe
it's related to
some memory issues and may be related to NIC's problem.


Harold Huang  于2022年1月13日周四 16:24写道：

> From: Harold Huang 
>
> When one flow is output to multiple egress ports, OVS copy the packets
> and send the copy packets to the intermediate ports. The original packets
> is sent to the last port. If the intermediate port is a dpdk port, the copy
> packets should also be prepared for tso offload.
>
> Fixes: 29cf9c1b3b ("userspace: Add TCP Segmentation Offload support")
> Signed-off-by: Harold Huang 
> ---
>  lib/netdev-dpdk.c | 13 ++---
>  1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 6782d3e8f..83029405e 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -2737,10 +2737,11 @@ dpdk_pktmbuf_alloc(struct rte_mempool *mp,
> uint32_t data_len)
>  }
>
>  static struct dp_packet *
> -dpdk_copy_dp_packet_to_mbuf(struct rte_mempool *mp, struct dp_packet
> *pkt_orig)
> +dpdk_copy_dp_packet_to_mbuf(struct netdev_dpdk *dev, struct dp_packet
> *pkt_orig)
>  {
>  struct rte_mbuf *mbuf_dest;
>  struct dp_packet *pkt_dest;
> +struct rte_mempool *mp = dev->dpdk_mp->mp;
>  uint32_t pkt_len;
>
>  pkt_len = dp_packet_size(pkt_orig);
> @@ -2761,11 +2762,9 @@ dpdk_copy_dp_packet_to_mbuf(struct rte_mempool *mp,
> struct dp_packet *pkt_orig)
>  memcpy(_dest->l2_pad_size, _orig->l2_pad_size,
> sizeof(struct dp_packet) - offsetof(struct dp_packet,
> l2_pad_size));
>
> -if (mbuf_dest->ol_flags & RTE_MBUF_F_TX_L4_MASK) {
> -mbuf_dest->l2_len = (char *)dp_packet_l3(pkt_dest)
> -- (char *)dp_packet_eth(pkt_dest);
> -mbuf_dest->l3_len = (char *)dp_packet_l4(pkt_dest)
> -- (char *) dp_packet_l3(pkt_dest);
> +if (!netdev_dpdk_prep_hwol_packet(dev, mbuf_dest)) {
>

perhaps check userspace_tso_enabled()?


> +rte_pktmbuf_free(mbuf_dest);
> +return NULL;
>  }
>
>  return pkt_dest;
> @@ -2813,7 +2812,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid,
> struct dp_packet_batch *batch)
>  continue;
>  }
>
> -pkts[txcnt] = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp,
> packet);
> +pkts[txcnt] = dpdk_copy_dp_packet_to_mbuf(dev, packet);
>  if (OVS_UNLIKELY(!pkts[txcnt])) {
>  dropped = cnt - i;
>  break;
> --
> 2.27.0
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>


-- 
hepeng
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [ovs-dev v8 1/2] ipf: add ipf context

2022-01-17 Thread 贺鹏

Hi,

this patch can pass all the system-userspace testsuite if one of my another
patch
gets merged.

So maybe review this patch "[ovs-dev] ofproto-dpif: fix
packet_execute_prepare" first.


Peng He  于2022年1月18日周二 11:28写道：

> From: Peng He 
>
> ipf_postprocess will emit packets into the datapath pipeline ignoring
> the conntrack context, this might casuse weird issues when a packet
> batch has less space to hold all the fragments belonging to single
> packet.
>
> Given the below ruleest and consider sending a 64K ICMP packet which
> is splitted into 64 fragments.
>
> priority=1,action=drop
> priority=10,arp,action=normal
> priority=100,in_port=1,ct_state=-trk,icmp,action=ct(zone=9,table=0)
> priority=100,in_port=1,ct_state=+new+trk,icmp,action=ct(zone=9,commit),2
> priority=100,in_port=1,ct_state=-new+est+trk,icmp,action=2
> priority=100,in_port=2,ct_state=-trk,icmp,action=ct(table=0,zone=9)
> priority=100,in_port=2,ct_state=+trk+est-new,icmp,action=1
>
> Batch 1:
> the first 32 packets will be buffered in the ipf preprocessing, nothing
> more proceeds.
>
> Batch 2:
> the second 32 packets succeed the fragment reassembly and goes to ct
> and ipf_post will emits the first 32 packets due to the limit of batch
> size.
>
> the first 32 packets goes to the datapath again due to the
> recirculation, and again buffered at ipf preprocessing before ct commit,
> then the ovs tries to call ct commit as well as ipf_postprocessing which
> emits
> the last 32 packets, in this case the last 32 packets will follow
> the current actions which will be sent to port 2 directly without
> recirculation and going to ipf preprocssing and ct commit again.
>
> This will cause the first 32 packets never get the chance to
> reassemble and evevntually this large ICMP packets fail to transmit.
>
> this patch fixes this issue by adding firstly ipf context to avoid
> ipf_postprocessing emits packets in the wrong context. Then by
> re-executing the action list again to emit the last 32 packets
> in the right context to correctly transmitting multiple fragments.
>
> This might be one of the root cause why the OVS log warns:
>
> "
> skipping output to input port on bridge br-int while processing
> "
>
> As a pkt might enter into different ipf context.
>
> One implementation trick of implementing ipf_ctx_eq, is to use ipf_list's
> key instead of the first frag's metadata, this can reduce quite a lot of
> cache misses as at the time of calling ipf_ctx_eq, ipf_list is cache-hot.
> On x86_64, this trick saves 2% of CPU usage of ipf processing.
>
> Signed-off-by: Peng He 
> Co-authored-by: Mike Pattrick 
> ---
>  lib/conntrack.c |  9 ---
>  lib/conntrack.h | 15 ++--
>  lib/dpif-netdev.c   | 52 +
>  lib/ipf.c   | 39 ---
>  lib/ipf.h   | 11 +++--
>  tests/system-traffic.at | 33 ++
>  tests/test-conntrack.c  |  6 ++---
>  7 files changed, 135 insertions(+), 30 deletions(-)
>
> diff --git a/lib/conntrack.c b/lib/conntrack.c
> index 33a1a9295..72999f1ae 100644
> --- a/lib/conntrack.c
> +++ b/lib/conntrack.c
> @@ -1434,14 +1434,14 @@ process_one(struct conntrack *ct, struct dp_packet
> *pkt,
>   * connection tables.  'setmark', if not NULL, should point to a two
>   * elements array containing a value and a mask to set the connection
> mark.
>   * 'setlabel' behaves similarly for the connection label.*/
> -int
> +bool
>  conntrack_execute(struct conntrack *ct, struct dp_packet_batch *pkt_batch,
>ovs_be16 dl_type, bool force, bool commit, uint16_t
> zone,
>const uint32_t *setmark,
>const struct ovs_key_ct_labels *setlabel,
>ovs_be16 tp_src, ovs_be16 tp_dst, const char *helper,
>const struct nat_action_info_t *nat_action_info,
> -  long long now, uint32_t tp_id)
> +  long long now, uint32_t tp_id, struct ipf_ctx *ipf_ctx)
>  {
>  ipf_preprocess_conntrack(ct->ipf, pkt_batch, now, dl_type, zone,
>   ct->hash_basis);
> @@ -1468,9 +1468,8 @@ conntrack_execute(struct conntrack *ct, struct
> dp_packet_batch *pkt_batch,
>  }
>  }
>
> -ipf_postprocess_conntrack(ct->ipf, pkt_batch, now, dl_type);
> -
> -return 0;
> +return ipf_postprocess_conntrack(ct->ipf, pkt_batch, ipf_ctx, \
> + now, dl_type);
>  }
>
>  void
> diff --git a/lib/conntrack.h b/lib/conntrack.h
> index 9553b188a..211efad3f 100644
> --- a/lib/conntrack.h
> +++ b/lib/conntrack.h
> @@ -64,6 +64,7 @@
>  struct dp_packet_batch;
>
>  struct conntrack;
> +struct ipf_ctx;
>
>  union ct_addr {
>  ovs_be32 ipv4;
> @@ -88,13 +89,13 @@ struct nat_action_info_t {
>  struct conntrack *conntrack_init(void);
>  void conntrack_destroy(struct conntrack *);
>
> -int conntrack_execute(struct conntrack *ct, struct dp_packet_batch
> *pkt_batch,

Re: [ovs-dev] [ovs-dev v7 2/3] ipf: micro-optimize ipf_ctx_eq

2022-01-12 Thread 贺鹏

Hi,

Thanks for the reviewing.

Aaron Conole  于2022年1月12日周三 05:37写道：

> Peng He  writes:
>
> > by using ipf_list's key instead of first frags' metadata can reduce
> > quite a lot of cache access as by the time calling ipf_ctx_eq, ipf_list
> > is cache-hot.
> >
> > Signed-off-by: Peng He 
> > ---
>
> Is there a reason not to just fold this into 1/3?  It's strange to
> introduce something and immediately re-write it in the same series.
>
> I don't see a reason not to fold this into 1/3.  The performance
> optimization isn't difficult to understand, and doesn't seem awkward.
>

will merge it into the first patch.


>
> >  lib/ipf.c | 18 +++---
> >  1 file changed, 11 insertions(+), 7 deletions(-)
> >
> > diff --git a/lib/ipf.c b/lib/ipf.c
> > index 1fd3d8d30..26de7bbcf 100644
> > --- a/lib/ipf.c
> > +++ b/lib/ipf.c
> > @@ -1047,14 +1047,17 @@ ipf_send_frags_in_list(struct ipf *ipf, struct
> ipf_list *ipf_list,
> >  }
> >
> >  static bool
> > -ipf_ctx_eq(struct ipf_list *ipf_list, struct ipf_ctx *ctx)
> > +ipf_ctx_eq(struct ipf_list *ipf_list, struct ipf_ctx *ctx,
> > +   struct dp_packet *pkt)
> >  {
> > -struct dp_packet *pkt =
> > -ipf_list->frag_list[ipf_list->last_sent_idx + 1].pkt;
> > -
> > -if (pkt->md.recirc_id != ctx->recirc_id ||
> > +/* NOTE: not using the first pkt's metadata, use ipf_list instead,
> > + * using pkt's metadata causes too much cache miss,
> > + * using ipf_list instead drops ipf_postprocessing cpu usage
> > + * from 4 percent to 2 percent. ^_^.
> > + */
>
> Maybe this note is better in the commit log.  We can't actually be sure
> about each system's underlying memory architecture (for example, does
> this hold true even on ARM?).
>


Ok, will put it into the commit log.


>
> > +if (ipf_list->key.recirc_id != ctx->recirc_id ||
> >  pkt->md.in_port.odp_port != ctx->in_port.odp_port ||
> > -pkt->md.ct_zone != ctx->zone) {
> > +ipf_list->key.zone != ctx->zone) {
> >  return false;
> >  }
> >  return true;
> > @@ -1077,7 +1080,8 @@ ipf_send_completed_frags(struct ipf *ipf, struct
> dp_packet_batch *pb,
> >
> >  LIST_FOR_EACH_SAFE (ipf_list, next, list_node,
> >frag_complete_list) {
> >
> > -if (ctx && !ipf_ctx_eq(ipf_list, ctx)) {
> > +if (ctx && !ipf_ctx_eq(ipf_list, ctx, \
> > +ipf_list->frag_list[ipf_list->last_sent_idx +
> 1].pkt)) {
> >  continue;
> >  }
>
>

-- 
hepeng
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [ovs-dev v6 3/3] ipf: ipf_preprocess_conntrack should also consider ipf_ctx

2022-01-09 Thread 贺鹏

Hi,

after I applied this patch, all the failed tests now passed.


diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
index cba49a99e..a7df7db6f 100644
--- a/ofproto/ofproto-dpif.c
+++ b/ofproto/ofproto-dpif.c
@@ -4972,6 +4972,7 @@ packet_execute_prepare(struct ofproto *ofproto_,
 /* Fix up in_port. */
 ofproto_dpif_set_packet_odp_port(ofproto, opo->flow->in_port.ofp_port,
  opo->packet);
+opo->flow->in_port = opo->packet->md.in_port;

 ofproto_dpif_packet_out_delete(aux);
 opo->aux = execute;

and I found a bug in the current patch, I've changed back the pkt pointer
that is used in ipf_ctx_eq

-if (ctx && !ipf_ctx_eq(ipf_list, ctx)) {
+if (ctx && !ipf_ctx_eq(ipf_list, ctx, \
+ipf_list->frag_list[ipf_list->last_sent_idx + 1].pkt))
{
 continue;
 }

in the v6 patch, it uses ipf_list->reass_execute_ctx, however, this pointer
is freed. will post a v7 patch and fixes this.

about the packet_execute_prepare's bug, I am not sure the flow->execute
should be changed into odp port.

Mike Pattrick  于2022年1月8日周六 06:03写道：

> On Thu, 2021-12-30 at 08:03 +, Peng He wrote:
> > considering a multi-thread PMD setting, when the frags are
> > reassembled
> > in one PMD, another thread might call *ipf_execute_reass_pkts* and
> > 'steal'
> > the reassembled packets into its ipf ctx, then this reassembled
> > packet will enter into another ipf context and causes errors.
> >
> > This happends when there are multiple CT zones, and frags are
> > reassembled in ct(zone=X) might be 'stealed' into the ct(zone=Y).
> >
> > Signed-off-by: Peng He 
> > ---
> >  lib/conntrack.c |  2 +-
> >  lib/ipf.c   | 10 --
> >  lib/ipf.h   |  1 +
> >  3 files changed, 10 insertions(+), 3 deletions(-)
> >
>
> Hello Peng,
>
> This patch appears to make a number of tests fail. For example:
>
> make check-system-userspace TESTSUITEFLAGS="-v 65 66 74 75 76 77 78 79
> 122"
>
>
> -M
>
> > diff --git a/lib/conntrack.c b/lib/conntrack.c
> > index 72999f1ae..aafa6b536 100644
> > --- a/lib/conntrack.c
> > +++ b/lib/conntrack.c
> > @@ -1443,7 +1443,7 @@ conntrack_execute(struct conntrack *ct, struct
> > dp_packet_batch *pkt_batch,
> >const struct nat_action_info_t *nat_action_info,
> >long long now, uint32_t tp_id, struct ipf_ctx
> > *ipf_ctx)
> >  {
> > -ipf_preprocess_conntrack(ct->ipf, pkt_batch, now, dl_type, zone,
> > +ipf_preprocess_conntrack(ct->ipf, pkt_batch, ipf_ctx, now,
> > dl_type, zone,
> >   ct->hash_basis);
> >
> >  struct dp_packet *packet;
> > diff --git a/lib/ipf.c b/lib/ipf.c
> > index ef302e59c..bca34f59c 100644
> > --- a/lib/ipf.c
> > +++ b/lib/ipf.c
> > @@ -1139,7 +1139,8 @@ ipf_send_expired_frags(struct ipf *ipf, struct
> > dp_packet_batch *pb,
> >  /* Adds a reassmebled packet to a packet batch to be processed by
> > the caller.
> >   */
> >  static void
> > -ipf_execute_reass_pkts(struct ipf *ipf, struct dp_packet_batch *pb)
> > +ipf_execute_reass_pkts(struct ipf *ipf, struct dp_packet_batch *pb,
> > +   struct ipf_ctx *ctx)
> >  {
> >  if (ovs_list_is_empty(>reassembled_pkt_list)) {
> >  return;
> > @@ -1149,6 +1150,10 @@ ipf_execute_reass_pkts(struct ipf *ipf, struct
> > dp_packet_batch *pb)
> >  struct reassembled_pkt *rp, *next;
> >
> >  LIST_FOR_EACH_SAFE (rp, next, rp_list_node, 
> > >reassembled_pkt_list) {
> > +if (ctx && !ipf_ctx_eq(rp->list, ctx, rp->pkt)) {
> > +continue;
> > +}
> > +
> >  if (!rp->list->reass_execute_ctx &&
> >  ipf_dp_packet_batch_add(pb, rp->pkt, false)) {
> >  rp->list->reass_execute_ctx = rp->pkt;
> > @@ -1250,6 +1255,7 @@ ipf_post_execute_reass_pkts(struct ipf *ipf,
> >   * be added to the batch to be sent through conntrack. */
> >  void
> >  ipf_preprocess_conntrack(struct ipf *ipf, struct dp_packet_batch
> > *pb,
> > + struct ipf_ctx *ipf_ctx,
> >   long long now, ovs_be16 dl_type,
> >   uint16_t zone, uint32_t hash_basis)
> >  {
> > @@ -1258,7 +1264,7 @@ ipf_preprocess_conntrack(struct ipf *ipf,
> > struct dp_packet_batch *pb,
> >  }
> >
> >  if (ipf_get_enabled(ipf) || atomic_count_get(>nfrag)) {
> > -ipf_execute_reass_pkts(ipf, pb);
> > +ipf_execute_reass_pkts(ipf, pb, ipf_ctx);
> >  }
> >  }
> >
> > diff --git a/lib/ipf.h b/lib/ipf.h
> > index 7bbf453af..8974f15ae 100644
> > --- a/lib/ipf.h
> > +++ b/lib/ipf.h
> > @@ -49,6 +49,7 @@ struct ipf_ctx {
> >  struct ipf *ipf_init(void);
> >  void ipf_destroy(struct ipf *ipf);
> >  void ipf_preprocess_conntrack(struct ipf *ipf, struct
> > dp_packet_batch *pb,
> > +  struct ipf_ctx *ctx,
> >long long now, ovs_be16 dl_type,
> > uint16_t zone,
> >

[ovs-dev] Fwd: [ovs-dev v6 3/3] ipf: ipf_preprocess_conntrack should also consider ipf_ctx

2022-01-08 Thread 贺鹏

just forget to reply all.

-- Forwarded message -
发件人： 贺鹏 
Date: 2022年1月9日周日 10:59
Subject: Re: [ovs-dev v6 3/3] ipf: ipf_preprocess_conntrack should also
consider ipf_ctx
To: Mike Pattrick 


Hi, Mike,

thank you for the review and testing.

it seems that all these broken tests are related to a bug in packet-out
action execution in OVS.
it seems that the packet-out uses the OFP to fill the dp_packet metadata,
and this in_port value has
not been changed during the execution in the datapath, where the value
should be translated into the ODP.

in the test the ofp values equals to 1, while the corresponding odp port
equals to 2, which makes ipf_ctx_eq test failed.

I am trying to think about a solution.
any idaes?



Mike Pattrick  于2022年1月8日周六 06:03写道：

> On Thu, 2021-12-30 at 08:03 +, Peng He wrote:
> > considering a multi-thread PMD setting, when the frags are
> > reassembled
> > in one PMD, another thread might call *ipf_execute_reass_pkts* and
> > 'steal'
> > the reassembled packets into its ipf ctx, then this reassembled
> > packet will enter into another ipf context and causes errors.
> >
> > This happends when there are multiple CT zones, and frags are
> > reassembled in ct(zone=X) might be 'stealed' into the ct(zone=Y).
> >
> > Signed-off-by: Peng He 
> > ---
> >  lib/conntrack.c |  2 +-
> >  lib/ipf.c   | 10 --
> >  lib/ipf.h   |  1 +
> >  3 files changed, 10 insertions(+), 3 deletions(-)
> >
>
> Hello Peng,
>
> This patch appears to make a number of tests fail. For example:
>
> make check-system-userspace TESTSUITEFLAGS="-v 65 66 74 75 76 77 78 79
> 122"
>
>
> -M
>
> > diff --git a/lib/conntrack.c b/lib/conntrack.c
> > index 72999f1ae..aafa6b536 100644
> > --- a/lib/conntrack.c
> > +++ b/lib/conntrack.c
> > @@ -1443,7 +1443,7 @@ conntrack_execute(struct conntrack *ct, struct
> > dp_packet_batch *pkt_batch,
> >const struct nat_action_info_t *nat_action_info,
> >long long now, uint32_t tp_id, struct ipf_ctx
> > *ipf_ctx)
> >  {
> > -ipf_preprocess_conntrack(ct->ipf, pkt_batch, now, dl_type, zone,
> > +ipf_preprocess_conntrack(ct->ipf, pkt_batch, ipf_ctx, now,
> > dl_type, zone,
> >   ct->hash_basis);
> >
> >  struct dp_packet *packet;
> > diff --git a/lib/ipf.c b/lib/ipf.c
> > index ef302e59c..bca34f59c 100644
> > --- a/lib/ipf.c
> > +++ b/lib/ipf.c
> > @@ -1139,7 +1139,8 @@ ipf_send_expired_frags(struct ipf *ipf, struct
> > dp_packet_batch *pb,
> >  /* Adds a reassmebled packet to a packet batch to be processed by
> > the caller.
> >   */
> >  static void
> > -ipf_execute_reass_pkts(struct ipf *ipf, struct dp_packet_batch *pb)
> > +ipf_execute_reass_pkts(struct ipf *ipf, struct dp_packet_batch *pb,
> > +   struct ipf_ctx *ctx)
> >  {
> >  if (ovs_list_is_empty(>reassembled_pkt_list)) {
> >  return;
> > @@ -1149,6 +1150,10 @@ ipf_execute_reass_pkts(struct ipf *ipf, struct
> > dp_packet_batch *pb)
> >  struct reassembled_pkt *rp, *next;
> >
> >  LIST_FOR_EACH_SAFE (rp, next, rp_list_node, 
> > >reassembled_pkt_list) {
> > +if (ctx && !ipf_ctx_eq(rp->list, ctx, rp->pkt)) {
> > +continue;
> > +}
> > +
> >  if (!rp->list->reass_execute_ctx &&
> >  ipf_dp_packet_batch_add(pb, rp->pkt, false)) {
> >  rp->list->reass_execute_ctx = rp->pkt;
> > @@ -1250,6 +1255,7 @@ ipf_post_execute_reass_pkts(struct ipf *ipf,
> >   * be added to the batch to be sent through conntrack. */
> >  void
> >  ipf_preprocess_conntrack(struct ipf *ipf, struct dp_packet_batch
> > *pb,
> > + struct ipf_ctx *ipf_ctx,
> >   long long now, ovs_be16 dl_type,
> >   uint16_t zone, uint32_t hash_basis)
> >  {
> > @@ -1258,7 +1264,7 @@ ipf_preprocess_conntrack(struct ipf *ipf,
> > struct dp_packet_batch *pb,
> >  }
> >
> >  if (ipf_get_enabled(ipf) || atomic_count_get(>nfrag)) {
> > -ipf_execute_reass_pkts(ipf, pb);
> > +ipf_execute_reass_pkts(ipf, pb, ipf_ctx);
> >  }
> >  }
> >
> > diff --git a/lib/ipf.h b/lib/ipf.h
> > index 7bbf453af..8974f15ae 100644
> > --- a/lib/ipf.h
> > +++ b/lib/ipf.h
> > @@ -49,6 +49,7 @@ struct ipf_ctx {
> >  struct ipf *ipf_init(void);
> >  void ipf_destroy(struct ipf *ipf);
> >  void ipf_preprocess_conntrack(struct ipf *ipf, struct
> > dp_packet_batch *pb,
> > +  struct ipf_ctx *ctx,
> >long long now, ovs_be16 dl_type,
> > uint16_t zone,
> >uint32_t hash_basis);
> >
>
>

-- 
hepeng


-- 
hepeng
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [ovs-dev v6 3/3] ipf: ipf_preprocess_conntrack should also consider ipf_ctx

2022-01-08 Thread 贺鹏

Hi, Mike,

The bug in the last email is not correct.

in packet_execute_prepare functions,
it has changed dp_packet md's port into the odp port,

but in the execute->flow, the value is still the ofp port.

 static void
 packet_execute_prepare(struct ofproto *ofproto_,
struct ofproto_packet_out *opo)
 OVS_REQUIRES(ofproto_mutex)
{

 pkt_metadata_from_flow(>packet->md, opo->flow);
 execute->packet = opo->packet;
 execute->flow = opo->flow;

...
 /* Fix up in_port. */
 ofproto_dpif_set_packet_odp_port(ofproto, opo->flow->in_port.ofp_port,
  opo->packet);


}

but ipf_ctx is formed using the flow's value instead of packet's value,
this causes the test suite to fail.

the reason of ipf_ctx using the flow's value is at the second time of
calling conntrack_execute, there are no packets in the batch, the batch is
empty and only used as a container for the ipf to inject packets.

I guess maybe we should also change the execute->flow's in_port into the
odp value?

any ideas?

Mike Pattrick  于2022年1月8日周六 06:03写道：

> On Thu, 2021-12-30 at 08:03 +, Peng He wrote:
> > considering a multi-thread PMD setting, when the frags are
> > reassembled
> > in one PMD, another thread might call *ipf_execute_reass_pkts* and
> > 'steal'
> > the reassembled packets into its ipf ctx, then this reassembled
> > packet will enter into another ipf context and causes errors.
> >
> > This happends when there are multiple CT zones, and frags are
> > reassembled in ct(zone=X) might be 'stealed' into the ct(zone=Y).
> >
> > Signed-off-by: Peng He 
> > ---
> >  lib/conntrack.c |  2 +-
> >  lib/ipf.c   | 10 --
> >  lib/ipf.h   |  1 +
> >  3 files changed, 10 insertions(+), 3 deletions(-)
> >
>
> Hello Peng,
>
> This patch appears to make a number of tests fail. For example:
>
> make check-system-userspace TESTSUITEFLAGS="-v 65 66 74 75 76 77 78 79
> 122"
>
>
> -M
>
> > diff --git a/lib/conntrack.c b/lib/conntrack.c
> > index 72999f1ae..aafa6b536 100644
> > --- a/lib/conntrack.c
> > +++ b/lib/conntrack.c
> > @@ -1443,7 +1443,7 @@ conntrack_execute(struct conntrack *ct, struct
> > dp_packet_batch *pkt_batch,
> >const struct nat_action_info_t *nat_action_info,
> >long long now, uint32_t tp_id, struct ipf_ctx
> > *ipf_ctx)
> >  {
> > -ipf_preprocess_conntrack(ct->ipf, pkt_batch, now, dl_type, zone,
> > +ipf_preprocess_conntrack(ct->ipf, pkt_batch, ipf_ctx, now,
> > dl_type, zone,
> >   ct->hash_basis);
> >
> >  struct dp_packet *packet;
> > diff --git a/lib/ipf.c b/lib/ipf.c
> > index ef302e59c..bca34f59c 100644
> > --- a/lib/ipf.c
> > +++ b/lib/ipf.c
> > @@ -1139,7 +1139,8 @@ ipf_send_expired_frags(struct ipf *ipf, struct
> > dp_packet_batch *pb,
> >  /* Adds a reassmebled packet to a packet batch to be processed by
> > the caller.
> >   */
> >  static void
> > -ipf_execute_reass_pkts(struct ipf *ipf, struct dp_packet_batch *pb)
> > +ipf_execute_reass_pkts(struct ipf *ipf, struct dp_packet_batch *pb,
> > +   struct ipf_ctx *ctx)
> >  {
> >  if (ovs_list_is_empty(>reassembled_pkt_list)) {
> >  return;
> > @@ -1149,6 +1150,10 @@ ipf_execute_reass_pkts(struct ipf *ipf, struct
> > dp_packet_batch *pb)
> >  struct reassembled_pkt *rp, *next;
> >
> >  LIST_FOR_EACH_SAFE (rp, next, rp_list_node, 
> > >reassembled_pkt_list) {
> > +if (ctx && !ipf_ctx_eq(rp->list, ctx, rp->pkt)) {
> > +continue;
> > +}
> > +
> >  if (!rp->list->reass_execute_ctx &&
> >  ipf_dp_packet_batch_add(pb, rp->pkt, false)) {
> >  rp->list->reass_execute_ctx = rp->pkt;
> > @@ -1250,6 +1255,7 @@ ipf_post_execute_reass_pkts(struct ipf *ipf,
> >   * be added to the batch to be sent through conntrack. */
> >  void
> >  ipf_preprocess_conntrack(struct ipf *ipf, struct dp_packet_batch
> > *pb,
> > + struct ipf_ctx *ipf_ctx,
> >   long long now, ovs_be16 dl_type,
> >   uint16_t zone, uint32_t hash_basis)
> >  {
> > @@ -1258,7 +1264,7 @@ ipf_preprocess_conntrack(struct ipf *ipf,
> > struct dp_packet_batch *pb,
> >  }
> >
> >  if (ipf_get_enabled(ipf) || atomic_count_get(>nfrag)) {
> > -ipf_execute_reass_pkts(ipf, pb);
> > +ipf_execute_reass_pkts(ipf, pb, ipf_ctx);
> >  }
> >  }
> >
> > diff --git a/lib/ipf.h b/lib/ipf.h
> > index 7bbf453af..8974f15ae 100644
> > --- a/lib/ipf.h
> > +++ b/lib/ipf.h
> > @@ -49,6 +49,7 @@ struct ipf_ctx {
> >  struct ipf *ipf_init(void);
> >  void ipf_destroy(struct ipf *ipf);
> >  void ipf_preprocess_conntrack(struct ipf *ipf, struct
> > dp_packet_batch *pb,
> > +  struct ipf_ctx *ctx,
> >long long now, ovs_be16 dl_type,
> > uint16_t zone,
> >uint32_t hash_basis);
> >
>
>

-- 
hepeng

Re: [ovs-dev] [ovs-dev v5] ipf: add ipf context

2021-12-18 Thread 贺鹏

I have found another problem in the ipf code which is related to the ipf.
will send a new version consists of two patches.

Mike Pattrick  于2021年11月30日周二 05:08写道：

> Looks good to me!
>
> Acked-by: Mike Pattrick 
>
>
> On Sat, 2021-11-27 at 03:39 +, Peng He wrote:
> > From: Peng He 
> >
> > ipf_postprocess will emit packets into the datapath pipeline ignoring
> > the conntrack context, this might casuse weird issues when a packet
> > batch has less space to hold all the fragments belonging to single
> > packet.
> >
> > Given the below ruleest and consider sending a 64K ICMP packet which
> > is splitted into 64 fragments.
> >
> > priority=1,action=drop
> > priority=10,arp,action=normal
> > priority=100,in_port=1,ct_state=-trk,icmp,action=ct(zone=9,table=0)
> > priority=100,in_port=1,ct_state=+new+trk,icmp,action=ct(zone=9,commit),2
> > priority=100,in_port=1,ct_state=-new+est+trk,icmp,action=2
> > priority=100,in_port=2,ct_state=-trk,icmp,action=ct(table=0,zone=9)
> > priority=100,in_port=2,ct_state=+trk+est-new,icmp,action=1
> >
> > Batch 1:
> > the first 32 packets will be buffered in the ipf preprocessing, nothing
> > more proceeds.
> >
> > Batch 2:
> > the second 32 packets succeed the fragment reassembly and goes to ct
> > and ipf_post will emits the first 32 packets due to the limit of batch
> > size.
> >
> > the first 32 packets goes to the datapath again due to the
> > recirculation, and again buffered at ipf preprocessing before ct commit,
> > then the ovs tries to call ct commit as well as ipf_postprocessing which
> emits
> > the last 32 packets, in this case the last 32 packets will follow
> > the current actions which will be sent to port 2 directly without
> > recirculation and going to ipf preprocssing and ct commit again.
> >
> > This will cause the first 32 packets never get the chance to
> > reassemble and evevntually this large ICMP packets fail to transmit.
> >
> > this patch fixes this issue by adding firstly ipf context to avoid
> > ipf_postprocessing emits packets in the wrong context. Then by
> > re-executing the action list again to emit the last 32 packets
> > in the right context to correctly transmitting multiple fragments.
> >
> > This might be one of the root cause why the OVS log warns:
> >
> > "
> > skipping output to input port on bridge br-int while processing
> > "
> >
> > As a pkt might enter into different ipf context.
> >
> > Signed-off-by: Peng He 
> > Co-authored-by: Mike Pattrick 
> > ---
> >  lib/conntrack.c |  9 ---
> >  lib/conntrack.h | 15 ++--
> >  lib/dpif-netdev.c   | 52 +
> >  lib/ipf.c   | 40 ---
> >  lib/ipf.h   | 11 +++--
> >  tests/system-traffic.at | 33 ++
> >  tests/test-conntrack.c  |  6 ++---
> >  7 files changed, 136 insertions(+), 30 deletions(-)
> >
> > diff --git a/lib/conntrack.c b/lib/conntrack.c
> > index 33a1a9295..72999f1ae 100644
> > --- a/lib/conntrack.c
> > +++ b/lib/conntrack.c
> > @@ -1434,14 +1434,14 @@ process_one(struct conntrack *ct, struct
> dp_packet *pkt,
> >   * connection tables.  'setmark', if not NULL, should point to a two
> >   * elements array containing a value and a mask to set the connection
> mark.
> >   * 'setlabel' behaves similarly for the connection label.*/
> > -int
> > +bool
> >  conntrack_execute(struct conntrack *ct, struct dp_packet_batch
> *pkt_batch,
> >ovs_be16 dl_type, bool force, bool commit, uint16_t
> zone,
> >const uint32_t *setmark,
> >const struct ovs_key_ct_labels *setlabel,
> >ovs_be16 tp_src, ovs_be16 tp_dst, const char *helper,
> >const struct nat_action_info_t *nat_action_info,
> > -  long long now, uint32_t tp_id)
> > +  long long now, uint32_t tp_id, struct ipf_ctx
> *ipf_ctx)
> >  {
> >  ipf_preprocess_conntrack(ct->ipf, pkt_batch, now, dl_type, zone,
> >   ct->hash_basis);
> > @@ -1468,9 +1468,8 @@ conntrack_execute(struct conntrack *ct, struct
> dp_packet_batch *pkt_batch,
> >  }
> >  }
> >
> > -ipf_postprocess_conntrack(ct->ipf, pkt_batch, now, dl_type);
> > -
> > -return 0;
> > +return ipf_postprocess_conntrack(ct->ipf, pkt_batch, ipf_ctx, \
> > + now, dl_type);
> >  }
> >
> >  void
> > diff --git a/lib/conntrack.h b/lib/conntrack.h
> > index 9553b188a..211efad3f 100644
> > --- a/lib/conntrack.h
> > +++ b/lib/conntrack.h
> > @@ -64,6 +64,7 @@
> >  struct dp_packet_batch;
> >
> >  struct conntrack;
> > +struct ipf_ctx;
> >
> >  union ct_addr {
> >  ovs_be32 ipv4;
> > @@ -88,13 +89,13 @@ struct nat_action_info_t {
> >  struct conntrack *conntrack_init(void);
> >  void conntrack_destroy(struct conntrack *);
> >
> > -int conntrack_execute(struct conntrack *ct, struct dp_packet_batch
> *pkt_batch,
> > -

Re: [ovs-dev] [PATCH RFC 3/5] conntrack: Replaces nat_conn introducing key directionality.

2021-12-11 Thread 贺鹏

Hi, Paolo

well done.

the patch looks good to me.
and I am going through your other patches.
I wonder maybe finally we need a *conn_flags* to store all the informations
containing NAT, cleaned flag, etc.




Paolo Valerio  于2021年11月30日周二 02:06写道：

> From: Peng He 
>
> Currently, when doing NAT, the userspace conntrack will use an extra
> conn for the two directions in a flow. However, each conn has actually
> the two keys for both orig and rev directions. This patch introduces a
> key_node[CT_DIR_MAX] member in the conn which consists of key and a
> cmap_node for hash lookup. Both keys can now be accessed in the
> following way:
>
> conn->key_node[CT_DIR_{FWD,REV}].key
>
> similarly to what Aaron Conole suggested.
>
> This patch avoids the extra allocation for nat_conn, and makes
> userspace code cleaner.
>
> Signed-off-by: Peng He 
> Co-authored-by: Paolo Valerio 
> Signed-off-by: Paolo Valerio 
> ---
> Initial patch:
>
>
> https://patchwork.ozlabs.org/project/openvswitch/patch/20201129033255.64647-2-hepeng.0...@bytedance.com/
> ---
>  lib/conntrack-private.h |   20 +-
>  lib/conntrack-tp.c  |6 -
>  lib/conntrack.c |  499
> ---
>  3 files changed, 229 insertions(+), 296 deletions(-)
>
> diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h
> index 34c688821..ea5ba3d9e 100644
> --- a/lib/conntrack-private.h
> +++ b/lib/conntrack-private.h
> @@ -48,9 +48,16 @@ struct ct_endpoint {
>   * hashing in ct_endpoint_hash_add(). */
>  BUILD_ASSERT_DECL(sizeof(struct ct_endpoint) == sizeof(union ct_addr) +
> 4);
>
> +enum key_dir {
> +CT_DIR_FWD = 0,
> +CT_DIR_REV,
> +CT_DIR_MAX,
> +};
> +
>  /* Changes to this structure need to be reflected in conn_key_hash()
>   * and conn_key_cmp(). */
>  struct conn_key {
> +enum key_dir dir;
>  struct ct_endpoint src;
>  struct ct_endpoint dst;
>
> @@ -86,21 +93,19 @@ struct alg_exp_node {
>  bool nat_rpl_dst;
>  };
>
> -enum OVS_PACKED_ENUM ct_conn_type {
> -CT_CONN_TYPE_DEFAULT,
> -CT_CONN_TYPE_UN_NAT,
> +struct conn_key_node {
> +struct conn_key key;
> +struct cmap_node cm_node;
>  };
>
>  struct conn {
>  /* Immutable data. */
> -struct conn_key key;
> -struct conn_key rev_key;
> +struct conn_key_node key_node[CT_DIR_MAX];
>  struct conn_key parent_key; /* Only used for orig_tuple support. */
>  struct ovs_list exp_node;
> -struct cmap_node cm_node;
> +
>  uint16_t nat_action;
>  char *alg;
> -struct conn *nat_conn; /* The NAT 'conn' context, if there is one. */
>
>  /* Mutable data. */
>  struct ovs_mutex lock; /* Guards all mutable fields. */
> @@ -120,7 +125,6 @@ struct conn {
>
>  /* Immutable data. */
>  bool alg_related; /* True if alg data connection. */
> -enum ct_conn_type conn_type;
>
>  uint32_t tp_id; /* Timeout policy ID. */
>  };
> diff --git a/lib/conntrack-tp.c b/lib/conntrack-tp.c
> index c2245038b..9ecb06978 100644
> --- a/lib/conntrack-tp.c
> +++ b/lib/conntrack-tp.c
> @@ -282,7 +282,8 @@ conn_update_expiration(struct conntrack *ct, struct
> conn *conn,
>  ovs_mutex_lock(>lock);
>  VLOG_DBG_RL(, "Update timeout %s zone=%u with policy id=%d "
>  "val=%u sec.",
> -ct_timeout_str[tm], conn->key.zone, conn->tp_id, val);
> +ct_timeout_str[tm], conn->key_node[CT_DIR_FWD].key.zone,
> +conn->tp_id, val);
>
>  conn_update_expiration__(ct, conn, tm, now, val);
>  }
> @@ -313,7 +314,8 @@ conn_init_expiration(struct conntrack *ct, struct conn
> *conn,
>  }
>
>  VLOG_DBG_RL(, "Init timeout %s zone=%u with policy id=%d val=%u
> sec.",
> -ct_timeout_str[tm], conn->key.zone, conn->tp_id, val);
> +ct_timeout_str[tm], conn->key_node[CT_DIR_FWD].key.zone,
> +conn->tp_id, val);
>
>  conn_init_expiration__(ct, conn, tm, now, val);
>  }
> diff --git a/lib/conntrack.c b/lib/conntrack.c
> index 983f824d2..a284c57c0 100644
> --- a/lib/conntrack.c
> +++ b/lib/conntrack.c
> @@ -96,7 +96,6 @@ static struct conn *new_conn(struct conntrack *ct,
> struct dp_packet *pkt,
>   uint32_t tp_id);
>  static void delete_conn_cmn(struct conn *);
>  static void delete_conn(struct conn *);
> -static void delete_conn_one(struct conn *conn);
>  static enum ct_update_res conn_update(struct conntrack *ct, struct conn
> *conn,
>struct dp_packet *pkt,
>struct conn_lookup_ctx *ctx,
> @@ -110,8 +109,7 @@ static void set_label(struct dp_packet *, struct conn
> *,
>  static void *clean_thread_main(void *f_);
>
>  static bool
> -nat_get_unique_tuple(struct conntrack *ct, const struct conn *conn,
> - struct conn *nat_conn,
> +nat_get_unique_tuple(struct conntrack *ct, struct conn *conn,
>   const struct nat_action_info_t *nat_info);
>
>  static uint8_t
>

Re: [ovs-dev] [ovs-dev v4] ipf: add ipf context

2021-11-21 Thread 贺鹏

Ah, in the commits, I forget to delete the introduction of recirc_depth.
will send a v5.

0-day Robot  于2021年11月20日周六 下午6:20写道：

> Bleep bloop.  Greetings Peng He, I am a robot and I have tried out your
> patch.
> Thanks for your contribution.
>
> I encountered some error that I wasn't expecting.  See the details below.
>
>
> checkpatch:
> ERROR: Author Peng He  needs to sign off.
> ERROR: Co-author Mike Pattrick  needs to sign off.
> WARNING: Unexpected sign-offs from developers who are not authors or
> co-authors or committers: Peng He 
> Lines checked: 411, Warnings: 1, Errors: 2
>
>
> Please check this out.  If you feel there has been an error, please email
> acon...@redhat.com
>
> Thanks,
> 0-day Robot
>


-- 
hepeng
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [ovs-dev v3] ipf: add ipf context

2021-11-17 Thread 贺鹏

Hi, Michael,

thanks for the suggestion!
the patch looks good to me, did you do the test in the patch?

Mike Pattrick  于2021年11月18日周四 上午6:33写道：

> Hello Peng,
>
> Great patch! How do you feel about a patch like this to get away from
> recursion?
>
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index fa56d86e0..935e416ea 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -8303,6 +8303,8 @@ dp_execute_cb(void *aux_, struct dp_packet_batch
> *packets_,
>pmd->ctx.now / 1000, tp_id,
> _ctx);
>  if (more_pkts) {
>  aux->redo_actions = a;
> +} else {
> +aux->redo_actions = NULL;
>  }
>  break;
>  }
> @@ -8358,28 +8360,23 @@ dp_netdev_execute_actions(struct
> dp_netdev_pmd_thread *pmd,
>bool should_steal, const struct flow *flow,
>const struct nlattr *actions, size_t
> actions_len)
>  {
> +int max_reexec = 2;
>  struct dp_netdev_execute_aux aux = { pmd, flow, NULL };
>
>  odp_execute_actions(, packets, should_steal, actions,
>  actions_len, dp_execute_cb);
>
> -if (OVS_UNLIKELY(aux.redo_actions)) {
> -uint32_t *depth = recirc_depth_get();
> -/* reuse depth to limit re-execution times to avoid too
> - * large fragments.
> - */
> -if (*depth < MAX_RECIRC_DEPTH) {
> -struct dp_packet_batch batch;
> -dp_packet_batch_init();
> -size_t redo_actions_len = \
> -  dp_netdev_find_actions_left(acti
> ons,
> -  actions_len,
> -  aux.redo_actions);
> -(*depth) ++;
> -dp_netdev_execute_actions(pmd, , should_steal, flow,
> -aux.redo_actions, redo_actions_len);
> -(*depth) --;
> -}
> +while (OVS_UNLIKELY(aux.redo_actions && max_reexec--)) {
> +struct dp_packet_batch batch;
> +dp_packet_batch_init();
> +size_t redo_actions_len = \
> +  dp_netdev_find_actions_left(actions,
> +  actions_len,
> +  aux.redo_actions);
> +
> +odp_execute_actions(, , should_steal,
> aux.redo_actions,
> +redo_actions_len, dp_execute_cb);
> +
>  }
>  }
>
>
> Cheers,
> Michael
>
> On Wed, 2021-11-17 at 09:03 -0500, Aaron Conole wrote:
> > Hi Peng,
> >
> > Peng He  writes:
> >
> > > From: Peng He 
> > >
> > > ipf_postprocess will emit packets into the datapath pipeline
> > > ignoring
> > > the conntrack context, this might casuse weird issues when a packet
> > > batch has less space to hold all the fragments belonging to single
> > > packet.
> > >
> > > Given the below ruleest and consider sending a 64K ICMP packet
> > > which
> > > is splitted into 64 fragments.
> > >
> > > priority=1,action=drop
> > > priority=10,arp,action=normal
> > > priority=100,in_port=1,ct_state=-trk,icmp,action=ct(zone=9,table=0)
> > > priority=100,in_port=1,ct_state=+new+trk,icmp,action=ct(zone=9,comm
> > > it),2
> > > priority=100,in_port=1,ct_state=-new+est+trk,icmp,action=2
> > > priority=100,in_port=2,ct_state=-trk,icmp,action=ct(table=0,zone=9)
> > > priority=100,in_port=2,ct_state=+trk+est-new,icmp,action=1
> > >
> > > Batch 1:
> > > the first 32 packets will be buffered in the ipf preprocessing,
> > > nothing
> > > more proceeds.
> > >
> > > Batch 2:
> > > the second 32 packets succeed the fragment reassembly and goes to
> > > ct
> > > and ipf_post will emits the first 32 packets due to the limit of
> > > batch
> > > size.
> > >
> > > the first 32 packets goes to the datapath again due to the
> > > recirculation, and again buffered at ipf preprocessing before ct
> > > commit,
> > > then the ovs tries to call ct commit as well as ipf_postprocessing
> > > which emits
> > > the last 32 packets, in this case the last 32 packets will follow
> > > the current actions which will be sent to port 2 directly without
> > > recirculation and going to ipf preprocssing and ct commit again.
> > >
> > > This will cause the first 32 packets never get the chance to
> > > reassemble and evevntually this large ICMP packets fail to
> > > transmit.
> > >
> > > this patch fixes this issue by adding firstly ipf context to avoid
> > > ipf_postprocessing emits packets in the wrong context. Then by
> > > re-executing the action list again to emit the last 32 packets
> > > in the right context to correctly transmitting multiple fragments.
> > >
> > > Last, we reuse the recirc_depth to limit the times of re-execution
> > > as re-execution is implemented by recursive function call.
> > >
> > > Signed-off-by: Peng He 
> > > ---
> >
> > Thanks for looking into this problem - it's a current problem with
> > the
> > userspace ipf implementation.
> >
> > >

[ovs-dev] tunnel info mask can unmasked when megaflow is with a recirc_id

2021-08-15 Thread 贺鹏

Hi,

One recirc_id can be viewed as a frozen state in the "xlate" process
and this frozen state stores tunnel info in it. So this recirc_id can also
be viewed as a packet tag, with this tag it means the packet has all
the tunnel metadata.

So I was wondering if the "xlate" process sees a packets with recirc id,
it can clear all the masks of tunnel info, as the recric_id itself represents
the tunnel info already.

With this optimization, the recircled packets do not have to matched against
tunnel info in the datapath, just recirc_id itself is enough, except
the openflow
rules clearly specify the tun_* fields.

Am I correct?


-- 
hepeng
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v3] ofproto-dpif-xlate: fix zone set from non-frozen-metadata fields

2021-07-26 Thread 贺鹏

Sure,
will send a v2. Thanks for the review.


Ilya Maximets  于2021年7月26日周一 上午7:22写道：

>
> Re-adding the list and people from CC.
>
> On 6/27/21 8:16 AM, 贺鹏 wrote:
> > Hi, Ilya,
> >
> > sorry for replying so late, being really busy this month.
>
> Busy time for everyone.
> BTW, I'm taking some time off next week.  Maybe someone else will be able
> to take a look at the changes while I'm not here.  My replies inline.

Have a good rest.

>
> > I've read through your comments. I personally prefer to keep the
> > number of megaflows as few as possible, as megaflow is becoming
> > expensive as the size of flow struct is large.
> > below are my experiments.
> >
> > Ilya Maximets  于2021年5月22日周六 上午3:50写道：
> >>
> >> On 5/21/21 5:00 AM, 贺鹏 wrote:
> >>> Hi, Ilya
> >>>
> >>>
> >>>
> >>> Ilya Maximets  于2021年5月19日周三 下午8:50写道：
> >>>>
> >>>> On 2/27/21 10:34 AM, Peng He wrote:
> >>>>> CT zone could be set from a field that is not included in frozen
> >>>>> metadata. Consider the example rules which are typically seen in
> >>>>> OpenStack security group rules:
> >>>>>
> >>>>> priority=100,in_port=1,tcp,ct_state=-trk,action=ct(zone=5,table=0)
> >>>>> priority=100,in_port=1,tcp,ct_state=+trk,action=ct(commit,zone=NXM_NX_CT_ZONE[]),2
> >>>>>
> >>>>> The zone is set from the first rule's ct action. These two rules will
> >>>>> generate two megaflows: the first one uses zone=5 to query the CT 
> >>>>> module,
> >>>>> the second one sets the zone-id from the first megaflow and commit to 
> >>>>> CT.
> >>>>>
> >>>>> The current implementation will generate a megaflow that does not use
> >>>>> ct_zone=5 as a match, but directly commit into the ct using zone=5, as 
> >>>>> zone is
> >>>>> set by an Imm not a field.
> >>>>>
> >>>>> Consider a situation that one changes the zone id (for example to 15)
> >>>>> in the first rule, however, still keep the second rule unchanged. During
> >>>>> this change, there is traffic hitting the two generated megaflows, the
> >>>>> revaldiator would revalidate all megaflows, however, the revalidator 
> >>>>> will
> >>>>> not change the second megaflow, because zone=5 is recorded in the
> >>>>> megaflow, so the xlate will still translate the commit action into 
> >>>>> zone=5,
> >>>>> and the new traffic will still commit to CT as zone=5, not zone=15,
> >>>>> resulting in taffic drops and other issues.
> >>>>>
> >>>>> Just like OVS set-field convention, if a field X is set by Y
> >>>>> (Y is a variable not an Imm), we should also mask Y as a match
> >>>>> in the generated megaflow. An exception is that if the zone-id is
> >>>>> set by the field that is included in the frozen state (i.e. regs) and 
> >>>>> this
> >>>>> upcall is a resume of a thawed xlate, the un-wildcarding can be skipped,
> >>>>> as the recirc_id is a hash of the values in these fields, and it will 
> >>>>> change
> >>>>> following the changes of these fields. When the recirc_id changes,
> >>>>> all megaflows with the old recirc id will be invalid later.
> >>>>>
> >>>>> Fixes: 07659514c3 ("Add support for connection tracking.")
> >>>>> Reported-by: Sai Su 
> >>>>> Signed-off-by: Peng He 
> >>>>> ---
> >>>>>  include/openvswitch/meta-flow.h |  1 +
> >>>>>  lib/meta-flow.c | 13 ++
> >>>>>  ofproto/ofproto-dpif-xlate.c| 12 +
> >>>>>  tests/system-traffic.at | 45 +
> >>>>>  4 files changed, 71 insertions(+)
> >>>>>
> >>>>> diff --git a/include/openvswitch/meta-flow.h 
> >>>>> b/include/openvswitch/meta-flow.h
> >>>>> index 95e52e358..045dce8f5 100644
> >>>>> --- a/include/openvswitch/meta-flow.h
> >>>>> +++ b/include/openvswitch/meta-flow.h
> >>>>> @@ -2305,6 +2305,7 @@ void mf_set_flow_value_masked(const struct 
> >>>>> mf_field *,
> >>>&

Re: [ovs-dev] [PATCH v3] ofproto-dpif-xlate: fix zone set from non-frozen-metadata fields

2021-07-24 Thread 贺鹏

Hi,
friendly ping for this patch.

Ilya Maximets  于2021年5月22日周六 上午3:50写道：
>
> On 5/21/21 5:00 AM, 贺鹏 wrote:
> > Hi, Ilya
> >
> >
> >
> > Ilya Maximets  于2021年5月19日周三 下午8:50写道：
> >>
> >> On 2/27/21 10:34 AM, Peng He wrote:
> >>> CT zone could be set from a field that is not included in frozen
> >>> metadata. Consider the example rules which are typically seen in
> >>> OpenStack security group rules:
> >>>
> >>> priority=100,in_port=1,tcp,ct_state=-trk,action=ct(zone=5,table=0)
> >>> priority=100,in_port=1,tcp,ct_state=+trk,action=ct(commit,zone=NXM_NX_CT_ZONE[]),2
> >>>
> >>> The zone is set from the first rule's ct action. These two rules will
> >>> generate two megaflows: the first one uses zone=5 to query the CT module,
> >>> the second one sets the zone-id from the first megaflow and commit to CT.
> >>>
> >>> The current implementation will generate a megaflow that does not use
> >>> ct_zone=5 as a match, but directly commit into the ct using zone=5, as 
> >>> zone is
> >>> set by an Imm not a field.
> >>>
> >>> Consider a situation that one changes the zone id (for example to 15)
> >>> in the first rule, however, still keep the second rule unchanged. During
> >>> this change, there is traffic hitting the two generated megaflows, the
> >>> revaldiator would revalidate all megaflows, however, the revalidator will
> >>> not change the second megaflow, because zone=5 is recorded in the
> >>> megaflow, so the xlate will still translate the commit action into zone=5,
> >>> and the new traffic will still commit to CT as zone=5, not zone=15,
> >>> resulting in taffic drops and other issues.
> >>>
> >>> Just like OVS set-field convention, if a field X is set by Y
> >>> (Y is a variable not an Imm), we should also mask Y as a match
> >>> in the generated megaflow. An exception is that if the zone-id is
> >>> set by the field that is included in the frozen state (i.e. regs) and this
> >>> upcall is a resume of a thawed xlate, the un-wildcarding can be skipped,
> >>> as the recirc_id is a hash of the values in these fields, and it will 
> >>> change
> >>> following the changes of these fields. When the recirc_id changes,
> >>> all megaflows with the old recirc id will be invalid later.
> >>>
> >>> Fixes: 07659514c3 ("Add support for connection tracking.")
> >>> Reported-by: Sai Su 
> >>> Signed-off-by: Peng He 
> >>> ---
> >>>  include/openvswitch/meta-flow.h |  1 +
> >>>  lib/meta-flow.c | 13 ++
> >>>  ofproto/ofproto-dpif-xlate.c| 12 +
> >>>  tests/system-traffic.at | 45 +
> >>>  4 files changed, 71 insertions(+)
> >>>
> >>> diff --git a/include/openvswitch/meta-flow.h 
> >>> b/include/openvswitch/meta-flow.h
> >>> index 95e52e358..045dce8f5 100644
> >>> --- a/include/openvswitch/meta-flow.h
> >>> +++ b/include/openvswitch/meta-flow.h
> >>> @@ -2305,6 +2305,7 @@ void mf_set_flow_value_masked(const struct mf_field 
> >>> *,
> >>>const union mf_value *mask,
> >>>struct flow *);
> >>>  bool mf_is_tun_metadata(const struct mf_field *);
> >>> +bool mf_is_frozen_metadata(const struct mf_field *);
> >>>  bool mf_is_pipeline_field(const struct mf_field *);
> >>>  bool mf_is_set(const struct mf_field *, const struct flow *);
> >>>  void mf_mask_field(const struct mf_field *, struct flow_wildcards *);
> >>> diff --git a/lib/meta-flow.c b/lib/meta-flow.c
> >>> index c808d205d..e03cd8d0c 100644
> >>> --- a/lib/meta-flow.c
> >>> +++ b/lib/meta-flow.c
> >>> @@ -1788,6 +1788,19 @@ mf_is_tun_metadata(const struct mf_field *mf)
> >>> mf->id < MFF_TUN_METADATA0 + TUN_METADATA_NUM_OPTS;
> >>>  }
> >>>
> >>> +bool
> >>> +mf_is_frozen_metadata(const struct mf_field *mf)
> >>> +{
> >>> +if (mf->id >= MFF_TUN_ID && mf->id <= MFF_IN_PORT_OXM) {
> >>> +return true;
> >>> +}
> >>> +
> >>> +if (mf->id >= MFF_REG0 && mf->id < MFF_ETH_SRC) {
> >>> +ret

Re: [ovs-dev] "Why perthread->mutex is needed?"

2021-07-09 Thread 贺鹏

Great, thanks!

Ben Pfaff  于2021年6月29日周二 下午11:36写道：
>
> On Tue, Jun 29, 2021 at 09:49:39PM +0800, 贺鹏 wrote:
> > I am investigating the OVS RCU code, and feel confused about the
> > perthread->mutex, what's the usage of this mutex? it seems in the code
> > there are only codes that inits and destroys the mutex, but there is
> > no code that locks and unlocks it.
>
> Thanks for spotting that.  I sent out a patch to remove it:
> https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384582.html



-- 
hepeng
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] Why save a flow structure in dp_netdev_key?

2021-07-09 Thread 贺鹏

Hi,

I am investigating the code of revalidator and the DPDK netdev
datapath for OVS.
I have some questions:

1) why save a whole flow in the dp_nedev_key? Currently, the size of
flow is around 670 bytes, and the full size of dp_netdev_key is around
1500 (meaning that this flow takes 50% of the full dp_netdev_key
size). This large size makes the dataplane hash key a little bit
expensive for the hash lookup.

Meanwhile, I find that in fact, the revalidator only uses ufid to
index the dp_netdev_flow to the corresponding ukey in the udpif.
Nornally it will not touch the full flow structure in the
dp_netdev_key at all. The only exception is that when revaldiator
cannot find the ukey accroding to the ufid, it will instead try to use
the full flow dump and install ukey for the flow, this seems the only
usage of this flow structure.

This scenario seems purely for kernel datapath, since in DPDK, the pmd
and revalidator are just two kinds of threads, they share the same
fate, and thus it's not possible that a dp_netdev_flow exists while
the ukey does not, unless you manually insert a dp_netdev_flow in the
dataplane for testing and debuging.

thanks.


-- 
hepeng
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] Fwd: "Why perthread->mutex is needed?"

2021-06-29 Thread 贺鹏

-- 转发的邮件 --
发件人： *贺鹏* 
日期： 2021年6月29日星期二
主题： "Why perthread->mutex is needed?"
收件人： Sriharsha Basavapatna via dev , Ben Pfaff <
b...@ovn.org>


Hi，Ben,

I am investigating the OVS RCU code, and feel confused about the
perthread->mutex, what's the usage of this mutex? it seems in the code
there are only codes that inits and destroys the mutex, but there is
no code that locks and unlocks it.

Thanks.

-- 
hepeng



-- 
hepeng
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] "Why perthread->mutex is needed?"

2021-06-29 Thread 贺鹏

Hi，Ben,

I am investigating the OVS RCU code, and feel confused about the
perthread->mutex, what's the usage of this mutex? it seems in the code
there are only codes that inits and destroys the mutex, but there is
no code that locks and unlocks it.

Thanks.

-- 
hepeng
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v3] ofproto-dpif-xlate: fix zone set from non-frozen-metadata fields

2021-05-20 Thread 贺鹏

Hi, Ilya



Ilya Maximets  于2021年5月19日周三 下午8:50写道：
>
> On 2/27/21 10:34 AM, Peng He wrote:
> > CT zone could be set from a field that is not included in frozen
> > metadata. Consider the example rules which are typically seen in
> > OpenStack security group rules:
> >
> > priority=100,in_port=1,tcp,ct_state=-trk,action=ct(zone=5,table=0)
> > priority=100,in_port=1,tcp,ct_state=+trk,action=ct(commit,zone=NXM_NX_CT_ZONE[]),2
> >
> > The zone is set from the first rule's ct action. These two rules will
> > generate two megaflows: the first one uses zone=5 to query the CT module,
> > the second one sets the zone-id from the first megaflow and commit to CT.
> >
> > The current implementation will generate a megaflow that does not use
> > ct_zone=5 as a match, but directly commit into the ct using zone=5, as zone 
> > is
> > set by an Imm not a field.
> >
> > Consider a situation that one changes the zone id (for example to 15)
> > in the first rule, however, still keep the second rule unchanged. During
> > this change, there is traffic hitting the two generated megaflows, the
> > revaldiator would revalidate all megaflows, however, the revalidator will
> > not change the second megaflow, because zone=5 is recorded in the
> > megaflow, so the xlate will still translate the commit action into zone=5,
> > and the new traffic will still commit to CT as zone=5, not zone=15,
> > resulting in taffic drops and other issues.
> >
> > Just like OVS set-field convention, if a field X is set by Y
> > (Y is a variable not an Imm), we should also mask Y as a match
> > in the generated megaflow. An exception is that if the zone-id is
> > set by the field that is included in the frozen state (i.e. regs) and this
> > upcall is a resume of a thawed xlate, the un-wildcarding can be skipped,
> > as the recirc_id is a hash of the values in these fields, and it will change
> > following the changes of these fields. When the recirc_id changes,
> > all megaflows with the old recirc id will be invalid later.
> >
> > Fixes: 07659514c3 ("Add support for connection tracking.")
> > Reported-by: Sai Su 
> > Signed-off-by: Peng He 
> > ---
> >  include/openvswitch/meta-flow.h |  1 +
> >  lib/meta-flow.c | 13 ++
> >  ofproto/ofproto-dpif-xlate.c| 12 +
> >  tests/system-traffic.at | 45 +
> >  4 files changed, 71 insertions(+)
> >
> > diff --git a/include/openvswitch/meta-flow.h 
> > b/include/openvswitch/meta-flow.h
> > index 95e52e358..045dce8f5 100644
> > --- a/include/openvswitch/meta-flow.h
> > +++ b/include/openvswitch/meta-flow.h
> > @@ -2305,6 +2305,7 @@ void mf_set_flow_value_masked(const struct mf_field *,
> >const union mf_value *mask,
> >struct flow *);
> >  bool mf_is_tun_metadata(const struct mf_field *);
> > +bool mf_is_frozen_metadata(const struct mf_field *);
> >  bool mf_is_pipeline_field(const struct mf_field *);
> >  bool mf_is_set(const struct mf_field *, const struct flow *);
> >  void mf_mask_field(const struct mf_field *, struct flow_wildcards *);
> > diff --git a/lib/meta-flow.c b/lib/meta-flow.c
> > index c808d205d..e03cd8d0c 100644
> > --- a/lib/meta-flow.c
> > +++ b/lib/meta-flow.c
> > @@ -1788,6 +1788,19 @@ mf_is_tun_metadata(const struct mf_field *mf)
> > mf->id < MFF_TUN_METADATA0 + TUN_METADATA_NUM_OPTS;
> >  }
> >
> > +bool
> > +mf_is_frozen_metadata(const struct mf_field *mf)
> > +{
> > +if (mf->id >= MFF_TUN_ID && mf->id <= MFF_IN_PORT_OXM) {
> > +return true;
> > +}
> > +
> > +if (mf->id >= MFF_REG0 && mf->id < MFF_ETH_SRC) {
> > +return true;
> > +}
> > +return false;
> > +}
> > +
> >  bool
> >  mf_is_pipeline_field(const struct mf_field *mf)
> >  {
> > diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
> > index 7108c8a30..14d00db1e 100644
> > --- a/ofproto/ofproto-dpif-xlate.c
> > +++ b/ofproto/ofproto-dpif-xlate.c
> > @@ -6195,6 +6195,18 @@ compose_conntrack_action(struct xlate_ctx *ctx, 
> > struct ofpact_conntrack *ofc,
> >
> >  if (ofc->zone_src.field) {
> >  zone = mf_get_subfield(>zone_src, >xin->flow);
> > +if (ctx->xin->frozen_state) {
> > +/* If the upcall is a resume of a recirculation, we only need 
> > to
> > + * unwildcard the fields that are not in the frozen_metadata, 
> > as
> > + * when the rules update, OVS will generate a new recirc_id,
> > + * which will invalidate the megaflow with old the recirc_id.
> > + */
> > +if (!mf_is_frozen_metadata(ofc->zone_src.field)) {
> > +mf_mask_field(ofc->zone_src.field, ctx->wc);
> > +}
> > +} else {
> > +mf_mask_field(ofc->zone_src.field, ctx->wc);
> > +}
>
> IIUC, in current code above block is equal to just a single line:
>
> mf_mask_field(ofc->zone_src.field, ctx->wc);
>
> That is because zone

Re: [ovs-dev] conntrack: code refactoring userspace ct.

2021-05-07 Thread 贺鹏

Hi,

If you have time to rework it then it's great. You can put me as one
of the authors using the bytedance email address (as an obligation of
the company policy).

Thanks.


Gaëtan Rivet  于2021年5月7日周五 下午5:26写道：
>
> On Sat, Mar 6, 2021, at 03:44, 贺鹏 wrote:
> > Gaëtan Rivet  于2021年3月2日周二 下午10:47写道：
> > >
> > > On Sun, Nov 29, 2020, at 03:32, hepeng.0320 wrote:
> > > > This series of patch refactor the code of userspace CT, which makes
> > > > the code more simple and more structured. We mainly refactor the code
> > > > path for NAT and ALG, using conn_flags instead of bool value and use a
> > > > light weight lock to reduce the size of the conn. The size of conn has
> > > > been reduced from 312 to 240 bytes.
> > > >
> > > >
> > >
> > > Hello,
> > >
> > > This series is interesting, but it needs some work.
> > >
> > > Commits should be divided into smaller elements (as Aaron pointed out), 
> > > and some changes are
> > > barely described, or bunched with other changes without enough 
> > > justification.
> > >
> > > Could you rebase and rework the series to have a more organized patchset?
> > > That would make the review easier.
> > >
> > > Regards,
> > > --
> > > Gaetan
> > > ___
> > > dev mailing list
> > > d...@openvswitch.org
> > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >
> > Hi,
> >
> > I've read all your suggestions and appreciate that.
> >
> > Just have been through a very busy week, and probably will have a
> > very busy month, so the work and reply might be slow.
> >
> > Will do a rebase in the next version.
> >
> > I've seen some of your patches related to CT. do you have some plans
> > for CT optimization? is it a part of plans related to CT offload?
> >
> >
> > --
> > hepeng
> >
>
> Hello 鹏,
>
> Were you able to rework this series after all?
> If you cannot find the time I can do it.
> I think reorganizing the code and trying to reduce memory footprint would be 
> nice.
>
> Otherwise indeed there are plans to improve the code of the connection 
> tracker,
> but first the current standing series should be taken care of :) .
>
> Best regards,
> --
> Gaetan Rivet



-- 
hepeng
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v3] ofproto-dpif-xlate: fix zone set from non-frozen-metadata fields

2021-04-26 Thread 贺鹏

Hi,

it has been two months now, can anyone review this patch?

贺鹏  于2021年3月19日周五 下午4:42写道：
>
> Hi,
>
> Ping for this patch.
>
> 贺鹏  于2021年3月6日周六 上午10:37写道：
> >
> > Mark Gray  于2021年3月5日周五 下午7:54写道：
> > >
> > > On 27/02/2021 09:34, Peng He wrote:
> > > > CT zone could be set from a field that is not included in frozen
> > > > metadata. Consider the example rules which are typically seen in
> > > > OpenStack security group rules:
> > > >
> > > > priority=100,in_port=1,tcp,ct_state=-trk,action=ct(zone=5,table=0)
> > > > priority=100,in_port=1,tcp,ct_state=+trk,action=ct(commit,zone=NXM_NX_CT_ZONE[]),2
> > > >
> > > > The zone is set from the first rule's ct action. These two rules will
> > > > generate two megaflows: the first one uses zone=5 to query the CT 
> > > > module,
> > > > the second one sets the zone-id from the first megaflow and commit to 
> > > > CT.
> > > >
> > > > The current implementation will generate a megaflow that does not use
> > > > ct_zone=5 as a match, but directly commit into the ct using zone=5, as 
> > > > zone is
> > > > set by an Imm not a field.
> > > >
> > > > Consider a situation that one changes the zone id (for example to 15)
> > > > in the first rule, however, still keep the second rule unchanged. During
> > > > this change, there is traffic hitting the two generated megaflows, the
> > > > revaldiator would revalidate all megaflows, however, the revalidator 
> > > > will
> > > > not change the second megaflow, because zone=5 is recorded in the
> > > > megaflow, so the xlate will still translate the commit action into 
> > > > zone=5,
> > > > and the new traffic will still commit to CT as zone=5, not zone=15,
> > > > resulting in taffic drops and other issues.
> > > >
> > > > Just like OVS set-field convention, if a field X is set by Y
> > > > (Y is a variable not an Imm), we should also mask Y as a match
> > > > in the generated megaflow. An exception is that if the zone-id is
> > > > set by the field that is included in the frozen state (i.e. regs) and 
> > > > this
> > > > upcall is a resume of a thawed xlate, the un-wildcarding can be skipped,
> > > > as the recirc_id is a hash of the values in these fields, and it will 
> > > > change
> > > > following the changes of these fields. When the recirc_id changes,
> > > > all megaflows with the old recirc id will be invalid later.
> > >
> > > This looks good to me and all the unit-tests pass.
> > >
> > > There is some trailing whitespace. You can run
> > > "./utilities/checkpatch.py" when submitting to catch them before the
> > > 0-day robot does. Its a bit of a nit and I don't know if this won't get
> > > committed because of it. That's a decision for a maintainer.
> >
> > thanks for your kind suggestion
> > .
> > better send a new version, but before that maybe there are
> > some more suggestions from maintainers, and it's better
> > to wait and then resend.
> >
> > > >
> > > > Fixes: 07659514c3 ("Add support for connection tracking.")
> > > > Reported-by: Sai Su 
> > > > Signed-off-by: Peng He 
> > > > ---
> > > >  include/openvswitch/meta-flow.h |  1 +
> > > >  lib/meta-flow.c | 13 ++
> > > >  ofproto/ofproto-dpif-xlate.c| 12 +
> > > >  tests/system-traffic.at | 45 +
> > > >  4 files changed, 71 insertions(+)
> > > >
> > > > diff --git a/include/openvswitch/meta-flow.h 
> > > > b/include/openvswitch/meta-flow.h
> > > > index 95e52e358..045dce8f5 100644
> > > > --- a/include/openvswitch/meta-flow.h
> > > > +++ b/include/openvswitch/meta-flow.h
> > > > @@ -2305,6 +2305,7 @@ void mf_set_flow_value_masked(const struct 
> > > > mf_field *,
> > > >const union mf_value *mask,
> > > >struct flow *);
> > > >  bool mf_is_tun_metadata(const struct mf_field *);
> > > > +bool mf_is_frozen_metadata(const struct mf_field *);
> > > >  bool mf_is_pipeline_field(const struct mf_field *);
> > > >  bool mf_is_set(const struct mf_field *, const struct flow *);
> > > >  void

Re: [ovs-dev] [PATCH v3] ofproto-dpif-xlate: fix zone set from non-frozen-metadata fields

2021-03-19 Thread 贺鹏

Hi,

Ping for this patch.

贺鹏  于2021年3月6日周六 上午10:37写道：
>
> Mark Gray  于2021年3月5日周五 下午7:54写道：
> >
> > On 27/02/2021 09:34, Peng He wrote:
> > > CT zone could be set from a field that is not included in frozen
> > > metadata. Consider the example rules which are typically seen in
> > > OpenStack security group rules:
> > >
> > > priority=100,in_port=1,tcp,ct_state=-trk,action=ct(zone=5,table=0)
> > > priority=100,in_port=1,tcp,ct_state=+trk,action=ct(commit,zone=NXM_NX_CT_ZONE[]),2
> > >
> > > The zone is set from the first rule's ct action. These two rules will
> > > generate two megaflows: the first one uses zone=5 to query the CT module,
> > > the second one sets the zone-id from the first megaflow and commit to CT.
> > >
> > > The current implementation will generate a megaflow that does not use
> > > ct_zone=5 as a match, but directly commit into the ct using zone=5, as 
> > > zone is
> > > set by an Imm not a field.
> > >
> > > Consider a situation that one changes the zone id (for example to 15)
> > > in the first rule, however, still keep the second rule unchanged. During
> > > this change, there is traffic hitting the two generated megaflows, the
> > > revaldiator would revalidate all megaflows, however, the revalidator will
> > > not change the second megaflow, because zone=5 is recorded in the
> > > megaflow, so the xlate will still translate the commit action into zone=5,
> > > and the new traffic will still commit to CT as zone=5, not zone=15,
> > > resulting in taffic drops and other issues.
> > >
> > > Just like OVS set-field convention, if a field X is set by Y
> > > (Y is a variable not an Imm), we should also mask Y as a match
> > > in the generated megaflow. An exception is that if the zone-id is
> > > set by the field that is included in the frozen state (i.e. regs) and this
> > > upcall is a resume of a thawed xlate, the un-wildcarding can be skipped,
> > > as the recirc_id is a hash of the values in these fields, and it will 
> > > change
> > > following the changes of these fields. When the recirc_id changes,
> > > all megaflows with the old recirc id will be invalid later.
> >
> > This looks good to me and all the unit-tests pass.
> >
> > There is some trailing whitespace. You can run
> > "./utilities/checkpatch.py" when submitting to catch them before the
> > 0-day robot does. Its a bit of a nit and I don't know if this won't get
> > committed because of it. That's a decision for a maintainer.
>
> thanks for your kind suggestion
> .
> better send a new version, but before that maybe there are
> some more suggestions from maintainers, and it's better
> to wait and then resend.
>
> > >
> > > Fixes: 07659514c3 ("Add support for connection tracking.")
> > > Reported-by: Sai Su 
> > > Signed-off-by: Peng He 
> > > ---
> > >  include/openvswitch/meta-flow.h |  1 +
> > >  lib/meta-flow.c | 13 ++
> > >  ofproto/ofproto-dpif-xlate.c| 12 +
> > >  tests/system-traffic.at | 45 +
> > >  4 files changed, 71 insertions(+)
> > >
> > > diff --git a/include/openvswitch/meta-flow.h 
> > > b/include/openvswitch/meta-flow.h
> > > index 95e52e358..045dce8f5 100644
> > > --- a/include/openvswitch/meta-flow.h
> > > +++ b/include/openvswitch/meta-flow.h
> > > @@ -2305,6 +2305,7 @@ void mf_set_flow_value_masked(const struct mf_field 
> > > *,
> > >const union mf_value *mask,
> > >struct flow *);
> > >  bool mf_is_tun_metadata(const struct mf_field *);
> > > +bool mf_is_frozen_metadata(const struct mf_field *);
> > >  bool mf_is_pipeline_field(const struct mf_field *);
> > >  bool mf_is_set(const struct mf_field *, const struct flow *);
> > >  void mf_mask_field(const struct mf_field *, struct flow_wildcards *);
> > > diff --git a/lib/meta-flow.c b/lib/meta-flow.c
> > > index c808d205d..e03cd8d0c 100644
> > > --- a/lib/meta-flow.c
> > > +++ b/lib/meta-flow.c
> > > @@ -1788,6 +1788,19 @@ mf_is_tun_metadata(const struct mf_field *mf)
> > > mf->id < MFF_TUN_METADATA0 + TUN_METADATA_NUM_OPTS;
> > >  }
> > >
> > > +bool
> > > +mf_is_frozen_metadata(const struct mf_field *mf)
> > > +{
> > > +if (mf->id >= MFF_TUN_

Re: [ovs-dev] [PATCH] [ovs-dev v2]ofproto:fix use-after-free of ofproto

2021-03-09 Thread 贺鹏

Is this a version that mixes using refcount and RCU?
do we have reached an agreement?

BTW, please use my company email address: hepeng.0...@bytedance.com

thanks.

Hongzhi Guo  于2021年3月8日周一 上午11:34写道：
>
> ASAN report use-after-free of ofproto when destroy ofproto_rule.
> The rule uses both RCU and refcount, while the ofproto uses only RCU,
> and the rule retains the pointer of the proto.
> More importantly, ofproto cannot guarantee a longer grace period than the 
> rule.
> So when the rule is deleted, it is possible that ofproto has been released,
> resulting in use-after-free of ofproto.
> This patch add ref_count for ofproto to avoid use-after-free.
>
> ===
> ASAN report as following:
> ==10399==ERROR: AddressSanitizer: heap-use-after-free on address 
> 0x61e1e420 at pc 0xdcc29d1c bp 0x6c5fde40 sp 0x6c5fde60
>  READ of size 8 at 0x61e1e420 thread T12 (urcu2)
> #0 0xdcc29d1b 
> (/usr/src/debug/openvswitch-2.12.asan.aarch64/ofproto/ofproto.c:2916)
> #1 0xdcf76f5f 
> (/usr/src/debug/openvswitch-2.12.asan.aarch64/lib/ovs-rcu.c:348 
> (discriminator 3))
> #2 0xdcf770c7 
> (/usr/src/debug/openvswitch-2.12.asan.aarch64/lib/ovs-rcu.c:363)
> #3 0xdcf7fa9b 
> (/usr/src/debug/openvswitch-2.12.asan.aarch64/lib/ovs-thread.c:708)
> #4 0x80fde8bb in start_thread (/lib64/libpthread.so.0+0x78bb)
> #5 0x808fa5cb in thread_start (/lib64/libc.so.6+0xd55cb)
> 0x61e1e420 is located 32 bytes inside of 34496-byte region 
> [0x61e1e400,0x61e26ac0)
>  freed by thread T12 (urcu2) here:
> #0 0x8214fe33 in free (/lib64/libasan.so.4+0xd2e33)
> #1 0xdcc576df 
> (/usr/src/debug/openvswitch-2.12.asan.aarch64/ofproto/ofproto-dpif.c:734)
> #2 0xdcc21acb 
> (/usr/src/debug/openvswitch-2.12.asan.aarch64/ofproto/ofproto.c:1687)
> #3 0xdcf76f5f 
> (/usr/src/debug/openvswitch-2.12.asan.aarch64/lib/ovs-rcu.c:348 
> (discriminator 3))
> #4 0xdcf770c7 
> (/usr/src/debug/openvswitch-2.12.asan.aarch64/lib/ovs-rcu.c:363)
> #5 0xdcf7fa9b 
> (/usr/src/debug/openvswitch-2.12.asan.aarch64/lib/ovs-thread.c:708)
> #6 0x80fde8bb in start_thread (/lib64/libpthread.so.0+0x78bb)
> #7 0x808fa5cb in thread_start (/lib64/libc.so.6+0xd55cb)
> previously allocated by thread T0 here:
> #0 0x821503c3 in __interceptor_calloc (/lib64/libasan.so.4+0xd33c3)
> #1 0xdd034717 in xcalloc 
> (/usr/src/debug/openvswitch-2.12.asan.aarch64/lib/util.c:99 (discriminator 3))
> #2 0xdd034767 in xzalloc 
> (/usr/src/debug/openvswitch-2.12.asan.aarch64/lib/util.c:110)
> #3 0xdcc576ab 
> (/usr/src/debug/openvswitch-2.12.asan.aarch64/ofproto/ofproto-dpif.c:726)
> #4 0xdcc1be93 in ofproto_create 
> (/usr/src/debug/openvswitch-2.12.asan.aarch64/ofproto/ofproto.c:505)
> #5 0xdcbd793f 
> (/usr/src/debug/openvswitch-2.12.asan.aarch64/vswitchd/bridge.c:1208)
> #6 0xdcbeefb7 in bridge_run 
> (/usr/src/debug/openvswitch-2.12.asan.aarch64/vswitchd/bridge.c:3944 
> (discriminator 4))
> #7 0xdcbfdb83 in main 
> (/usr/src/debug/openvswitch-2.12.asan.aarch64/vswitchd/ovs-vswitchd.c:240)
> #8 0x80845adf in __libc_start_main (/lib64/libc.so.6+0x20adf)
> #9 0xdcbd19b3 (/usr/sbin/ovs-vswitchd-2.12.asan+0x26f9b3)
> SUMMARY: AddressSanitizer: heap-use-after-free 
> (/usr/src/debug/openvswitch-2.12.asan.aarch64/ofproto/ofproto.c:2916)
>
> ===
> TEST:
> 1.ASAN test for three days
>
> CC: Jarno Rajahalme 
> Fixes: 39c9459355b6 ("Use classifier versioning.")
>
> Signed-off-by: Hongzhi Guo 
> Signed-off-by: hepeng 
> ---
>  ofproto/ofproto-dpif-xlate-cache.c |  1 +
>  ofproto/ofproto-dpif.c | 20 ++---
>  ofproto/ofproto-provider.h |  5 +
>  ofproto/ofproto.c  | 35 ++
>  4 files changed, 49 insertions(+), 12 deletions(-)
>
> diff --git a/ofproto/ofproto-dpif-xlate-cache.c 
> b/ofproto/ofproto-dpif-xlate-cache.c
> index dcc91cb38..0deee365d 100644
> --- a/ofproto/ofproto-dpif-xlate-cache.c
> +++ b/ofproto/ofproto-dpif-xlate-cache.c
> @@ -209,6 +209,7 @@ xlate_cache_clear_entry(struct xc_entry *entry)
>  {
>  switch (entry->type) {
>  case XC_TABLE:
> +ofproto_unref(&(entry->table.ofproto->up));
>  break;
>  case XC_RULE:
>  ofproto_rule_unref(>rule->up);
> diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
> index fd0b2fdea..8b22eda5a 100644
> --- a/ofproto/ofproto-dpif.c
> +++ b/ofproto/ofproto-dpif.c
> @@ -4416,10 +4416,12 @@ rule_dpif_lookup_from_table(struct ofproto_dpif 
> *ofproto,
>  if (xcache) {
>  struct xc_entry *entry;
>
> -entry = xlate_cache_add_entry(xcache, XC_TABLE);
> -entry->table.ofproto = ofproto;
> -entry->table.id = *table_id;
> -entry->table.match = true;
> +if (ofproto_try_ref(>up)) {
> +

Re: [ovs-dev] [PATCH 4/4] conntrack: compact the size of conn structure.

2021-03-05 Thread 贺鹏

Gaëtan Rivet  于2021年3月3日周三 上午4:01写道：
>
> On Sun, Nov 29, 2020, at 03:32, hepeng.0320 wrote:
> > From: hepeng 
> >
> > This patch tries to use ovs_spin_lock as an alternative to reduce the
> > size of each conn. The ovs_mutex_lock consumes 48 bytes while the
> > ovs_spin_lock only uses 16 bytes.
>
> Using a spin-lock means that the thread won't yield. It might be fine for 
> userland datapath threads, but those are not the only one locking the 
> connections. The ct_clean thread will do, and will execute on randomly 
> assigned CPUs, potentially blocking other threads from being scheduled there. 
> As many conn critical sections are not short, this could impair other modules 
> beyond conntrack.

I guess one way to check if you should use spinlock is to check if
your datapath is polling base or not, the spinlock might be harmful
for AF_XDP userspace datapath, I guess.

since PMD are normally tied into some cores, so the execute time of
the critical section is limited and predictable (as normally some
special optimization like putting isolcpu in the GRUB), so we can
expect that the CT clean thread will not be blocked that long?
I see in DPDK code the spinlock is used in many places, I guess in
certain cases, that we can switch to use spinlock.

Maybe uses pthread_mutex as a default setting, but provide compile
macros as in this patch.

>
> Gaining some bytes is good, but it needs more thought and opinions. There 
> might be other venues available, less risky, to reduce the size.
>
> >Also, we remove 
> > the alg info into an
> > extra space as alg is not common in the real world userspace ct use case.
> > (mostly used as security group and nat in the cloud).
>
> Can you submit this as a separate commit?
>
> >
> > The size of conn_tcp: 312 -> 240.
> >
> > Signed-off-by: Peng He 
>
> Regards,
> --
> Gaetan Rivet
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

-- 
hepeng
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 3/4] conntrack: refactoring alg handle code path

2021-03-05 Thread 贺鹏

Gaëtan Rivet  于2021年3月3日周三 上午3:49写道：
>
> On Sun, Nov 29, 2020, at 03:32, hepeng.0320 wrote:
> > From: hepeng 
> >
> > Just like the previous patch, split the handle_ftp_ctl into
> > handle_ftp_interest and handle_ftp_other, since we can infer
> > which one to call from the called positions.
> >
> > We also introduce a simple framework which unifies the normal
> > conn handle path and alg conn handle path by using two hooks:
> > *before_conn_update_hook* and *after_conn_update_hook*.
> >
>
> Reading this 'also', I thought initially the commit should be split and the 
> two changes, dividing the ALG CTL in two and introducing the framework were 
> unrelated. Reading further, it's clearer now that this is one logical change.
> I think the link between the two should be expanded in the commit log.
>
> The motivation I get for this commit is to "unify the normal and ALG conn 
> handle path". However, only the FTP ALG benefits from the hooks introduced. 
> The code is streamlined and a single-purpose function is removed, but it 
> seems flimsy as a motivation to introduce a new callback system.

The userspace CT only supports very few ALGs, and even for FTP ALG,
the code can just work for the normal traffic,
any retrans TCP will confuse FTP ALG since the seq adjust account both
normal and retrans packets.

the mini framework is just to unify the code handle path, as for alg
process, you have to do something before conn_udpate
and after, it's better to have a hook framework better than hard
coded. I am not sure if future ALG handlers can benefit from
this framework, but it seems it's better than the current implementation.

The other part I agree with you.

>
> Can you expand on the goal of the framework, and which ALG it could help to 
> implement?
> It is hard to judge whether the framework is useful and properly defined 
> without having a full picture.
>
> > Signed-off-by: Peng He 
> > ---
> >  lib/conntrack-private.h |  16 +-
> >  lib/conntrack.c | 477 
> >  2 files changed, 307 insertions(+), 186 deletions(-)
> >
> > diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h
> > index 6bec43d3f..2aa828674 100644
> > --- a/lib/conntrack-private.h
> > +++ b/lib/conntrack-private.h
> > @@ -83,7 +83,22 @@ struct alg_exp_node {
> >  bool nat_rpl_dst;
> >  };
> >
> > +enum ct_alg_ctl_type {
> > +CT_ALG_CTL_NONE,
> > +CT_ALG_CTL_FTP,
> > +CT_ALG_CTL_TFTP,
> > +/* SIP is not enabled through Openflow and presently only used as
> > + * an example of an alg that allows a wildcard src ip. */
> > +CT_ALG_CTL_SIP,
> > +CT_ALG_CTL_MAX,
>
> I don't see this CT_ALG_CTL_MAX used afterward in this commit?
> It is introduced by this change.
>
> > +};
> > +
> >  #define CONN_FLAG_NAT_MASK 0xf
> > +#define CONN_FLAG_CTL_FTP  (CT_ALG_CTL_FTP  << 4)
> > +#define CONN_FLAG_CTL_TFTP (CT_ALG_CTL_TFTP << 4)
> > +#define CONN_FLAG_CTL_SIP  (CT_ALG_CTL_SIP  << 4)
> > +/* currently only 3 algs supported */
> > +#define CONN_FLAG_ALG_MASK 0x70
>
> The enum values are not flags.
> I guess they should instead be defined as
>
> #define CONN_FLAG_CTL_FTP (UINT64_C(1) << (4 + CT_ALG_CTL_FTP))
>
> In the code this should have no effect as SIP is not implemented, so only ALG 
> were 1 and 2, which are valid flags.
>
> >
> >  struct conn_dir {
> >  struct cmap_node cm_node;
> > @@ -99,7 +114,6 @@ struct conn {
> >  struct conn_key parent_key; /* Only used for orig_tuple support. */
> >  struct ovs_list exp_node;
> >  uint64_t conn_flags;
> > -char *alg;
> >
> >  /* Mutable data. */
> >  struct ovs_mutex lock; /* Guards all mutable fields. */
> > diff --git a/lib/conntrack.c b/lib/conntrack.c
> > index a22252a63..fffc617fb 100644
> > --- a/lib/conntrack.c
> > +++ b/lib/conntrack.c
> > @@ -69,15 +69,6 @@ enum ct_alg_mode {
> >  CT_TFTP_MODE,
> >  };
> >
> > -enum ct_alg_ctl_type {
> > -CT_ALG_CTL_NONE,
> > -CT_ALG_CTL_FTP,
> > -CT_ALG_CTL_TFTP,
> > -/* SIP is not enabled through Openflow and presently only used as
> > - * an example of an alg that allows a wildcard src ip. */
> > -CT_ALG_CTL_SIP,
> > -};
> > -
> >  struct zone_limit {
> >  struct hmap_node node;
> >  struct conntrack_zone_limit czl;
> > @@ -153,30 +144,145 @@ static struct ct_l4_proto *l4_protos[] = {
> >  };
> >
> >  static void
> > -handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
> > -   struct dp_packet *pkt, struct conn *ec, long long now,
> > -   enum ftp_ctl_pkt ftp_ctl, bool nat);
> > -
> > +handle_ftp_ctl_other(struct conntrack *ct, const struct
> > conn_lookup_ctx *ctx,
> > + struct dp_packet *pkt, struct conn *ec, long long
> > now,
> > + bool nat);
> > +static void
> > +handle_ftp_ctl_interest(struct conntrack *ct, const struct
> > conn_lookup_ctx *ctx,
> > +struct dp_packet *pkt, struct conn *ec, long
> > long

Re: [ovs-dev] [PATCH 1/4] conntrack: remove nat_conn

2021-03-05 Thread 贺鹏

Gaëtan Rivet  于2021年3月3日周三 上午1:46写道：
>
> On Sat, Feb 27, 2021, at 09:49, 贺鹏 wrote:
> > Hi,
> >
> > Thanks for the detailed reviews.
> > These patches are like RFC to see if the work is interesting enough.
> >
> > Aaron Conole  于2021年2月25日周四 上午1:37写道：
> > >
> > > "hepeng.0320"  writes:
> > >
> > > > From: hepeng 
> > > >
> > > > Currently, when doing NAT, the userspace conntrack will use an extra
> > > > conn for the two directions in a flow. However, each conn has actually
> > > > the two keys for both orig and rev flow directions. This patch
> > > > introduces a conn_dir member in the conn and it consists of both rev and
> > > > orig cmap_node for hash lookup. This saves extra allocation for
> > > > nat_conn, and makes userspace code much cleaner.
> > >
> > > If I'm understanding this correctly, you still re-insert the conn into
> > > the conn list?
> >
> >
> > Yes. This is to insert the rev key into the cmap also. So for the nat 
> > traffic,
> > both orig and rev key are in the cmap.
> >
> > >
> > > > We also introduces a conn_flags member to reduce the memory footprint 
> > > > of a
> > > > conn.
> > >
> > > We should split this out to a separate patch.
> > >
> > > > Signed-off-by: Peng He 
> > > > ---
> > > >  lib/conntrack-private.h |  20 +--
> > > >  lib/conntrack.c | 264 
> > > >  2 files changed, 121 insertions(+), 163 deletions(-)
> > > >
> > > > diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h
> > > > index 789af82ff..6bec43d3f 100644
> > > > --- a/lib/conntrack-private.h
> > > > +++ b/lib/conntrack-private.h
> > > > @@ -83,21 +83,23 @@ struct alg_exp_node {
> > > >  bool nat_rpl_dst;
> > > >  };
> > > >
> > > > -enum OVS_PACKED_ENUM ct_conn_type {
> > > > -CT_CONN_TYPE_DEFAULT,
> > > > -CT_CONN_TYPE_UN_NAT,
> > > > +#define CONN_FLAG_NAT_MASK 0xf
> > > > +
> > > > +struct conn_dir {
> > > > +struct cmap_node cm_node;
> > > > +bool orig;
> > >
> > > This naming is confusing.  We can have 'conn->orig->orig'.  Consider
> > > renaming one of these fields to distinguish what is going on.  I would
> > > prefer seeing 'fwd' used (since it's the forward direction).
> >
> > Yes, agree.
> >
> > >
> > > >  };
> > > >
> > > >  struct conn {
> > > >  /* Immutable data. */
> > > >  struct conn_key key;
> > > > +struct conn_dir orig;
> > > >  struct conn_key rev_key;
> > > > +struct conn_dir rev;
> > >
> > > I think this might be better if setup like:
> > > +enum ct_direction {
> > > +CT_DIRECTION_FWD,
> > > +CT_DIRECTION_REV,
> > > +CT_DIRECTIONS
> > > +}
> > > +
> > > +struct conn_data {
> > > +struct conn_key key;
> > > +struct cmap_node cm_node;
> > >  };
> > >
> > >  struct conn {
> > > -struct conn_key key;
> > > -struct conn_key rev_key;
> > > +struct conn_data cn_data[CT_DIRECTIONS];
> > > ...
> > >
> > > Then in the code, we can always get orig and rev information:
> > >
> > >   conn->cn_data[CT_DIRECTION_FWD]
> > >   conn->cn_data[CT_DIRECTION_REV]
> > >
> > > Did I miss something?
> >
> > Since both origin and rev are in the cmap, when you lookup the hash table,
> > you get a pointer to the  'middle data structure', in the patch, it is
> > the conn_dir.
> >
> > However, you are not sure what you get is the origin dir or the rev dir. 
> > The key
> > is to use a variable stored in the conn_dir, like 'fwd' or 'orig'.
> > Then you know the dir,
> > by knowing the dir, you know the offset between your pointer and the 
> > pointer to
> > the conn, just like the belowing code:
> >
> > static inline struct conn * conn_from_conndir(struct conn_dir *conndir) {
> > return conndir->orig ? CONTAINER_OF(conndir, struct conn, orig) : \
> >CONTAINER_OF(conndir, struct conn, rev);
> > }
> >
> > In your case:
> >
> > static inline struct conn * conn_from_conndir(struct co

Re: [ovs-dev] conntrack: code refactoring userspace ct.

2021-03-05 Thread 贺鹏

Hi,

I've read all your suggestions and appreciate that.

Just have been through a very busy week, and probably will have a
very busy month, so the work and reply might be slow.

Will do a rebase in the next version.

I've seen some of your patches related to CT. do you have some plans
for CT optimization? is it a part of plans related to CT offload?


Gaëtan Rivet  于2021年3月2日周二 下午10:47写道：
>
> On Sun, Nov 29, 2020, at 03:32, hepeng.0320 wrote:
> > This series of patch refactor the code of userspace CT, which makes
> > the code more simple and more structured. We mainly refactor the code
> > path for NAT and ALG, using conn_flags instead of bool value and use a
> > light weight lock to reduce the size of the conn. The size of conn has
> > been reduced from 312 to 240 bytes.
> >
> >
>
> Hello,
>
> This series is interesting, but it needs some work.
>
> Commits should be divided into smaller elements (as Aaron pointed out), and 
> some changes are
> barely described, or bunched with other changes without enough justification.
>
> Could you rebase and rework the series to have a more organized patchset?
> That would make the review easier.
>
> Regards,
> --
> Gaetan
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev



-- 
hepeng
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v3] ofproto-dpif-xlate: fix zone set from non-frozen-metadata fields

2021-03-05 Thread 贺鹏

Mark Gray  于2021年3月5日周五 下午7:54写道：
>
> On 27/02/2021 09:34, Peng He wrote:
> > CT zone could be set from a field that is not included in frozen
> > metadata. Consider the example rules which are typically seen in
> > OpenStack security group rules:
> >
> > priority=100,in_port=1,tcp,ct_state=-trk,action=ct(zone=5,table=0)
> > priority=100,in_port=1,tcp,ct_state=+trk,action=ct(commit,zone=NXM_NX_CT_ZONE[]),2
> >
> > The zone is set from the first rule's ct action. These two rules will
> > generate two megaflows: the first one uses zone=5 to query the CT module,
> > the second one sets the zone-id from the first megaflow and commit to CT.
> >
> > The current implementation will generate a megaflow that does not use
> > ct_zone=5 as a match, but directly commit into the ct using zone=5, as zone 
> > is
> > set by an Imm not a field.
> >
> > Consider a situation that one changes the zone id (for example to 15)
> > in the first rule, however, still keep the second rule unchanged. During
> > this change, there is traffic hitting the two generated megaflows, the
> > revaldiator would revalidate all megaflows, however, the revalidator will
> > not change the second megaflow, because zone=5 is recorded in the
> > megaflow, so the xlate will still translate the commit action into zone=5,
> > and the new traffic will still commit to CT as zone=5, not zone=15,
> > resulting in taffic drops and other issues.
> >
> > Just like OVS set-field convention, if a field X is set by Y
> > (Y is a variable not an Imm), we should also mask Y as a match
> > in the generated megaflow. An exception is that if the zone-id is
> > set by the field that is included in the frozen state (i.e. regs) and this
> > upcall is a resume of a thawed xlate, the un-wildcarding can be skipped,
> > as the recirc_id is a hash of the values in these fields, and it will change
> > following the changes of these fields. When the recirc_id changes,
> > all megaflows with the old recirc id will be invalid later.
>
> This looks good to me and all the unit-tests pass.
>
> There is some trailing whitespace. You can run
> "./utilities/checkpatch.py" when submitting to catch them before the
> 0-day robot does. Its a bit of a nit and I don't know if this won't get
> committed because of it. That's a decision for a maintainer.

thanks for your kind suggestion
.
better send a new version, but before that maybe there are
some more suggestions from maintainers, and it's better
to wait and then resend.

> >
> > Fixes: 07659514c3 ("Add support for connection tracking.")
> > Reported-by: Sai Su 
> > Signed-off-by: Peng He 
> > ---
> >  include/openvswitch/meta-flow.h |  1 +
> >  lib/meta-flow.c | 13 ++
> >  ofproto/ofproto-dpif-xlate.c| 12 +
> >  tests/system-traffic.at | 45 +
> >  4 files changed, 71 insertions(+)
> >
> > diff --git a/include/openvswitch/meta-flow.h 
> > b/include/openvswitch/meta-flow.h
> > index 95e52e358..045dce8f5 100644
> > --- a/include/openvswitch/meta-flow.h
> > +++ b/include/openvswitch/meta-flow.h
> > @@ -2305,6 +2305,7 @@ void mf_set_flow_value_masked(const struct mf_field *,
> >const union mf_value *mask,
> >struct flow *);
> >  bool mf_is_tun_metadata(const struct mf_field *);
> > +bool mf_is_frozen_metadata(const struct mf_field *);
> >  bool mf_is_pipeline_field(const struct mf_field *);
> >  bool mf_is_set(const struct mf_field *, const struct flow *);
> >  void mf_mask_field(const struct mf_field *, struct flow_wildcards *);
> > diff --git a/lib/meta-flow.c b/lib/meta-flow.c
> > index c808d205d..e03cd8d0c 100644
> > --- a/lib/meta-flow.c
> > +++ b/lib/meta-flow.c
> > @@ -1788,6 +1788,19 @@ mf_is_tun_metadata(const struct mf_field *mf)
> > mf->id < MFF_TUN_METADATA0 + TUN_METADATA_NUM_OPTS;
> >  }
> >
> > +bool
> > +mf_is_frozen_metadata(const struct mf_field *mf)
> > +{
> > +if (mf->id >= MFF_TUN_ID && mf->id <= MFF_IN_PORT_OXM) {
> > +return true;
> > +}
> > +
> > +if (mf->id >= MFF_REG0 && mf->id < MFF_ETH_SRC) {
> > +return true;
> > +}
> > +return false;
> > +}
> > +
> >  bool
> >  mf_is_pipeline_field(const struct mf_field *mf)
> >  {
> > diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
> > index 7108c8a30..14d00db1e 100644
> > --- a/ofproto/ofproto-dpif-xlate.c
> > +++ b/ofproto/ofproto-dpif-xlate.c
> > @@ -6195,6 +6195,18 @@ compose_conntrack_action(struct xlate_ctx *ctx, 
> > struct ofpact_conntrack *ofc,
> >
> >  if (ofc->zone_src.field) {
> >  zone = mf_get_subfield(>zone_src, >xin->flow);
> > +if (ctx->xin->frozen_state) {
> > +/* If the upcall is a resume of a recirculation, we only need 
> > to
> > + * unwildcard the fields that are not in the frozen_metadata, 
> > as
> > + * when the rules update, OVS will generate a new recirc_id,
> > + * which

Re: [ovs-dev] [External] Re: [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy

2021-03-01 Thread . 贺鹏

On Tue, Mar 2, 2021 at 4:15 AM Ilya Maximets  wrote:
>
> On 2/27/21 9:53 AM, 贺鹏 wrote:
> > Hi,
> >
> > Thanks William and Ilya for a detailed revisiting of the origin of the
> > problem. I learned a lot.
> >
> > I now understand that the mix using of RCU and refcounts is not
> > intended in the first place.
> > But my point is that mix using RCU and refcounts now gives you more
> > choices, and actually eases the code changes.
> >
> > For example, the code for * ofproto_dpif_lookup_by_name* or other
> > ofproto lookup function,
> > when only using refcounts, you need to change it to:
> >
> >  struct ofproto_dpif *
> >  ofproto_dpif_lookup_by_uuid(const struct uuid *uuid)
> >  {
> >  struct ofproto_dpif *ofproto;
> >
> >  HMAP_FOR_EACH_WITH_HASH (ofproto, all_ofproto_dpifs_by_uuid_node,
> >   uuid_hash(uuid), _ofproto_dpifs_by_uuid) {
> >  if (uuid_equals(>uuid, uuid)) {
> >
> >  ---> if  ovs_refcount_try_ref(ofproto)
> >
> >  return ofproto;
> >  }
> >  }
> >  return NULL;
> >  }
>
> That is an interesting example here...
> I can't help but notice that this function is typically called
> from different handler or pmd threads and modified by the main thread
> while upcalls enabled.  And hmap is not a thread-safe structure.
> I guess, we have another possible problem here.  We need to protect
> at least this hmap and the other one with rw lock or something...
> Am I right or am I missing something?  What else we need to protect?

I guess the function needs to be changed into using cmap 

But fortunately, ofproto_dpif_lookup_by_uuid is called by
upcall_receive, and for most upcalls
(with the type MISS_UPCALL), it will not be called.

...
 } else if (upcall->type == MISS_UPCALL) {
 error = xlate_lookup(backer, flow, >ofproto, >ipfix,
  >sflow, NULL, >ofp_in_port);
 if (error) {
 return error;
 }
 } else {
 struct ofproto_dpif *ofproto
 = ofproto_dpif_lookup_by_uuid(>cookie.ofproto_uuid);
 if (!ofproto) {
 VLOG_INFO_RL(, "upcall could not find ofproto");
 return ENODEV;
 }
 upcall->ofproto = ofproto;
 upcall->ipfix = ofproto->ipfix;
 upcall->sflow = ofproto->sflow;
 upcall->ofp_in_port = upcall->cookie.ofp_in_port;
 }
...

So for the people who use sflow, I guess there could be a risk, but
this is also rare,
as the bridges rarely change.


another case ofproto_dpif_lookup_by_name looks fine, it will only be called
in the main thread.


>
> >
> > and after finish its usage, you have to do ofproto_unref the ofproto.
> >
> > This is why I said, most accessing to ofproto is ephemeral. If you
> > change to use the pure refcounts solution, you have
> > to add refcount and release it every time you access the ofproto. We
> > should be more careful and remember to unref
> > the ofproto.
> >
> > However, if using RCU and refcounts, in the above case, you do not
> > need to change the code, since the RCU ensures that
> > these ephemeral accesses are safe.
> >
> > you only need to add refcount, when you find that the pointer to
> > ofproto lives longer than one grace period.
> >
> >
> > This is why in my patch, I do not add ref to ofproto after its
> > creation. I agree the patch is not complete and has issues,
> > and understand it could confuse people if changes into mix RCU and
> > refcounts version.
> >
> >
> > William Tu  于2021年2月26日周五 上午2:32写道：
> >>
> >> On Thu, Feb 25, 2021 at 4:54 AM Ilya Maximets  wrote:
> >>>
> >>> On 2/25/21 4:32 AM, Guohongzhi (Russell Lab) wrote:
> >>>> Refcount and RCU are not mutually exclusive.
> >>>> IMO, the main reason for this problem is that the rule uses both 
> >>>> refcount and rcu, while the ofproto uses only rcu, and the rule retains 
> >>>> the pointer of the ofproto. More importantly, ofproto cannot guarantee a 
> >>>> longer grace period than the rule.
> >>>>
> >>>
> >>> I understand that refcount and RCU are not mutually exclusive.
> >>> However, in this particular case RCU for ofproto was introduced for one
> >>> and only one reason - to avoid freeing ofproto while rules are still
> >>> alive.  This was done in commit f416c8d61601 ("ofproto: RCU postpone
> >>> rule destruction.").  The goal was to allow using

Re: [ovs-dev] [PATCH 1/4] conntrack: remove nat_conn

2021-02-27 Thread 贺鹏

Hi,

Thanks for the detailed reviews.
These patches are like RFC to see if the work is interesting enough.

Aaron Conole  于2021年2月25日周四 上午1:37写道：
>
> "hepeng.0320"  writes:
>
> > From: hepeng 
> >
> > Currently, when doing NAT, the userspace conntrack will use an extra
> > conn for the two directions in a flow. However, each conn has actually
> > the two keys for both orig and rev flow directions. This patch
> > introduces a conn_dir member in the conn and it consists of both rev and
> > orig cmap_node for hash lookup. This saves extra allocation for
> > nat_conn, and makes userspace code much cleaner.
>
> If I'm understanding this correctly, you still re-insert the conn into
> the conn list?


Yes. This is to insert the rev key into the cmap also. So for the nat traffic,
both orig and rev key are in the cmap.

>
> > We also introduces a conn_flags member to reduce the memory footprint of a
> > conn.
>
> We should split this out to a separate patch.
>
> > Signed-off-by: Peng He 
> > ---
> >  lib/conntrack-private.h |  20 +--
> >  lib/conntrack.c | 264 
> >  2 files changed, 121 insertions(+), 163 deletions(-)
> >
> > diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h
> > index 789af82ff..6bec43d3f 100644
> > --- a/lib/conntrack-private.h
> > +++ b/lib/conntrack-private.h
> > @@ -83,21 +83,23 @@ struct alg_exp_node {
> >  bool nat_rpl_dst;
> >  };
> >
> > -enum OVS_PACKED_ENUM ct_conn_type {
> > -CT_CONN_TYPE_DEFAULT,
> > -CT_CONN_TYPE_UN_NAT,
> > +#define CONN_FLAG_NAT_MASK 0xf
> > +
> > +struct conn_dir {
> > +struct cmap_node cm_node;
> > +bool orig;
>
> This naming is confusing.  We can have 'conn->orig->orig'.  Consider
> renaming one of these fields to distinguish what is going on.  I would
> prefer seeing 'fwd' used (since it's the forward direction).

Yes, agree.

>
> >  };
> >
> >  struct conn {
> >  /* Immutable data. */
> >  struct conn_key key;
> > +struct conn_dir orig;
> >  struct conn_key rev_key;
> > +struct conn_dir rev;
>
> I think this might be better if setup like:
> +enum ct_direction {
> +CT_DIRECTION_FWD,
> +CT_DIRECTION_REV,
> +CT_DIRECTIONS
> +}
> +
> +struct conn_data {
> +struct conn_key key;
> +struct cmap_node cm_node;
>  };
>
>  struct conn {
> -struct conn_key key;
> -struct conn_key rev_key;
> +struct conn_data cn_data[CT_DIRECTIONS];
> ...
>
> Then in the code, we can always get orig and rev information:
>
>   conn->cn_data[CT_DIRECTION_FWD]
>   conn->cn_data[CT_DIRECTION_REV]
>
> Did I miss something?

Since both origin and rev are in the cmap, when you lookup the hash table,
you get a pointer to the  'middle data structure', in the patch, it is
the conn_dir.

However, you are not sure what you get is the origin dir or the rev dir. The key
is to use a variable stored in the conn_dir, like 'fwd' or 'orig'.
Then you know the dir,
by knowing the dir, you know the offset between your pointer and the pointer to
the conn, just like the belowing code:

static inline struct conn * conn_from_conndir(struct conn_dir *conndir) {
return conndir->orig ? CONTAINER_OF(conndir, struct conn, orig) : \
   CONTAINER_OF(conndir, struct conn, rev);
}

In your case:

static inline struct conn * conn_from_conndir(struct conn_dir *conndir) {
return conndir->orig ? CONTAINER_OF(conndir, struct conn, cn_data[FWD]) : \
   CONTAINER_OF(conndir, struct conn, cn_data[REV]);
}

In the following patches I've changed the code of conn_dir as a part
of conn_key



>
> >  struct conn_key parent_key; /* Only used for orig_tuple support. */
> >  struct ovs_list exp_node;
> > -struct cmap_node cm_node;
> > -struct nat_action_info_t *nat_info;
> > +uint64_t conn_flags;
>
> As mentioned, please separate this as a different patch.  This is new
> piece of functionality.
>
> >  char *alg;
> > -struct conn *nat_conn; /* The NAT 'conn' context, if there is one. */
> >
> >  /* Mutable data. */
> >  struct ovs_mutex lock; /* Guards all mutable fields. */
> > @@ -117,11 +119,15 @@ struct conn {
> >
> >  /* Immutable data. */
> >  bool alg_related; /* True if alg data connection. */
> > -enum ct_conn_type conn_type;
> >
> >  uint32_t tp_id; /* Timeout policy ID. */
> >  };
> >
> > +static inline struct conn * conn_from_conndir(struct conn_dir *conndir) {
> > +return conndir->orig ? CONTAINER_OF(conndir, struct conn, orig) : \
> > +   CONTAINER_OF(conndir, struct conn, rev);
> > +}
> > +
> >  enum ct_update_res {
> >  CT_UPDATE_INVALID,
> >  CT_UPDATE_VALID,
> > diff --git a/lib/conntrack.c b/lib/conntrack.c
> > index 930ed0be6..73d3a2fb2 100644
> > --- a/lib/conntrack.c
> > +++ b/lib/conntrack.c
> > @@ -94,7 +94,6 @@ static struct conn *new_conn(struct conntrack *ct, struct 
> > dp_packet *pkt,
> >   uint32_t tp_id);
>

Re: [ovs-dev] [PATCH v2] ofproto-dpif-xlate: fix zone set from non-frozen-metadata fields

2021-02-27 Thread 贺鹏

Hi,

Sent a new version of the patch.

贺鹏  于2021年2月20日周六 上午10:21写道：
>
> Hi, Mark,
>
> thanks for the review!
>
> Mark Gray  于2021年2月19日周五 下午9:07写道：
> >
> > On 18/02/2021 09:43, Peng He wrote:
> >
> > One small comment below. By the way, good catch .. i found it difficult
> > enough to review it, let alone find it!
>
> Thanks.
> This is found by exploring our controller's bug, which changes zone-id
> every time it restarts.
>
> >
> > > CT zone could be set from a field that is not included in frozen
> > > metadata. Consider the example rules which are typically seen in
> > > OpenStack security group rules:
> > >
> > > priority=100,in_port=1,tcp,ct_state=-trk,action=ct(zone=5,table=0)
> > > priority=100,in_port=1,tcp,ct_state=+trk,action=ct(commit,zone=NXM_NX_CT_ZONE[]),2
> > >
> > > The zone is set from the first rule's ct action. These two rules will
> > > generate two megaflows: the first one uses zone=5 to query the CT module,
> > > the second one sets the zone-id from the first megaflow and commit to CT.
> > >
> > > The current implementation will generate a megaflow that does not use
> > > ct_zone=5 as a match, but directly commit into the ct using zone=5, as 
> > > zone is
> > > set by an Imm not a field.
> > >
> > > Consider a situation that one changes the zone id (for example to 15)
> > > in the first rule, however, still keep the second rule unchanged. During
> > > this change, there is traffic hitting the two generated megaflows, the
> > > revaldiator would revalidate all megaflows, however, the revalidator will
> > > not change the second megaflow, because zone=5 is recorded in the
> > > megaflow, so the xlate will still translate the commit action into zone=5,
> > > and the new traffic will still commit to CT as zone=5, not zone=15,
> > > resulting in taffic drops and other issues.
> > >
> > > Just like OVS set-field convention, if a field X is set by Y
> > > (Y is a variable not an Imm), we should also mask Y as a match
> > > in the generated megaflow. An exception is that if the zone-id is
> > > set by the field that is included in the frozen state (i.e. regs) and this
> > > upcall is a resume of a thawed xlate, the un-wildcarding can be skipped,
> > > as the recirc_id is a hash of the values in these fields, and it will 
> > > change
> > > following the changes of these fields. When the recirc_id changes,
> > > all megaflows with the old recirc id will be invalid later.
> > >
> > > Fixes: 07659514c3 ("Add support for connection tracking.")
> > > Reported-by: Sai Su 
> > > Signed-off-by: Peng He 
> > > ---
> > >  include/openvswitch/meta-flow.h |  1 +
> > >  lib/meta-flow.c | 13 ++
> > >  ofproto/ofproto-dpif-xlate.c| 12 ++
> > >  tests/system-traffic.at | 42 +
> > >  4 files changed, 68 insertions(+)
> > >
> > > diff --git a/include/openvswitch/meta-flow.h 
> > > b/include/openvswitch/meta-flow.h
> > > index 95e52e358..045dce8f5 100644
> > > --- a/include/openvswitch/meta-flow.h
> > > +++ b/include/openvswitch/meta-flow.h
> > > @@ -2305,6 +2305,7 @@ void mf_set_flow_value_masked(const struct mf_field 
> > > *,
> > >const union mf_value *mask,
> > >struct flow *);
> > >  bool mf_is_tun_metadata(const struct mf_field *);
> > > +bool mf_is_frozen_metadata(const struct mf_field *);
> > >  bool mf_is_pipeline_field(const struct mf_field *);
> > >  bool mf_is_set(const struct mf_field *, const struct flow *);
> > >  void mf_mask_field(const struct mf_field *, struct flow_wildcards *);
> > > diff --git a/lib/meta-flow.c b/lib/meta-flow.c
> > > index c808d205d..e03cd8d0c 100644
> > > --- a/lib/meta-flow.c
> > > +++ b/lib/meta-flow.c
> > > @@ -1788,6 +1788,19 @@ mf_is_tun_metadata(const struct mf_field *mf)
> > > mf->id < MFF_TUN_METADATA0 + TUN_METADATA_NUM_OPTS;
> > >  }
> > >
> > > +bool
> > > +mf_is_frozen_metadata(const struct mf_field *mf)
> > > +{
> > > +if (mf->id >= MFF_TUN_ID && mf->id <= MFF_IN_PORT_OXM) {
> > > +return true;
> > > +}
> > > +
> > > +if (mf->id >= MFF_REG0 && mf->id < MFF_ETH_SRC) {
> > > +return tr

Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy

2021-02-27 Thread 贺鹏

Hi,

Thanks William and Ilya for a detailed revisiting of the origin of the
problem. I learned a lot.

I now understand that the mix using of RCU and refcounts is not
intended in the first place.
But my point is that mix using RCU and refcounts now gives you more
choices, and actually eases the code changes.

For example, the code for * ofproto_dpif_lookup_by_name* or other
ofproto lookup function,
when only using refcounts, you need to change it to:

 struct ofproto_dpif *
 ofproto_dpif_lookup_by_uuid(const struct uuid *uuid)
 {
 struct ofproto_dpif *ofproto;

 HMAP_FOR_EACH_WITH_HASH (ofproto, all_ofproto_dpifs_by_uuid_node,
  uuid_hash(uuid), _ofproto_dpifs_by_uuid) {
 if (uuid_equals(>uuid, uuid)) {

 ---> if  ovs_refcount_try_ref(ofproto)

 return ofproto;
 }
 }
 return NULL;
 }

and after finish its usage, you have to do ofproto_unref the ofproto.

This is why I said, most accessing to ofproto is ephemeral. If you
change to use the pure refcounts solution, you have
to add refcount and release it every time you access the ofproto. We
should be more careful and remember to unref
the ofproto.

However, if using RCU and refcounts, in the above case, you do not
need to change the code, since the RCU ensures that
these ephemeral accesses are safe.

you only need to add refcount, when you find that the pointer to
ofproto lives longer than one grace period.


This is why in my patch, I do not add ref to ofproto after its
creation. I agree the patch is not complete and has issues,
and understand it could confuse people if changes into mix RCU and
refcounts version.


William Tu  于2021年2月26日周五 上午2:32写道：
>
> On Thu, Feb 25, 2021 at 4:54 AM Ilya Maximets  wrote:
> >
> > On 2/25/21 4:32 AM, Guohongzhi (Russell Lab) wrote:
> > > Refcount and RCU are not mutually exclusive.
> > > IMO, the main reason for this problem is that the rule uses both refcount 
> > > and rcu, while the ofproto uses only rcu, and the rule retains the 
> > > pointer of the ofproto. More importantly, ofproto cannot guarantee a 
> > > longer grace period than the rule.
> > >
> >
> > I understand that refcount and RCU are not mutually exclusive.
> > However, in this particular case RCU for ofproto was introduced for one
> > and only one reason - to avoid freeing ofproto while rules are still
> > alive.  This was done in commit f416c8d61601 ("ofproto: RCU postpone
> > rule destruction.").  The goal was to allow using rules without
> > refcounting them within a single grace period.  And that forced us
> > to postpone destruction of the ofproto for a single grace period.
> > Later commit 39c9459355b6 ("Use classifier versioning.") made it
> > possible for rules to be alive for more than one grace period, so
> > the commit made ofproto wait for 2 grace periods by double postponing.
> > As we can see now, that wasn't enough and we have to wait for more
> > than 2 grace periods in certain cases.
> >
> > Now we're introducing refcounts to wait for all rules to be destroyed
> > before destroying the ofproto.  But what is the point of having
> > RCU if we already know that if refcount is zero than all rules are
> > already destroyed and we don't need to wait anything and could just
> > destroy the ofproto?
> >
> > RCU doesn't protect ofproto itself here, it actually protects rules,
> > i.e. keeps ofproto alive while rules alive, but we can fully replace
> > this with refcounts and I see no value in having RCU additionally.
> >
> > To have a full picture: right now we also have groups protected by
> > RCU, so we need to refcount ofproto for them too.  But that is almost
> > exactly same situation as we have with rules.
> >
> Thanks Guohongzhi, Hepeng, and Ilya, I learned a lot from your discussions.
>
> I think refcount and RCU are mutually exclusive. They are two different ways 
> of
> doing synchronization and somehow we mix them together by using RCU to
> optimize refcount.
>
> Before the commit f416c8d61601 ("ofproto: RCU postpone rule destruction."),
> we are using refcount to protect rules, and as a result every time OVS
> references
> a rule, we have to take refcount. It works ok but this probably has
> high performance
> overhead because it's doing atomic operations.
>
> So the commit f416c8d61601 optimizes it by doing
> 1) not taking refcount of rule "within a grace period"
> 2) introduce RCU for rule
> The assumption is that a grace period is like refcount == 0, so it's
> valid to do so.
> A side effect is that ofproto_destroy__()  needs to be postponed.
> Note that if a rule is alive across grace period, it has to take refcount.
>
> However, later commit 39c9459355b6 ("Use classifier versioning.")
> makes rule alive across grace period. It breaks the first commit's assumption
> and it has to introduce double postponing for ofproto, which we found
> out is the problem now.
>
> Since my point is RCU and refcount are mutually exclusive (A grace period
> is like

Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy

2021-02-24 Thread 贺鹏

Ilya Maximets  于2021年2月25日周四 上午12:01写道：
>
> On 2/24/21 3:29 PM, Ilya Maximets wrote:
> > On 7/17/20 3:50 AM, hepeng.0320 wrote:
> >> From: Peng He 
> >>
> >> The call stack of rule_destroy_cb:
> >>
> >> remove_rules_postponed (one grace period)
> >> -> remove_rule_rcu
> >> -> remove_rule_rcu__
> >> -> ofproto_rule_unref -> ref count != 1
> >> -> ... more grace periods.
> >> -> rule_destroy_cb (> 2 grace periods)
> >> -> free
> >>
> >> Currently ofproto_destory waits only two grace periods, this means when
> >> rule_destroy_cb is called, the ofproto might have been already destroyed.
> >>
> >> This patch adds a refcount for ofproto to ensure ofproto exists when it
> >> is needed to call free functions.
> >>
> >> This patch fixes the crashes found:
> >> Program terminated with signal SIGSEGV, Segmentation fault.
> >> 0  0x562a55169e49 in rule_destroy_cb (rule=0x562a598be2c0) at 
> >> ofproto/ofproto.c:2956
> >> 1  0x562a552623d6 in ovsrcu_call_postponed () at lib/ovs-rcu.c:348
> >> 2  0x562a55262504 in ovsrcu_postpone_thread (arg=) at 
> >> lib/ovs-rcu.c:364
> >> 3  0x562a55264aef in ovsthread_wrapper (aux_=) at 
> >> lib/ovs-thread.c:383
> >> 4  0x7febe715a4a4 in start_thread (arg=0x7febe4c1c700) at 
> >> pthread_create.c:456
> >> 5  0x7febe6990d0f in clone () at 
> >> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
> >> (gdb) p rule->ofproto->ofproto_class
> >> $3 = (const struct ofproto_class *) 0x0
> >>
> >> Also:
> >> 0  ofproto_dpif_credit_table_stats (ofproto=0x5583583332b0, table_id=0 
> >> '\000', n_matches=5, n_misses=0) at ofproto/ofproto-dpif.c:4310
> >> 1  0x558354c68514 in xlate_push_stats_entry 
> >> (entry=entry@entry=0x7ff488031398, stats=stats@entry=0x7ff49af30b60) at 
> >> ofproto/ofproto-dpif-xlate-cache.c:99
> >> 2  0x558354c686f3 in xlate_push_stats (xcache=, 
> >> stats=stats@entry=0x7ff49af30b60) at ofproto/ofproto-dpif-xlate-cache.c:181
> >> 3  0x558354c5411a in revalidate_ukey 
> >> (udpif=udpif@entry=0x5583583baba0, ukey=ukey@entry=0x7ff47809f770, 
> >> stats=stats@entry=0x7ff49af32128, 
> >> odp_actions=odp_actions@entry=0x7ff49af30c50, 
> >> reval_seq=reval_seq@entry=275670456,
> >>recircs=recircs@entry=0x7ff49af30cd0) at 
> >> ofproto/ofproto-dpif-upcall.c:2292
> >> 4  0x558354c57cbc in revalidate (revalidator=) at 
> >> ofproto/ofproto-dpif-upcall.c:2681
> >> 5  0x558354c57f8e in udpif_revalidator (arg=0x5583583d2a90) at 
> >> ofproto/ofproto-dpif-upcall.c:934
> >> 6  0x558354d24aef in ovsthread_wrapper (aux_=) at 
> >> lib/ovs-thread.c:383
> >> 7  0x7ff4a479d4a4 in start_thread (arg=0x7ff49af35700) at 
> >> pthread_create.c:456
> >> 8  0x7ff4a3fd3d0f in clone () at 
> >> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
> >> (gdb) p ofproto->up.ofproto_class
> >> $3 = (const struct ofproto_class *) 0x0
> >>
> >> Signed-off-by: Peng He 
> >> ---
>
> CC: Hongzhi Guo
>
> And I just remembered that there is a very similar patch that
> was submitted to the mail-list several months betore this one.
> Ben started review, but didn't finish:
>
> http://patchwork.ozlabs.org/project/openvswitch/patch/20200306130555.19884-1-guohongz...@huawei.com/
>
> Both patches has smilar issues, i.e. mixing RCU with refcounters
> and not covering all the cases.
>
> There are at least 2 major places where ofproto needs to be refcounted:
> 1. rules
>1.a xlate caches?
> 2. ofproto groups
>
> IMO, mixing RCU with refcounts basically means that we do not have a
> clear understanding on how it works and using RCU as a safety net.
> We shuld stop using RCU for orproto and just have refcounts.  Once
> ofproto refcount reaches zero it should be immediately destroyed.
> All the places where we need to increase refcount needs to be tracked
> down.

IMO, I think RCU and refcounts are not mutually exclusive. In kernel
connntack system such as IPVS, sessions are used with both RCU and
refcounts.

Essentially, the RCU implies a refcount if there are no places holding
a pointer to the object other than hash tables (or other RCU enabled
data structure), thus by removing it from hash tables, and wait a
grace period,
we can safely free the object because we are sure that no one holds a
pointer to the object. Removing it from hash tables is like reducing
the refcount by 1, and since hash tables are
the only place that holds the pointer and this is the only reference
that lives longer than a grace period, so by removing it, we are sure
that no other ref can live longer than a grace period, and after the
period, we are safe to free it.

If we use only refcounts, every time we access the object, we have to
add refcounts.
But if use RCU combined with refcount, we only need to add ref, if
there are other places than the hashtables that live longer than one
grace period.

So I think using RCU with refcounts actually makes code changes less,
as mostly, the access to object is ephemeral, you lookup

Re: [ovs-dev] [PATCHv2] connmgr: Check nullptr inside ofmonitor_report()

2021-02-22 Thread 贺鹏

Hi, Ilya and William,

Ilya Maximets  于2021年2月22日周一 下午8:12写道：
>
> On 2/21/21 5:11 PM, William Tu wrote:
> > On Fri, Feb 19, 2021 at 6:29 PM 贺鹏  wrote:
> >>
> >> Hi, Ilya
> >>
> >> Ilya Maximets  于2021年2月19日周五 下午7:19写道：
> >>>
> >>> On 2/19/21 3:12 AM, 贺鹏 wrote:
> >>>> Hi,
> >>>>
> >>>> Looks like this bug is caused by violating the fact that if a rule is
> >>>> referenced, the related ofproto should not be destroyed.
> >>>>
> >>>> If so, I have a patch that also fixes the problem, not sure if this 
> >>>> helps.
> >>>>
> >>>> http://patchwork.ozlabs.org/project/openvswitch/patch/20200717015041.82746-1-hepeng.0...@bytedance.com/
> >>>
> >>> There is at least one more problem that is not strictly related but
> >>> in more or less the same part of the code:
> >>>   https://mail.openvswitch.org/pipermail/ovs-dev/2021-February/380582.html
> >>
> >> So maybe before add it into *ovsrcu_postpone*, we should add refcount
> >> of ofproto also?
> >>
> > Yes, I think that will fix the issue. But are we able to find out all the
> > places that we need to add refcount of ofproto?

Yes, this could be difficult.

>
> destruct() is called for ofproto_dpif unconditionally without any deferring,
> so this will not help unless we're delaying the the destruct() itself, and
> I'm not sure if we can do that.
>
> >
> > Looks like we might have multiple rcu postponed function that might
> > access the 'ofproto'. And 'ofproto' might be freed already and cause 
> > segfault.
> >
> > Hepeng's patch fixes two places.
> >   
> > http://patchwork.ozlabs.org/project/openvswitch/patch/20200717015041.82746-1-hepeng.0...@bytedance.com/
> > Ilya pointed out another place
> >   https://mail.openvswitch.org/pipermail/ovs-dev/2021-February/380582.html
> > yifeng's case is a little different (not due to ofproto = NULL, but
> > due to setting
> > the p->connmgr = NULL before postponed)
>
> In my case ofproto is alive too.  The problem is destroyed ofproto->baker.
> And in case of meter_id we don't really need to refcount the ofproto itself.
> 'baker' already has a refcount, so we could increase it and pass 'baker'
> pointer to free_meter_id instead of 'ofproto'.  This function doesn't
> actually need an 'ofproto' pointer.
>
> Regarding this connmgr related patch... It's still valid even with refcounts
> because destruction of connmgr is explicitly not deferred for a reason.  It's
> to free the listening socket that users might want to reuse.  Refcounts will
> not help here.
> Also for the meters related issue.. destruct() is called for ofproto_dpif
> without any postponing and I'm not sure if we need or actually able to
> postpone it without consequences.  So, baker->refcount might be a better
> solution keeping the ofproto->refcount only for rules.  This will reduce
> the scope of changes, otherwise we will have to do a lot more work tracking
> down all the users that holds 'ofproto' pointer and will miss some of them
> with high probability.

Agree.
I recheck the patch. It's a better idea to use ofproto->backer's refcount.
But in the rule case, looks like the rule->ofproto needs to be alive
to access both ofproto->vl_mff_map and ofproto->ofproto_class.

 static void
 rule_destroy_cb(struct rule *rule)
 OVS_NO_THREAD_SAFETY_ANALYSIS
 {
 /* Send rule removed if needed. */
 if (rule->flags & OFPUTIL_FF_SEND_FLOW_REM
 && rule->removed_reason != OVS_OFPRR_NONE
 && !rule_is_hidden(rule)) {
 ofproto_rule_send_removed(rule);
 }
 rule->ofproto->ofproto_class->rule_destruct(rule);
 mf_vl_mff_unref(>ofproto->vl_mff_map, rule->match_tlv_bitmap);
 mf_vl_mff_unref(>ofproto->vl_mff_map, rule->ofpacts_tlv_bitmap);
 ofproto_rule_destroy__(rule);
 }

Maybe add refcount into vl_mff_map and ofproto_class?

>
> I'll recheck this (connmgr) patch and, likely, apply it.   Will also review
> ofproto refcount patch.
>
> Best regards, Ilya Maximets.



-- 
hepeng
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCHv2] connmgr: Check nullptr inside ofmonitor_report()

2021-02-19 Thread 贺鹏

Hi, Ilya

Ilya Maximets  于2021年2月19日周五 下午7:19写道：
>
> On 2/19/21 3:12 AM, 贺鹏 wrote:
> > Hi,
> >
> > Looks like this bug is caused by violating the fact that if a rule is
> > referenced, the related ofproto should not be destroyed.
> >
> > If so, I have a patch that also fixes the problem, not sure if this helps.
> >
> > http://patchwork.ozlabs.org/project/openvswitch/patch/20200717015041.82746-1-hepeng.0...@bytedance.com/
>
> There is at least one more problem that is not strictly related but
> in more or less the same part of the code:
>   https://mail.openvswitch.org/pipermail/ovs-dev/2021-February/380582.html

So maybe before add it into *ovsrcu_postpone*, we should add refcount
of ofproto also?


>
> >
> > William Tu  于2021年2月19日周五 上午8:10写道：
> >>
> >> On Wed, Feb 17, 2021 at 1:09 PM Yifeng Sun  wrote:
> >>>
> >>> ovs-vswitchd could crash under these circumstances:
> >>> 1. When one bridge is being destroyed, ofproto_destroy() is called and
> >>> connmgr pointer of its ofproto struct is nullified. This ofproto struct is
> >>> deallocated through 'ovsrcu_postpone(ofproto_destroy_defer__, p);'.
> >>> 2. Before RCU enters quiesce state to actually free this ofproto struct,
> >>> revalidator thread calls udpif_revalidator(), which could handle
> >>> a learn flow and calls ofproto_flow_mod_learn(), it later calls
> >>> ofmonitor_report() and ofproto struct's connmgr pointer is accessed.
> >>>
> >>> The crash stack trace is shown below:
> >>>
> >>> 0  ofmonitor_report (mgr=0x0, rule=rule@entry=0x7fa4ac067c30, 
> >>> event=event@entry=NXFME_ADDED,
> >>> reason=reason@entry=OFPRR_IDLE_TIMEOUT, abbrev_ofconn=0x0, 
> >>> abbrev_xid=0, old_actions=old_actions@entry=0x0)
> >>> at ofproto/connmgr.c:2160
> >>> 1  0x7fa4d6803495 in add_flow_finish (ofproto=0x55d9075d4ab0, 
> >>> ofm=, req=req@entry=0x0)
> >>> at ofproto/ofproto.c:5221
> >>> 2  0x7fa4d68036af in modify_flows_finish (req=0x0, 
> >>> ofm=0x7fa4980753f0, ofproto=0x55d9075d4ab0)
> >>> at ofproto/ofproto.c:5823
> >>> 3  ofproto_flow_mod_finish (ofproto=0x55d9075d4ab0, 
> >>> ofm=ofm@entry=0x7fa4980753f0, req=req@entry=0x0)
> >>> at ofproto/ofproto.c:8088
> >>> 4  0x7fa4d680372d in ofproto_flow_mod_learn_finish 
> >>> (ofm=ofm@entry=0x7fa4980753f0,
> >>> orig_ofproto=orig_ofproto@entry=0x0) at ofproto/ofproto.c:5439
> >>> 5  0x7fa4d68072f9 in ofproto_flow_mod_learn (ofm=0x7fa4980753f0, 
> >>> keep_ref=keep_ref@entry=true,
> >>> limit=, below_limitp=below_limitp@entry=0x0) at 
> >>> ofproto/ofproto.c:5499
> >>> 6  0x7fa4d6835d33 in xlate_push_stats_entry (entry=0x7fa498012448, 
> >>> stats=stats@entry=0x7fa4d2701a10,
> >>> offloaded=offloaded@entry=false) at 
> >>> ofproto/ofproto-dpif-xlate-cache.c:127
> >>> 7  0x7fa4d6835e3a in xlate_push_stats (xcache=, 
> >>> stats=stats@entry=0x7fa4d2701a10,
> >>> offloaded=offloaded@entry=false) at 
> >>> ofproto/ofproto-dpif-xlate-cache.c:181
> >>> 8  0x7fa4d6822046 in revalidate_ukey 
> >>> (udpif=udpif@entry=0x55d90760b240, ukey=ukey@entry=0x7fa4b0191660,
> >>> stats=stats@entry=0x7fa4d2705118, 
> >>> odp_actions=odp_actions@entry=0x7fa4d2701b50,
> >>> reval_seq=reval_seq@entry=5655486242, 
> >>> recircs=recircs@entry=0x7fa4d2701b40, offloaded=false)
> >>> at ofproto/ofproto-dpif-upcall.c:2294
> >>> 9  0x7fa4d6825aee in revalidate (revalidator=0x55d90769dd00) at 
> >>> ofproto/ofproto-dpif-upcall.c:2683
> >>> 10 0x7fa4d6825cf3 in udpif_revalidator (arg=0x55d90769dd00) at 
> >>> ofproto/ofproto-dpif-upcall.c:936
> >>> 11 0x7fa4d6259c9f in ovsthread_wrapper (aux_=) at 
> >>> lib/ovs-thread.c:423
> >>> 12 0x7fa4d582cea5 in start_thread () from /usr/lib64/libpthread.so.0
> >>> 13 0x7fa4d504b96d in clone () from /usr/lib64/libc.so.6
> >>>
> >>> At the time of crash, the involved ofproto was already deallocated:
> >>>
> >>> (gdb) print *ofproto
> >>> $1 = ..., name = 0x55d907602820 "nsx-managed", ..., ports = {...,
> >>> one = 0x0, mask = 63, n = 0}, ..., connmgr = 0x0, ...
> >>>
> >>> This patch fixes it.
> >>>
> >>> VMware-BZ:

Re: [ovs-dev] [PATCH v2] ofproto-dpif-xlate: fix zone set from non-frozen-metadata fields

2021-02-19 Thread 贺鹏

Hi, Mark,

thanks for the review!

Mark Gray  于2021年2月19日周五 下午9:07写道：
>
> On 18/02/2021 09:43, Peng He wrote:
>
> One small comment below. By the way, good catch .. i found it difficult
> enough to review it, let alone find it!

Thanks.
This is found by exploring our controller's bug, which changes zone-id
every time it restarts.

>
> > CT zone could be set from a field that is not included in frozen
> > metadata. Consider the example rules which are typically seen in
> > OpenStack security group rules:
> >
> > priority=100,in_port=1,tcp,ct_state=-trk,action=ct(zone=5,table=0)
> > priority=100,in_port=1,tcp,ct_state=+trk,action=ct(commit,zone=NXM_NX_CT_ZONE[]),2
> >
> > The zone is set from the first rule's ct action. These two rules will
> > generate two megaflows: the first one uses zone=5 to query the CT module,
> > the second one sets the zone-id from the first megaflow and commit to CT.
> >
> > The current implementation will generate a megaflow that does not use
> > ct_zone=5 as a match, but directly commit into the ct using zone=5, as zone 
> > is
> > set by an Imm not a field.
> >
> > Consider a situation that one changes the zone id (for example to 15)
> > in the first rule, however, still keep the second rule unchanged. During
> > this change, there is traffic hitting the two generated megaflows, the
> > revaldiator would revalidate all megaflows, however, the revalidator will
> > not change the second megaflow, because zone=5 is recorded in the
> > megaflow, so the xlate will still translate the commit action into zone=5,
> > and the new traffic will still commit to CT as zone=5, not zone=15,
> > resulting in taffic drops and other issues.
> >
> > Just like OVS set-field convention, if a field X is set by Y
> > (Y is a variable not an Imm), we should also mask Y as a match
> > in the generated megaflow. An exception is that if the zone-id is
> > set by the field that is included in the frozen state (i.e. regs) and this
> > upcall is a resume of a thawed xlate, the un-wildcarding can be skipped,
> > as the recirc_id is a hash of the values in these fields, and it will change
> > following the changes of these fields. When the recirc_id changes,
> > all megaflows with the old recirc id will be invalid later.
> >
> > Fixes: 07659514c3 ("Add support for connection tracking.")
> > Reported-by: Sai Su 
> > Signed-off-by: Peng He 
> > ---
> >  include/openvswitch/meta-flow.h |  1 +
> >  lib/meta-flow.c | 13 ++
> >  ofproto/ofproto-dpif-xlate.c| 12 ++
> >  tests/system-traffic.at | 42 +
> >  4 files changed, 68 insertions(+)
> >
> > diff --git a/include/openvswitch/meta-flow.h 
> > b/include/openvswitch/meta-flow.h
> > index 95e52e358..045dce8f5 100644
> > --- a/include/openvswitch/meta-flow.h
> > +++ b/include/openvswitch/meta-flow.h
> > @@ -2305,6 +2305,7 @@ void mf_set_flow_value_masked(const struct mf_field *,
> >const union mf_value *mask,
> >struct flow *);
> >  bool mf_is_tun_metadata(const struct mf_field *);
> > +bool mf_is_frozen_metadata(const struct mf_field *);
> >  bool mf_is_pipeline_field(const struct mf_field *);
> >  bool mf_is_set(const struct mf_field *, const struct flow *);
> >  void mf_mask_field(const struct mf_field *, struct flow_wildcards *);
> > diff --git a/lib/meta-flow.c b/lib/meta-flow.c
> > index c808d205d..e03cd8d0c 100644
> > --- a/lib/meta-flow.c
> > +++ b/lib/meta-flow.c
> > @@ -1788,6 +1788,19 @@ mf_is_tun_metadata(const struct mf_field *mf)
> > mf->id < MFF_TUN_METADATA0 + TUN_METADATA_NUM_OPTS;
> >  }
> >
> > +bool
> > +mf_is_frozen_metadata(const struct mf_field *mf)
> > +{
> > +if (mf->id >= MFF_TUN_ID && mf->id <= MFF_IN_PORT_OXM) {
> > +return true;
> > +}
> > +
> > +if (mf->id >= MFF_REG0 && mf->id < MFF_ETH_SRC) {
> > +return true;
> > +}
> > +return false;
> > +}
> > +
> >  bool
> >  mf_is_pipeline_field(const struct mf_field *mf)
> >  {
> > diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
> > index 7108c8a30..14d00db1e 100644
> > --- a/ofproto/ofproto-dpif-xlate.c
> > +++ b/ofproto/ofproto-dpif-xlate.c
> > @@ -6195,6 +6195,18 @@ compose_conntrack_action(struct xlate_ctx *ctx, 
> > struct ofpact_conntrack *ofc,
> >
> >  if (ofc->zone_src.field) {
> >  zone = mf_get_subfield(>zone_src, >xin->flow);
> > +if (ctx->xin->frozen_state) {
> > +/* If the upcall is a resume of a recirculation, we only need 
> > to
> > + * unwildcard the fields that are not in the frozen_metadata, 
> > as
> > + * when the rules update, OVS will generate a new recirc_id,
> > + * which will invalidate the megaflow with old the recirc_id.
> > + */
> > +if (!mf_is_frozen_metadata(ofc->zone_src.field)) {
> > +mf_mask_field(ofc->zone_src.field, ctx->wc);
> > +}
>

Re: [ovs-dev] [PATCHv2] connmgr: Check nullptr inside ofmonitor_report()

2021-02-18 Thread 贺鹏

Hi,

Looks like this bug is caused by violating the fact that if a rule is
referenced, the related ofproto should not be destroyed.

If so, I have a patch that also fixes the problem, not sure if this helps.

http://patchwork.ozlabs.org/project/openvswitch/patch/20200717015041.82746-1-hepeng.0...@bytedance.com/

William Tu  于2021年2月19日周五 上午8:10写道：
>
> On Wed, Feb 17, 2021 at 1:09 PM Yifeng Sun  wrote:
> >
> > ovs-vswitchd could crash under these circumstances:
> > 1. When one bridge is being destroyed, ofproto_destroy() is called and
> > connmgr pointer of its ofproto struct is nullified. This ofproto struct is
> > deallocated through 'ovsrcu_postpone(ofproto_destroy_defer__, p);'.
> > 2. Before RCU enters quiesce state to actually free this ofproto struct,
> > revalidator thread calls udpif_revalidator(), which could handle
> > a learn flow and calls ofproto_flow_mod_learn(), it later calls
> > ofmonitor_report() and ofproto struct's connmgr pointer is accessed.
> >
> > The crash stack trace is shown below:
> >
> > 0  ofmonitor_report (mgr=0x0, rule=rule@entry=0x7fa4ac067c30, 
> > event=event@entry=NXFME_ADDED,
> > reason=reason@entry=OFPRR_IDLE_TIMEOUT, abbrev_ofconn=0x0, 
> > abbrev_xid=0, old_actions=old_actions@entry=0x0)
> > at ofproto/connmgr.c:2160
> > 1  0x7fa4d6803495 in add_flow_finish (ofproto=0x55d9075d4ab0, 
> > ofm=, req=req@entry=0x0)
> > at ofproto/ofproto.c:5221
> > 2  0x7fa4d68036af in modify_flows_finish (req=0x0, ofm=0x7fa4980753f0, 
> > ofproto=0x55d9075d4ab0)
> > at ofproto/ofproto.c:5823
> > 3  ofproto_flow_mod_finish (ofproto=0x55d9075d4ab0, 
> > ofm=ofm@entry=0x7fa4980753f0, req=req@entry=0x0)
> > at ofproto/ofproto.c:8088
> > 4  0x7fa4d680372d in ofproto_flow_mod_learn_finish 
> > (ofm=ofm@entry=0x7fa4980753f0,
> > orig_ofproto=orig_ofproto@entry=0x0) at ofproto/ofproto.c:5439
> > 5  0x7fa4d68072f9 in ofproto_flow_mod_learn (ofm=0x7fa4980753f0, 
> > keep_ref=keep_ref@entry=true,
> > limit=, below_limitp=below_limitp@entry=0x0) at 
> > ofproto/ofproto.c:5499
> > 6  0x7fa4d6835d33 in xlate_push_stats_entry (entry=0x7fa498012448, 
> > stats=stats@entry=0x7fa4d2701a10,
> > offloaded=offloaded@entry=false) at 
> > ofproto/ofproto-dpif-xlate-cache.c:127
> > 7  0x7fa4d6835e3a in xlate_push_stats (xcache=, 
> > stats=stats@entry=0x7fa4d2701a10,
> > offloaded=offloaded@entry=false) at 
> > ofproto/ofproto-dpif-xlate-cache.c:181
> > 8  0x7fa4d6822046 in revalidate_ukey (udpif=udpif@entry=0x55d90760b240, 
> > ukey=ukey@entry=0x7fa4b0191660,
> > stats=stats@entry=0x7fa4d2705118, 
> > odp_actions=odp_actions@entry=0x7fa4d2701b50,
> > reval_seq=reval_seq@entry=5655486242, 
> > recircs=recircs@entry=0x7fa4d2701b40, offloaded=false)
> > at ofproto/ofproto-dpif-upcall.c:2294
> > 9  0x7fa4d6825aee in revalidate (revalidator=0x55d90769dd00) at 
> > ofproto/ofproto-dpif-upcall.c:2683
> > 10 0x7fa4d6825cf3 in udpif_revalidator (arg=0x55d90769dd00) at 
> > ofproto/ofproto-dpif-upcall.c:936
> > 11 0x7fa4d6259c9f in ovsthread_wrapper (aux_=) at 
> > lib/ovs-thread.c:423
> > 12 0x7fa4d582cea5 in start_thread () from /usr/lib64/libpthread.so.0
> > 13 0x7fa4d504b96d in clone () from /usr/lib64/libc.so.6
> >
> > At the time of crash, the involved ofproto was already deallocated:
> >
> > (gdb) print *ofproto
> > $1 = ..., name = 0x55d907602820 "nsx-managed", ..., ports = {...,
> > one = 0x0, mask = 63, n = 0}, ..., connmgr = 0x0, ...
> >
> > This patch fixes it.
> >
> > VMware-BZ: #2700626
> > Signed-off-by: Yifeng Sun 
> > ---
> > v1->v2: Add check for ofmonitor_flush, thanks William.
> >
>
> LGTM, thanks.
> Acked-by: William Tu < u9012...@gmail.com>
>
> CC Ilya and Ben to see if any comments.
>
> I feel this kind of RCU issue is hard to find out, and
> existing tools such as addressSanitizer are usually
> not helpful.
>
> William
>
> >  ofproto/connmgr.c | 6 +-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/ofproto/connmgr.c b/ofproto/connmgr.c
> > index 9c5c633b4171..fa8f6cd0e83a 100644
> > --- a/ofproto/connmgr.c
> > +++ b/ofproto/connmgr.c
> > @@ -2140,7 +2140,7 @@ ofmonitor_report(struct connmgr *mgr, struct rule 
> > *rule,
> >   const struct rule_actions *old_actions)
> >  OVS_REQUIRES(ofproto_mutex)
> >  {
> > -if (rule_is_hidden(rule)) {
> > +if (!mgr || rule_is_hidden(rule)) {
> >  return;
> >  }
> >
> > @@ -2244,6 +2244,10 @@ ofmonitor_flush(struct connmgr *mgr)
> >  {
> >  struct ofconn *ofconn;
> >
> > +if (!mgr) {
> > +return;
> > +}
> > +
> >  LIST_FOR_EACH (ofconn, connmgr_node, >conns) {
> >  struct rconn_packet_counter *counter = ofconn->monitor_counter;
> >
> > --
> > 2.7.4
> >
> > ___
> > dev mailing list
> > d...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> ___
> dev

Re: [ovs-dev] [PATCH v1] ofproto-dpif-xlate: fix zone set from non frozen metadata field

2021-02-18 Thread 贺鹏

Hi, I've already sent a v2 patch.

Mark Gray  于2021年2月17日周三 下午6:44写道：
>
> On 17/02/2021 10:40, 贺鹏 wrote:
> > Hi,
> >
> > Thanks for the review.
> >
> > Mark Gray  于2021年2月17日周三 下午6:13写道：
> >>
> >> I'm not too familiar with this code but I have some comments.
> >>
> >> On 15/02/2021 09:50, Peng He wrote:
> >>> CT zone could be set from a field that is not included in frozen
> >>> metedata. Consider the belowing cases which is normally used in
> >>
> >> Nits:
> >>
> >> s/metedata/metadata
> >> s/belowing cases which is/cases below which are
> >
> > sorry for the typo, will fix it in the next version
> >
> >>
> >>> OpenStack security group rules:
> >>>
> >>> priority=100,in_port=1,tcp,ct_state=-trk,action=ct(zone=5,table=0)
> >>> priority=100,in_port=1,tcp,ct_state=+trk,action=ct(commit,zone=NXM_NX_CT_ZONE[]),2
> >>>
> >>> The zone is set from the first rule's ct action. These two rules will
> >>> generate two megaflows: the first one uses zone=5 to query the CT module,
> >>> the second one set zone from the first megaflow and commit to CT.
> >>>
> >>> The current implementation will generate a megaflow which does not use
> >>> ct_zone=5 as a match, but directly commit into the ct using zone=5, as 
> >>> zone is
> >>> set by an Imm not a field.
> >>>
> >>> Consider a situation that one changes the zone id (for example to 15)
> >>> in the first rule however still keep the second rule unchanged. During
> >>> this change, there is traffic hitting the two generated megaflows, the
> >>> revaldiator would revalidate all megaflows, however, the revalidator will
> >>> not change the second megaflow, because zone=5 is recorded in the
> >>> megaflow, so the xlate will still translate the commit action into zone=5,
> >>> and the new traffic will still commit to CT as zone=5, not zone=15,
> >>> resulting in taffic drops and other issues.
> >>>
> >>> Just like OVS set-field action, if the field X is set by Y, we should also
> >>> mask Y as a match in the generated megaflow. An exception is that, if the 
> >>> zone
> >>> is set by the field that is included in the frozen state and this upcall 
> >>> is
> >>> a resume of a thawed xlate, the masking can be skipped, as if one changes
> >>> the previous rule, the generated recirc_id will be changed, and all 
> >>> megaflows
> >>> with the old recirc id will be invalid later, i.e. the revalidator will
> >>> not reuse the old megaflows at all.
> >>>
> >>> Fixes: 07659514c3 ("Add support for connection tracking.")
> >>> Reported-by: Sai Su 
> >>> Signed-off-by: Peng He 
> >>> ---
> >>>  include/openvswitch/meta-flow.h |  1 +
> >>>  lib/meta-flow.c | 13 +++
> >>>  ofproto/ofproto-dpif-xlate.c| 15 +
> >>>  tests/system-traffic.at | 39 +
> >>>  4 files changed, 68 insertions(+)
> >>>
> >>> diff --git a/include/openvswitch/meta-flow.h 
> >>> b/include/openvswitch/meta-flow.h
> >>> index 95e52e358..045dce8f5 100644
> >>> --- a/include/openvswitch/meta-flow.h
> >>> +++ b/include/openvswitch/meta-flow.h
> >>> @@ -2305,6 +2305,7 @@ void mf_set_flow_value_masked(const struct mf_field 
> >>> *,
> >>>const union mf_value *mask,
> >>>struct flow *);
> >>>  bool mf_is_tun_metadata(const struct mf_field *);
> >>> +bool mf_is_frozen_metadata(const struct mf_field *);
> >>>  bool mf_is_pipeline_field(const struct mf_field *);
> >>>  bool mf_is_set(const struct mf_field *, const struct flow *);
> >>>  void mf_mask_field(const struct mf_field *, struct flow_wildcards *);
> >>> diff --git a/lib/meta-flow.c b/lib/meta-flow.c
> >>> index c808d205d..e03cd8d0c 100644
> >>> --- a/lib/meta-flow.c
> >>> +++ b/lib/meta-flow.c
> >>> @@ -1788,6 +1788,19 @@ mf_is_tun_metadata(const struct mf_field *mf)
> >>> mf->id < MFF_TUN_METADATA0 + TUN_METADATA_NUM_OPTS;
> >>>  }
> >>>
> >>> +bool
> >>> +mf_is_frozen_metadata(const struct mf_field *mf)
> >>> +{

Re: [ovs-dev] [PATCH v1] ofproto-dpif-xlate: fix zone set from non frozen metadata field

2021-02-17 Thread 贺鹏

Hi,

Thanks for the review.

Mark Gray  于2021年2月17日周三 下午6:13写道：
>
> I'm not too familiar with this code but I have some comments.
>
> On 15/02/2021 09:50, Peng He wrote:
> > CT zone could be set from a field that is not included in frozen
> > metedata. Consider the belowing cases which is normally used in
>
> Nits:
>
> s/metedata/metadata
> s/belowing cases which is/cases below which are

sorry for the typo, will fix it in the next version

>
> > OpenStack security group rules:
> >
> > priority=100,in_port=1,tcp,ct_state=-trk,action=ct(zone=5,table=0)
> > priority=100,in_port=1,tcp,ct_state=+trk,action=ct(commit,zone=NXM_NX_CT_ZONE[]),2
> >
> > The zone is set from the first rule's ct action. These two rules will
> > generate two megaflows: the first one uses zone=5 to query the CT module,
> > the second one set zone from the first megaflow and commit to CT.
> >
> > The current implementation will generate a megaflow which does not use
> > ct_zone=5 as a match, but directly commit into the ct using zone=5, as zone 
> > is
> > set by an Imm not a field.
> >
> > Consider a situation that one changes the zone id (for example to 15)
> > in the first rule however still keep the second rule unchanged. During
> > this change, there is traffic hitting the two generated megaflows, the
> > revaldiator would revalidate all megaflows, however, the revalidator will
> > not change the second megaflow, because zone=5 is recorded in the
> > megaflow, so the xlate will still translate the commit action into zone=5,
> > and the new traffic will still commit to CT as zone=5, not zone=15,
> > resulting in taffic drops and other issues.
> >
> > Just like OVS set-field action, if the field X is set by Y, we should also
> > mask Y as a match in the generated megaflow. An exception is that, if the 
> > zone
> > is set by the field that is included in the frozen state and this upcall is
> > a resume of a thawed xlate, the masking can be skipped, as if one changes
> > the previous rule, the generated recirc_id will be changed, and all 
> > megaflows
> > with the old recirc id will be invalid later, i.e. the revalidator will
> > not reuse the old megaflows at all.
> >
> > Fixes: 07659514c3 ("Add support for connection tracking.")
> > Reported-by: Sai Su 
> > Signed-off-by: Peng He 
> > ---
> >  include/openvswitch/meta-flow.h |  1 +
> >  lib/meta-flow.c | 13 +++
> >  ofproto/ofproto-dpif-xlate.c| 15 +
> >  tests/system-traffic.at | 39 +
> >  4 files changed, 68 insertions(+)
> >
> > diff --git a/include/openvswitch/meta-flow.h 
> > b/include/openvswitch/meta-flow.h
> > index 95e52e358..045dce8f5 100644
> > --- a/include/openvswitch/meta-flow.h
> > +++ b/include/openvswitch/meta-flow.h
> > @@ -2305,6 +2305,7 @@ void mf_set_flow_value_masked(const struct mf_field *,
> >const union mf_value *mask,
> >struct flow *);
> >  bool mf_is_tun_metadata(const struct mf_field *);
> > +bool mf_is_frozen_metadata(const struct mf_field *);
> >  bool mf_is_pipeline_field(const struct mf_field *);
> >  bool mf_is_set(const struct mf_field *, const struct flow *);
> >  void mf_mask_field(const struct mf_field *, struct flow_wildcards *);
> > diff --git a/lib/meta-flow.c b/lib/meta-flow.c
> > index c808d205d..e03cd8d0c 100644
> > --- a/lib/meta-flow.c
> > +++ b/lib/meta-flow.c
> > @@ -1788,6 +1788,19 @@ mf_is_tun_metadata(const struct mf_field *mf)
> > mf->id < MFF_TUN_METADATA0 + TUN_METADATA_NUM_OPTS;
> >  }
> >
> > +bool
> > +mf_is_frozen_metadata(const struct mf_field *mf)
> > +{
> > +if (mf->id >= MFF_TUN_ID && mf->id <= MFF_IN_PORT_OXM) {
> > +return true;
> > +}
> > +
> > +if (mf->id >= MFF_REG0 && mf->id < MFF_ETH_SRC) {
> > +return true;
> > +}
> > +return false;
> > +}
> > +
> >  bool
> >  mf_is_pipeline_field(const struct mf_field *mf)
> >  {
> > diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
> > index 7108c8a30..5d1f029fd 100644
> > --- a/ofproto/ofproto-dpif-xlate.c
> > +++ b/ofproto/ofproto-dpif-xlate.c
> > @@ -6212,6 +6212,21 @@ compose_conntrack_action(struct xlate_ctx *ctx, 
> > struct ofpact_conntrack *ofc,
> > >xin->flow, ctx->wc, zone);
> >  }
> >  }
> > +
> > +if (ofc->zone_src.field) {
> > +if (ctx->xin->frozen_state) {
> > +/* If the upcall is a resume of a recirculation, we only need 
> > to
> > + * unwildcard the fields that are not in the frozen_metadata, 
> > as
> > + * when the rules update, OVS will generate a new recirc_id,
> > + * which will invalidate the megaflow with old the recirc_id.
> > + */
> > +if (!mf_is_frozen_metadata(ofc->zone_src.field)) {
> > +mf_mask_field(ofc->zone_src.field, ctx->wc);
> Is this the only field we should check and un-wildcard here. This

[ovs-dev] 回复：[ovs-dev v1] ipf: avoid accessing to a freed rp.

2020-11-17 Thread 贺鹏

ping

于2020年10月14日星期三 11:15  写道:

if there are multiple pkts in the batch, the loop will access a freed rp,
which cause ovs crash. --- lib/ipf.c | 3 ++- 1 file changed, 2
insertions(+), 1 deletion(-) diff --git a/lib/ipf.c b/lib/ipf.c index
446e89d13..c20bcc0b3 100644 --- a/lib/ipf.c +++ b/lib/ipf.c @@ -1153,7
+1153,7 @@ ipf_post_execute_reass_pkts(struct ipf *ipf, /* Inner batch loop
is constant time since batch size is <= * NETDEV_MAX_BURST. */
DP_PACKET_BATCH_REFILL_FOR_EACH (pb_idx, pb_cnt, pkt, pb) { - if (pkt ==
rp->list->reass_execute_ctx) { + if (rp && pkt ==
rp->list->reass_execute_ctx) { for (int i = 0; i <=
rp->list->last_inuse_idx; i++) { rp->list->frag_list[i].pkt->md.ct_label =
pkt->md.ct_label; rp->list->frag_list[i].pkt->md.ct_mark = pkt->md.ct_mark;
@@ -1206,6 +1206,7 @@ ipf_post_execute_reass_pkts(struct ipf *ipf,
ipf_reassembled_list_remove(rp); dp_packet_delete(rp->pkt); free(rp); + rp
= NULL; } else { dp_packet_batch_refill(pb, pkt, pb_idx); } -- 2.20.1
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v1] ofproto-dpif-upcall: fix push_dp_ops

2020-10-09 Thread 贺鹏

sorry, please ignore this patch.

hepeng.0320  于2020年10月9日周五 下午8:36写道：
>
> From: Peng He 
>
> if for some reason, the dpif ops in push_dp_ops failed with op->dop.type
> not equal to DPIF_OP_FLOW_DEL, the generated ukey is installed however
> the corresponding datapath flow is not. The consequent upcalls will
> always fail to install datapath flows as:
>
> try_ukey_replace
>   -> return false
>  -> upcall return NOSPC error, however the ukey still exist.
>
> note that, the revalidator_sweep__ will not certainly clean such ukeys.
> The current code would try firstly to call revalidate_ukey and it might
> result in keeping such ukeys in the end.
>
> In this case, a large amount of warning logs are observed:
>
> ofproto_dpif_upcall(pmd-c22/id:13)|WARN|Dropped 3898 log messages in last 87 
> seconds (most recently, 29 seconds ago) due to excessive rate
> ofproto_dpif_upcall(pmd-c22/id:13)|WARN|upcall_cb failure: ukey installation 
> fails
>
> and it will not recovery unless you manually run revalidator/purge.
>
> this path checks if ops succeeded, if not, will changed
> ukey->state into EVICTED state.
>
> Signed-off-by: Peng He 
> ---
>  ofproto/ofproto-dpif-upcall.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c
> index 72a5b4d73..c4c20b30c 100644
> --- a/ofproto/ofproto-dpif-upcall.c
> +++ b/ofproto/ofproto-dpif-upcall.c
> @@ -2367,11 +2367,6 @@ push_dp_ops(struct udpif *udpif, struct ukey_op *ops, 
> size_t n_ops)
>  stats = op->dop.flow_del.stats;
>  push = _buf;
>
> -if (op->dop.type != DPIF_OP_FLOW_DEL) {
> -/* Only deleted flows need their stats pushed. */
> -continue;
> -}
> -
>  if (op->dop.error) {
>  /* flow_del error, 'stats' is unusable. */
>  if (op->ukey) {
> @@ -2382,6 +2377,11 @@ push_dp_ops(struct udpif *udpif, struct ukey_op *ops, 
> size_t n_ops)
>  continue;
>  }
>
> +if (op->dop.type != DPIF_OP_FLOW_DEL) {
> +/* Only deleted flows need their stats pushed. */
> +continue;
> +}
> +
>  if (op->ukey) {
>  ovs_mutex_lock(>ukey->mutex);
>  transition_ukey(op->ukey, UKEY_EVICTED);
> --
> 2.20.1
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev



-- 
hepeng
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] Re： Re: Re: [PATCH v4 0/3] Add support for TSO with DPDK

2020-06-02 Thread 贺鹏

yes，check dpdk gso lib……

在 2020年3月10日星期二，txfh2007 via dev  写道：

> Hi Flavio and all:
>
>  Is there a way to support software TSO for DPDK tunnel network ? I
> have tried userspace TSO function, and running on tunnel network, I have
> got the following error:
>  "Tunneling packets with HW offload flags is not supported: packet
> dropped"
>  So is there a way to work around if we would support both vlan and
> tunnel network on the same compute node ?
>
> Thanks
> Timo
>
>
>
> --
>
>
>
>
> On Fri, Feb 28, 2020 at 9:56 AM Flavio Leitner  wrote:
> >
> >
> > Hi Yi Yang,
> >
> > This is the bug fix required to make veth TSO work in OvS:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/
> linux.git/commit/?id=9d2f67e43b73e8af7438be219b66a5de0cfa8bd9
> >
> > commit 9d2f67e43b73e8af7438be219b66a5de0cfa8bd9
> > Author: Jianfeng Tan 
> > Date:   Sat Sep 29 15:41:27 2018 +
> >
> > net/packet: fix packet drop as of virtio gso
> >
> > When we use raw socket as the vhost backend, a packet from virito
> with
> > gso offloading information, cannot be sent out in later validaton at
> > xmit path, as we did not set correct skb->protocol which is further
> used
> > for looking up the gso function.
> >
> > To fix this, we set this field according to virito hdr information.
> >
> > Fixes: e858fae2b0b8f4 ("virtio_net: use common code for
> virtio_net_hdr and skb GSO conversion")
> > Signed-off-by: Jianfeng Tan 
> > Signed-off-by: David S. Miller 
> >
> >
> > So, the minimum kernel version is 4.19.
> >
> Thanks,
> I sent a patch to update the documentation. Please take a look.
> William
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>


-- 
hepeng
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] ofproto-dpif-xlate: ovs-tcpdump cannot capture incomming vxlan packets

2020-05-22 Thread 贺鹏

ignore this patch, forget to sign-off the patch.

hepeng.0320  于2020年5月22日周五 下午5:02写道：
>
> when running ovs-tcpdump -i ethX and the port is used as the incomming port 
> for a vxlan port.
>
> The callstack for the upcall:
>
> mirror_ingress_packet
> mirror_packet
> output_normal
> compose_output_action
> compose_output_action__
> terminate_native_tunnel
>
> will xlate the action into a tnl_pop action, not an output action to the
> mirror port. So eventually the translated actions will be 'tnl_pop(x), 
> tnl_pop(x)'.
> However, the right action should be '(mirror port), tnl_pop(x)'
>
> This patch adds a flag in xlate_ctx indicating the current output_normal
> is used by mirroring. Note that we cannot use ctx->mirrors as the
> indicator as in the mirror code, the ctx->mirrors will not be cleared
> after mirror action finished.
> ---
>  ofproto/ofproto-dpif-xlate.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
> index 80fba84cb..03f370a0a 100644
> --- a/ofproto/ofproto-dpif-xlate.c
> +++ b/ofproto/ofproto-dpif-xlate.c
> @@ -269,6 +269,7 @@ struct xlate_ctx {
>  bool exit;  /* No further actions should be processed. */
>  mirror_mask_t mirrors;  /* Bitmap of associated mirrors. */
>  int mirror_snaplen; /* Max size of a mirror packet in byte. */
> +bool in_mirror_output;
>
> /* Freezing Translation
>  * 
> @@ -2154,7 +2155,9 @@ mirror_packet(struct xlate_ctx *ctx, struct xbundle 
> *xbundle,
>  if (out) {
>  struct xbundle *out_xbundle = xbundle_lookup(ctx->xcfg, out);
>  if (out_xbundle) {
> +ctx->in_mirror_output = true;
>  output_normal(ctx, out_xbundle, );
> +ctx->in_mirror_output = false;
>  }
>  } else if (xvlan.v[0].vid != out_vlan
> && !eth_addr_is_reserved(ctx->xin->flow.dl_dst)) {
> @@ -2165,7 +2168,9 @@ mirror_packet(struct xlate_ctx *ctx, struct xbundle 
> *xbundle,
>  LIST_FOR_EACH (xb, list_node, >xbundles) {
>  if (xbundle_includes_vlan(xb, )
>  && !xbundle_mirror_out(xbridge, xb)) {
> +ctx->in_mirror_output = true;
>  output_normal(ctx, xb, );
> +ctx->in_mirror_output = false;
>  }
>  }
>  xvlan.v[0].vid = old_vid;
> @@ -4231,7 +4236,7 @@ compose_output_action__(struct xlate_ctx *ctx, 
> ofp_port_t ofp_port,
>  native_tunnel_output(ctx, xport, flow, odp_port, truncate);
>  flow->tunnel = flow_tnl; /* Restore tunnel metadata */
>
> -} else if (terminate_native_tunnel(ctx, flow, wc,
> +} else if (!ctx->in_mirror_output && terminate_native_tunnel(ctx, 
> flow, wc,
> _tnl_port)) {
>  /* Intercept packet to be received on native tunnel port. */
>  nl_msg_put_odp_port(ctx->odp_actions, OVS_ACTION_ATTR_TUNNEL_POP,
> @@ -7492,6 +7497,7 @@ xlate_actions(struct xlate_in *xin, struct xlate_out 
> *xout)
>  .exit = false,
>  .error = XLATE_OK,
>  .mirrors = 0,
> +.in_mirror_output = false,
>
>  .freezing = false,
>  .recirc_update_dp_hash = false,
> --
> 2.20.1
>


-- 
hepeng
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v2] ofproto-dpif-xlate: do not unwildcard source address if not needed

2020-03-26 Thread 贺鹏

rc=1.1.2.92,in_port=3,action=drop'])
+
+AT_CHECK([ovs-appctl ofproto/trace int-br
'tun_src=1.1.2.92,tun_dst=1.1.2.88,tun_id=456,in_port=3'], [0],
[stdout])
+AT_CHECK([tail -2 stdout], [0],
+  [Megaflow: 
recirc_id=0,eth,tun_id=0x1c8,tun_src=1.1.2.92,tun_dst=1.1.2.88,tun_tos=0,tun_flags=-df-csum-key,in_port=3,dl_type=0x
+Datapath actions: drop
+])
+
+OVS_VSWITCHD_STOP
+AT_CLEANUP
--
2.20.1

William Tu  于2020年3月26日周四 下午11:05写道：
>
> On Sun, Mar 22, 2020 at 10:42:16AM +0800, 贺鹏 wrote:
> > if the tunnel is specified as "remote_ip=flow", we can try to generate
> > a megaflow
> > which ignores all source address, such as source mac, source IP address
> > in the outter header of the packet. This can reduce the number of
> > megaflows and avoid expensive upcall and improve the performance.
> >
> > Signed-off-by: hepeng 
>
> Hi Hepeng,
> I'm not able to apply your patch.
> Can you check?
>
> > ---
> >  include/openvswitch/flow.h|  2 ++
> >  ofproto/ofproto-dpif-xlate.c  | 16 +++-
> >  ofproto/tunnel.c  | 17 +
> >  ofproto/tunnel.h  |  3 ++-
> >  tests/packet-type-aware.at|  2 +-
> >  tests/tunnel-push-pop-ipv6.at |  2 +-
> >  tests/tunnel-push-pop.at  | 46 ++-
> >  7 files changed, 83 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/openvswitch/flow.h b/include/openvswitch/flow.h
> > index 57b6c925c..f4b9a6d48 100644
> > --- a/include/openvswitch/flow.h
> > +++ b/include/openvswitch/flow.h
> > @@ -204,6 +204,8 @@ struct flow_wildcards {
> >  ((WC)->masks.FIELD |= (MASK))
> >  #define WC_UNMASK_FIELD(WC, FIELD) \
> >  memset(&(WC)->masks.FIELD, 0, sizeof (WC)->masks.FIELD)
> > +#define WC_FIELD_MASKED(WC, FIELD) \
> > +((WC)->masks.FIELD != 0)
> >
> >  void flow_wildcards_init_catchall(struct flow_wildcards *);
> >
> > diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
> > index 28dcc67dd..8f438fad3 100644
> > --- a/ofproto/ofproto-dpif-xlate.c
> > +++ b/ofproto/ofproto-dpif-xlate.c
> > @@ -4087,7 +4087,17 @@ terminate_native_tunnel(struct xlate_ctx *ctx,
> > struct flow *flow,
> >  /* XXX: Write better Filter for tunnel port. We can use in_port
> >   * in tunnel-port flow to avoid these checks completely. */
> >  if (ovs_native_tunneling_is_on(ctx->xbridge->ofproto)) {
> > -*tnl_port = tnl_port_map_lookup(flow, wc);
> > +/* to check if the wc contains src info, if not, erase all
> > + * source info including smac, and sip
> > + */
> > +if (!WC_FIELD_MASKED(wc, nw_src)) {
> > +*tnl_port = tnl_port_map_lookup(flow, wc);
> > +if (*tnl_port != ODPP_NONE && !WC_FIELD_MASKED(wc, nw_src)) {
> > +memset(>masks.dl_src, 0, sizeof wc->masks.dl_src);
> > +}
> > +} else {
> > +*tnl_port = tnl_port_map_lookup(flow, wc);
> > +}
> this below can move out of if.. else ..
> *tnl_port = tnl_port_map_lookup(flow, wc);
>
> Thanks
> William
> >
> >  /* If no tunnel port was found and it's about an ARP or ICMPv6 
> > packet,
> >   * do tunnel neighbor snooping. */
> > @@ -7616,6 +7626,10 @@ xlate_actions(struct xlate_in *xin, struct
> > xlate_out *xout)
> >  ctx.xin->xport_uuid = in_port->uuid;
> >  }
> >
> > +if (in_port && in_port->is_tunnel) {
> > +tnl_wc_init_by_port(in_port->ofport, ctx.wc);
> > +}
> > +
> >  if (flow->packet_type != htonl(PT_ETH) && in_port &&
> >  in_port->pt_mode == NETDEV_PT_LEGACY_L3 && ctx.table_id == 0) {
> >  /* Add dummy Ethernet header to non-L2 packet if it's coming from a
> > diff --git a/ofproto/tunnel.c b/ofproto/tunnel.c
> > index 03f0ab765..892f3aa5f 100644
> > --- a/ofproto/tunnel.c
> > +++ b/ofproto/tunnel.c
> > @@ -300,6 +300,23 @@ tnl_port_del(const struct ofport_dpif *ofport,
> > odp_port_t odp_port)
> >  fat_rwlock_unlock();
> >  }
> >
> > +void
> > +tnl_wc_init_by_port(const struct ofport_dpif *ofport,
> > +struct flow_wildcards *wc)
> > +{
> > +struct tnl_port *tnl_port;
> > +
> > +fat_rwlock_rdlock();
> > +tnl_port = tnl_find_ofport(ofport);
> > +if (tnl_port) {
> > +if (tnl_port->match.ip_dst_flow) {
> > +WC_UNMASK_FIELD(wc, tunnel.ip_src);
> > +W

[ovs-dev] [PATCH v2] ofproto-dpif-xlate: do not unwildcard source address if not needed

2020-03-21 Thread 贺鹏

if the tunnel is specified as "remote_ip=flow", we can try to generate
a megaflow
which ignores all source address, such as source mac, source IP address
in the outter header of the packet. This can reduce the number of
megaflows and avoid expensive upcall and improve the performance.

Signed-off-by: hepeng 
---
 include/openvswitch/flow.h|  2 ++
 ofproto/ofproto-dpif-xlate.c  | 16 +++-
 ofproto/tunnel.c  | 17 +
 ofproto/tunnel.h  |  3 ++-
 tests/packet-type-aware.at|  2 +-
 tests/tunnel-push-pop-ipv6.at |  2 +-
 tests/tunnel-push-pop.at  | 46 ++-
 7 files changed, 83 insertions(+), 5 deletions(-)

diff --git a/include/openvswitch/flow.h b/include/openvswitch/flow.h
index 57b6c925c..f4b9a6d48 100644
--- a/include/openvswitch/flow.h
+++ b/include/openvswitch/flow.h
@@ -204,6 +204,8 @@ struct flow_wildcards {
 ((WC)->masks.FIELD |= (MASK))
 #define WC_UNMASK_FIELD(WC, FIELD) \
 memset(&(WC)->masks.FIELD, 0, sizeof (WC)->masks.FIELD)
+#define WC_FIELD_MASKED(WC, FIELD) \
+((WC)->masks.FIELD != 0)

 void flow_wildcards_init_catchall(struct flow_wildcards *);

diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
index 28dcc67dd..8f438fad3 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -4087,7 +4087,17 @@ terminate_native_tunnel(struct xlate_ctx *ctx,
struct flow *flow,
 /* XXX: Write better Filter for tunnel port. We can use in_port
  * in tunnel-port flow to avoid these checks completely. */
 if (ovs_native_tunneling_is_on(ctx->xbridge->ofproto)) {
-*tnl_port = tnl_port_map_lookup(flow, wc);
+/* to check if the wc contains src info, if not, erase all
+ * source info including smac, and sip
+ */
+if (!WC_FIELD_MASKED(wc, nw_src)) {
+*tnl_port = tnl_port_map_lookup(flow, wc);
+if (*tnl_port != ODPP_NONE && !WC_FIELD_MASKED(wc, nw_src)) {
+memset(>masks.dl_src, 0, sizeof wc->masks.dl_src);
+}
+} else {
+*tnl_port = tnl_port_map_lookup(flow, wc);
+}

 /* If no tunnel port was found and it's about an ARP or ICMPv6 packet,
  * do tunnel neighbor snooping. */
@@ -7616,6 +7626,10 @@ xlate_actions(struct xlate_in *xin, struct
xlate_out *xout)
 ctx.xin->xport_uuid = in_port->uuid;
 }

+if (in_port && in_port->is_tunnel) {
+tnl_wc_init_by_port(in_port->ofport, ctx.wc);
+}
+
 if (flow->packet_type != htonl(PT_ETH) && in_port &&
 in_port->pt_mode == NETDEV_PT_LEGACY_L3 && ctx.table_id == 0) {
 /* Add dummy Ethernet header to non-L2 packet if it's coming from a
diff --git a/ofproto/tunnel.c b/ofproto/tunnel.c
index 03f0ab765..892f3aa5f 100644
--- a/ofproto/tunnel.c
+++ b/ofproto/tunnel.c
@@ -300,6 +300,23 @@ tnl_port_del(const struct ofport_dpif *ofport,
odp_port_t odp_port)
 fat_rwlock_unlock();
 }

+void
+tnl_wc_init_by_port(const struct ofport_dpif *ofport,
+struct flow_wildcards *wc)
+{
+struct tnl_port *tnl_port;
+
+fat_rwlock_rdlock();
+tnl_port = tnl_find_ofport(ofport);
+if (tnl_port) {
+if (tnl_port->match.ip_dst_flow) {
+WC_UNMASK_FIELD(wc, tunnel.ip_src);
+WC_UNMASK_FIELD(wc, tunnel.ipv6_src);
+}
+}
+fat_rwlock_unlock();
+}
+
 /* Looks in the table of tunnels for a tunnel matching the metadata in 'flow'.
  * Returns the 'ofport' corresponding to the new in_port, or a null pointer if
  * none is found.
diff --git a/ofproto/tunnel.h b/ofproto/tunnel.h
index 323f3fa03..166190875 100644
--- a/ofproto/tunnel.h
+++ b/ofproto/tunnel.h
@@ -45,7 +45,8 @@ bool tnl_process_ecn(struct flow *);
 odp_port_t tnl_port_send(const struct ofport_dpif *, struct flow *,
  struct flow_wildcards *wc);
 const char *tnl_port_get_type(const struct ofport_dpif *tnl_port);
-
+void tnl_wc_init_by_port(const struct ofport_dpif *ofport,
+ struct flow_wildcards *wc);
 /* Returns true if 'flow' should be submitted to tnl_port_receive(). */
 static inline bool
 tnl_port_should_receive(const struct flow *flow)
diff --git a/tests/packet-type-aware.at b/tests/packet-type-aware.at
index 540cf98f3..9c16d1289 100644
--- a/tests/packet-type-aware.at
+++ b/tests/packet-type-aware.at
@@ -1019,7 +1019,7 @@ ovs-appctl time/warp 1000
 AT_CHECK([
 ovs-appctl dpctl/dump-flows --names dummy@ovs-dummy | strip_used
| grep -v ipv6 | sort
 ], [0], [flow-dump from the main thread:
-recirc_id(0),in_port(p0),packet_type(ns=0,id=0),eth(src=aa:bb:cc:00:00:02,dst=aa:bb:cc:00:00:01),eth_type(0x0800),ipv4(dst=20.0.0.1,proto=47,frag=no),
packets:3, bytes:378, used:0.0s, actions:tnl_pop(gre_sys)
+recirc_id(0),in_port(p0),packet_type(ns=0,id=0),eth(dst=aa:bb:cc:00:00:01),eth_type(0x0800),ipv4(dst=20.0.0.1,proto=47,frag=no),
packets:3, bytes:378, used:0.0s, actions:tnl_pop(gre_sys)

[ovs-dev] [PATCH] ofproto-dpif-xlate: do not unwildcard source address if not needed

2020-03-14 Thread 贺鹏

if the tunnel is specified as "ip_remote=flow", we can try to generate
a megaflow
which ignores all source address, such as source mac, source IP address
in the outter header of the packet. This can reduce the number of
megaflows and avoid expensive upcall and improve the performance.

Signed-off-by: hepeng 
---
 include/openvswitch/flow.h|  2 ++
 ofproto/ofproto-dpif-xlate.c  | 16 +++-
 ofproto/tunnel.c  | 17 +
 ofproto/tunnel.h  |  3 ++-
 tests/packet-type-aware.at|  2 +-
 tests/tunnel-push-pop-ipv6.at |  2 +-
 tests/tunnel-push-pop.at  |  2 +-
 7 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/include/openvswitch/flow.h b/include/openvswitch/flow.h
index 57b6c925c..f4b9a6d48 100644
--- a/include/openvswitch/flow.h
+++ b/include/openvswitch/flow.h
@@ -204,6 +204,8 @@ struct flow_wildcards {
 ((WC)->masks.FIELD |= (MASK))
 #define WC_UNMASK_FIELD(WC, FIELD) \
 memset(&(WC)->masks.FIELD, 0, sizeof (WC)->masks.FIELD)
+#define WC_FIELD_MASKED(WC, FIELD) \
+((WC)->masks.FIELD != 0)

 void flow_wildcards_init_catchall(struct flow_wildcards *);

diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
index adf57a5e8..9b5c3ddad 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -4070,7 +4070,17 @@ terminate_native_tunnel(struct xlate_ctx *ctx,
struct flow *flow,
 /* XXX: Write better Filter for tunnel port. We can use in_port
  * in tunnel-port flow to avoid these checks completely. */
 if (ovs_native_tunneling_is_on(ctx->xbridge->ofproto)) {
-*tnl_port = tnl_port_map_lookup(flow, wc);
+/* to check if the wc contains src info, if not, erase all
+ * source info including smac, and sip
+ */
+if (!WC_FIELD_MASKED(wc, nw_src)) {
+*tnl_port = tnl_port_map_lookup(flow, wc);
+if (*tnl_port != ODPP_NONE && !WC_FIELD_MASKED(wc, nw_src)) {
+memset(>masks.dl_src, 0, sizeof wc->masks.dl_src);
+}
+} else {
+*tnl_port = tnl_port_map_lookup(flow, wc);
+}

 /* If no tunnel port was found and it's about an ARP or ICMPv6 packet,
  * do tunnel neighbor snooping. */
@@ -7599,6 +7609,10 @@ xlate_actions(struct xlate_in *xin, struct
xlate_out *xout)
 ctx.xin->xport_uuid = in_port->uuid;
 }

+if (in_port && in_port->is_tunnel) {
+tnl_wc_init_by_port(in_port->ofport, ctx.wc);
+}
+
 if (flow->packet_type != htonl(PT_ETH) && in_port &&
 in_port->pt_mode == NETDEV_PT_LEGACY_L3 && ctx.table_id == 0) {
 /* Add dummy Ethernet header to non-L2 packet if it's coming from a
diff --git a/ofproto/tunnel.c b/ofproto/tunnel.c
index 03f0ab765..892f3aa5f 100644
--- a/ofproto/tunnel.c
+++ b/ofproto/tunnel.c
@@ -300,6 +300,23 @@ tnl_port_del(const struct ofport_dpif *ofport,
odp_port_t odp_port)
 fat_rwlock_unlock();
 }

+void
+tnl_wc_init_by_port(const struct ofport_dpif *ofport,
+struct flow_wildcards *wc)
+{
+struct tnl_port *tnl_port;
+
+fat_rwlock_rdlock();
+tnl_port = tnl_find_ofport(ofport);
+if (tnl_port) {
+if (tnl_port->match.ip_dst_flow) {
+WC_UNMASK_FIELD(wc, tunnel.ip_src);
+WC_UNMASK_FIELD(wc, tunnel.ipv6_src);
+}
+}
+fat_rwlock_unlock();
+}
+
 /* Looks in the table of tunnels for a tunnel matching the metadata in 'flow'.
  * Returns the 'ofport' corresponding to the new in_port, or a null pointer if
  * none is found.
diff --git a/ofproto/tunnel.h b/ofproto/tunnel.h
index 323f3fa03..166190875 100644
--- a/ofproto/tunnel.h
+++ b/ofproto/tunnel.h
@@ -45,7 +45,8 @@ bool tnl_process_ecn(struct flow *);
 odp_port_t tnl_port_send(const struct ofport_dpif *, struct flow *,
  struct flow_wildcards *wc);
 const char *tnl_port_get_type(const struct ofport_dpif *tnl_port);
-
+void tnl_wc_init_by_port(const struct ofport_dpif *ofport,
+ struct flow_wildcards *wc);
 /* Returns true if 'flow' should be submitted to tnl_port_receive(). */
 static inline bool
 tnl_port_should_receive(const struct flow *flow)
diff --git a/tests/packet-type-aware.at b/tests/packet-type-aware.at
index 540cf98f3..9c16d1289 100644
--- a/tests/packet-type-aware.at
+++ b/tests/packet-type-aware.at
@@ -1019,7 +1019,7 @@ ovs-appctl time/warp 1000
 AT_CHECK([
 ovs-appctl dpctl/dump-flows --names dummy@ovs-dummy | strip_used
| grep -v ipv6 | sort
 ], [0], [flow-dump from the main thread:
-recirc_id(0),in_port(p0),packet_type(ns=0,id=0),eth(src=aa:bb:cc:00:00:02,dst=aa:bb:cc:00:00:01),eth_type(0x0800),ipv4(dst=20.0.0.1,proto=47,frag=no),
packets:3, bytes:378, used:0.0s, actions:tnl_pop(gre_sys)
+recirc_id(0),in_port(p0),packet_type(ns=0,id=0),eth(dst=aa:bb:cc:00:00:01),eth_type(0x0800),ipv4(dst=20.0.0.1,proto=47,frag=no),
packets:3, bytes:378, used:0.0s, actions:tnl_pop(gre_sys)

[ovs-dev] [PATCH] ofproto-dpif-xlate: do not unwildcard source address if not needed]

2020-03-14 Thread 贺鹏

if the tunnel is specified as "ip_remote=flow", we can try to generate
a megaflow
which ignores all source address, such as source mac, source IP address
in the outter header of the packet. This can reduce the number of
megaflows and avoid expensive upcalls and improve the performance.

Signed-off-by: hepeng 
---
 include/openvswitch/flow.h|  2 ++
 ofproto/ofproto-dpif-xlate.c  | 16 +++-
 ofproto/tunnel.c  | 17 +
 ofproto/tunnel.h  |  3 ++-
 tests/packet-type-aware.at|  2 +-
 tests/tunnel-push-pop-ipv6.at |  2 +-
 tests/tunnel-push-pop.at  |  2 +-
 7 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/include/openvswitch/flow.h b/include/openvswitch/flow.h
index 57b6c925c..f4b9a6d48 100644
--- a/include/openvswitch/flow.h
+++ b/include/openvswitch/flow.h
@@ -204,6 +204,8 @@ struct flow_wildcards {
 ((WC)->masks.FIELD |= (MASK))
 #define WC_UNMASK_FIELD(WC, FIELD) \
 memset(&(WC)->masks.FIELD, 0, sizeof (WC)->masks.FIELD)
+#define WC_FIELD_MASKED(WC, FIELD) \
+((WC)->masks.FIELD != 0)

 void flow_wildcards_init_catchall(struct flow_wildcards *);

diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
index adf57a5e8..9b5c3ddad 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -4070,7 +4070,17 @@ terminate_native_tunnel(struct xlate_ctx *ctx,
struct flow *flow,
 /* XXX: Write better Filter for tunnel port. We can use in_port
  * in tunnel-port flow to avoid these checks completely. */
 if (ovs_native_tunneling_is_on(ctx->xbridge->ofproto)) {
-*tnl_port = tnl_port_map_lookup(flow, wc);
+/* to check if the wc contains src info, if not, erase all
+ * source info including smac, and sip
+ */
+if (!WC_FIELD_MASKED(wc, nw_src)) {
+*tnl_port = tnl_port_map_lookup(flow, wc);
+if (*tnl_port != ODPP_NONE && !WC_FIELD_MASKED(wc, nw_src)) {
+memset(>masks.dl_src, 0, sizeof wc->masks.dl_src);
+}
+} else {
+*tnl_port = tnl_port_map_lookup(flow, wc);
+}

 /* If no tunnel port was found and it's about an ARP or ICMPv6 packet,
  * do tunnel neighbor snooping. */
@@ -7599,6 +7609,10 @@ xlate_actions(struct xlate_in *xin, struct
xlate_out *xout)
 ctx.xin->xport_uuid = in_port->uuid;
 }

+if (in_port && in_port->is_tunnel) {
+tnl_wc_init_by_port(in_port->ofport, ctx.wc);
+}
+
 if (flow->packet_type != htonl(PT_ETH) && in_port &&
 in_port->pt_mode == NETDEV_PT_LEGACY_L3 && ctx.table_id == 0) {
 /* Add dummy Ethernet header to non-L2 packet if it's coming from a
diff --git a/ofproto/tunnel.c b/ofproto/tunnel.c
index 03f0ab765..892f3aa5f 100644
--- a/ofproto/tunnel.c
+++ b/ofproto/tunnel.c
@@ -300,6 +300,23 @@ tnl_port_del(const struct ofport_dpif *ofport,
odp_port_t odp_port)
 fat_rwlock_unlock();
 }

+void
+tnl_wc_init_by_port(const struct ofport_dpif *ofport,
+struct flow_wildcards *wc)
+{
+struct tnl_port *tnl_port;
+
+fat_rwlock_rdlock();
+tnl_port = tnl_find_ofport(ofport);
+if (tnl_port) {
+if (tnl_port->match.ip_dst_flow) {
+WC_UNMASK_FIELD(wc, tunnel.ip_src);
+WC_UNMASK_FIELD(wc, tunnel.ipv6_src);
+}
+}
+fat_rwlock_unlock();
+}
+
 /* Looks in the table of tunnels for a tunnel matching the metadata in 'flow'.
  * Returns the 'ofport' corresponding to the new in_port, or a null pointer if
  * none is found.
diff --git a/ofproto/tunnel.h b/ofproto/tunnel.h
index 323f3fa03..166190875 100644
--- a/ofproto/tunnel.h
+++ b/ofproto/tunnel.h
@@ -45,7 +45,8 @@ bool tnl_process_ecn(struct flow *);
 odp_port_t tnl_port_send(const struct ofport_dpif *, struct flow *,
  struct flow_wildcards *wc);
 const char *tnl_port_get_type(const struct ofport_dpif *tnl_port);
-
+void tnl_wc_init_by_port(const struct ofport_dpif *ofport,
+ struct flow_wildcards *wc);
 /* Returns true if 'flow' should be submitted to tnl_port_receive(). */
 static inline bool
 tnl_port_should_receive(const struct flow *flow)
diff --git a/tests/packet-type-aware.at b/tests/packet-type-aware.at
index 540cf98f3..9c16d1289 100644
--- a/tests/packet-type-aware.at
+++ b/tests/packet-type-aware.at
@@ -1019,7 +1019,7 @@ ovs-appctl time/warp 1000
 AT_CHECK([
 ovs-appctl dpctl/dump-flows --names dummy@ovs-dummy | strip_used
| grep -v ipv6 | sort
 ], [0], [flow-dump from the main thread:
-recirc_id(0),in_port(p0),packet_type(ns=0,id=0),eth(src=aa:bb:cc:00:00:02,dst=aa:bb:cc:00:00:01),eth_type(0x0800),ipv4(dst=20.0.0.1,proto=47,frag=no),
packets:3, bytes:378, used:0.0s, actions:tnl_pop(gre_sys)
+recirc_id(0),in_port(p0),packet_type(ns=0,id=0),eth(dst=aa:bb:cc:00:00:01),eth_type(0x0800),ipv4(dst=20.0.0.1,proto=47,frag=no),
packets:3, bytes:378, used:0.0s, actions:tnl_pop(gre_sys)

Re: [ovs-dev] [PATCH V6 05/18] netdev-dpdk: Introduce rte flow query function

2019-12-28 Thread 贺鹏

Hi,

there is a potential memory leak in ovs-dpdk hw offload, when remove a
dpdk port in datapath, the dpif will call dpif_port_del and will
eventually remove the netdev
from the map *port_to_netdev* (in netdev-offload). Meanwhile, a
resulted datapath reconfiguration will flush all the marked flows by
calling *flow_mark_flush*, however, this function only put the offload
requests on the queue, it does not wait for the flow deletion
finishes. So there could be a case that the offload thread tries to
delete a hw flow, however, since the netdev is removed, it will fail
at *netdev_ports_get* function, thus, will not call *netdev_flow_del*
thus not free the flow entry (ufid_to_rte_flow_data), which will cause
memory leak.

In fact, since any port removal will hold dp->port_mutex, when calling
*flow_mark_flush*, no flow deletion will really happen since the
offload thread cannot get the *dp->port_mutex* to call
*netdev_flow_del*.

To fix this, I guess we'd better design a wait-done mechanism to wait the
offload process to finish all the deletion before calling *netdev_ports_remove*.
But, first, the holding of dp->port_mutex in offload thread which
prevents the real removal of the offloaded
flows should also be considered.

Also, this patch includes querying the hw stats also
depends on the dp->port_mutex to be thread safe, these also needs to be fixed.

Eli Britstein  于2019年12月19日周四 下午7:54写道：
>
> Introduce a rte flow query function as a pre-step towards reading HW
> statistics of fully offloaded flows.
>
> Signed-off-by: Eli Britstein 
> Reviewed-by: Oz Shlomo 
> ---
>  lib/netdev-dpdk.c | 30 ++
>  lib/netdev-dpdk.h |  6 ++
>  2 files changed, 36 insertions(+)
>
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 663eef5b8..74fc9a99c 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -4527,6 +4527,36 @@ netdev_dpdk_rte_flow_create(struct netdev *netdev,
>  return flow;
>  }
>
> +int
> +netdev_dpdk_rte_flow_query_count(struct netdev *netdev,
> + struct rte_flow *rte_flow,
> + struct rte_flow_query_count *query,
> + struct rte_flow_error *error)
> +{
> +struct rte_flow_action_count count = { .shared = 0, .id = 0 };
> +const struct rte_flow_action actions[] = {
> +{
> +.type = RTE_FLOW_ACTION_TYPE_COUNT,
> +.conf = ,
> +},
> +{
> +.type = RTE_FLOW_ACTION_TYPE_END,
> +},
> +};
> +struct netdev_dpdk *dev;
> +int ret;
> +
> +if (!is_dpdk_class(netdev->netdev_class)) {
> +return -1;
> +}
> +
> +dev = netdev_dpdk_cast(netdev);
> +ovs_mutex_lock(>mutex);
> +ret = rte_flow_query(dev->port_id, rte_flow, actions, query, error);
> +ovs_mutex_unlock(>mutex);
> +return ret;
> +}
> +
>  #define NETDEV_DPDK_CLASS_COMMON\
>  .is_pmd = true, \
>  .alloc = netdev_dpdk_alloc, \
> diff --git a/lib/netdev-dpdk.h b/lib/netdev-dpdk.h
> index 60631c4f0..59919a89a 100644
> --- a/lib/netdev-dpdk.h
> +++ b/lib/netdev-dpdk.h
> @@ -31,6 +31,7 @@ struct rte_flow_error;
>  struct rte_flow_attr;
>  struct rte_flow_item;
>  struct rte_flow_action;
> +struct rte_flow_query_count;
>
>  void netdev_dpdk_register(void);
>  void free_dpdk_buf(struct dp_packet *);
> @@ -47,6 +48,11 @@ netdev_dpdk_rte_flow_create(struct netdev *netdev,
>  const struct rte_flow_item *items,
>  const struct rte_flow_action *actions,
>  struct rte_flow_error *error);
> +int
> +netdev_dpdk_rte_flow_query_count(struct netdev *netdev,
> + struct rte_flow *rte_flow,
> + struct rte_flow_query_count *query,
> + struct rte_flow_error *error);
>
>  #else
>
> --
> 2.14.5
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev



-- 
hepeng
Bytedance
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] [PATCH RFC] potential memory leak in ovs-dpdk hw offload

2019-12-25 Thread 贺鹏

Hi,

I found a potential memory leak in ovs-dpdk hw offload, when remove a
dpdk port in datapath,
the dpif will call dpif_port_del and will eventually remove the netdev
from the map *port_to_netdev* (in netdev-offload). Meanwhile, a
resulted datapath reconfiguration will flush all the marked flows by
calling *flow_mark_flush*, however, this function only put the offload
requests on the queue, it does not wait for the flow deletion
finishes. So there could be a case that the offload thread tries to
delete a hw flow, however, since the netdev is removed, it will fail
at *netdev_ports_get* function, thus, will not call *netdev_flow_del*
thus not free the flow entry (ufid_to_rte_flow_data), which will cause
memory leak.

In fact, since any port removal will hold dp->port_mutex, when calling
*flow_mark_flush*, no flow deletion will really happen since the
offload thread cannot get the *dp->port_mutex* to call
*netdev_flow_del*.

To fix this, we'd better design a wait-done mechanism to wait the
offload process to finish all the deletion. I guess we also need to
remove holding dp->port_mutex in offload thread, instead, by holding
netdev->mutex in the offload layer.

I remember that some patch including querying the hw stats also
depends on the dp->port_mutex, these also needs to be fixed.

-- 
hepeng
Bytedance
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] [RFC] dpif-netdev:Remove holding the dp_netdev_mutex lock in dpif_netdev_wait

2019-12-25 Thread 贺鹏

Hi,

I think the holding of dp_netdev_mutex lock can be removed in the
dpif_netdev_wait, as in dpif_netdev_run, the code does not hold the
dp_netdev_mutex lock, just the dp->port_mutex lock.

So I am curious why the code here holds the dp_netdev_mutex lock. If
any one knows it, I will appreciate an explanation. Thanks.

Regards
-- 
hepeng
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH V3 08/19] netdev-offload-dpdk: Protect UFID map by mutex

2019-12-09 Thread 贺鹏

Got it.
You're correct.

Eli Britstein  于2019年12月9日周一 下午3:21写道：
>
>
> On 12/9/2019 9:15 AM, 贺鹏 wrote:
> > Hi,
> >
> >
> > cmap support multiple readers and one writer.
> > since all write operations are performed in the offload thread,
> > why need a mutex here? Maybe I miss some patches.
> Collecting statistics using netdev_flow_get in patch 10/20 is another
> thread. There, it gets the rte_flow, but if not protected it might get
> destroyed by the offload thread before it uses it.
> >
> >
> >
> > Eli Britstein  于2019年12月8日周日 下午9:23写道：
> >
> >> Flow deletion and dumping for statistics collection are called from
> >> different threads. As a pre-step towards collecting HW statistics,
> >> protect the UFID map by mutex to make it thread safe.
> >>
> >> Signed-off-by: Eli Britstein 
> >> ---
> >>   lib/netdev-offload-dpdk.c | 7 ++-
> >>   1 file changed, 6 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/lib/netdev-offload-dpdk.c b/lib/netdev-offload-dpdk.c
> >> index b2ec05cec..5568400b6 100644
> >> --- a/lib/netdev-offload-dpdk.c
> >> +++ b/lib/netdev-offload-dpdk.c
> >> @@ -51,6 +51,7 @@ static struct vlog_rate_limit error_rl = 
> >> VLOG_RATE_LIMIT_INIT(100, 5);
> >>* A mapping from ufid to dpdk rte_flow.
> >>*/
> >>   static struct cmap ufid_to_rte_flow = CMAP_INITIALIZER;
> >> +static struct ovs_mutex ufid_map_mutex = OVS_MUTEX_INITIALIZER;
> >>
> >>   struct ufid_to_rte_flow_data {
> >>   struct cmap_node node;
> >> @@ -630,8 +631,11 @@ netdev_offload_dpdk_destroy_flow(struct netdev 
> >> *netdev,
> >>struct rte_flow *rte_flow)
> >>   {
> >>   struct rte_flow_error error;
> >> -int ret = netdev_dpdk_rte_flow_destroy(netdev, rte_flow, );
> >> +int ret;
> >> +
> >> +ovs_mutex_lock(_map_mutex);
> >>
> >> +ret = netdev_dpdk_rte_flow_destroy(netdev, rte_flow, );
> >>   if (ret == 0) {
> >>   ufid_to_rte_flow_disassociate(ufid);
> >>   VLOG_DBG("%s: removed rte flow %p associated with ufid " 
> >> UUID_FMT "\n",
> >> @@ -642,6 +646,7 @@ netdev_offload_dpdk_destroy_flow(struct netdev *netdev,
> >>netdev_get_name(netdev), error.message, error.type);
> >>   }
> >>
> >> +ovs_mutex_unlock(_map_mutex);
> >>   return ret;
> >>   }
> >>
> >> --
> >> 2.14.5
> >>
> >> ___
> >> dev mailing list
> >> d...@openvswitch.org
> >> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.openvswitch.org%2Fmailman%2Flistinfo%2Fovs-devdata=02%7C01%7Celibr%40mellanox.com%7C9af2f119de544768711408d77c779a66%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637114725449215533sdata=VyrLUh6%2BP0ne9dXUSDjME31NBj0tHTV%2FEciNZY%2Baugo%3Dreserved=0
> >
> >
> > --
> > hepeng
> > ICT



-- 
hepeng
Bytedance
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH V3 08/19] netdev-offload-dpdk: Protect UFID map by mutex

2019-12-08 Thread 贺鹏

Hi,


cmap support multiple readers and one writer.
since all write operations are performed in the offload thread,
why need a mutex here? Maybe I miss some patches.



Eli Britstein  于2019年12月8日周日 下午9:23写道：

>
> Flow deletion and dumping for statistics collection are called from
> different threads. As a pre-step towards collecting HW statistics,
> protect the UFID map by mutex to make it thread safe.
>
> Signed-off-by: Eli Britstein 
> ---
>  lib/netdev-offload-dpdk.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/lib/netdev-offload-dpdk.c b/lib/netdev-offload-dpdk.c
> index b2ec05cec..5568400b6 100644
> --- a/lib/netdev-offload-dpdk.c
> +++ b/lib/netdev-offload-dpdk.c
> @@ -51,6 +51,7 @@ static struct vlog_rate_limit error_rl = 
> VLOG_RATE_LIMIT_INIT(100, 5);
>   * A mapping from ufid to dpdk rte_flow.
>   */
>  static struct cmap ufid_to_rte_flow = CMAP_INITIALIZER;
> +static struct ovs_mutex ufid_map_mutex = OVS_MUTEX_INITIALIZER;
>
>  struct ufid_to_rte_flow_data {
>  struct cmap_node node;
> @@ -630,8 +631,11 @@ netdev_offload_dpdk_destroy_flow(struct netdev *netdev,
>   struct rte_flow *rte_flow)
>  {
>  struct rte_flow_error error;
> -int ret = netdev_dpdk_rte_flow_destroy(netdev, rte_flow, );
> +int ret;
> +
> +ovs_mutex_lock(_map_mutex);
>
> +ret = netdev_dpdk_rte_flow_destroy(netdev, rte_flow, );
>  if (ret == 0) {
>  ufid_to_rte_flow_disassociate(ufid);
>  VLOG_DBG("%s: removed rte flow %p associated with ufid " UUID_FMT 
> "\n",
> @@ -642,6 +646,7 @@ netdev_offload_dpdk_destroy_flow(struct netdev *netdev,
>   netdev_get_name(netdev), error.message, error.type);
>  }
>
> +ovs_mutex_unlock(_map_mutex);
>  return ret;
>  }
>
> --
> 2.14.5
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev



--
hepeng
ICT
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

58 matches

Mail list logo