Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy

2021-03-01 Thread Ben Pfaff
On Thu, Feb 25, 2021 at 10:31:35AM -0800, William Tu wrote:
> I think refcount and RCU are mutually exclusive. They are two
> different ways of doing synchronization and somehow we mix them
> together by using RCU to optimize refcount.

I think that you are correct in your description of the possibilities.
A data structure can be protected by RCU, or by a refcount, or by a
refcount that is only needed if access to a data structure needs to span
a grace period.  But to me, "mutually exclusive" means two things that
cannot exist together, and I think that the third possibility is in fact
a case where RCU and refcount exist together, so I would not call this
"mutually exclusive".
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy

2021-03-01 Thread Ilya Maximets
On 2/27/21 9:53 AM, 贺鹏 wrote:
> Hi,
> 
> Thanks William and Ilya for a detailed revisiting of the origin of the
> problem. I learned a lot.
> 
> I now understand that the mix using of RCU and refcounts is not
> intended in the first place.
> But my point is that mix using RCU and refcounts now gives you more
> choices, and actually eases the code changes.
> 
> For example, the code for * ofproto_dpif_lookup_by_name* or other
> ofproto lookup function,
> when only using refcounts, you need to change it to:
> 
>  struct ofproto_dpif *
>  ofproto_dpif_lookup_by_uuid(const struct uuid *uuid)
>  {
>  struct ofproto_dpif *ofproto;
> 
>  HMAP_FOR_EACH_WITH_HASH (ofproto, all_ofproto_dpifs_by_uuid_node,
>   uuid_hash(uuid), _ofproto_dpifs_by_uuid) {
>  if (uuid_equals(>uuid, uuid)) {
> 
>  ---> if  ovs_refcount_try_ref(ofproto)
> 
>  return ofproto;
>  }
>  }
>  return NULL;
>  }

That is an interesting example here...
I can't help but notice that this function is typically called
from different handler or pmd threads and modified by the main thread
while upcalls enabled.  And hmap is not a thread-safe structure.
I guess, we have another possible problem here.  We need to protect
at least this hmap and the other one with rw lock or something...
Am I right or am I missing something?  What else we need to protect?

> 
> and after finish its usage, you have to do ofproto_unref the ofproto.
> 
> This is why I said, most accessing to ofproto is ephemeral. If you
> change to use the pure refcounts solution, you have
> to add refcount and release it every time you access the ofproto. We
> should be more careful and remember to unref
> the ofproto.
> 
> However, if using RCU and refcounts, in the above case, you do not
> need to change the code, since the RCU ensures that
> these ephemeral accesses are safe.
> 
> you only need to add refcount, when you find that the pointer to
> ofproto lives longer than one grace period.
> 
> 
> This is why in my patch, I do not add ref to ofproto after its
> creation. I agree the patch is not complete and has issues,
> and understand it could confuse people if changes into mix RCU and
> refcounts version.
> 
> 
> William Tu  于2021年2月26日周五 上午2:32写道:
>>
>> On Thu, Feb 25, 2021 at 4:54 AM Ilya Maximets  wrote:
>>>
>>> On 2/25/21 4:32 AM, Guohongzhi (Russell Lab) wrote:
 Refcount and RCU are not mutually exclusive.
 IMO, the main reason for this problem is that the rule uses both refcount 
 and rcu, while the ofproto uses only rcu, and the rule retains the pointer 
 of the ofproto. More importantly, ofproto cannot guarantee a longer grace 
 period than the rule.

>>>
>>> I understand that refcount and RCU are not mutually exclusive.
>>> However, in this particular case RCU for ofproto was introduced for one
>>> and only one reason - to avoid freeing ofproto while rules are still
>>> alive.  This was done in commit f416c8d61601 ("ofproto: RCU postpone
>>> rule destruction.").  The goal was to allow using rules without
>>> refcounting them within a single grace period.  And that forced us
>>> to postpone destruction of the ofproto for a single grace period.
>>> Later commit 39c9459355b6 ("Use classifier versioning.") made it
>>> possible for rules to be alive for more than one grace period, so
>>> the commit made ofproto wait for 2 grace periods by double postponing.
>>> As we can see now, that wasn't enough and we have to wait for more
>>> than 2 grace periods in certain cases.
>>>
>>> Now we're introducing refcounts to wait for all rules to be destroyed
>>> before destroying the ofproto.  But what is the point of having
>>> RCU if we already know that if refcount is zero than all rules are
>>> already destroyed and we don't need to wait anything and could just
>>> destroy the ofproto?
>>>
>>> RCU doesn't protect ofproto itself here, it actually protects rules,
>>> i.e. keeps ofproto alive while rules alive, but we can fully replace
>>> this with refcounts and I see no value in having RCU additionally.
>>>
>>> To have a full picture: right now we also have groups protected by
>>> RCU, so we need to refcount ofproto for them too.  But that is almost
>>> exactly same situation as we have with rules.
>>>
>> Thanks Guohongzhi, Hepeng, and Ilya, I learned a lot from your discussions.
>>
>> I think refcount and RCU are mutually exclusive. They are two different ways 
>> of
>> doing synchronization and somehow we mix them together by using RCU to
>> optimize refcount.
>>
>> Before the commit f416c8d61601 ("ofproto: RCU postpone rule destruction."),
>> we are using refcount to protect rules, and as a result every time OVS
>> references
>> a rule, we have to take refcount. It works ok but this probably has
>> high performance
>> overhead because it's doing atomic operations.
>>
>> So the commit f416c8d61601 optimizes it by doing
>> 1) not taking refcount of rule "within a grace period"
>> 2) 

Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy

2021-02-27 Thread William Tu
Hi Hepeng,
Thanks, now I understand.

On Sat, Feb 27, 2021 at 12:54 AM 贺鹏  wrote:
>
> Hi,
>
> Thanks William and Ilya for a detailed revisiting of the origin of the
> problem. I learned a lot.
>
> I now understand that the mix using of RCU and refcounts is not
> intended in the first place.
> But my point is that mix using RCU and refcounts now gives you more
> choices, and actually eases the code changes.
>
> For example, the code for * ofproto_dpif_lookup_by_name* or other
> ofproto lookup function,
> when only using refcounts, you need to change it to:
>
>  struct ofproto_dpif *
>  ofproto_dpif_lookup_by_uuid(const struct uuid *uuid)
>  {
>  struct ofproto_dpif *ofproto;
>
>  HMAP_FOR_EACH_WITH_HASH (ofproto, all_ofproto_dpifs_by_uuid_node,
>   uuid_hash(uuid), _ofproto_dpifs_by_uuid) {
>  if (uuid_equals(>uuid, uuid)) {
>
>  ---> if  ovs_refcount_try_ref(ofproto)
>
>  return ofproto;
>  }
>  }
>  return NULL;
>  }
>
> and after finish its usage, you have to do ofproto_unref the ofproto.
>
> This is why I said, most accessing to ofproto is ephemeral. If you
> change to use the pure refcounts solution, you have
> to add refcount and release it every time you access the ofproto. We
> should be more careful and remember to unref
> the ofproto.
>
> However, if using RCU and refcounts, in the above case, you do not
> need to change the code, since the RCU ensures that
> these ephemeral accesses are safe.

How do we know which access is ephemeral, so no refcount is needed,
and which access is not, so we have to add refcount?

>
> you only need to add refcount, when you find that the pointer to
> ofproto lives longer than one grace period.
>
>
> This is why in my patch, I do not add ref to ofproto after its
> creation. I agree the patch is not complete and has issues,
> and understand it could confuse people if changes into mix RCU and
> refcounts version.
>
I see, thanks!
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy

2021-02-27 Thread 贺鹏
Hi,

Thanks William and Ilya for a detailed revisiting of the origin of the
problem. I learned a lot.

I now understand that the mix using of RCU and refcounts is not
intended in the first place.
But my point is that mix using RCU and refcounts now gives you more
choices, and actually eases the code changes.

For example, the code for * ofproto_dpif_lookup_by_name* or other
ofproto lookup function,
when only using refcounts, you need to change it to:

 struct ofproto_dpif *
 ofproto_dpif_lookup_by_uuid(const struct uuid *uuid)
 {
 struct ofproto_dpif *ofproto;

 HMAP_FOR_EACH_WITH_HASH (ofproto, all_ofproto_dpifs_by_uuid_node,
  uuid_hash(uuid), _ofproto_dpifs_by_uuid) {
 if (uuid_equals(>uuid, uuid)) {

 ---> if  ovs_refcount_try_ref(ofproto)

 return ofproto;
 }
 }
 return NULL;
 }

and after finish its usage, you have to do ofproto_unref the ofproto.

This is why I said, most accessing to ofproto is ephemeral. If you
change to use the pure refcounts solution, you have
to add refcount and release it every time you access the ofproto. We
should be more careful and remember to unref
the ofproto.

However, if using RCU and refcounts, in the above case, you do not
need to change the code, since the RCU ensures that
these ephemeral accesses are safe.

you only need to add refcount, when you find that the pointer to
ofproto lives longer than one grace period.


This is why in my patch, I do not add ref to ofproto after its
creation. I agree the patch is not complete and has issues,
and understand it could confuse people if changes into mix RCU and
refcounts version.


William Tu  于2021年2月26日周五 上午2:32写道:
>
> On Thu, Feb 25, 2021 at 4:54 AM Ilya Maximets  wrote:
> >
> > On 2/25/21 4:32 AM, Guohongzhi (Russell Lab) wrote:
> > > Refcount and RCU are not mutually exclusive.
> > > IMO, the main reason for this problem is that the rule uses both refcount 
> > > and rcu, while the ofproto uses only rcu, and the rule retains the 
> > > pointer of the ofproto. More importantly, ofproto cannot guarantee a 
> > > longer grace period than the rule.
> > >
> >
> > I understand that refcount and RCU are not mutually exclusive.
> > However, in this particular case RCU for ofproto was introduced for one
> > and only one reason - to avoid freeing ofproto while rules are still
> > alive.  This was done in commit f416c8d61601 ("ofproto: RCU postpone
> > rule destruction.").  The goal was to allow using rules without
> > refcounting them within a single grace period.  And that forced us
> > to postpone destruction of the ofproto for a single grace period.
> > Later commit 39c9459355b6 ("Use classifier versioning.") made it
> > possible for rules to be alive for more than one grace period, so
> > the commit made ofproto wait for 2 grace periods by double postponing.
> > As we can see now, that wasn't enough and we have to wait for more
> > than 2 grace periods in certain cases.
> >
> > Now we're introducing refcounts to wait for all rules to be destroyed
> > before destroying the ofproto.  But what is the point of having
> > RCU if we already know that if refcount is zero than all rules are
> > already destroyed and we don't need to wait anything and could just
> > destroy the ofproto?
> >
> > RCU doesn't protect ofproto itself here, it actually protects rules,
> > i.e. keeps ofproto alive while rules alive, but we can fully replace
> > this with refcounts and I see no value in having RCU additionally.
> >
> > To have a full picture: right now we also have groups protected by
> > RCU, so we need to refcount ofproto for them too.  But that is almost
> > exactly same situation as we have with rules.
> >
> Thanks Guohongzhi, Hepeng, and Ilya, I learned a lot from your discussions.
>
> I think refcount and RCU are mutually exclusive. They are two different ways 
> of
> doing synchronization and somehow we mix them together by using RCU to
> optimize refcount.
>
> Before the commit f416c8d61601 ("ofproto: RCU postpone rule destruction."),
> we are using refcount to protect rules, and as a result every time OVS
> references
> a rule, we have to take refcount. It works ok but this probably has
> high performance
> overhead because it's doing atomic operations.
>
> So the commit f416c8d61601 optimizes it by doing
> 1) not taking refcount of rule "within a grace period"
> 2) introduce RCU for rule
> The assumption is that a grace period is like refcount == 0, so it's
> valid to do so.
> A side effect is that ofproto_destroy__()  needs to be postponed.
> Note that if a rule is alive across grace period, it has to take refcount.
>
> However, later commit 39c9459355b6 ("Use classifier versioning.")
> makes rule alive across grace period. It breaks the first commit's assumption
> and it has to introduce double postponing for ofproto, which we found
> out is the problem now.
>
> Since my point is RCU and refcount are mutually exclusive (A grace period
> is like 

Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy

2021-02-25 Thread William Tu
On Thu, Feb 25, 2021 at 4:54 AM Ilya Maximets  wrote:
>
> On 2/25/21 4:32 AM, Guohongzhi (Russell Lab) wrote:
> > Refcount and RCU are not mutually exclusive.
> > IMO, the main reason for this problem is that the rule uses both refcount 
> > and rcu, while the ofproto uses only rcu, and the rule retains the pointer 
> > of the ofproto. More importantly, ofproto cannot guarantee a longer grace 
> > period than the rule.
> >
>
> I understand that refcount and RCU are not mutually exclusive.
> However, in this particular case RCU for ofproto was introduced for one
> and only one reason - to avoid freeing ofproto while rules are still
> alive.  This was done in commit f416c8d61601 ("ofproto: RCU postpone
> rule destruction.").  The goal was to allow using rules without
> refcounting them within a single grace period.  And that forced us
> to postpone destruction of the ofproto for a single grace period.
> Later commit 39c9459355b6 ("Use classifier versioning.") made it
> possible for rules to be alive for more than one grace period, so
> the commit made ofproto wait for 2 grace periods by double postponing.
> As we can see now, that wasn't enough and we have to wait for more
> than 2 grace periods in certain cases.
>
> Now we're introducing refcounts to wait for all rules to be destroyed
> before destroying the ofproto.  But what is the point of having
> RCU if we already know that if refcount is zero than all rules are
> already destroyed and we don't need to wait anything and could just
> destroy the ofproto?
>
> RCU doesn't protect ofproto itself here, it actually protects rules,
> i.e. keeps ofproto alive while rules alive, but we can fully replace
> this with refcounts and I see no value in having RCU additionally.
>
> To have a full picture: right now we also have groups protected by
> RCU, so we need to refcount ofproto for them too.  But that is almost
> exactly same situation as we have with rules.
>
Thanks Guohongzhi, Hepeng, and Ilya, I learned a lot from your discussions.

I think refcount and RCU are mutually exclusive. They are two different ways of
doing synchronization and somehow we mix them together by using RCU to
optimize refcount.

Before the commit f416c8d61601 ("ofproto: RCU postpone rule destruction."),
we are using refcount to protect rules, and as a result every time OVS
references
a rule, we have to take refcount. It works ok but this probably has
high performance
overhead because it's doing atomic operations.

So the commit f416c8d61601 optimizes it by doing
1) not taking refcount of rule "within a grace period"
2) introduce RCU for rule
The assumption is that a grace period is like refcount == 0, so it's
valid to do so.
A side effect is that ofproto_destroy__()  needs to be postponed.
Note that if a rule is alive across grace period, it has to take refcount.

However, later commit 39c9459355b6 ("Use classifier versioning.")
makes rule alive across grace period. It breaks the first commit's assumption
and it has to introduce double postponing for ofproto, which we found
out is the problem now.

Since my point is RCU and refcount are mutually exclusive (A grace period
is like refcount ==0) , I agree with Ilya that we can just use refcount to fix
ofproto issue.

Regards,
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy

2021-02-25 Thread Ilya Maximets
On 2/25/21 4:32 AM, Guohongzhi (Russell Lab) wrote:
> Refcount and RCU are not mutually exclusive.
> IMO, the main reason for this problem is that the rule uses both refcount and 
> rcu, while the ofproto uses only rcu, and the rule retains the pointer of the 
> ofproto. More importantly, ofproto cannot guarantee a longer grace period 
> than the rule.
> 

I understand that refcount and RCU are not mutually exclusive.
However, in this particular case RCU for ofproto was introduced for one
and only one reason - to avoid freeing ofproto while rules are still
alive.  This was done in commit f416c8d61601 ("ofproto: RCU postpone
rule destruction.").  The goal was to allow using rules without
refcounting them within a single grace period.  And that forced us
to postpone destruction of the ofproto for a single grace period.
Later commit 39c9459355b6 ("Use classifier versioning.") made it
possible for rules to be alive for more than one grace period, so
the commit made ofproto wait for 2 grace periods by double postponing.
As we can see now, that wasn't enough and we have to wait for more
than 2 grace periods in certain cases.

Now we're introducing refcounts to wait for all rules to be destroyed
before destroying the ofproto.  But what is the point of having
RCU if we already know that if refcount is zero than all rules are
already destroyed and we don't need to wait anything and could just
destroy the ofproto?

RCU doesn't protect ofproto itself here, it actually protects rules,
i.e. keeps ofproto alive while rules alive, but we can fully replace
this with refcounts and I see no value in having RCU additionally.

To have a full picture: right now we also have groups protected by
RCU, so we need to refcount ofproto for them too.  But that is almost
exactly same situation as we have with rules.

> 
> -Original Message-
> From: 贺鹏 [mailto:xnhp0...@gmail.com] 
> Sent: 25/2/2021 11:14
> To: Ilya Maximets 
> Cc: hepeng.0320 ;  
> ; Guohongzhi (Russell Lab) 
> Subject: Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix 
> crash at ofproto_destroy
> 
> Ilya Maximets  于2021年2月25日周四 上午12:01写道:
>>
>> On 2/24/21 3:29 PM, Ilya Maximets wrote:
>>> On 7/17/20 3:50 AM, hepeng.0320 wrote:
>>>> From: Peng He 
>>>>
>>>> The call stack of rule_destroy_cb:
>>>>
>>>> remove_rules_postponed (one grace period)
>>>> -> remove_rule_rcu
>>>> -> remove_rule_rcu__
>>>> -> ofproto_rule_unref -> ref count != 1
>>>> -> ... more grace periods.
>>>> -> rule_destroy_cb (> 2 grace periods)
>>>> -> free
>>>>
>>>> Currently ofproto_destory waits only two grace periods, this means 
>>>> when rule_destroy_cb is called, the ofproto might have been already 
>>>> destroyed.
>>>>
>>>> This patch adds a refcount for ofproto to ensure ofproto exists 
>>>> when it is needed to call free functions.
>>>>
>>>> This patch fixes the crashes found:
>>>> Program terminated with signal SIGSEGV, Segmentation fault.
>>>> 0  0x562a55169e49 in rule_destroy_cb (rule=0x562a598be2c0) at 
>>>> ofproto/ofproto.c:2956
>>>> 1  0x562a552623d6 in ovsrcu_call_postponed () at 
>>>> lib/ovs-rcu.c:348
>>>> 2  0x562a55262504 in ovsrcu_postpone_thread (arg=>>> out>) at lib/ovs-rcu.c:364
>>>> 3  0x562a55264aef in ovsthread_wrapper (aux_=) 
>>>> at lib/ovs-thread.c:383
>>>> 4  0x7febe715a4a4 in start_thread (arg=0x7febe4c1c700) at 
>>>> pthread_create.c:456
>>>> 5  0x7febe6990d0f in clone () at 
>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
>>>> (gdb) p rule->ofproto->ofproto_class
>>>> $3 = (const struct ofproto_class *) 0x0
>>>>
>>>> Also:
>>>> 0  ofproto_dpif_credit_table_stats (ofproto=0x5583583332b0, 
>>>> table_id=0 '\000', n_matches=5, n_misses=0) at 
>>>> ofproto/ofproto-dpif.c:4310
>>>> 1  0x558354c68514 in xlate_push_stats_entry 
>>>> (entry=entry@entry=0x7ff488031398, 
>>>> stats=stats@entry=0x7ff49af30b60) at 
>>>> ofproto/ofproto-dpif-xlate-cache.c:99
>>>> 2  0x558354c686f3 in xlate_push_stats (xcache=, 
>>>> stats=stats@entry=0x7ff49af30b60) at 
>>>> ofproto/ofproto-dpif-xlate-cache.c:181
>>>> 3  0x558354c5411a in revalidate_ukey 
>>>> (udpif=udpif@entry=0x5583583baba0, ukey=ukey@entry=0x7ff47809f770, 
>>>> st

Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy

2021-02-24 Thread Guohongzhi (Russell Lab)
Refcount and RCU are not mutually exclusive.
IMO, the main reason for this problem is that the rule uses both refcount and 
rcu, while the ofproto uses only rcu, and the rule retains the pointer of the 
ofproto. More importantly, ofproto cannot guarantee a longer grace period than 
the rule.


-Original Message-
From: 贺鹏 [mailto:xnhp0...@gmail.com] 
Sent: 25/2/2021 11:14
To: Ilya Maximets 
Cc: hepeng.0320 ;  
; Guohongzhi (Russell Lab) 
Subject: Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix 
crash at ofproto_destroy

Ilya Maximets  于2021年2月25日周四 上午12:01写道:
>
> On 2/24/21 3:29 PM, Ilya Maximets wrote:
> > On 7/17/20 3:50 AM, hepeng.0320 wrote:
> >> From: Peng He 
> >>
> >> The call stack of rule_destroy_cb:
> >>
> >> remove_rules_postponed (one grace period)
> >> -> remove_rule_rcu
> >> -> remove_rule_rcu__
> >> -> ofproto_rule_unref -> ref count != 1
> >> -> ... more grace periods.
> >> -> rule_destroy_cb (> 2 grace periods)
> >> -> free
> >>
> >> Currently ofproto_destory waits only two grace periods, this means 
> >> when rule_destroy_cb is called, the ofproto might have been already 
> >> destroyed.
> >>
> >> This patch adds a refcount for ofproto to ensure ofproto exists 
> >> when it is needed to call free functions.
> >>
> >> This patch fixes the crashes found:
> >> Program terminated with signal SIGSEGV, Segmentation fault.
> >> 0  0x562a55169e49 in rule_destroy_cb (rule=0x562a598be2c0) at 
> >> ofproto/ofproto.c:2956
> >> 1  0x562a552623d6 in ovsrcu_call_postponed () at 
> >> lib/ovs-rcu.c:348
> >> 2  0x562a55262504 in ovsrcu_postpone_thread (arg= >> out>) at lib/ovs-rcu.c:364
> >> 3  0x562a55264aef in ovsthread_wrapper (aux_=) 
> >> at lib/ovs-thread.c:383
> >> 4  0x7febe715a4a4 in start_thread (arg=0x7febe4c1c700) at 
> >> pthread_create.c:456
> >> 5  0x7febe6990d0f in clone () at 
> >> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
> >> (gdb) p rule->ofproto->ofproto_class
> >> $3 = (const struct ofproto_class *) 0x0
> >>
> >> Also:
> >> 0  ofproto_dpif_credit_table_stats (ofproto=0x5583583332b0, 
> >> table_id=0 '\000', n_matches=5, n_misses=0) at 
> >> ofproto/ofproto-dpif.c:4310
> >> 1  0x558354c68514 in xlate_push_stats_entry 
> >> (entry=entry@entry=0x7ff488031398, 
> >> stats=stats@entry=0x7ff49af30b60) at 
> >> ofproto/ofproto-dpif-xlate-cache.c:99
> >> 2  0x558354c686f3 in xlate_push_stats (xcache=, 
> >> stats=stats@entry=0x7ff49af30b60) at 
> >> ofproto/ofproto-dpif-xlate-cache.c:181
> >> 3  0x558354c5411a in revalidate_ukey 
> >> (udpif=udpif@entry=0x5583583baba0, ukey=ukey@entry=0x7ff47809f770, 
> >> stats=stats@entry=0x7ff49af32128, 
> >> odp_actions=odp_actions@entry=0x7ff49af30c50, 
> >> reval_seq=reval_seq@entry=275670456,
> >>recircs=recircs@entry=0x7ff49af30cd0) at 
> >> ofproto/ofproto-dpif-upcall.c:2292
> >> 4  0x558354c57cbc in revalidate (revalidator=) 
> >> at ofproto/ofproto-dpif-upcall.c:2681
> >> 5  0x558354c57f8e in udpif_revalidator (arg=0x5583583d2a90) at 
> >> ofproto/ofproto-dpif-upcall.c:934
> >> 6  0x558354d24aef in ovsthread_wrapper (aux_=) 
> >> at lib/ovs-thread.c:383
> >> 7  0x7ff4a479d4a4 in start_thread (arg=0x7ff49af35700) at 
> >> pthread_create.c:456
> >> 8  0x7ff4a3fd3d0f in clone () at 
> >> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
> >> (gdb) p ofproto->up.ofproto_class
> >> $3 = (const struct ofproto_class *) 0x0
> >>
> >> Signed-off-by: Peng He 
> >> ---
>
> CC: Hongzhi Guo
>
> And I just remembered that there is a very similar patch that was 
> submitted to the mail-list several months betore this one.
> Ben started review, but didn't finish:
>
> http://patchwork.ozlabs.org/project/openvswitch/patch/20200306130555.1
> 9884-1-guohongz...@huawei.com/
>
> Both patches has smilar issues, i.e. mixing RCU with refcounters and 
> not covering all the cases.
>
> There are at least 2 major places where ofproto needs to be refcounted:
> 1. rules
>1.a xlate caches?
> 2. ofproto groups
>
> IMO, mixing RCU with refcounts basically means that we do not have a 
> clear understanding on how it works and using RCU as a safety net.
> We shuld stop using RCU for orproto and ju

Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy

2021-02-24 Thread 贺鹏
Ilya Maximets  于2021年2月25日周四 上午12:01写道:
>
> On 2/24/21 3:29 PM, Ilya Maximets wrote:
> > On 7/17/20 3:50 AM, hepeng.0320 wrote:
> >> From: Peng He 
> >>
> >> The call stack of rule_destroy_cb:
> >>
> >> remove_rules_postponed (one grace period)
> >> -> remove_rule_rcu
> >> -> remove_rule_rcu__
> >> -> ofproto_rule_unref -> ref count != 1
> >> -> ... more grace periods.
> >> -> rule_destroy_cb (> 2 grace periods)
> >> -> free
> >>
> >> Currently ofproto_destory waits only two grace periods, this means when
> >> rule_destroy_cb is called, the ofproto might have been already destroyed.
> >>
> >> This patch adds a refcount for ofproto to ensure ofproto exists when it
> >> is needed to call free functions.
> >>
> >> This patch fixes the crashes found:
> >> Program terminated with signal SIGSEGV, Segmentation fault.
> >> 0  0x562a55169e49 in rule_destroy_cb (rule=0x562a598be2c0) at 
> >> ofproto/ofproto.c:2956
> >> 1  0x562a552623d6 in ovsrcu_call_postponed () at lib/ovs-rcu.c:348
> >> 2  0x562a55262504 in ovsrcu_postpone_thread (arg=) at 
> >> lib/ovs-rcu.c:364
> >> 3  0x562a55264aef in ovsthread_wrapper (aux_=) at 
> >> lib/ovs-thread.c:383
> >> 4  0x7febe715a4a4 in start_thread (arg=0x7febe4c1c700) at 
> >> pthread_create.c:456
> >> 5  0x7febe6990d0f in clone () at 
> >> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
> >> (gdb) p rule->ofproto->ofproto_class
> >> $3 = (const struct ofproto_class *) 0x0
> >>
> >> Also:
> >> 0  ofproto_dpif_credit_table_stats (ofproto=0x5583583332b0, table_id=0 
> >> '\000', n_matches=5, n_misses=0) at ofproto/ofproto-dpif.c:4310
> >> 1  0x558354c68514 in xlate_push_stats_entry 
> >> (entry=entry@entry=0x7ff488031398, stats=stats@entry=0x7ff49af30b60) at 
> >> ofproto/ofproto-dpif-xlate-cache.c:99
> >> 2  0x558354c686f3 in xlate_push_stats (xcache=, 
> >> stats=stats@entry=0x7ff49af30b60) at ofproto/ofproto-dpif-xlate-cache.c:181
> >> 3  0x558354c5411a in revalidate_ukey 
> >> (udpif=udpif@entry=0x5583583baba0, ukey=ukey@entry=0x7ff47809f770, 
> >> stats=stats@entry=0x7ff49af32128, 
> >> odp_actions=odp_actions@entry=0x7ff49af30c50, 
> >> reval_seq=reval_seq@entry=275670456,
> >>recircs=recircs@entry=0x7ff49af30cd0) at 
> >> ofproto/ofproto-dpif-upcall.c:2292
> >> 4  0x558354c57cbc in revalidate (revalidator=) at 
> >> ofproto/ofproto-dpif-upcall.c:2681
> >> 5  0x558354c57f8e in udpif_revalidator (arg=0x5583583d2a90) at 
> >> ofproto/ofproto-dpif-upcall.c:934
> >> 6  0x558354d24aef in ovsthread_wrapper (aux_=) at 
> >> lib/ovs-thread.c:383
> >> 7  0x7ff4a479d4a4 in start_thread (arg=0x7ff49af35700) at 
> >> pthread_create.c:456
> >> 8  0x7ff4a3fd3d0f in clone () at 
> >> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
> >> (gdb) p ofproto->up.ofproto_class
> >> $3 = (const struct ofproto_class *) 0x0
> >>
> >> Signed-off-by: Peng He 
> >> ---
>
> CC: Hongzhi Guo
>
> And I just remembered that there is a very similar patch that
> was submitted to the mail-list several months betore this one.
> Ben started review, but didn't finish:
>
> http://patchwork.ozlabs.org/project/openvswitch/patch/20200306130555.19884-1-guohongz...@huawei.com/
>
> Both patches has smilar issues, i.e. mixing RCU with refcounters
> and not covering all the cases.
>
> There are at least 2 major places where ofproto needs to be refcounted:
> 1. rules
>1.a xlate caches?
> 2. ofproto groups
>
> IMO, mixing RCU with refcounts basically means that we do not have a
> clear understanding on how it works and using RCU as a safety net.
> We shuld stop using RCU for orproto and just have refcounts.  Once
> ofproto refcount reaches zero it should be immediately destroyed.
> All the places where we need to increase refcount needs to be tracked
> down.

IMO, I think RCU and refcounts are not mutually exclusive. In kernel
connntack system such as IPVS, sessions are used with both RCU and
refcounts.

Essentially, the RCU implies a refcount if there are no places holding
a pointer to the object other than hash tables (or other RCU enabled
data structure), thus by removing it from hash tables, and wait a
grace period,
we can safely free the object because we are sure that no one holds a
pointer to the object. Removing it from hash tables is like reducing
the refcount by 1, and since hash tables are
the only place that holds the pointer and this is the only reference
that lives longer than a grace period, so by removing it, we are sure
that no other ref can live longer than a grace period, and after the
period, we are safe to free it.

If we use only refcounts, every time we access the object, we have to
add refcounts.
But if use RCU combined with refcount, we only need to add ref, if
there are other places than the hashtables that live longer than one
grace period.

So I think using RCU with refcounts actually makes code changes less,
as mostly, the access to object is ephemeral, you lookup 

Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy

2021-02-24 Thread Guohongzhi (Russell Lab)
It's true that the patch has been reviewed by ben and a question has been 
raised.I have replied the question, but I did not get a follow-up reply.

In short, I agree with you very much and would be happy to prepare a complete 
solution for this issue.

-Original Message-
From: Ilya Maximets [mailto:i.maxim...@ovn.org] 
Sent: 2/25/2021 0:02 AM
To: Ilya Maximets ; hepeng.0320 
; d...@openvswitch.org
Cc: William Tu ; Yifeng Sun ; 
Guohongzhi (Russell Lab) 
Subject: Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix 
crash at ofproto_destroy

On 2/24/21 3:29 PM, Ilya Maximets wrote:
> On 7/17/20 3:50 AM, hepeng.0320 wrote:
>> From: Peng He 
>>
>> The call stack of rule_destroy_cb:
>>
>> remove_rules_postponed (one grace period)
>> -> remove_rule_rcu
>> -> remove_rule_rcu__
>> -> ofproto_rule_unref -> ref count != 1
>> -> ... more grace periods.
>> -> rule_destroy_cb (> 2 grace periods)
>> -> free
>>
>> Currently ofproto_destory waits only two grace periods, this means 
>> when rule_destroy_cb is called, the ofproto might have been already 
>> destroyed.
>>
>> This patch adds a refcount for ofproto to ensure ofproto exists when 
>> it is needed to call free functions.
>>
>> This patch fixes the crashes found:
>> Program terminated with signal SIGSEGV, Segmentation fault.
>> 0  0x562a55169e49 in rule_destroy_cb (rule=0x562a598be2c0) at 
>> ofproto/ofproto.c:2956
>> 1  0x562a552623d6 in ovsrcu_call_postponed () at 
>> lib/ovs-rcu.c:348
>> 2  0x562a55262504 in ovsrcu_postpone_thread (arg=) 
>> at lib/ovs-rcu.c:364
>> 3  0x562a55264aef in ovsthread_wrapper (aux_=) at 
>> lib/ovs-thread.c:383
>> 4  0x7febe715a4a4 in start_thread (arg=0x7febe4c1c700) at 
>> pthread_create.c:456
>> 5  0x7febe6990d0f in clone () at 
>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
>> (gdb) p rule->ofproto->ofproto_class
>> $3 = (const struct ofproto_class *) 0x0
>>
>> Also:
>> 0  ofproto_dpif_credit_table_stats (ofproto=0x5583583332b0, 
>> table_id=0 '\000', n_matches=5, n_misses=0) at 
>> ofproto/ofproto-dpif.c:4310
>> 1  0x558354c68514 in xlate_push_stats_entry 
>> (entry=entry@entry=0x7ff488031398, stats=stats@entry=0x7ff49af30b60) 
>> at ofproto/ofproto-dpif-xlate-cache.c:99
>> 2  0x558354c686f3 in xlate_push_stats (xcache=, 
>> stats=stats@entry=0x7ff49af30b60) at 
>> ofproto/ofproto-dpif-xlate-cache.c:181
>> 3  0x558354c5411a in revalidate_ukey (udpif=udpif@entry=0x5583583baba0, 
>> ukey=ukey@entry=0x7ff47809f770, stats=stats@entry=0x7ff49af32128, 
>> odp_actions=odp_actions@entry=0x7ff49af30c50, 
>> reval_seq=reval_seq@entry=275670456,
>>recircs=recircs@entry=0x7ff49af30cd0) at 
>> ofproto/ofproto-dpif-upcall.c:2292
>> 4  0x558354c57cbc in revalidate (revalidator=) at 
>> ofproto/ofproto-dpif-upcall.c:2681
>> 5  0x558354c57f8e in udpif_revalidator (arg=0x5583583d2a90) at 
>> ofproto/ofproto-dpif-upcall.c:934
>> 6  0x558354d24aef in ovsthread_wrapper (aux_=) at 
>> lib/ovs-thread.c:383
>> 7  0x7ff4a479d4a4 in start_thread (arg=0x7ff49af35700) at 
>> pthread_create.c:456
>> 8  0x7ff4a3fd3d0f in clone () at 
>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
>> (gdb) p ofproto->up.ofproto_class
>> $3 = (const struct ofproto_class *) 0x0
>>
>> Signed-off-by: Peng He 
>> ---

CC: Hongzhi Guo

And I just remembered that there is a very similar patch that was submitted to 
the mail-list several months betore this one.
Ben started review, but didn't finish:

http://patchwork.ozlabs.org/project/openvswitch/patch/20200306130555.19884-1-guohongz...@huawei.com/

Both patches has smilar issues, i.e. mixing RCU with refcounters and not 
covering all the cases.

There are at least 2 major places where ofproto needs to be refcounted:
1. rules
   1.a xlate caches?
2. ofproto groups

IMO, mixing RCU with refcounts basically means that we do not have a clear 
understanding on how it works and using RCU as a safety net.
We shuld stop using RCU for orproto and just have refcounts.  Once ofproto 
refcount reaches zero it should be immediately destroyed.
All the places where we need to increase refcount needs to be tracked down.

It'll be great if you can cooperate to prepare a complete solution for this 
issue.

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy

2021-02-24 Thread Ilya Maximets
On 2/24/21 3:29 PM, Ilya Maximets wrote:
> On 7/17/20 3:50 AM, hepeng.0320 wrote:
>> From: Peng He 
>>
>> The call stack of rule_destroy_cb:
>>
>> remove_rules_postponed (one grace period)
>> -> remove_rule_rcu
>> -> remove_rule_rcu__
>> -> ofproto_rule_unref -> ref count != 1
>> -> ... more grace periods.
>> -> rule_destroy_cb (> 2 grace periods)
>> -> free
>>
>> Currently ofproto_destory waits only two grace periods, this means when
>> rule_destroy_cb is called, the ofproto might have been already destroyed.
>>
>> This patch adds a refcount for ofproto to ensure ofproto exists when it
>> is needed to call free functions.
>>
>> This patch fixes the crashes found:
>> Program terminated with signal SIGSEGV, Segmentation fault.
>> 0  0x562a55169e49 in rule_destroy_cb (rule=0x562a598be2c0) at 
>> ofproto/ofproto.c:2956
>> 1  0x562a552623d6 in ovsrcu_call_postponed () at lib/ovs-rcu.c:348
>> 2  0x562a55262504 in ovsrcu_postpone_thread (arg=) at 
>> lib/ovs-rcu.c:364
>> 3  0x562a55264aef in ovsthread_wrapper (aux_=) at 
>> lib/ovs-thread.c:383
>> 4  0x7febe715a4a4 in start_thread (arg=0x7febe4c1c700) at 
>> pthread_create.c:456
>> 5  0x7febe6990d0f in clone () at 
>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
>> (gdb) p rule->ofproto->ofproto_class
>> $3 = (const struct ofproto_class *) 0x0
>>
>> Also:
>> 0  ofproto_dpif_credit_table_stats (ofproto=0x5583583332b0, table_id=0 
>> '\000', n_matches=5, n_misses=0) at ofproto/ofproto-dpif.c:4310
>> 1  0x558354c68514 in xlate_push_stats_entry 
>> (entry=entry@entry=0x7ff488031398, stats=stats@entry=0x7ff49af30b60) at 
>> ofproto/ofproto-dpif-xlate-cache.c:99
>> 2  0x558354c686f3 in xlate_push_stats (xcache=, 
>> stats=stats@entry=0x7ff49af30b60) at ofproto/ofproto-dpif-xlate-cache.c:181
>> 3  0x558354c5411a in revalidate_ukey (udpif=udpif@entry=0x5583583baba0, 
>> ukey=ukey@entry=0x7ff47809f770, stats=stats@entry=0x7ff49af32128, 
>> odp_actions=odp_actions@entry=0x7ff49af30c50, 
>> reval_seq=reval_seq@entry=275670456,
>>recircs=recircs@entry=0x7ff49af30cd0) at 
>> ofproto/ofproto-dpif-upcall.c:2292
>> 4  0x558354c57cbc in revalidate (revalidator=) at 
>> ofproto/ofproto-dpif-upcall.c:2681
>> 5  0x558354c57f8e in udpif_revalidator (arg=0x5583583d2a90) at 
>> ofproto/ofproto-dpif-upcall.c:934
>> 6  0x558354d24aef in ovsthread_wrapper (aux_=) at 
>> lib/ovs-thread.c:383
>> 7  0x7ff4a479d4a4 in start_thread (arg=0x7ff49af35700) at 
>> pthread_create.c:456
>> 8  0x7ff4a3fd3d0f in clone () at 
>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
>> (gdb) p ofproto->up.ofproto_class
>> $3 = (const struct ofproto_class *) 0x0
>>
>> Signed-off-by: Peng He 
>> ---

CC: Hongzhi Guo

And I just remembered that there is a very similar patch that
was submitted to the mail-list several months betore this one.
Ben started review, but didn't finish:

http://patchwork.ozlabs.org/project/openvswitch/patch/20200306130555.19884-1-guohongz...@huawei.com/

Both patches has smilar issues, i.e. mixing RCU with refcounters
and not covering all the cases.

There are at least 2 major places where ofproto needs to be refcounted:
1. rules
   1.a xlate caches?
2. ofproto groups

IMO, mixing RCU with refcounts basically means that we do not have a
clear understanding on how it works and using RCU as a safety net.
We shuld stop using RCU for orproto and just have refcounts.  Once
ofproto refcount reaches zero it should be immediately destroyed.
All the places where we need to increase refcount needs to be tracked
down.

It'll be great if you can cooperate to prepare a complete solution
for this issue.

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy

2021-02-24 Thread Ilya Maximets
On 7/17/20 3:50 AM, hepeng.0320 wrote:
> From: Peng He 
> 
> The call stack of rule_destroy_cb:
> 
> remove_rules_postponed (one grace period)
> -> remove_rule_rcu
> -> remove_rule_rcu__
> -> ofproto_rule_unref -> ref count != 1
> -> ... more grace periods.
> -> rule_destroy_cb (> 2 grace periods)
> -> free
> 
> Currently ofproto_destory waits only two grace periods, this means when
> rule_destroy_cb is called, the ofproto might have been already destroyed.
> 
> This patch adds a refcount for ofproto to ensure ofproto exists when it
> is needed to call free functions.
> 
> This patch fixes the crashes found:
> Program terminated with signal SIGSEGV, Segmentation fault.
> 0  0x562a55169e49 in rule_destroy_cb (rule=0x562a598be2c0) at 
> ofproto/ofproto.c:2956
> 1  0x562a552623d6 in ovsrcu_call_postponed () at lib/ovs-rcu.c:348
> 2  0x562a55262504 in ovsrcu_postpone_thread (arg=) at 
> lib/ovs-rcu.c:364
> 3  0x562a55264aef in ovsthread_wrapper (aux_=) at 
> lib/ovs-thread.c:383
> 4  0x7febe715a4a4 in start_thread (arg=0x7febe4c1c700) at 
> pthread_create.c:456
> 5  0x7febe6990d0f in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
> (gdb) p rule->ofproto->ofproto_class
> $3 = (const struct ofproto_class *) 0x0
> 
> Also:
> 0  ofproto_dpif_credit_table_stats (ofproto=0x5583583332b0, table_id=0 
> '\000', n_matches=5, n_misses=0) at ofproto/ofproto-dpif.c:4310
> 1  0x558354c68514 in xlate_push_stats_entry 
> (entry=entry@entry=0x7ff488031398, stats=stats@entry=0x7ff49af30b60) at 
> ofproto/ofproto-dpif-xlate-cache.c:99
> 2  0x558354c686f3 in xlate_push_stats (xcache=, 
> stats=stats@entry=0x7ff49af30b60) at ofproto/ofproto-dpif-xlate-cache.c:181
> 3  0x558354c5411a in revalidate_ukey (udpif=udpif@entry=0x5583583baba0, 
> ukey=ukey@entry=0x7ff47809f770, stats=stats@entry=0x7ff49af32128, 
> odp_actions=odp_actions@entry=0x7ff49af30c50, 
> reval_seq=reval_seq@entry=275670456,
>recircs=recircs@entry=0x7ff49af30cd0) at ofproto/ofproto-dpif-upcall.c:2292
> 4  0x558354c57cbc in revalidate (revalidator=) at 
> ofproto/ofproto-dpif-upcall.c:2681
> 5  0x558354c57f8e in udpif_revalidator (arg=0x5583583d2a90) at 
> ofproto/ofproto-dpif-upcall.c:934
> 6  0x558354d24aef in ovsthread_wrapper (aux_=) at 
> lib/ovs-thread.c:383
> 7  0x7ff4a479d4a4 in start_thread (arg=0x7ff49af35700) at 
> pthread_create.c:456
> 8  0x7ff4a3fd3d0f in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
> (gdb) p ofproto->up.ofproto_class
> $3 = (const struct ofproto_class *) 0x0
> 
> Signed-off-by: Peng He 
> ---
>  ofproto/ofproto-dpif-xlate-cache.c |  1 +
>  ofproto/ofproto-dpif.c | 22 
>  ofproto/ofproto-provider.h |  2 ++
>  ofproto/ofproto.c  | 41 +++---
>  ofproto/ofproto.h  |  4 +++
>  5 files changed, 56 insertions(+), 14 deletions(-)
> 
> diff --git a/ofproto/ofproto-dpif-xlate-cache.c 
> b/ofproto/ofproto-dpif-xlate-cache.c
> index dcc91cb38..0deee365d 100644
> --- a/ofproto/ofproto-dpif-xlate-cache.c
> +++ b/ofproto/ofproto-dpif-xlate-cache.c
> @@ -209,6 +209,7 @@ xlate_cache_clear_entry(struct xc_entry *entry)
>  {
>  switch (entry->type) {
>  case XC_TABLE:
> +ofproto_unref(&(entry->table.ofproto->up));
>  break;
>  case XC_RULE:
>  ofproto_rule_unref(>rule->up);
> diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
> index 4f0638f23..6480b3c78 100644
> --- a/ofproto/ofproto-dpif.c
> +++ b/ofproto/ofproto-dpif.c
> @@ -4414,11 +4414,12 @@ rule_dpif_lookup_from_table(struct ofproto_dpif 
> *ofproto,
>  }
>  if (xcache) {
>  struct xc_entry *entry;
> -
> -entry = xlate_cache_add_entry(xcache, XC_TABLE);
> -entry->table.ofproto = ofproto;
> -entry->table.id = *table_id;
> -entry->table.match = true;
> +if (ofproto_try_ref(>up)) {
> +entry = xlate_cache_add_entry(xcache, XC_TABLE);

You're handling only XC_TABLE cache entries.  Do we need to ref ofproto
for other types of cache?  XC_NORMAL holds the pointer too.

> +entry->table.ofproto = ofproto;
> +entry->table.id = *table_id;
> +entry->table.match = true;
> +}
>  }
>  return rule;
>  }
> @@ -4450,11 +4451,12 @@ rule_dpif_lookup_from_table(struct ofproto_dpif 
> *ofproto,
>  }
>  if (xcache) {
>  struct xc_entry *entry;
> -
> -entry = xlate_cache_add_entry(xcache, XC_TABLE);
> -entry->table.ofproto = ofproto;
> -entry->table.id = next_id;
> -entry->table.match = (rule != NULL);
> +if (ofproto_try_ref(>up)) {
> +entry = xlate_cache_add_entry(xcache, 

Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy

2020-07-20 Thread Gregory Rose



On 7/16/2020 6:50 PM, hepeng.0320 wrote:

From: Peng He 

The call stack of rule_destroy_cb:

remove_rules_postponed (one grace period)
 -> remove_rule_rcu
 -> remove_rule_rcu__
 -> ofproto_rule_unref -> ref count != 1
 -> ... more grace periods.
 -> rule_destroy_cb (> 2 grace periods)
 -> free

Currently ofproto_destory waits only two grace periods, this means when
rule_destroy_cb is called, the ofproto might have been already destroyed.

This patch adds a refcount for ofproto to ensure ofproto exists when it
is needed to call free functions.

This patch fixes the crashes found:
Program terminated with signal SIGSEGV, Segmentation fault.
0  0x562a55169e49 in rule_destroy_cb (rule=0x562a598be2c0) at 
ofproto/ofproto.c:2956
1  0x562a552623d6 in ovsrcu_call_postponed () at lib/ovs-rcu.c:348
2  0x562a55262504 in ovsrcu_postpone_thread (arg=) at 
lib/ovs-rcu.c:364
3  0x562a55264aef in ovsthread_wrapper (aux_=) at 
lib/ovs-thread.c:383
4  0x7febe715a4a4 in start_thread (arg=0x7febe4c1c700) at 
pthread_create.c:456
5  0x7febe6990d0f in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:97
(gdb) p rule->ofproto->ofproto_class
$3 = (const struct ofproto_class *) 0x0

Also:
0  ofproto_dpif_credit_table_stats (ofproto=0x5583583332b0, table_id=0 '\000', 
n_matches=5, n_misses=0) at ofproto/ofproto-dpif.c:4310
1  0x558354c68514 in xlate_push_stats_entry 
(entry=entry@entry=0x7ff488031398, stats=stats@entry=0x7ff49af30b60) at 
ofproto/ofproto-dpif-xlate-cache.c:99
2  0x558354c686f3 in xlate_push_stats (xcache=, 
stats=stats@entry=0x7ff49af30b60) at ofproto/ofproto-dpif-xlate-cache.c:181
3  0x558354c5411a in revalidate_ukey (udpif=udpif@entry=0x5583583baba0, 
ukey=ukey@entry=0x7ff47809f770, stats=stats@entry=0x7ff49af32128, 
odp_actions=odp_actions@entry=0x7ff49af30c50, 
reval_seq=reval_seq@entry=275670456,
recircs=recircs@entry=0x7ff49af30cd0) at ofproto/ofproto-dpif-upcall.c:2292
4  0x558354c57cbc in revalidate (revalidator=) at 
ofproto/ofproto-dpif-upcall.c:2681
5  0x558354c57f8e in udpif_revalidator (arg=0x5583583d2a90) at 
ofproto/ofproto-dpif-upcall.c:934
6  0x558354d24aef in ovsthread_wrapper (aux_=) at 
lib/ovs-thread.c:383
7  0x7ff4a479d4a4 in start_thread (arg=0x7ff49af35700) at 
pthread_create.c:456
8  0x7ff4a3fd3d0f in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:97
(gdb) p ofproto->up.ofproto_class
$3 = (const struct ofproto_class *) 0x0

Signed-off-by: Peng He 


Hi Peng,

thank you for working on this.  I was wondering if you have a reliable
way to reproduce this bug?

Besides the checkpatch errors, overall the code looks fine to me but
I'd like to be able to test it.
Have you considered adding a test for it in tests/ofproto.at?

Couple other comments below.

Thanks,

- Greg


---
  ofproto/ofproto-dpif-xlate-cache.c |  1 +
  ofproto/ofproto-dpif.c | 22 
  ofproto/ofproto-provider.h |  2 ++
  ofproto/ofproto.c  | 41 +++---
  ofproto/ofproto.h  |  4 +++
  5 files changed, 56 insertions(+), 14 deletions(-)

diff --git a/ofproto/ofproto-dpif-xlate-cache.c 
b/ofproto/ofproto-dpif-xlate-cache.c
index dcc91cb38..0deee365d 100644
--- a/ofproto/ofproto-dpif-xlate-cache.c
+++ b/ofproto/ofproto-dpif-xlate-cache.c
@@ -209,6 +209,7 @@ xlate_cache_clear_entry(struct xc_entry *entry)
  {
  switch (entry->type) {
  case XC_TABLE:
+ofproto_unref(&(entry->table.ofproto->up));
  break;
  case XC_RULE:
  ofproto_rule_unref(>rule->up);
diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
index 4f0638f23..6480b3c78 100644
--- a/ofproto/ofproto-dpif.c
+++ b/ofproto/ofproto-dpif.c
@@ -4414,11 +4414,12 @@ rule_dpif_lookup_from_table(struct ofproto_dpif 
*ofproto,
  }
  if (xcache) {
  struct xc_entry *entry;
-
-entry = xlate_cache_add_entry(xcache, XC_TABLE);
-entry->table.ofproto = ofproto;
-entry->table.id = *table_id;
-entry->table.match = true;
+if (ofproto_try_ref(>up)) {
+entry = xlate_cache_add_entry(xcache, XC_TABLE);
+entry->table.ofproto = ofproto;
+entry->table.id = *table_id;
+entry->table.match = true;
+}
  }
  return rule;
  }
@@ -4450,11 +4451,12 @@ rule_dpif_lookup_from_table(struct ofproto_dpif 
*ofproto,
  }
  if (xcache) {
  struct xc_entry *entry;
-
-entry = xlate_cache_add_entry(xcache, XC_TABLE);
-entry->table.ofproto = ofproto;
-entry->table.id = next_id;
-entry->table.match = (rule != NULL);
+if (ofproto_try_ref(>up)) {
+entry = xlate_cache_add_entry(xcache, XC_TABLE);
+ 

Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy

2020-07-16 Thread 0-day Robot
Bleep bloop.  Greetings hepeng.0320, I am a robot and I have tried out your 
patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


checkpatch:
ERROR: Inappropriate bracing around statement
#182 FILE: ofproto/ofproto.c:1734:
if (ovs_refcount_read(>refcount) != 1)

ERROR: Inappropriate bracing around statement
#184 FILE: ofproto/ofproto.c:1736:
else

Lines checked: 222, Warnings: 0, Errors: 2


Please check this out.  If you feel there has been an error, please email 
acon...@redhat.com

Thanks,
0-day Robot
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy

2020-07-16 Thread hepeng.0320
From: Peng He 

The call stack of rule_destroy_cb:

remove_rules_postponed (one grace period)
-> remove_rule_rcu
-> remove_rule_rcu__
-> ofproto_rule_unref -> ref count != 1
-> ... more grace periods.
-> rule_destroy_cb (> 2 grace periods)
-> free

Currently ofproto_destory waits only two grace periods, this means when
rule_destroy_cb is called, the ofproto might have been already destroyed.

This patch adds a refcount for ofproto to ensure ofproto exists when it
is needed to call free functions.

This patch fixes the crashes found:
Program terminated with signal SIGSEGV, Segmentation fault.
0  0x562a55169e49 in rule_destroy_cb (rule=0x562a598be2c0) at 
ofproto/ofproto.c:2956
1  0x562a552623d6 in ovsrcu_call_postponed () at lib/ovs-rcu.c:348
2  0x562a55262504 in ovsrcu_postpone_thread (arg=) at 
lib/ovs-rcu.c:364
3  0x562a55264aef in ovsthread_wrapper (aux_=) at 
lib/ovs-thread.c:383
4  0x7febe715a4a4 in start_thread (arg=0x7febe4c1c700) at 
pthread_create.c:456
5  0x7febe6990d0f in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:97
(gdb) p rule->ofproto->ofproto_class
$3 = (const struct ofproto_class *) 0x0

Also:
0  ofproto_dpif_credit_table_stats (ofproto=0x5583583332b0, table_id=0 '\000', 
n_matches=5, n_misses=0) at ofproto/ofproto-dpif.c:4310
1  0x558354c68514 in xlate_push_stats_entry 
(entry=entry@entry=0x7ff488031398, stats=stats@entry=0x7ff49af30b60) at 
ofproto/ofproto-dpif-xlate-cache.c:99
2  0x558354c686f3 in xlate_push_stats (xcache=, 
stats=stats@entry=0x7ff49af30b60) at ofproto/ofproto-dpif-xlate-cache.c:181
3  0x558354c5411a in revalidate_ukey (udpif=udpif@entry=0x5583583baba0, 
ukey=ukey@entry=0x7ff47809f770, stats=stats@entry=0x7ff49af32128, 
odp_actions=odp_actions@entry=0x7ff49af30c50, 
reval_seq=reval_seq@entry=275670456,
   recircs=recircs@entry=0x7ff49af30cd0) at ofproto/ofproto-dpif-upcall.c:2292
4  0x558354c57cbc in revalidate (revalidator=) at 
ofproto/ofproto-dpif-upcall.c:2681
5  0x558354c57f8e in udpif_revalidator (arg=0x5583583d2a90) at 
ofproto/ofproto-dpif-upcall.c:934
6  0x558354d24aef in ovsthread_wrapper (aux_=) at 
lib/ovs-thread.c:383
7  0x7ff4a479d4a4 in start_thread (arg=0x7ff49af35700) at 
pthread_create.c:456
8  0x7ff4a3fd3d0f in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:97
(gdb) p ofproto->up.ofproto_class
$3 = (const struct ofproto_class *) 0x0

Signed-off-by: Peng He 
---
 ofproto/ofproto-dpif-xlate-cache.c |  1 +
 ofproto/ofproto-dpif.c | 22 
 ofproto/ofproto-provider.h |  2 ++
 ofproto/ofproto.c  | 41 +++---
 ofproto/ofproto.h  |  4 +++
 5 files changed, 56 insertions(+), 14 deletions(-)

diff --git a/ofproto/ofproto-dpif-xlate-cache.c 
b/ofproto/ofproto-dpif-xlate-cache.c
index dcc91cb38..0deee365d 100644
--- a/ofproto/ofproto-dpif-xlate-cache.c
+++ b/ofproto/ofproto-dpif-xlate-cache.c
@@ -209,6 +209,7 @@ xlate_cache_clear_entry(struct xc_entry *entry)
 {
 switch (entry->type) {
 case XC_TABLE:
+ofproto_unref(&(entry->table.ofproto->up));
 break;
 case XC_RULE:
 ofproto_rule_unref(>rule->up);
diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
index 4f0638f23..6480b3c78 100644
--- a/ofproto/ofproto-dpif.c
+++ b/ofproto/ofproto-dpif.c
@@ -4414,11 +4414,12 @@ rule_dpif_lookup_from_table(struct ofproto_dpif 
*ofproto,
 }
 if (xcache) {
 struct xc_entry *entry;
-
-entry = xlate_cache_add_entry(xcache, XC_TABLE);
-entry->table.ofproto = ofproto;
-entry->table.id = *table_id;
-entry->table.match = true;
+if (ofproto_try_ref(>up)) {
+entry = xlate_cache_add_entry(xcache, XC_TABLE);
+entry->table.ofproto = ofproto;
+entry->table.id = *table_id;
+entry->table.match = true;
+}
 }
 return rule;
 }
@@ -4450,11 +4451,12 @@ rule_dpif_lookup_from_table(struct ofproto_dpif 
*ofproto,
 }
 if (xcache) {
 struct xc_entry *entry;
-
-entry = xlate_cache_add_entry(xcache, XC_TABLE);
-entry->table.ofproto = ofproto;
-entry->table.id = next_id;
-entry->table.match = (rule != NULL);
+if (ofproto_try_ref(>up)) {
+entry = xlate_cache_add_entry(xcache, XC_TABLE);
+entry->table.ofproto = ofproto;
+entry->table.id = next_id;
+entry->table.match = (rule != NULL);
+}
 }
 if (rule) {
 goto out;   /* Match. */
diff --git a/ofproto/ofproto-provider.h b/ofproto/ofproto-provider.h
index afecb24cb..34074fd62 100644
--- a/ofproto/ofproto-provider.h
+++ b/ofproto/ofproto-provider.h