Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy
On Thu, Feb 25, 2021 at 10:31:35AM -0800, William Tu wrote: > I think refcount and RCU are mutually exclusive. They are two > different ways of doing synchronization and somehow we mix them > together by using RCU to optimize refcount. I think that you are correct in your description of the possibilities. A data structure can be protected by RCU, or by a refcount, or by a refcount that is only needed if access to a data structure needs to span a grace period. But to me, "mutually exclusive" means two things that cannot exist together, and I think that the third possibility is in fact a case where RCU and refcount exist together, so I would not call this "mutually exclusive". ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy
On 2/27/21 9:53 AM, 贺鹏 wrote: > Hi, > > Thanks William and Ilya for a detailed revisiting of the origin of the > problem. I learned a lot. > > I now understand that the mix using of RCU and refcounts is not > intended in the first place. > But my point is that mix using RCU and refcounts now gives you more > choices, and actually eases the code changes. > > For example, the code for * ofproto_dpif_lookup_by_name* or other > ofproto lookup function, > when only using refcounts, you need to change it to: > > struct ofproto_dpif * > ofproto_dpif_lookup_by_uuid(const struct uuid *uuid) > { > struct ofproto_dpif *ofproto; > > HMAP_FOR_EACH_WITH_HASH (ofproto, all_ofproto_dpifs_by_uuid_node, > uuid_hash(uuid), _ofproto_dpifs_by_uuid) { > if (uuid_equals(>uuid, uuid)) { > > ---> if ovs_refcount_try_ref(ofproto) > > return ofproto; > } > } > return NULL; > } That is an interesting example here... I can't help but notice that this function is typically called from different handler or pmd threads and modified by the main thread while upcalls enabled. And hmap is not a thread-safe structure. I guess, we have another possible problem here. We need to protect at least this hmap and the other one with rw lock or something... Am I right or am I missing something? What else we need to protect? > > and after finish its usage, you have to do ofproto_unref the ofproto. > > This is why I said, most accessing to ofproto is ephemeral. If you > change to use the pure refcounts solution, you have > to add refcount and release it every time you access the ofproto. We > should be more careful and remember to unref > the ofproto. > > However, if using RCU and refcounts, in the above case, you do not > need to change the code, since the RCU ensures that > these ephemeral accesses are safe. > > you only need to add refcount, when you find that the pointer to > ofproto lives longer than one grace period. > > > This is why in my patch, I do not add ref to ofproto after its > creation. I agree the patch is not complete and has issues, > and understand it could confuse people if changes into mix RCU and > refcounts version. > > > William Tu 于2021年2月26日周五 上午2:32写道: >> >> On Thu, Feb 25, 2021 at 4:54 AM Ilya Maximets wrote: >>> >>> On 2/25/21 4:32 AM, Guohongzhi (Russell Lab) wrote: Refcount and RCU are not mutually exclusive. IMO, the main reason for this problem is that the rule uses both refcount and rcu, while the ofproto uses only rcu, and the rule retains the pointer of the ofproto. More importantly, ofproto cannot guarantee a longer grace period than the rule. >>> >>> I understand that refcount and RCU are not mutually exclusive. >>> However, in this particular case RCU for ofproto was introduced for one >>> and only one reason - to avoid freeing ofproto while rules are still >>> alive. This was done in commit f416c8d61601 ("ofproto: RCU postpone >>> rule destruction."). The goal was to allow using rules without >>> refcounting them within a single grace period. And that forced us >>> to postpone destruction of the ofproto for a single grace period. >>> Later commit 39c9459355b6 ("Use classifier versioning.") made it >>> possible for rules to be alive for more than one grace period, so >>> the commit made ofproto wait for 2 grace periods by double postponing. >>> As we can see now, that wasn't enough and we have to wait for more >>> than 2 grace periods in certain cases. >>> >>> Now we're introducing refcounts to wait for all rules to be destroyed >>> before destroying the ofproto. But what is the point of having >>> RCU if we already know that if refcount is zero than all rules are >>> already destroyed and we don't need to wait anything and could just >>> destroy the ofproto? >>> >>> RCU doesn't protect ofproto itself here, it actually protects rules, >>> i.e. keeps ofproto alive while rules alive, but we can fully replace >>> this with refcounts and I see no value in having RCU additionally. >>> >>> To have a full picture: right now we also have groups protected by >>> RCU, so we need to refcount ofproto for them too. But that is almost >>> exactly same situation as we have with rules. >>> >> Thanks Guohongzhi, Hepeng, and Ilya, I learned a lot from your discussions. >> >> I think refcount and RCU are mutually exclusive. They are two different ways >> of >> doing synchronization and somehow we mix them together by using RCU to >> optimize refcount. >> >> Before the commit f416c8d61601 ("ofproto: RCU postpone rule destruction."), >> we are using refcount to protect rules, and as a result every time OVS >> references >> a rule, we have to take refcount. It works ok but this probably has >> high performance >> overhead because it's doing atomic operations. >> >> So the commit f416c8d61601 optimizes it by doing >> 1) not taking refcount of rule "within a grace period" >> 2)
Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy
Hi Hepeng, Thanks, now I understand. On Sat, Feb 27, 2021 at 12:54 AM 贺鹏 wrote: > > Hi, > > Thanks William and Ilya for a detailed revisiting of the origin of the > problem. I learned a lot. > > I now understand that the mix using of RCU and refcounts is not > intended in the first place. > But my point is that mix using RCU and refcounts now gives you more > choices, and actually eases the code changes. > > For example, the code for * ofproto_dpif_lookup_by_name* or other > ofproto lookup function, > when only using refcounts, you need to change it to: > > struct ofproto_dpif * > ofproto_dpif_lookup_by_uuid(const struct uuid *uuid) > { > struct ofproto_dpif *ofproto; > > HMAP_FOR_EACH_WITH_HASH (ofproto, all_ofproto_dpifs_by_uuid_node, > uuid_hash(uuid), _ofproto_dpifs_by_uuid) { > if (uuid_equals(>uuid, uuid)) { > > ---> if ovs_refcount_try_ref(ofproto) > > return ofproto; > } > } > return NULL; > } > > and after finish its usage, you have to do ofproto_unref the ofproto. > > This is why I said, most accessing to ofproto is ephemeral. If you > change to use the pure refcounts solution, you have > to add refcount and release it every time you access the ofproto. We > should be more careful and remember to unref > the ofproto. > > However, if using RCU and refcounts, in the above case, you do not > need to change the code, since the RCU ensures that > these ephemeral accesses are safe. How do we know which access is ephemeral, so no refcount is needed, and which access is not, so we have to add refcount? > > you only need to add refcount, when you find that the pointer to > ofproto lives longer than one grace period. > > > This is why in my patch, I do not add ref to ofproto after its > creation. I agree the patch is not complete and has issues, > and understand it could confuse people if changes into mix RCU and > refcounts version. > I see, thanks! William ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy
Hi, Thanks William and Ilya for a detailed revisiting of the origin of the problem. I learned a lot. I now understand that the mix using of RCU and refcounts is not intended in the first place. But my point is that mix using RCU and refcounts now gives you more choices, and actually eases the code changes. For example, the code for * ofproto_dpif_lookup_by_name* or other ofproto lookup function, when only using refcounts, you need to change it to: struct ofproto_dpif * ofproto_dpif_lookup_by_uuid(const struct uuid *uuid) { struct ofproto_dpif *ofproto; HMAP_FOR_EACH_WITH_HASH (ofproto, all_ofproto_dpifs_by_uuid_node, uuid_hash(uuid), _ofproto_dpifs_by_uuid) { if (uuid_equals(>uuid, uuid)) { ---> if ovs_refcount_try_ref(ofproto) return ofproto; } } return NULL; } and after finish its usage, you have to do ofproto_unref the ofproto. This is why I said, most accessing to ofproto is ephemeral. If you change to use the pure refcounts solution, you have to add refcount and release it every time you access the ofproto. We should be more careful and remember to unref the ofproto. However, if using RCU and refcounts, in the above case, you do not need to change the code, since the RCU ensures that these ephemeral accesses are safe. you only need to add refcount, when you find that the pointer to ofproto lives longer than one grace period. This is why in my patch, I do not add ref to ofproto after its creation. I agree the patch is not complete and has issues, and understand it could confuse people if changes into mix RCU and refcounts version. William Tu 于2021年2月26日周五 上午2:32写道: > > On Thu, Feb 25, 2021 at 4:54 AM Ilya Maximets wrote: > > > > On 2/25/21 4:32 AM, Guohongzhi (Russell Lab) wrote: > > > Refcount and RCU are not mutually exclusive. > > > IMO, the main reason for this problem is that the rule uses both refcount > > > and rcu, while the ofproto uses only rcu, and the rule retains the > > > pointer of the ofproto. More importantly, ofproto cannot guarantee a > > > longer grace period than the rule. > > > > > > > I understand that refcount and RCU are not mutually exclusive. > > However, in this particular case RCU for ofproto was introduced for one > > and only one reason - to avoid freeing ofproto while rules are still > > alive. This was done in commit f416c8d61601 ("ofproto: RCU postpone > > rule destruction."). The goal was to allow using rules without > > refcounting them within a single grace period. And that forced us > > to postpone destruction of the ofproto for a single grace period. > > Later commit 39c9459355b6 ("Use classifier versioning.") made it > > possible for rules to be alive for more than one grace period, so > > the commit made ofproto wait for 2 grace periods by double postponing. > > As we can see now, that wasn't enough and we have to wait for more > > than 2 grace periods in certain cases. > > > > Now we're introducing refcounts to wait for all rules to be destroyed > > before destroying the ofproto. But what is the point of having > > RCU if we already know that if refcount is zero than all rules are > > already destroyed and we don't need to wait anything and could just > > destroy the ofproto? > > > > RCU doesn't protect ofproto itself here, it actually protects rules, > > i.e. keeps ofproto alive while rules alive, but we can fully replace > > this with refcounts and I see no value in having RCU additionally. > > > > To have a full picture: right now we also have groups protected by > > RCU, so we need to refcount ofproto for them too. But that is almost > > exactly same situation as we have with rules. > > > Thanks Guohongzhi, Hepeng, and Ilya, I learned a lot from your discussions. > > I think refcount and RCU are mutually exclusive. They are two different ways > of > doing synchronization and somehow we mix them together by using RCU to > optimize refcount. > > Before the commit f416c8d61601 ("ofproto: RCU postpone rule destruction."), > we are using refcount to protect rules, and as a result every time OVS > references > a rule, we have to take refcount. It works ok but this probably has > high performance > overhead because it's doing atomic operations. > > So the commit f416c8d61601 optimizes it by doing > 1) not taking refcount of rule "within a grace period" > 2) introduce RCU for rule > The assumption is that a grace period is like refcount == 0, so it's > valid to do so. > A side effect is that ofproto_destroy__() needs to be postponed. > Note that if a rule is alive across grace period, it has to take refcount. > > However, later commit 39c9459355b6 ("Use classifier versioning.") > makes rule alive across grace period. It breaks the first commit's assumption > and it has to introduce double postponing for ofproto, which we found > out is the problem now. > > Since my point is RCU and refcount are mutually exclusive (A grace period > is like
Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy
On Thu, Feb 25, 2021 at 4:54 AM Ilya Maximets wrote: > > On 2/25/21 4:32 AM, Guohongzhi (Russell Lab) wrote: > > Refcount and RCU are not mutually exclusive. > > IMO, the main reason for this problem is that the rule uses both refcount > > and rcu, while the ofproto uses only rcu, and the rule retains the pointer > > of the ofproto. More importantly, ofproto cannot guarantee a longer grace > > period than the rule. > > > > I understand that refcount and RCU are not mutually exclusive. > However, in this particular case RCU for ofproto was introduced for one > and only one reason - to avoid freeing ofproto while rules are still > alive. This was done in commit f416c8d61601 ("ofproto: RCU postpone > rule destruction."). The goal was to allow using rules without > refcounting them within a single grace period. And that forced us > to postpone destruction of the ofproto for a single grace period. > Later commit 39c9459355b6 ("Use classifier versioning.") made it > possible for rules to be alive for more than one grace period, so > the commit made ofproto wait for 2 grace periods by double postponing. > As we can see now, that wasn't enough and we have to wait for more > than 2 grace periods in certain cases. > > Now we're introducing refcounts to wait for all rules to be destroyed > before destroying the ofproto. But what is the point of having > RCU if we already know that if refcount is zero than all rules are > already destroyed and we don't need to wait anything and could just > destroy the ofproto? > > RCU doesn't protect ofproto itself here, it actually protects rules, > i.e. keeps ofproto alive while rules alive, but we can fully replace > this with refcounts and I see no value in having RCU additionally. > > To have a full picture: right now we also have groups protected by > RCU, so we need to refcount ofproto for them too. But that is almost > exactly same situation as we have with rules. > Thanks Guohongzhi, Hepeng, and Ilya, I learned a lot from your discussions. I think refcount and RCU are mutually exclusive. They are two different ways of doing synchronization and somehow we mix them together by using RCU to optimize refcount. Before the commit f416c8d61601 ("ofproto: RCU postpone rule destruction."), we are using refcount to protect rules, and as a result every time OVS references a rule, we have to take refcount. It works ok but this probably has high performance overhead because it's doing atomic operations. So the commit f416c8d61601 optimizes it by doing 1) not taking refcount of rule "within a grace period" 2) introduce RCU for rule The assumption is that a grace period is like refcount == 0, so it's valid to do so. A side effect is that ofproto_destroy__() needs to be postponed. Note that if a rule is alive across grace period, it has to take refcount. However, later commit 39c9459355b6 ("Use classifier versioning.") makes rule alive across grace period. It breaks the first commit's assumption and it has to introduce double postponing for ofproto, which we found out is the problem now. Since my point is RCU and refcount are mutually exclusive (A grace period is like refcount ==0) , I agree with Ilya that we can just use refcount to fix ofproto issue. Regards, William ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy
On 2/25/21 4:32 AM, Guohongzhi (Russell Lab) wrote: > Refcount and RCU are not mutually exclusive. > IMO, the main reason for this problem is that the rule uses both refcount and > rcu, while the ofproto uses only rcu, and the rule retains the pointer of the > ofproto. More importantly, ofproto cannot guarantee a longer grace period > than the rule. > I understand that refcount and RCU are not mutually exclusive. However, in this particular case RCU for ofproto was introduced for one and only one reason - to avoid freeing ofproto while rules are still alive. This was done in commit f416c8d61601 ("ofproto: RCU postpone rule destruction."). The goal was to allow using rules without refcounting them within a single grace period. And that forced us to postpone destruction of the ofproto for a single grace period. Later commit 39c9459355b6 ("Use classifier versioning.") made it possible for rules to be alive for more than one grace period, so the commit made ofproto wait for 2 grace periods by double postponing. As we can see now, that wasn't enough and we have to wait for more than 2 grace periods in certain cases. Now we're introducing refcounts to wait for all rules to be destroyed before destroying the ofproto. But what is the point of having RCU if we already know that if refcount is zero than all rules are already destroyed and we don't need to wait anything and could just destroy the ofproto? RCU doesn't protect ofproto itself here, it actually protects rules, i.e. keeps ofproto alive while rules alive, but we can fully replace this with refcounts and I see no value in having RCU additionally. To have a full picture: right now we also have groups protected by RCU, so we need to refcount ofproto for them too. But that is almost exactly same situation as we have with rules. > > -Original Message- > From: 贺鹏 [mailto:xnhp0...@gmail.com] > Sent: 25/2/2021 11:14 > To: Ilya Maximets > Cc: hepeng.0320 ; > ; Guohongzhi (Russell Lab) > Subject: Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix > crash at ofproto_destroy > > Ilya Maximets 于2021年2月25日周四 上午12:01写道: >> >> On 2/24/21 3:29 PM, Ilya Maximets wrote: >>> On 7/17/20 3:50 AM, hepeng.0320 wrote: >>>> From: Peng He >>>> >>>> The call stack of rule_destroy_cb: >>>> >>>> remove_rules_postponed (one grace period) >>>> -> remove_rule_rcu >>>> -> remove_rule_rcu__ >>>> -> ofproto_rule_unref -> ref count != 1 >>>> -> ... more grace periods. >>>> -> rule_destroy_cb (> 2 grace periods) >>>> -> free >>>> >>>> Currently ofproto_destory waits only two grace periods, this means >>>> when rule_destroy_cb is called, the ofproto might have been already >>>> destroyed. >>>> >>>> This patch adds a refcount for ofproto to ensure ofproto exists >>>> when it is needed to call free functions. >>>> >>>> This patch fixes the crashes found: >>>> Program terminated with signal SIGSEGV, Segmentation fault. >>>> 0 0x562a55169e49 in rule_destroy_cb (rule=0x562a598be2c0) at >>>> ofproto/ofproto.c:2956 >>>> 1 0x562a552623d6 in ovsrcu_call_postponed () at >>>> lib/ovs-rcu.c:348 >>>> 2 0x562a55262504 in ovsrcu_postpone_thread (arg=>>> out>) at lib/ovs-rcu.c:364 >>>> 3 0x562a55264aef in ovsthread_wrapper (aux_=) >>>> at lib/ovs-thread.c:383 >>>> 4 0x7febe715a4a4 in start_thread (arg=0x7febe4c1c700) at >>>> pthread_create.c:456 >>>> 5 0x7febe6990d0f in clone () at >>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97 >>>> (gdb) p rule->ofproto->ofproto_class >>>> $3 = (const struct ofproto_class *) 0x0 >>>> >>>> Also: >>>> 0 ofproto_dpif_credit_table_stats (ofproto=0x5583583332b0, >>>> table_id=0 '\000', n_matches=5, n_misses=0) at >>>> ofproto/ofproto-dpif.c:4310 >>>> 1 0x558354c68514 in xlate_push_stats_entry >>>> (entry=entry@entry=0x7ff488031398, >>>> stats=stats@entry=0x7ff49af30b60) at >>>> ofproto/ofproto-dpif-xlate-cache.c:99 >>>> 2 0x558354c686f3 in xlate_push_stats (xcache=, >>>> stats=stats@entry=0x7ff49af30b60) at >>>> ofproto/ofproto-dpif-xlate-cache.c:181 >>>> 3 0x558354c5411a in revalidate_ukey >>>> (udpif=udpif@entry=0x5583583baba0, ukey=ukey@entry=0x7ff47809f770, >>>> st
Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy
Refcount and RCU are not mutually exclusive. IMO, the main reason for this problem is that the rule uses both refcount and rcu, while the ofproto uses only rcu, and the rule retains the pointer of the ofproto. More importantly, ofproto cannot guarantee a longer grace period than the rule. -Original Message- From: 贺鹏 [mailto:xnhp0...@gmail.com] Sent: 25/2/2021 11:14 To: Ilya Maximets Cc: hepeng.0320 ; ; Guohongzhi (Russell Lab) Subject: Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy Ilya Maximets 于2021年2月25日周四 上午12:01写道: > > On 2/24/21 3:29 PM, Ilya Maximets wrote: > > On 7/17/20 3:50 AM, hepeng.0320 wrote: > >> From: Peng He > >> > >> The call stack of rule_destroy_cb: > >> > >> remove_rules_postponed (one grace period) > >> -> remove_rule_rcu > >> -> remove_rule_rcu__ > >> -> ofproto_rule_unref -> ref count != 1 > >> -> ... more grace periods. > >> -> rule_destroy_cb (> 2 grace periods) > >> -> free > >> > >> Currently ofproto_destory waits only two grace periods, this means > >> when rule_destroy_cb is called, the ofproto might have been already > >> destroyed. > >> > >> This patch adds a refcount for ofproto to ensure ofproto exists > >> when it is needed to call free functions. > >> > >> This patch fixes the crashes found: > >> Program terminated with signal SIGSEGV, Segmentation fault. > >> 0 0x562a55169e49 in rule_destroy_cb (rule=0x562a598be2c0) at > >> ofproto/ofproto.c:2956 > >> 1 0x562a552623d6 in ovsrcu_call_postponed () at > >> lib/ovs-rcu.c:348 > >> 2 0x562a55262504 in ovsrcu_postpone_thread (arg= >> out>) at lib/ovs-rcu.c:364 > >> 3 0x562a55264aef in ovsthread_wrapper (aux_=) > >> at lib/ovs-thread.c:383 > >> 4 0x7febe715a4a4 in start_thread (arg=0x7febe4c1c700) at > >> pthread_create.c:456 > >> 5 0x7febe6990d0f in clone () at > >> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97 > >> (gdb) p rule->ofproto->ofproto_class > >> $3 = (const struct ofproto_class *) 0x0 > >> > >> Also: > >> 0 ofproto_dpif_credit_table_stats (ofproto=0x5583583332b0, > >> table_id=0 '\000', n_matches=5, n_misses=0) at > >> ofproto/ofproto-dpif.c:4310 > >> 1 0x558354c68514 in xlate_push_stats_entry > >> (entry=entry@entry=0x7ff488031398, > >> stats=stats@entry=0x7ff49af30b60) at > >> ofproto/ofproto-dpif-xlate-cache.c:99 > >> 2 0x558354c686f3 in xlate_push_stats (xcache=, > >> stats=stats@entry=0x7ff49af30b60) at > >> ofproto/ofproto-dpif-xlate-cache.c:181 > >> 3 0x558354c5411a in revalidate_ukey > >> (udpif=udpif@entry=0x5583583baba0, ukey=ukey@entry=0x7ff47809f770, > >> stats=stats@entry=0x7ff49af32128, > >> odp_actions=odp_actions@entry=0x7ff49af30c50, > >> reval_seq=reval_seq@entry=275670456, > >>recircs=recircs@entry=0x7ff49af30cd0) at > >> ofproto/ofproto-dpif-upcall.c:2292 > >> 4 0x558354c57cbc in revalidate (revalidator=) > >> at ofproto/ofproto-dpif-upcall.c:2681 > >> 5 0x558354c57f8e in udpif_revalidator (arg=0x5583583d2a90) at > >> ofproto/ofproto-dpif-upcall.c:934 > >> 6 0x558354d24aef in ovsthread_wrapper (aux_=) > >> at lib/ovs-thread.c:383 > >> 7 0x7ff4a479d4a4 in start_thread (arg=0x7ff49af35700) at > >> pthread_create.c:456 > >> 8 0x7ff4a3fd3d0f in clone () at > >> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97 > >> (gdb) p ofproto->up.ofproto_class > >> $3 = (const struct ofproto_class *) 0x0 > >> > >> Signed-off-by: Peng He > >> --- > > CC: Hongzhi Guo > > And I just remembered that there is a very similar patch that was > submitted to the mail-list several months betore this one. > Ben started review, but didn't finish: > > http://patchwork.ozlabs.org/project/openvswitch/patch/20200306130555.1 > 9884-1-guohongz...@huawei.com/ > > Both patches has smilar issues, i.e. mixing RCU with refcounters and > not covering all the cases. > > There are at least 2 major places where ofproto needs to be refcounted: > 1. rules >1.a xlate caches? > 2. ofproto groups > > IMO, mixing RCU with refcounts basically means that we do not have a > clear understanding on how it works and using RCU as a safety net. > We shuld stop using RCU for orproto and ju
Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy
Ilya Maximets 于2021年2月25日周四 上午12:01写道: > > On 2/24/21 3:29 PM, Ilya Maximets wrote: > > On 7/17/20 3:50 AM, hepeng.0320 wrote: > >> From: Peng He > >> > >> The call stack of rule_destroy_cb: > >> > >> remove_rules_postponed (one grace period) > >> -> remove_rule_rcu > >> -> remove_rule_rcu__ > >> -> ofproto_rule_unref -> ref count != 1 > >> -> ... more grace periods. > >> -> rule_destroy_cb (> 2 grace periods) > >> -> free > >> > >> Currently ofproto_destory waits only two grace periods, this means when > >> rule_destroy_cb is called, the ofproto might have been already destroyed. > >> > >> This patch adds a refcount for ofproto to ensure ofproto exists when it > >> is needed to call free functions. > >> > >> This patch fixes the crashes found: > >> Program terminated with signal SIGSEGV, Segmentation fault. > >> 0 0x562a55169e49 in rule_destroy_cb (rule=0x562a598be2c0) at > >> ofproto/ofproto.c:2956 > >> 1 0x562a552623d6 in ovsrcu_call_postponed () at lib/ovs-rcu.c:348 > >> 2 0x562a55262504 in ovsrcu_postpone_thread (arg=) at > >> lib/ovs-rcu.c:364 > >> 3 0x562a55264aef in ovsthread_wrapper (aux_=) at > >> lib/ovs-thread.c:383 > >> 4 0x7febe715a4a4 in start_thread (arg=0x7febe4c1c700) at > >> pthread_create.c:456 > >> 5 0x7febe6990d0f in clone () at > >> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97 > >> (gdb) p rule->ofproto->ofproto_class > >> $3 = (const struct ofproto_class *) 0x0 > >> > >> Also: > >> 0 ofproto_dpif_credit_table_stats (ofproto=0x5583583332b0, table_id=0 > >> '\000', n_matches=5, n_misses=0) at ofproto/ofproto-dpif.c:4310 > >> 1 0x558354c68514 in xlate_push_stats_entry > >> (entry=entry@entry=0x7ff488031398, stats=stats@entry=0x7ff49af30b60) at > >> ofproto/ofproto-dpif-xlate-cache.c:99 > >> 2 0x558354c686f3 in xlate_push_stats (xcache=, > >> stats=stats@entry=0x7ff49af30b60) at ofproto/ofproto-dpif-xlate-cache.c:181 > >> 3 0x558354c5411a in revalidate_ukey > >> (udpif=udpif@entry=0x5583583baba0, ukey=ukey@entry=0x7ff47809f770, > >> stats=stats@entry=0x7ff49af32128, > >> odp_actions=odp_actions@entry=0x7ff49af30c50, > >> reval_seq=reval_seq@entry=275670456, > >>recircs=recircs@entry=0x7ff49af30cd0) at > >> ofproto/ofproto-dpif-upcall.c:2292 > >> 4 0x558354c57cbc in revalidate (revalidator=) at > >> ofproto/ofproto-dpif-upcall.c:2681 > >> 5 0x558354c57f8e in udpif_revalidator (arg=0x5583583d2a90) at > >> ofproto/ofproto-dpif-upcall.c:934 > >> 6 0x558354d24aef in ovsthread_wrapper (aux_=) at > >> lib/ovs-thread.c:383 > >> 7 0x7ff4a479d4a4 in start_thread (arg=0x7ff49af35700) at > >> pthread_create.c:456 > >> 8 0x7ff4a3fd3d0f in clone () at > >> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97 > >> (gdb) p ofproto->up.ofproto_class > >> $3 = (const struct ofproto_class *) 0x0 > >> > >> Signed-off-by: Peng He > >> --- > > CC: Hongzhi Guo > > And I just remembered that there is a very similar patch that > was submitted to the mail-list several months betore this one. > Ben started review, but didn't finish: > > http://patchwork.ozlabs.org/project/openvswitch/patch/20200306130555.19884-1-guohongz...@huawei.com/ > > Both patches has smilar issues, i.e. mixing RCU with refcounters > and not covering all the cases. > > There are at least 2 major places where ofproto needs to be refcounted: > 1. rules >1.a xlate caches? > 2. ofproto groups > > IMO, mixing RCU with refcounts basically means that we do not have a > clear understanding on how it works and using RCU as a safety net. > We shuld stop using RCU for orproto and just have refcounts. Once > ofproto refcount reaches zero it should be immediately destroyed. > All the places where we need to increase refcount needs to be tracked > down. IMO, I think RCU and refcounts are not mutually exclusive. In kernel connntack system such as IPVS, sessions are used with both RCU and refcounts. Essentially, the RCU implies a refcount if there are no places holding a pointer to the object other than hash tables (or other RCU enabled data structure), thus by removing it from hash tables, and wait a grace period, we can safely free the object because we are sure that no one holds a pointer to the object. Removing it from hash tables is like reducing the refcount by 1, and since hash tables are the only place that holds the pointer and this is the only reference that lives longer than a grace period, so by removing it, we are sure that no other ref can live longer than a grace period, and after the period, we are safe to free it. If we use only refcounts, every time we access the object, we have to add refcounts. But if use RCU combined with refcount, we only need to add ref, if there are other places than the hashtables that live longer than one grace period. So I think using RCU with refcounts actually makes code changes less, as mostly, the access to object is ephemeral, you lookup
Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy
It's true that the patch has been reviewed by ben and a question has been raised.I have replied the question, but I did not get a follow-up reply. In short, I agree with you very much and would be happy to prepare a complete solution for this issue. -Original Message- From: Ilya Maximets [mailto:i.maxim...@ovn.org] Sent: 2/25/2021 0:02 AM To: Ilya Maximets ; hepeng.0320 ; d...@openvswitch.org Cc: William Tu ; Yifeng Sun ; Guohongzhi (Russell Lab) Subject: Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy On 2/24/21 3:29 PM, Ilya Maximets wrote: > On 7/17/20 3:50 AM, hepeng.0320 wrote: >> From: Peng He >> >> The call stack of rule_destroy_cb: >> >> remove_rules_postponed (one grace period) >> -> remove_rule_rcu >> -> remove_rule_rcu__ >> -> ofproto_rule_unref -> ref count != 1 >> -> ... more grace periods. >> -> rule_destroy_cb (> 2 grace periods) >> -> free >> >> Currently ofproto_destory waits only two grace periods, this means >> when rule_destroy_cb is called, the ofproto might have been already >> destroyed. >> >> This patch adds a refcount for ofproto to ensure ofproto exists when >> it is needed to call free functions. >> >> This patch fixes the crashes found: >> Program terminated with signal SIGSEGV, Segmentation fault. >> 0 0x562a55169e49 in rule_destroy_cb (rule=0x562a598be2c0) at >> ofproto/ofproto.c:2956 >> 1 0x562a552623d6 in ovsrcu_call_postponed () at >> lib/ovs-rcu.c:348 >> 2 0x562a55262504 in ovsrcu_postpone_thread (arg=) >> at lib/ovs-rcu.c:364 >> 3 0x562a55264aef in ovsthread_wrapper (aux_=) at >> lib/ovs-thread.c:383 >> 4 0x7febe715a4a4 in start_thread (arg=0x7febe4c1c700) at >> pthread_create.c:456 >> 5 0x7febe6990d0f in clone () at >> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97 >> (gdb) p rule->ofproto->ofproto_class >> $3 = (const struct ofproto_class *) 0x0 >> >> Also: >> 0 ofproto_dpif_credit_table_stats (ofproto=0x5583583332b0, >> table_id=0 '\000', n_matches=5, n_misses=0) at >> ofproto/ofproto-dpif.c:4310 >> 1 0x558354c68514 in xlate_push_stats_entry >> (entry=entry@entry=0x7ff488031398, stats=stats@entry=0x7ff49af30b60) >> at ofproto/ofproto-dpif-xlate-cache.c:99 >> 2 0x558354c686f3 in xlate_push_stats (xcache=, >> stats=stats@entry=0x7ff49af30b60) at >> ofproto/ofproto-dpif-xlate-cache.c:181 >> 3 0x558354c5411a in revalidate_ukey (udpif=udpif@entry=0x5583583baba0, >> ukey=ukey@entry=0x7ff47809f770, stats=stats@entry=0x7ff49af32128, >> odp_actions=odp_actions@entry=0x7ff49af30c50, >> reval_seq=reval_seq@entry=275670456, >>recircs=recircs@entry=0x7ff49af30cd0) at >> ofproto/ofproto-dpif-upcall.c:2292 >> 4 0x558354c57cbc in revalidate (revalidator=) at >> ofproto/ofproto-dpif-upcall.c:2681 >> 5 0x558354c57f8e in udpif_revalidator (arg=0x5583583d2a90) at >> ofproto/ofproto-dpif-upcall.c:934 >> 6 0x558354d24aef in ovsthread_wrapper (aux_=) at >> lib/ovs-thread.c:383 >> 7 0x7ff4a479d4a4 in start_thread (arg=0x7ff49af35700) at >> pthread_create.c:456 >> 8 0x7ff4a3fd3d0f in clone () at >> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97 >> (gdb) p ofproto->up.ofproto_class >> $3 = (const struct ofproto_class *) 0x0 >> >> Signed-off-by: Peng He >> --- CC: Hongzhi Guo And I just remembered that there is a very similar patch that was submitted to the mail-list several months betore this one. Ben started review, but didn't finish: http://patchwork.ozlabs.org/project/openvswitch/patch/20200306130555.19884-1-guohongz...@huawei.com/ Both patches has smilar issues, i.e. mixing RCU with refcounters and not covering all the cases. There are at least 2 major places where ofproto needs to be refcounted: 1. rules 1.a xlate caches? 2. ofproto groups IMO, mixing RCU with refcounts basically means that we do not have a clear understanding on how it works and using RCU as a safety net. We shuld stop using RCU for orproto and just have refcounts. Once ofproto refcount reaches zero it should be immediately destroyed. All the places where we need to increase refcount needs to be tracked down. It'll be great if you can cooperate to prepare a complete solution for this issue. Best regards, Ilya Maximets. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy
On 2/24/21 3:29 PM, Ilya Maximets wrote: > On 7/17/20 3:50 AM, hepeng.0320 wrote: >> From: Peng He >> >> The call stack of rule_destroy_cb: >> >> remove_rules_postponed (one grace period) >> -> remove_rule_rcu >> -> remove_rule_rcu__ >> -> ofproto_rule_unref -> ref count != 1 >> -> ... more grace periods. >> -> rule_destroy_cb (> 2 grace periods) >> -> free >> >> Currently ofproto_destory waits only two grace periods, this means when >> rule_destroy_cb is called, the ofproto might have been already destroyed. >> >> This patch adds a refcount for ofproto to ensure ofproto exists when it >> is needed to call free functions. >> >> This patch fixes the crashes found: >> Program terminated with signal SIGSEGV, Segmentation fault. >> 0 0x562a55169e49 in rule_destroy_cb (rule=0x562a598be2c0) at >> ofproto/ofproto.c:2956 >> 1 0x562a552623d6 in ovsrcu_call_postponed () at lib/ovs-rcu.c:348 >> 2 0x562a55262504 in ovsrcu_postpone_thread (arg=) at >> lib/ovs-rcu.c:364 >> 3 0x562a55264aef in ovsthread_wrapper (aux_=) at >> lib/ovs-thread.c:383 >> 4 0x7febe715a4a4 in start_thread (arg=0x7febe4c1c700) at >> pthread_create.c:456 >> 5 0x7febe6990d0f in clone () at >> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97 >> (gdb) p rule->ofproto->ofproto_class >> $3 = (const struct ofproto_class *) 0x0 >> >> Also: >> 0 ofproto_dpif_credit_table_stats (ofproto=0x5583583332b0, table_id=0 >> '\000', n_matches=5, n_misses=0) at ofproto/ofproto-dpif.c:4310 >> 1 0x558354c68514 in xlate_push_stats_entry >> (entry=entry@entry=0x7ff488031398, stats=stats@entry=0x7ff49af30b60) at >> ofproto/ofproto-dpif-xlate-cache.c:99 >> 2 0x558354c686f3 in xlate_push_stats (xcache=, >> stats=stats@entry=0x7ff49af30b60) at ofproto/ofproto-dpif-xlate-cache.c:181 >> 3 0x558354c5411a in revalidate_ukey (udpif=udpif@entry=0x5583583baba0, >> ukey=ukey@entry=0x7ff47809f770, stats=stats@entry=0x7ff49af32128, >> odp_actions=odp_actions@entry=0x7ff49af30c50, >> reval_seq=reval_seq@entry=275670456, >>recircs=recircs@entry=0x7ff49af30cd0) at >> ofproto/ofproto-dpif-upcall.c:2292 >> 4 0x558354c57cbc in revalidate (revalidator=) at >> ofproto/ofproto-dpif-upcall.c:2681 >> 5 0x558354c57f8e in udpif_revalidator (arg=0x5583583d2a90) at >> ofproto/ofproto-dpif-upcall.c:934 >> 6 0x558354d24aef in ovsthread_wrapper (aux_=) at >> lib/ovs-thread.c:383 >> 7 0x7ff4a479d4a4 in start_thread (arg=0x7ff49af35700) at >> pthread_create.c:456 >> 8 0x7ff4a3fd3d0f in clone () at >> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97 >> (gdb) p ofproto->up.ofproto_class >> $3 = (const struct ofproto_class *) 0x0 >> >> Signed-off-by: Peng He >> --- CC: Hongzhi Guo And I just remembered that there is a very similar patch that was submitted to the mail-list several months betore this one. Ben started review, but didn't finish: http://patchwork.ozlabs.org/project/openvswitch/patch/20200306130555.19884-1-guohongz...@huawei.com/ Both patches has smilar issues, i.e. mixing RCU with refcounters and not covering all the cases. There are at least 2 major places where ofproto needs to be refcounted: 1. rules 1.a xlate caches? 2. ofproto groups IMO, mixing RCU with refcounts basically means that we do not have a clear understanding on how it works and using RCU as a safety net. We shuld stop using RCU for orproto and just have refcounts. Once ofproto refcount reaches zero it should be immediately destroyed. All the places where we need to increase refcount needs to be tracked down. It'll be great if you can cooperate to prepare a complete solution for this issue. Best regards, Ilya Maximets. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy
On 7/17/20 3:50 AM, hepeng.0320 wrote: > From: Peng He > > The call stack of rule_destroy_cb: > > remove_rules_postponed (one grace period) > -> remove_rule_rcu > -> remove_rule_rcu__ > -> ofproto_rule_unref -> ref count != 1 > -> ... more grace periods. > -> rule_destroy_cb (> 2 grace periods) > -> free > > Currently ofproto_destory waits only two grace periods, this means when > rule_destroy_cb is called, the ofproto might have been already destroyed. > > This patch adds a refcount for ofproto to ensure ofproto exists when it > is needed to call free functions. > > This patch fixes the crashes found: > Program terminated with signal SIGSEGV, Segmentation fault. > 0 0x562a55169e49 in rule_destroy_cb (rule=0x562a598be2c0) at > ofproto/ofproto.c:2956 > 1 0x562a552623d6 in ovsrcu_call_postponed () at lib/ovs-rcu.c:348 > 2 0x562a55262504 in ovsrcu_postpone_thread (arg=) at > lib/ovs-rcu.c:364 > 3 0x562a55264aef in ovsthread_wrapper (aux_=) at > lib/ovs-thread.c:383 > 4 0x7febe715a4a4 in start_thread (arg=0x7febe4c1c700) at > pthread_create.c:456 > 5 0x7febe6990d0f in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:97 > (gdb) p rule->ofproto->ofproto_class > $3 = (const struct ofproto_class *) 0x0 > > Also: > 0 ofproto_dpif_credit_table_stats (ofproto=0x5583583332b0, table_id=0 > '\000', n_matches=5, n_misses=0) at ofproto/ofproto-dpif.c:4310 > 1 0x558354c68514 in xlate_push_stats_entry > (entry=entry@entry=0x7ff488031398, stats=stats@entry=0x7ff49af30b60) at > ofproto/ofproto-dpif-xlate-cache.c:99 > 2 0x558354c686f3 in xlate_push_stats (xcache=, > stats=stats@entry=0x7ff49af30b60) at ofproto/ofproto-dpif-xlate-cache.c:181 > 3 0x558354c5411a in revalidate_ukey (udpif=udpif@entry=0x5583583baba0, > ukey=ukey@entry=0x7ff47809f770, stats=stats@entry=0x7ff49af32128, > odp_actions=odp_actions@entry=0x7ff49af30c50, > reval_seq=reval_seq@entry=275670456, >recircs=recircs@entry=0x7ff49af30cd0) at ofproto/ofproto-dpif-upcall.c:2292 > 4 0x558354c57cbc in revalidate (revalidator=) at > ofproto/ofproto-dpif-upcall.c:2681 > 5 0x558354c57f8e in udpif_revalidator (arg=0x5583583d2a90) at > ofproto/ofproto-dpif-upcall.c:934 > 6 0x558354d24aef in ovsthread_wrapper (aux_=) at > lib/ovs-thread.c:383 > 7 0x7ff4a479d4a4 in start_thread (arg=0x7ff49af35700) at > pthread_create.c:456 > 8 0x7ff4a3fd3d0f in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:97 > (gdb) p ofproto->up.ofproto_class > $3 = (const struct ofproto_class *) 0x0 > > Signed-off-by: Peng He > --- > ofproto/ofproto-dpif-xlate-cache.c | 1 + > ofproto/ofproto-dpif.c | 22 > ofproto/ofproto-provider.h | 2 ++ > ofproto/ofproto.c | 41 +++--- > ofproto/ofproto.h | 4 +++ > 5 files changed, 56 insertions(+), 14 deletions(-) > > diff --git a/ofproto/ofproto-dpif-xlate-cache.c > b/ofproto/ofproto-dpif-xlate-cache.c > index dcc91cb38..0deee365d 100644 > --- a/ofproto/ofproto-dpif-xlate-cache.c > +++ b/ofproto/ofproto-dpif-xlate-cache.c > @@ -209,6 +209,7 @@ xlate_cache_clear_entry(struct xc_entry *entry) > { > switch (entry->type) { > case XC_TABLE: > +ofproto_unref(&(entry->table.ofproto->up)); > break; > case XC_RULE: > ofproto_rule_unref(>rule->up); > diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c > index 4f0638f23..6480b3c78 100644 > --- a/ofproto/ofproto-dpif.c > +++ b/ofproto/ofproto-dpif.c > @@ -4414,11 +4414,12 @@ rule_dpif_lookup_from_table(struct ofproto_dpif > *ofproto, > } > if (xcache) { > struct xc_entry *entry; > - > -entry = xlate_cache_add_entry(xcache, XC_TABLE); > -entry->table.ofproto = ofproto; > -entry->table.id = *table_id; > -entry->table.match = true; > +if (ofproto_try_ref(>up)) { > +entry = xlate_cache_add_entry(xcache, XC_TABLE); You're handling only XC_TABLE cache entries. Do we need to ref ofproto for other types of cache? XC_NORMAL holds the pointer too. > +entry->table.ofproto = ofproto; > +entry->table.id = *table_id; > +entry->table.match = true; > +} > } > return rule; > } > @@ -4450,11 +4451,12 @@ rule_dpif_lookup_from_table(struct ofproto_dpif > *ofproto, > } > if (xcache) { > struct xc_entry *entry; > - > -entry = xlate_cache_add_entry(xcache, XC_TABLE); > -entry->table.ofproto = ofproto; > -entry->table.id = next_id; > -entry->table.match = (rule != NULL); > +if (ofproto_try_ref(>up)) { > +entry = xlate_cache_add_entry(xcache,
Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy
On 7/16/2020 6:50 PM, hepeng.0320 wrote: From: Peng He The call stack of rule_destroy_cb: remove_rules_postponed (one grace period) -> remove_rule_rcu -> remove_rule_rcu__ -> ofproto_rule_unref -> ref count != 1 -> ... more grace periods. -> rule_destroy_cb (> 2 grace periods) -> free Currently ofproto_destory waits only two grace periods, this means when rule_destroy_cb is called, the ofproto might have been already destroyed. This patch adds a refcount for ofproto to ensure ofproto exists when it is needed to call free functions. This patch fixes the crashes found: Program terminated with signal SIGSEGV, Segmentation fault. 0 0x562a55169e49 in rule_destroy_cb (rule=0x562a598be2c0) at ofproto/ofproto.c:2956 1 0x562a552623d6 in ovsrcu_call_postponed () at lib/ovs-rcu.c:348 2 0x562a55262504 in ovsrcu_postpone_thread (arg=) at lib/ovs-rcu.c:364 3 0x562a55264aef in ovsthread_wrapper (aux_=) at lib/ovs-thread.c:383 4 0x7febe715a4a4 in start_thread (arg=0x7febe4c1c700) at pthread_create.c:456 5 0x7febe6990d0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97 (gdb) p rule->ofproto->ofproto_class $3 = (const struct ofproto_class *) 0x0 Also: 0 ofproto_dpif_credit_table_stats (ofproto=0x5583583332b0, table_id=0 '\000', n_matches=5, n_misses=0) at ofproto/ofproto-dpif.c:4310 1 0x558354c68514 in xlate_push_stats_entry (entry=entry@entry=0x7ff488031398, stats=stats@entry=0x7ff49af30b60) at ofproto/ofproto-dpif-xlate-cache.c:99 2 0x558354c686f3 in xlate_push_stats (xcache=, stats=stats@entry=0x7ff49af30b60) at ofproto/ofproto-dpif-xlate-cache.c:181 3 0x558354c5411a in revalidate_ukey (udpif=udpif@entry=0x5583583baba0, ukey=ukey@entry=0x7ff47809f770, stats=stats@entry=0x7ff49af32128, odp_actions=odp_actions@entry=0x7ff49af30c50, reval_seq=reval_seq@entry=275670456, recircs=recircs@entry=0x7ff49af30cd0) at ofproto/ofproto-dpif-upcall.c:2292 4 0x558354c57cbc in revalidate (revalidator=) at ofproto/ofproto-dpif-upcall.c:2681 5 0x558354c57f8e in udpif_revalidator (arg=0x5583583d2a90) at ofproto/ofproto-dpif-upcall.c:934 6 0x558354d24aef in ovsthread_wrapper (aux_=) at lib/ovs-thread.c:383 7 0x7ff4a479d4a4 in start_thread (arg=0x7ff49af35700) at pthread_create.c:456 8 0x7ff4a3fd3d0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97 (gdb) p ofproto->up.ofproto_class $3 = (const struct ofproto_class *) 0x0 Signed-off-by: Peng He Hi Peng, thank you for working on this. I was wondering if you have a reliable way to reproduce this bug? Besides the checkpatch errors, overall the code looks fine to me but I'd like to be able to test it. Have you considered adding a test for it in tests/ofproto.at? Couple other comments below. Thanks, - Greg --- ofproto/ofproto-dpif-xlate-cache.c | 1 + ofproto/ofproto-dpif.c | 22 ofproto/ofproto-provider.h | 2 ++ ofproto/ofproto.c | 41 +++--- ofproto/ofproto.h | 4 +++ 5 files changed, 56 insertions(+), 14 deletions(-) diff --git a/ofproto/ofproto-dpif-xlate-cache.c b/ofproto/ofproto-dpif-xlate-cache.c index dcc91cb38..0deee365d 100644 --- a/ofproto/ofproto-dpif-xlate-cache.c +++ b/ofproto/ofproto-dpif-xlate-cache.c @@ -209,6 +209,7 @@ xlate_cache_clear_entry(struct xc_entry *entry) { switch (entry->type) { case XC_TABLE: +ofproto_unref(&(entry->table.ofproto->up)); break; case XC_RULE: ofproto_rule_unref(>rule->up); diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c index 4f0638f23..6480b3c78 100644 --- a/ofproto/ofproto-dpif.c +++ b/ofproto/ofproto-dpif.c @@ -4414,11 +4414,12 @@ rule_dpif_lookup_from_table(struct ofproto_dpif *ofproto, } if (xcache) { struct xc_entry *entry; - -entry = xlate_cache_add_entry(xcache, XC_TABLE); -entry->table.ofproto = ofproto; -entry->table.id = *table_id; -entry->table.match = true; +if (ofproto_try_ref(>up)) { +entry = xlate_cache_add_entry(xcache, XC_TABLE); +entry->table.ofproto = ofproto; +entry->table.id = *table_id; +entry->table.match = true; +} } return rule; } @@ -4450,11 +4451,12 @@ rule_dpif_lookup_from_table(struct ofproto_dpif *ofproto, } if (xcache) { struct xc_entry *entry; - -entry = xlate_cache_add_entry(xcache, XC_TABLE); -entry->table.ofproto = ofproto; -entry->table.id = next_id; -entry->table.match = (rule != NULL); +if (ofproto_try_ref(>up)) { +entry = xlate_cache_add_entry(xcache, XC_TABLE); +
Re: [ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy
Bleep bloop. Greetings hepeng.0320, I am a robot and I have tried out your patch. Thanks for your contribution. I encountered some error that I wasn't expecting. See the details below. checkpatch: ERROR: Inappropriate bracing around statement #182 FILE: ofproto/ofproto.c:1734: if (ovs_refcount_read(>refcount) != 1) ERROR: Inappropriate bracing around statement #184 FILE: ofproto/ofproto.c:1736: else Lines checked: 222, Warnings: 0, Errors: 2 Please check this out. If you feel there has been an error, please email acon...@redhat.com Thanks, 0-day Robot ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [ovs-dev v2] ofproto: add refcount to ofproto to fix crash at ofproto_destroy
From: Peng He The call stack of rule_destroy_cb: remove_rules_postponed (one grace period) -> remove_rule_rcu -> remove_rule_rcu__ -> ofproto_rule_unref -> ref count != 1 -> ... more grace periods. -> rule_destroy_cb (> 2 grace periods) -> free Currently ofproto_destory waits only two grace periods, this means when rule_destroy_cb is called, the ofproto might have been already destroyed. This patch adds a refcount for ofproto to ensure ofproto exists when it is needed to call free functions. This patch fixes the crashes found: Program terminated with signal SIGSEGV, Segmentation fault. 0 0x562a55169e49 in rule_destroy_cb (rule=0x562a598be2c0) at ofproto/ofproto.c:2956 1 0x562a552623d6 in ovsrcu_call_postponed () at lib/ovs-rcu.c:348 2 0x562a55262504 in ovsrcu_postpone_thread (arg=) at lib/ovs-rcu.c:364 3 0x562a55264aef in ovsthread_wrapper (aux_=) at lib/ovs-thread.c:383 4 0x7febe715a4a4 in start_thread (arg=0x7febe4c1c700) at pthread_create.c:456 5 0x7febe6990d0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97 (gdb) p rule->ofproto->ofproto_class $3 = (const struct ofproto_class *) 0x0 Also: 0 ofproto_dpif_credit_table_stats (ofproto=0x5583583332b0, table_id=0 '\000', n_matches=5, n_misses=0) at ofproto/ofproto-dpif.c:4310 1 0x558354c68514 in xlate_push_stats_entry (entry=entry@entry=0x7ff488031398, stats=stats@entry=0x7ff49af30b60) at ofproto/ofproto-dpif-xlate-cache.c:99 2 0x558354c686f3 in xlate_push_stats (xcache=, stats=stats@entry=0x7ff49af30b60) at ofproto/ofproto-dpif-xlate-cache.c:181 3 0x558354c5411a in revalidate_ukey (udpif=udpif@entry=0x5583583baba0, ukey=ukey@entry=0x7ff47809f770, stats=stats@entry=0x7ff49af32128, odp_actions=odp_actions@entry=0x7ff49af30c50, reval_seq=reval_seq@entry=275670456, recircs=recircs@entry=0x7ff49af30cd0) at ofproto/ofproto-dpif-upcall.c:2292 4 0x558354c57cbc in revalidate (revalidator=) at ofproto/ofproto-dpif-upcall.c:2681 5 0x558354c57f8e in udpif_revalidator (arg=0x5583583d2a90) at ofproto/ofproto-dpif-upcall.c:934 6 0x558354d24aef in ovsthread_wrapper (aux_=) at lib/ovs-thread.c:383 7 0x7ff4a479d4a4 in start_thread (arg=0x7ff49af35700) at pthread_create.c:456 8 0x7ff4a3fd3d0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97 (gdb) p ofproto->up.ofproto_class $3 = (const struct ofproto_class *) 0x0 Signed-off-by: Peng He --- ofproto/ofproto-dpif-xlate-cache.c | 1 + ofproto/ofproto-dpif.c | 22 ofproto/ofproto-provider.h | 2 ++ ofproto/ofproto.c | 41 +++--- ofproto/ofproto.h | 4 +++ 5 files changed, 56 insertions(+), 14 deletions(-) diff --git a/ofproto/ofproto-dpif-xlate-cache.c b/ofproto/ofproto-dpif-xlate-cache.c index dcc91cb38..0deee365d 100644 --- a/ofproto/ofproto-dpif-xlate-cache.c +++ b/ofproto/ofproto-dpif-xlate-cache.c @@ -209,6 +209,7 @@ xlate_cache_clear_entry(struct xc_entry *entry) { switch (entry->type) { case XC_TABLE: +ofproto_unref(&(entry->table.ofproto->up)); break; case XC_RULE: ofproto_rule_unref(>rule->up); diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c index 4f0638f23..6480b3c78 100644 --- a/ofproto/ofproto-dpif.c +++ b/ofproto/ofproto-dpif.c @@ -4414,11 +4414,12 @@ rule_dpif_lookup_from_table(struct ofproto_dpif *ofproto, } if (xcache) { struct xc_entry *entry; - -entry = xlate_cache_add_entry(xcache, XC_TABLE); -entry->table.ofproto = ofproto; -entry->table.id = *table_id; -entry->table.match = true; +if (ofproto_try_ref(>up)) { +entry = xlate_cache_add_entry(xcache, XC_TABLE); +entry->table.ofproto = ofproto; +entry->table.id = *table_id; +entry->table.match = true; +} } return rule; } @@ -4450,11 +4451,12 @@ rule_dpif_lookup_from_table(struct ofproto_dpif *ofproto, } if (xcache) { struct xc_entry *entry; - -entry = xlate_cache_add_entry(xcache, XC_TABLE); -entry->table.ofproto = ofproto; -entry->table.id = next_id; -entry->table.match = (rule != NULL); +if (ofproto_try_ref(>up)) { +entry = xlate_cache_add_entry(xcache, XC_TABLE); +entry->table.ofproto = ofproto; +entry->table.id = next_id; +entry->table.match = (rule != NULL); +} } if (rule) { goto out; /* Match. */ diff --git a/ofproto/ofproto-provider.h b/ofproto/ofproto-provider.h index afecb24cb..34074fd62 100644 --- a/ofproto/ofproto-provider.h +++ b/ofproto/ofproto-provider.h