Re: [Patch net 00/15] net_sched: remove RCU callbacks from TC
On Thu, Oct 26, 2017 at 6:58 AM, Paul E. McKenneywrote: > > So when removing an entire chain, you flush any queued workqueue handlers > to make sure that any operations using elements on that chain have also > completed, correct? This might also motivate the rcu_barrier() calls. Yes, we can only free it after all pending RCU callbacks and works finish, because they are in the same chain and we are freeing the head of the chain in tcf_block_put(). > > Seems to me that your proposed patch is at least worth a try, then. > Thank you! Then it makes some basic sense. I did a quick test and don't see any crash or lockdep warning.
Re: [Patch net 00/15] net_sched: remove RCU callbacks from TC
On Wed, Oct 25, 2017 at 09:49:09PM -0700, Cong Wang wrote: > On Wed, Oct 25, 2017 at 5:19 PM, Paul E. McKenney >wrote: > > On Wed, Oct 25, 2017 at 03:37:40PM -0700, Cong Wang wrote: > >> My solution is introducing a workqueue for tc filters > >> and let each RCU callback defer the work to this > >> workqueue. I solve the flush_workqueue() deadlock > >> by queuing another work in the same workqueue > >> at the end, so the execution order should be as same > >> as it is now. The ugly part is now tcf_block_put() which > >> looks like below: > >> > >> > >> static void tcf_block_put_final(struct work_struct *work) > >> { > >> struct tcf_block *block = container_of(work, struct tcf_block, > >> work); > >> struct tcf_chain *chain, *tmp; > >> > >> /* At this point, all the chains should have refcnt == 1. */ > >> rtnl_lock(); > >> list_for_each_entry_safe(chain, tmp, >chain_list, list) > >> tcf_chain_put(chain); > >> rtnl_unlock(); > >> kfree(block); > >> } > > > > I am guessing that tcf_chain_put() sometimes does a call_rcu(), > > and the callback function in turn calls schedule_work(), and that > > tcf_block_put_deferred() is the workqueue handler function. > > Yes, tcf_chain_put() triggers call_rcu() indirectly during flush, > this is why we have rcu_barrier()'s in current code base. OK, got it. > >> static void tcf_block_put_deferred(struct work_struct *work) > >> { > >> struct tcf_block *block = container_of(work, struct tcf_block, > >> work); > >> struct tcf_chain *chain; > >> > >> rtnl_lock(); > >> /* Hold a refcnt for all chains, except 0, in case they are gone. > >> */ > >> list_for_each_entry(chain, >chain_list, list) > >> if (chain->index) > >> tcf_chain_hold(chain); > >> > >> /* No race on the list, because no chain could be destroyed. */ > >> list_for_each_entry(chain, >chain_list, list) > >> tcf_chain_flush(chain); > >> > >> INIT_WORK(>work, tcf_block_put_final); > >> /* Wait for RCU callbacks to release the reference count and make > >> * sure their works have been queued before this. > >> */ > >> rcu_barrier(); > > > > This one can take awhile... Though in recent kernels it will often > > be a bit faster than synchronize_rcu(). > > It is already in current code base, so it is not introduced here. Very good, then no problems with added overhead from rcu_barrier(). ;-) > > Note that rcu_barrier() only waits for callbacks posted via call_rcu() > > before the rcu_barrier() starts, if that matters. > > Yes, this is exactly what I expect. Good. > >> tcf_queue_work(>work); > >> rtnl_unlock(); > >> } > > > > And it looks like tcf_block_put_deferred() queues itself as work as well. > > Or maybe instead? > > Yes, it queues itself after all the works queued via call_rcu(), > to ensure it is the last. OK. > >> void tcf_block_put(struct tcf_block *block) > >> { > >> if (!block) > >> return; > >> > >> INIT_WORK(>work, tcf_block_put_deferred); > >> /* Wait for existing RCU callbacks to cool down, make sure their > >> works > >> * have been queued before this. We can not flush pending works > >> here > >> * because we are holding the RTNL lock. > >> */ > >> rcu_barrier(); > >> tcf_queue_work(>work); > >> } > >> > >> > >> Paul, does this make any sense to you? ;) > > > > would be surprised if I fully understand the problem to be solved, > > but my current guess is that the constraints are as follows: > > > > 1. Things removed must be processed in order. > > Sort of, RCU callbacks themselves don't have any order, they only > need to be serialized with RTNL lock, so we have to defer it to a > workqeueue. OK, got it. > What needs a strict order is tcf_block_put() vs. flying works. Before > tcf_block_put() finishes, all the flying works must be done otherwise > use-after-free. :-/ So when removing an entire chain, you flush any queued workqueue handlers to make sure that any operations using elements on that chain have also completed, correct? This might also motivate the rcu_barrier() calls. > > 2. Things removes must not be processed until a grace period > > has elapsed. > > For these RCU callbacks yes, but for tcf_block_put() no, it uses > refcnt not RCU. OK, got it, give or take. > > 3. Things being processed after a grace period should not be > > processed concurrently with each other or with subsequent > > removals. > > Yeah, this is the cause of the problem. They have to be serialized > with each other and with other tc action paths (holding RTNL) too. OK, makes sense. > > 4. A given removal is not finalized until its reference count > > reaches zero. > > There are two kinds of
Re: [Patch net 00/15] net_sched: remove RCU callbacks from TC
On Wed, Oct 25, 2017 at 5:19 PM, Paul E. McKenneywrote: > On Wed, Oct 25, 2017 at 03:37:40PM -0700, Cong Wang wrote: >> My solution is introducing a workqueue for tc filters >> and let each RCU callback defer the work to this >> workqueue. I solve the flush_workqueue() deadlock >> by queuing another work in the same workqueue >> at the end, so the execution order should be as same >> as it is now. The ugly part is now tcf_block_put() which >> looks like below: >> >> >> static void tcf_block_put_final(struct work_struct *work) >> { >> struct tcf_block *block = container_of(work, struct tcf_block, work); >> struct tcf_chain *chain, *tmp; >> >> /* At this point, all the chains should have refcnt == 1. */ >> rtnl_lock(); >> list_for_each_entry_safe(chain, tmp, >chain_list, list) >> tcf_chain_put(chain); >> rtnl_unlock(); >> kfree(block); >> } > > I am guessing that tcf_chain_put() sometimes does a call_rcu(), > and the callback function in turn calls schedule_work(), and that > tcf_block_put_deferred() is the workqueue handler function. Yes, tcf_chain_put() triggers call_rcu() indirectly during flush, this is why we have rcu_barrier()'s in current code base. > >> static void tcf_block_put_deferred(struct work_struct *work) >> { >> struct tcf_block *block = container_of(work, struct tcf_block, work); >> struct tcf_chain *chain; >> >> rtnl_lock(); >> /* Hold a refcnt for all chains, except 0, in case they are gone. */ >> list_for_each_entry(chain, >chain_list, list) >> if (chain->index) >> tcf_chain_hold(chain); >> >> /* No race on the list, because no chain could be destroyed. */ >> list_for_each_entry(chain, >chain_list, list) >> tcf_chain_flush(chain); >> >> INIT_WORK(>work, tcf_block_put_final); >> /* Wait for RCU callbacks to release the reference count and make >> * sure their works have been queued before this. >> */ >> rcu_barrier(); > > This one can take awhile... Though in recent kernels it will often > be a bit faster than synchronize_rcu(). > It is already in current code base, so it is not introduced here. > Note that rcu_barrier() only waits for callbacks posted via call_rcu() > before the rcu_barrier() starts, if that matters. > Yes, this is exactly what I expect. >> tcf_queue_work(>work); >> rtnl_unlock(); >> } > > And it looks like tcf_block_put_deferred() queues itself as work as well. > Or maybe instead? Yes, it queues itself after all the works queued via call_rcu(), to ensure it is the last. > >> void tcf_block_put(struct tcf_block *block) >> { >> if (!block) >> return; >> >> INIT_WORK(>work, tcf_block_put_deferred); >> /* Wait for existing RCU callbacks to cool down, make sure their >> works >> * have been queued before this. We can not flush pending works here >> * because we are holding the RTNL lock. >> */ >> rcu_barrier(); >> tcf_queue_work(>work); >> } >> >> >> Paul, does this make any sense to you? ;) > > would be surprised if I fully understand the problem to be solved, > but my current guess is that the constraints are as follows: > > 1. Things removed must be processed in order. > Sort of, RCU callbacks themselves don't have any order, they only need to be serialized with RTNL lock, so we have to defer it to a workqeueue. What needs a strict order is tcf_block_put() vs. flying works. Before tcf_block_put() finishes, all the flying works must be done otherwise use-after-free. :-/ > 2. Things removes must not be processed until a grace period > has elapsed. > For these RCU callbacks yes, but for tcf_block_put() no, it uses refcnt not RCU. > 3. Things being processed after a grace period should not be > processed concurrently with each other or with subsequent > removals. Yeah, this is the cause of the problem. They have to be serialized with each other and with other tc action paths (holding RTNL) too. > > 4. A given removal is not finalized until its reference count > reaches zero. There are two kinds of removal here: 1) Filter removal. This is not reference counted, it needs RCU grace period. 2) Filter chain removal. This is refcount'ed and filters hold its refcnt, the actions attached to filters could hold its refcnt too So when we remove a filter chain, it flushes all the filters and actions attached, so RCU callbacks will be flying, tcf_block_put() needs to wait for refcnt hits zero, especially by those actions, before freeing it. > > 5. RTNL might not be held when the reference count reaches zero. > Without my patch, no, RTNL is not held in RCU callbacks where actions are destroyed and releasing refcnt. With my patch, RTNL will be always
Re: [Patch net 00/15] net_sched: remove RCU callbacks from TC
On Wed, Oct 25, 2017 at 03:37:40PM -0700, Cong Wang wrote: > On Tue, Oct 24, 2017 at 6:43 PM, David Millerwrote: > > From: Cong Wang > > Date: Mon, 23 Oct 2017 15:02:49 -0700 > > > >> Recently, the RCU callbacks used in TC filters and TC actions keep > >> drawing my attention, they introduce at least 4 race condition bugs: > > > > Like Eric, I think doing a full RCU sync on every delete is too big > > a pill to swallow. This is a major control plane performance > > regression. > > > > Please find another reasonable way to fix this. > > > > Alright... I finally find a way to make everyone happy. > > My solution is introducing a workqueue for tc filters > and let each RCU callback defer the work to this > workqueue. I solve the flush_workqueue() deadlock > by queuing another work in the same workqueue > at the end, so the execution order should be as same > as it is now. The ugly part is now tcf_block_put() which > looks like below: > > > static void tcf_block_put_final(struct work_struct *work) > { > struct tcf_block *block = container_of(work, struct tcf_block, work); > struct tcf_chain *chain, *tmp; > > /* At this point, all the chains should have refcnt == 1. */ > rtnl_lock(); > list_for_each_entry_safe(chain, tmp, >chain_list, list) > tcf_chain_put(chain); > rtnl_unlock(); > kfree(block); > } I am guessing that tcf_chain_put() sometimes does a call_rcu(), and the callback function in turn calls schedule_work(), and that tcf_block_put_deferred() is the workqueue handler function. > static void tcf_block_put_deferred(struct work_struct *work) > { > struct tcf_block *block = container_of(work, struct tcf_block, work); > struct tcf_chain *chain; > > rtnl_lock(); > /* Hold a refcnt for all chains, except 0, in case they are gone. */ > list_for_each_entry(chain, >chain_list, list) > if (chain->index) > tcf_chain_hold(chain); > > /* No race on the list, because no chain could be destroyed. */ > list_for_each_entry(chain, >chain_list, list) > tcf_chain_flush(chain); > > INIT_WORK(>work, tcf_block_put_final); > /* Wait for RCU callbacks to release the reference count and make > * sure their works have been queued before this. > */ > rcu_barrier(); This one can take awhile... Though in recent kernels it will often be a bit faster than synchronize_rcu(). Note that rcu_barrier() only waits for callbacks posted via call_rcu() before the rcu_barrier() starts, if that matters. > tcf_queue_work(>work); > rtnl_unlock(); > } And it looks like tcf_block_put_deferred() queues itself as work as well. Or maybe instead? > void tcf_block_put(struct tcf_block *block) > { > if (!block) > return; > > INIT_WORK(>work, tcf_block_put_deferred); > /* Wait for existing RCU callbacks to cool down, make sure their works > * have been queued before this. We can not flush pending works here > * because we are holding the RTNL lock. > */ > rcu_barrier(); > tcf_queue_work(>work); > } > > > Paul, does this make any sense to you? ;) would be surprised if I fully understand the problem to be solved, but my current guess is that the constraints are as follows: 1. Things removed must be processed in order. 2. Things removes must not be processed until a grace period has elapsed. 3. Things being processed after a grace period should not be processed concurrently with each other or with subsequent removals. 4. A given removal is not finalized until its reference count reaches zero. 5. RTNL might not be held when the reference count reaches zero. Or did I lose the thread somewhere? Thanx, Paul
Re: [Patch net 00/15] net_sched: remove RCU callbacks from TC
On Tue, Oct 24, 2017 at 6:43 PM, David Millerwrote: > From: Cong Wang > Date: Mon, 23 Oct 2017 15:02:49 -0700 > >> Recently, the RCU callbacks used in TC filters and TC actions keep >> drawing my attention, they introduce at least 4 race condition bugs: > > Like Eric, I think doing a full RCU sync on every delete is too big > a pill to swallow. This is a major control plane performance > regression. > > Please find another reasonable way to fix this. > Alright... I finally find a way to make everyone happy. My solution is introducing a workqueue for tc filters and let each RCU callback defer the work to this workqueue. I solve the flush_workqueue() deadlock by queuing another work in the same workqueue at the end, so the execution order should be as same as it is now. The ugly part is now tcf_block_put() which looks like below: static void tcf_block_put_final(struct work_struct *work) { struct tcf_block *block = container_of(work, struct tcf_block, work); struct tcf_chain *chain, *tmp; /* At this point, all the chains should have refcnt == 1. */ rtnl_lock(); list_for_each_entry_safe(chain, tmp, >chain_list, list) tcf_chain_put(chain); rtnl_unlock(); kfree(block); } static void tcf_block_put_deferred(struct work_struct *work) { struct tcf_block *block = container_of(work, struct tcf_block, work); struct tcf_chain *chain; rtnl_lock(); /* Hold a refcnt for all chains, except 0, in case they are gone. */ list_for_each_entry(chain, >chain_list, list) if (chain->index) tcf_chain_hold(chain); /* No race on the list, because no chain could be destroyed. */ list_for_each_entry(chain, >chain_list, list) tcf_chain_flush(chain); INIT_WORK(>work, tcf_block_put_final); /* Wait for RCU callbacks to release the reference count and make * sure their works have been queued before this. */ rcu_barrier(); tcf_queue_work(>work); rtnl_unlock(); } void tcf_block_put(struct tcf_block *block) { if (!block) return; INIT_WORK(>work, tcf_block_put_deferred); /* Wait for existing RCU callbacks to cool down, make sure their works * have been queued before this. We can not flush pending works here * because we are holding the RTNL lock. */ rcu_barrier(); tcf_queue_work(>work); } Paul, does this make any sense to you? ;)
Re: [Patch net 00/15] net_sched: remove RCU callbacks from TC
On Mon, Oct 23, 2017 at 4:31 PM, Eric Dumazetwrote: > > I did not pretend to give a bug fix, I simply said your patch series was > probably not the right way. Generally I agree with you on avoid synchronize_rcu(), but this case is very special, you need to consider case by case, not just talking generally. > > Sure, we could add back BKL and solve a lot of lockdep issues. > > For the record, this case is about race conditions which lead to real bugs, not merely lockdep warnings.
Re: [Patch net 00/15] net_sched: remove RCU callbacks from TC
From: Cong WangDate: Mon, 23 Oct 2017 15:02:49 -0700 > Recently, the RCU callbacks used in TC filters and TC actions keep > drawing my attention, they introduce at least 4 race condition bugs: Like Eric, I think doing a full RCU sync on every delete is too big a pill to swallow. This is a major control plane performance regression. Please find another reasonable way to fix this. Thank you.
Re: [Patch net 00/15] net_sched: remove RCU callbacks from TC
On Mon, 2017-10-23 at 16:23 -0700, Cong Wang wrote: > On Mon, Oct 23, 2017 at 4:16 PM, Eric Dumazetwrote: > > On Mon, 2017-10-23 at 15:02 -0700, Cong Wang wrote: > > > >> b) As suggested by Paul, we could defer the work to a workqueue and > >> gain the permission of holding RTNL again without any performance > >> impact, however, this seems impossible too, because as lockdep > >> complains we have a deadlock when flushing workqueue while hodling > >> RTNL lock, see the rcu_barrier() in tcf_block_put(). > >> > >> Therefore, the simplest solution we have is probably just getting > >> rid of these RCU callbacks, because they are not necessary at all, > >> callers of these call_rcu() are all on slow paths and have RTNL > >> lock, so blocking is allowed in their contexts. > > > > I am against these pessimistic changes, sorry for not following past > > discussions last week. > > So even Cc'ing you doesn't work. :-D Nope. At the end of the day, there are only 24 hours per day. > > > > > I am asking a talk during upcoming netdev/netconf about this, if we need > > to take a decision. > > I won't be able to make it. > > > > > RTNL is already a big problem for many of us, adding synchronize_rcu() > > calls while holding RTNL is a no - go, unless we have clear evidence it > > can not be avoided. > > You omitted too much, including the evidence I provide. In short it is very > hard to do, otherwise I should have already done it. I am very open to > any simple solution to avoid it, but so far none... > > Saying no but without giving a possible solution does not help anything > here. I did not pretend to give a bug fix, I simply said your patch series was probably not the right way. Sure, we could add back BKL and solve a lot of lockdep issues.
Re: [Patch net 00/15] net_sched: remove RCU callbacks from TC
On Mon, Oct 23, 2017 at 4:16 PM, Eric Dumazetwrote: > On Mon, 2017-10-23 at 15:02 -0700, Cong Wang wrote: > >> b) As suggested by Paul, we could defer the work to a workqueue and >> gain the permission of holding RTNL again without any performance >> impact, however, this seems impossible too, because as lockdep >> complains we have a deadlock when flushing workqueue while hodling >> RTNL lock, see the rcu_barrier() in tcf_block_put(). >> >> Therefore, the simplest solution we have is probably just getting >> rid of these RCU callbacks, because they are not necessary at all, >> callers of these call_rcu() are all on slow paths and have RTNL >> lock, so blocking is allowed in their contexts. > > I am against these pessimistic changes, sorry for not following past > discussions last week. So even Cc'ing you doesn't work. :-D > > I am asking a talk during upcoming netdev/netconf about this, if we need > to take a decision. I won't be able to make it. > > RTNL is already a big problem for many of us, adding synchronize_rcu() > calls while holding RTNL is a no - go, unless we have clear evidence it > can not be avoided. You omitted too much, including the evidence I provide. In short it is very hard to do, otherwise I should have already done it. I am very open to any simple solution to avoid it, but so far none... Saying no but without giving a possible solution does not help anything here.
Re: [Patch net 00/15] net_sched: remove RCU callbacks from TC
On Mon, 2017-10-23 at 15:02 -0700, Cong Wang wrote: > b) As suggested by Paul, we could defer the work to a workqueue and > gain the permission of holding RTNL again without any performance > impact, however, this seems impossible too, because as lockdep > complains we have a deadlock when flushing workqueue while hodling > RTNL lock, see the rcu_barrier() in tcf_block_put(). > > Therefore, the simplest solution we have is probably just getting > rid of these RCU callbacks, because they are not necessary at all, > callers of these call_rcu() are all on slow paths and have RTNL > lock, so blocking is allowed in their contexts. I am against these pessimistic changes, sorry for not following past discussions last week. I am asking a talk during upcoming netdev/netconf about this, if we need to take a decision. RTNL is already a big problem for many of us, adding synchronize_rcu() calls while holding RTNL is a no - go, unless we have clear evidence it can not be avoided. Thanks !
[Patch net 00/15] net_sched: remove RCU callbacks from TC
Recently, the RCU callbacks used in TC filters and TC actions keep drawing my attention, they introduce at least 4 race condition bugs: 1. A simple one fixed by Daniel: commit c78e1746d3ad7d548bdf3fe491898cc453911a49 Author: Daniel BorkmannDate: Wed May 20 17:13:33 2015 +0200 net: sched: fix call_rcu() race on classifier module unloads 2. A very nasty one fixed by me: commit 1697c4bb5245649a23f06a144cc38c06715e1b65 Author: Cong Wang Date: Mon Sep 11 16:33:32 2017 -0700 net_sched: carefully handle tcf_block_put() 3. Two more bugs found by Chris: https://patchwork.ozlabs.org/patch/826696/ https://patchwork.ozlabs.org/patch/826695/ Usually RCU callbacks are simple, however for TC filters and actions, they are complex because at least TC actions could be destroyed together with the TC filter in one callback. And RCU callbacks are invoked in BH context, without locking they are parallel too. All of these contribute to the cause of these nasty bugs. It looks like they bring us more troubles than benefits. Alternatively, we could also: a) Introduce a spinlock to serialize these RCU callbacks. But as I said in commit 1697c4bb5245 ("net_sched: carefully handle tcf_block_put()"), it is very hard to do because of tcf_chain_dump(). Potentially we need to do a lot of work to make it possible, if not impossible. b) As suggested by Paul, we could defer the work to a workqueue and gain the permission of holding RTNL again without any performance impact, however, this seems impossible too, because as lockdep complains we have a deadlock when flushing workqueue while hodling RTNL lock, see the rcu_barrier() in tcf_block_put(). Therefore, the simplest solution we have is probably just getting rid of these RCU callbacks, because they are not necessary at all, callers of these call_rcu() are all on slow paths and have RTNL lock, so blocking is allowed in their contexts. The downside is this could hurt the performance of deleting TC filters, but again it is not a hot path and I already batch synchronize_rcu() whenever needed. Different filters have different data structures, so please also see each patch for details. Chris Mi (2): selftests: Introduce a new script to generate tc batch file selftests: Introduce a new test case to tc testsuite Cong Wang (13): net_sched: remove RCU callbacks in basic filter net_sched: remove RCU callbacks in bpf filter net_sched: remove RCU callbacks in flower filter net_sched: remove RCU callbacks in matchall filter net_sched: remove RCU callbacks in cgroup filter net_sched: remove RCU callbacks in rsvp filter net_sched: remove RCU callbacks in flow filter net_sched: remove RCU callbacks in tcindex filter net_sched: remove RCU callbacks in u32 filter net_sched: remove RCU callbacks in fw filter net_sched: remove RCU callbacks in route filter net_sched: remove RCU callbacks in sample action net_sched: add rtnl assertion to tcf_exts_destroy() include/net/tc_act/tc_sample.h | 1 - net/sched/act_sample.c | 12 +--- net/sched/cls_api.c| 1 + net/sched/cls_basic.c | 23 net/sched/cls_bpf.c| 22 net/sched/cls_cgroup.c | 18 +++--- net/sched/cls_flow.c | 24 net/sched/cls_flower.c | 46 +--- net/sched/cls_fw.c | 18 +++--- net/sched/cls_matchall.c | 9 +-- net/sched/cls_route.c | 21 +++ net/sched/cls_rsvp.h | 13 + net/sched/cls_tcindex.c| 35 +--- net/sched/cls_u32.c| 64 ++ .../tc-testing/tc-tests/filters/tests.json | 23 +++- tools/testing/selftests/tc-testing/tdc.py | 20 +-- tools/testing/selftests/tc-testing/tdc_batch.py| 62 + tools/testing/selftests/tc-testing/tdc_config.py | 2 + 18 files changed, 222 insertions(+), 192 deletions(-) create mode 100644 tools/testing/selftests/tc-testing/tdc_batch.py -- 2.13.0