Re: [PATCH net-next 2/2] net/sched: act_police: don't use spinlock in the data path
On 11/16/2018 06:41 AM, David Laight wrote: > From: Eric Dumazet >> Sent: 16 November 2018 14:35 > ... >> I suggest to use a single cache line with a dedicated spinlock and these >> three s64 >> >> spinlock_t tcfp_lock cacheline_aligned_in_smp; >> s64 ... >> s64 ... >> s64 ... > > Doesn't this do something really stupid when cache lines are big. > If the spinlock is 8 bytes you never want more than 32 byte alignment. > If cache lines are 256 bytes you don't even need that. We do want that, even if cache lines are 256 bytes, thank you. > > Also ISTR that the kmalloc() only guarantees 8 byte alignment on x86_64. > So aligning structure members to larger offsets is rather pointless. No it is not, we use these hints all the time. Just double check and report a bug to mm teams if you disagree. Please do not send feedback if you are not sure.
RE: [PATCH net-next 2/2] net/sched: act_police: don't use spinlock in the data path
From: Eric Dumazet > Sent: 16 November 2018 14:35 ... > I suggest to use a single cache line with a dedicated spinlock and these > three s64 > > spinlock_t tcfp_lock cacheline_aligned_in_smp; > s64 ... > s64 ... > s64 ... Doesn't this do something really stupid when cache lines are big. If the spinlock is 8 bytes you never want more than 32 byte alignment. If cache lines are 256 bytes you don't even need that. Also ISTR that the kmalloc() only guarantees 8 byte alignment on x86_64. So aligning structure members to larger offsets is rather pointless. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Re: [PATCH net-next 2/2] net/sched: act_police: don't use spinlock in the data path
On 11/16/2018 06:34 AM, Eric Dumazet wrote: > >> +s64 tcfp_toks; >> +s64 tcfp_ptoks; >> +s64 tcfp_t_c; > > I suggest to use a single cache line with a dedicated spinlock and these > three s64 > > spinlock_t tcfp_lock cacheline_aligned_in_smp; > s64 ... > s64 ... > s64 ... > > >> struct tcf_police_params __rcu *params; > > Make sure to use a different cache line for *params > > struct tcf_police_params __rcu *params cacheline_aligned_in_smp; Or move it before the cacheline used by the lock and three s64, since 'common' should be read-mostly. No need for a separate cache line.
Re: [PATCH net-next 2/2] net/sched: act_police: don't use spinlock in the data path
On 11/16/2018 03:28 AM, Davide Caratti wrote: > On Thu, 2018-11-15 at 05:53 -0800, Eric Dumazet wrote: >> >> On 11/15/2018 03:43 AM, Davide Caratti wrote: >>> On Wed, 2018-11-14 at 22:46 -0800, Eric Dumazet wrote: On 09/13/2018 10:29 AM, Davide Caratti wrote: > use RCU instead of spinlocks, to protect concurrent read/write on > act_police configuration. This reduces the effects of contention in the > data path, in case multiple readers are present. > > Signed-off-by: Davide Caratti > --- > net/sched/act_police.c | 156 - > 1 file changed, 92 insertions(+), 64 deletions(-) > I must be missing something obvious with this patch. >>> >>> hello Eric, >>> >>> On the opposite, I missed something obvious when I wrote that patch: there >>> is a race condition on tcfp_toks, tcfp_ptoks and tcfp_t_c: thank you for >>> noticing it. >>> >>> These variables still need to be protected with a spinlock. I will do a >>> patch and evaluate if 'act_police' is still faster than a version where >>> 2d550dbad83c ("net/sched: ") is reverted, and share results in the >>> next hours. >>> >>> Ok? >>> >> >> SGTM, thanks. > > hello, > I just finished the comparison of act_police, in the following cases: > > a) revert the RCU-ification (i.e. commit 2d550dbad83c ("net/sched: > act_police: don't use spinlock in the data path"), and leave per-cpu > counters used by the rate estimator > > b) keep RCU-ified configuration parameters, and protect read/update of > tcfp_toks, tcfp_ptoks and tcfp_t with a spinlock (code at the bottom of > this message). > > ## Test setup: > > $DEV is a 'dummy' with clsact qdisc; the following two commands, > > # test police with 'rate' > $TC filter add dev $DEV egress matchall \ > action police rate 2gbit burst 100k conform-exceed pass/pass index 100 > > # test police with 'avrate' > $TC filter add dev prova egress estimator 1s 8s matchall \ > action police avrate 2gbit conform-exceed pass/pass index 100 > > are tested with the following loop: > > for c in 1 2 4 8 16; do > ./pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -v -s 64 -t $c -n 500 -i > $DEV > done > > > ## Test results: > > using rate | reverted | patched > $c | act_police (a) | act_police (b) > -++--- > 1 | 3364442 | 3345580 > 2 | 2703030 | 2721919 > 4 | 1130146 | 1253555 > 8 |664238 | 658777 > 16 |154026 | 155259 > > > using avrate | reverted | patched > $c | act_police (a) | act_police (b) > -++--- > 1 | 3621796 | 3658472 > 2 | 3075589 | 3548135 > 4 | 2313314 | 3343717 > 8 |768458 | 3260480 > 16 |16 | 3254128 > > > so, 'avrate' still gets a significant improvement because the 'conform/exceed' > decision doesn't need the spinlock in this case. The estimation is probably > less accurate, because it use per-CPU variables: if this is not acceptable, > then we need to revert also 93be42f9173b ("net/sched: act_police: use per-cpu > counters"). > > > ## patch code: > > -- >8 -- > diff --git a/net/sched/act_police.c b/net/sched/act_police.c > index 052855d..42db852 100644 > --- a/net/sched/act_police.c > +++ b/net/sched/act_police.c > @@ -27,10 +27,7 @@ struct tcf_police_params { > u32 tcfp_ewma_rate; > s64 tcfp_burst; > u32 tcfp_mtu; > - s64 tcfp_toks; > - s64 tcfp_ptoks; > s64 tcfp_mtu_ptoks; > - s64 tcfp_t_c; > struct psched_ratecfg rate; > boolrate_present; > struct psched_ratecfg peak; > @@ -40,6 +37,9 @@ struct tcf_police_params { > > struct tcf_police { > struct tc_actioncommon; > + s64 tcfp_toks; > + s64 tcfp_ptoks; > + s64 tcfp_t_c; I suggest to use a single cache line with a dedicated spinlock and these three s64 spinlock_t tcfp_lock cacheline_aligned_in_smp; s64 ... s64 ... s64 ... > struct tcf_police_params __rcu *params; Make sure to use a different cache line for *params struct tcf_police_params __rcu *params cacheline_aligned_in_smp; > }; > > @@ -186,12 +186,6 @@ static int tcf_police_init(struct net *net, struct > nlattr *nla, > } > > new->tcfp_burst = PSCHED_TICKS2NS(parm->burst); > - new->tcfp_toks = new->tcfp_burst; > - if (new->peak_present) { > - new->tcfp_mtu_ptoks = (s64)psched_l2t_
Re: [PATCH net-next 2/2] net/sched: act_police: don't use spinlock in the data path
On Thu, 2018-11-15 at 05:53 -0800, Eric Dumazet wrote: > > On 11/15/2018 03:43 AM, Davide Caratti wrote: > > On Wed, 2018-11-14 at 22:46 -0800, Eric Dumazet wrote: > > > On 09/13/2018 10:29 AM, Davide Caratti wrote: > > > > use RCU instead of spinlocks, to protect concurrent read/write on > > > > act_police configuration. This reduces the effects of contention in the > > > > data path, in case multiple readers are present. > > > > > > > > Signed-off-by: Davide Caratti > > > > --- > > > > net/sched/act_police.c | 156 - > > > > 1 file changed, 92 insertions(+), 64 deletions(-) > > > > > > > > > > I must be missing something obvious with this patch. > > > > hello Eric, > > > > On the opposite, I missed something obvious when I wrote that patch: there > > is a race condition on tcfp_toks, tcfp_ptoks and tcfp_t_c: thank you for > > noticing it. > > > > These variables still need to be protected with a spinlock. I will do a > > patch and evaluate if 'act_police' is still faster than a version where > > 2d550dbad83c ("net/sched: ") is reverted, and share results in the > > next hours. > > > > Ok? > > > > SGTM, thanks. hello, I just finished the comparison of act_police, in the following cases: a) revert the RCU-ification (i.e. commit 2d550dbad83c ("net/sched: act_police: don't use spinlock in the data path"), and leave per-cpu counters used by the rate estimator b) keep RCU-ified configuration parameters, and protect read/update of tcfp_toks, tcfp_ptoks and tcfp_t with a spinlock (code at the bottom of this message). ## Test setup: $DEV is a 'dummy' with clsact qdisc; the following two commands, # test police with 'rate' $TC filter add dev $DEV egress matchall \ action police rate 2gbit burst 100k conform-exceed pass/pass index 100 # test police with 'avrate' $TC filter add dev prova egress estimator 1s 8s matchall \ action police avrate 2gbit conform-exceed pass/pass index 100 are tested with the following loop: for c in 1 2 4 8 16; do ./pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -v -s 64 -t $c -n 500 -i $DEV done ## Test results: using rate | reverted | patched $c | act_police (a) | act_police (b) -++--- 1 | 3364442 | 3345580 2 | 2703030 | 2721919 4 | 1130146 | 1253555 8 |664238 | 658777 16 |154026 | 155259 using avrate | reverted | patched $c | act_police (a) | act_police (b) -++--- 1 | 3621796 | 3658472 2 | 3075589 | 3548135 4 | 2313314 | 3343717 8 |768458 | 3260480 16 |16 | 3254128 so, 'avrate' still gets a significant improvement because the 'conform/exceed' decision doesn't need the spinlock in this case. The estimation is probably less accurate, because it use per-CPU variables: if this is not acceptable, then we need to revert also 93be42f9173b ("net/sched: act_police: use per-cpu counters"). ## patch code: -- >8 -- diff --git a/net/sched/act_police.c b/net/sched/act_police.c index 052855d..42db852 100644 --- a/net/sched/act_police.c +++ b/net/sched/act_police.c @@ -27,10 +27,7 @@ struct tcf_police_params { u32 tcfp_ewma_rate; s64 tcfp_burst; u32 tcfp_mtu; - s64 tcfp_toks; - s64 tcfp_ptoks; s64 tcfp_mtu_ptoks; - s64 tcfp_t_c; struct psched_ratecfg rate; boolrate_present; struct psched_ratecfg peak; @@ -40,6 +37,9 @@ struct tcf_police_params { struct tcf_police { struct tc_actioncommon; + s64 tcfp_toks; + s64 tcfp_ptoks; + s64 tcfp_t_c; struct tcf_police_params __rcu *params; }; @@ -186,12 +186,6 @@ static int tcf_police_init(struct net *net, struct nlattr *nla, } new->tcfp_burst = PSCHED_TICKS2NS(parm->burst); - new->tcfp_toks = new->tcfp_burst; - if (new->peak_present) { - new->tcfp_mtu_ptoks = (s64)psched_l2t_ns(&new->peak, -new->tcfp_mtu); - new->tcfp_ptoks = new->tcfp_mtu_ptoks; - } if (tb[TCA_POLICE_AVRATE]) new->tcfp_ewma_rate = nla_get_u32(tb[TCA_POLICE_AVRATE]); @@ -207,7 +201,14 @@ static int tcf_police_init(struct net *net, struct nlattr *nla, } spin_lock_bh(&police->tcf_lock); - new->tcfp_t_c = ktime_get_ns(); + police->tcfp_t_c = ktime_get_ns(); + police->tcfp_toks = new->tcfp_burst; + if (new->peak_present) { +
Re: [PATCH net-next 2/2] net/sched: act_police: don't use spinlock in the data path
On 11/15/2018 03:43 AM, Davide Caratti wrote: > On Wed, 2018-11-14 at 22:46 -0800, Eric Dumazet wrote: >> >> On 09/13/2018 10:29 AM, Davide Caratti wrote: >>> use RCU instead of spinlocks, to protect concurrent read/write on >>> act_police configuration. This reduces the effects of contention in the >>> data path, in case multiple readers are present. >>> >>> Signed-off-by: Davide Caratti >>> --- >>> net/sched/act_police.c | 156 - >>> 1 file changed, 92 insertions(+), 64 deletions(-) >>> >> >> I must be missing something obvious with this patch. > > hello Eric, > > On the opposite, I missed something obvious when I wrote that patch: there > is a race condition on tcfp_toks, tcfp_ptoks and tcfp_t_c: thank you for > noticing it. > > These variables still need to be protected with a spinlock. I will do a > patch and evaluate if 'act_police' is still faster than a version where > 2d550dbad83c ("net/sched: ") is reverted, and share results in the > next hours. > > Ok? > SGTM, thanks.
Re: [PATCH net-next 2/2] net/sched: act_police: don't use spinlock in the data path
On Wed, 2018-11-14 at 22:46 -0800, Eric Dumazet wrote: > > On 09/13/2018 10:29 AM, Davide Caratti wrote: > > use RCU instead of spinlocks, to protect concurrent read/write on > > act_police configuration. This reduces the effects of contention in the > > data path, in case multiple readers are present. > > > > Signed-off-by: Davide Caratti > > --- > > net/sched/act_police.c | 156 - > > 1 file changed, 92 insertions(+), 64 deletions(-) > > > > I must be missing something obvious with this patch. hello Eric, On the opposite, I missed something obvious when I wrote that patch: there is a race condition on tcfp_toks, tcfp_ptoks and tcfp_t_c: thank you for noticing it. These variables still need to be protected with a spinlock. I will do a patch and evaluate if 'act_police' is still faster than a version where 2d550dbad83c ("net/sched: ") is reverted, and share results in the next hours. Ok? -- davide
Re: [PATCH net-next 2/2] net/sched: act_police: don't use spinlock in the data path
On 09/13/2018 10:29 AM, Davide Caratti wrote: > use RCU instead of spinlocks, to protect concurrent read/write on > act_police configuration. This reduces the effects of contention in the > data path, in case multiple readers are present. > > Signed-off-by: Davide Caratti > --- > net/sched/act_police.c | 156 - > 1 file changed, 92 insertions(+), 64 deletions(-) > I must be missing something obvious with this patch. How can the following piece of code in tcf_police_act() can possibly be run without a spinlock or something preventing multiple cpus messing badly with the state variables ? now = ktime_get_ns(); toks = min_t(s64, now - p->tcfp_t_c, p->tcfp_burst); if (p->peak_present) { ptoks = toks + p->tcfp_ptoks; if (ptoks > p->tcfp_mtu_ptoks) ptoks = p->tcfp_mtu_ptoks; ptoks -= (s64)psched_l2t_ns(&p->peak, qdisc_pkt_len(skb)); } toks += p->tcfp_toks; if (toks > p->tcfp_burst) toks = p->tcfp_burst; toks -= (s64)psched_l2t_ns(&p->rate, qdisc_pkt_len(skb)); if ((toks|ptoks) >= 0) { p->tcfp_t_c = now; p->tcfp_toks = toks; p->tcfp_ptoks = ptoks; ret = p->tcfp_result; goto inc_drops; }