subject:"Re\: \[PATCH RFC V2 2\/2\] net\: Optimize snmp stat aggregation by walking all the percpu data at once"

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-29 Thread Raghavendra K T

On 08/29/2015 10:41 AM, David Miller wrote:

From: Raghavendra K T 
Date: Sat, 29 Aug 2015 08:27:15 +0530

resending the patch with memset. Please let me know if you want to
resend all the patches.

Do not post patches as replies to existing discussion threads.

Instead, make a new, fresh, patch posting, updating the Subject line
as needed.

Sure.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-29 Thread Raghavendra K T


On 08/29/2015 08:56 AM, Eric Dumazet wrote:

On Sat, 2015-08-29 at 08:27 +0530, Raghavendra K T wrote:


/* Use put_unaligned() because stats may not be aligned for u64. */
put_unaligned(items, [0]);




for (i = 1; i < items; i++)
-   put_unaligned(snmp_fold_field64(mib, i, syncpoff), [i]);
+   put_unaligned(buff[i], [i]);



I believe Joe suggested following code instead :

buff[0] = items;
memcpy(stats, buff, items * sizeof(u64));


Thanks. Sure, will use this.

(I missed that. I thought that it was applicable only when we have
aligned data,and for power, put_aunaligned was not a nop unlike intel).



Also please move buff[] array into __snmp6_fill_stats64() to make it
clear it is used in a 'leaf' function.


Correct.



(even if calling memcpy()/memset() makes it not a leaf function)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-29 Thread Raghavendra K T

On 08/29/2015 10:41 AM, David Miller wrote:

From: Raghavendra K T raghavendra...@linux.vnet.ibm.com
Date: Sat, 29 Aug 2015 08:27:15 +0530

resending the patch with memset. Please let me know if you want to
resend all the patches.

Do not post patches as replies to existing discussion threads.

Instead, make a new, fresh, patch posting, updating the Subject line
as needed.

Sure.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-29 Thread Raghavendra K T


On 08/29/2015 08:56 AM, Eric Dumazet wrote:

On Sat, 2015-08-29 at 08:27 +0530, Raghavendra K T wrote:


/* Use put_unaligned() because stats may not be aligned for u64. */
put_unaligned(items, stats[0]);




for (i = 1; i  items; i++)
-   put_unaligned(snmp_fold_field64(mib, i, syncpoff), stats[i]);
+   put_unaligned(buff[i], stats[i]);



I believe Joe suggested following code instead :

buff[0] = items;
memcpy(stats, buff, items * sizeof(u64));


Thanks. Sure, will use this.

(I missed that. I thought that it was applicable only when we have
aligned data,and for power, put_aunaligned was not a nop unlike intel).



Also please move buff[] array into __snmp6_fill_stats64() to make it
clear it is used in a 'leaf' function.


Correct.



(even if calling memcpy()/memset() makes it not a leaf function)



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread David Miller

From: Raghavendra K T 
Date: Sat, 29 Aug 2015 08:27:15 +0530

> resending the patch with memset. Please let me know if you want to
> resend all the patches.

Do not post patches as replies to existing discussion threads.

Instead, make a new, fresh, patch posting, updating the Subject line
as needed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Eric Dumazet

On Sat, 2015-08-29 at 08:27 +0530, Raghavendra K T wrote:
>  
>   /* Use put_unaligned() because stats may not be aligned for u64. */
>   put_unaligned(items, [0]);


>   for (i = 1; i < items; i++)
> - put_unaligned(snmp_fold_field64(mib, i, syncpoff), [i]);
> + put_unaligned(buff[i], [i]);
>  

I believe Joe suggested following code instead :

buff[0] = items;
memcpy(stats, buff, items * sizeof(u64));

Also please move buff[] array into __snmp6_fill_stats64() to make it
clear it is used in a 'leaf' function.

(even if calling memcpy()/memset() makes it not a leaf function)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Raghavendra K T

* David Miller  [2015-08-28 11:24:13]:

> From: Raghavendra K T 
> Date: Fri, 28 Aug 2015 12:09:52 +0530
> 
> > On 08/28/2015 12:08 AM, David Miller wrote:
> >> From: Raghavendra K T 
> >> Date: Wed, 26 Aug 2015 23:07:33 +0530
> >>
> >>> @@ -4641,10 +4647,12 @@ static inline void __snmp6_fill_stats64(u64
> >>> *stats, void __percpu *mib,
> >>>   static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int
> >>>   attrtype,
> >>>int bytes)
> >>>   {
> >>> + u64 buff[IPSTATS_MIB_MAX] = {0,};
> >>> +
>  ...
> > hope you wanted to know the overhead than to change the current
> > patch. please let me know..
> 
> I want you to change that variable initializer to an explicit memset().
> 
> The compiler is emitting a memset() or similar _anyways_.
> 
> Not because it will have any impact at all upon performance, but because
> of how it looks to people trying to read and understand the code.
> 
> 

Hi David,
resending the patch with memset. Please let me know if you want to
resend all the patches.

8<
From: Raghavendra K T 
Subject: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking
 all the percpu data at once

Docker container creation linearly increased from around 1.6 sec to 7.5 sec
(at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field.

reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks
through per cpu data of an item (iteratively for around 90 items).

idea: This patch tries to aggregate the statistics by going through
all the items of each cpu sequentially which is reducing cache
misses.

Docker creation got faster by more than 2x after the patch.

Result:
   Before   After
Docker creation time   6.836s   3.357s
cache miss 2.7% 1.38%

perf before:
50.73%  docker   [kernel.kallsyms]   [k] snmp_fold_field
 9.07%  swapper  [kernel.kallsyms]   [k] snooze_loop
 3.49%  docker   [kernel.kallsyms]   [k] veth_stats_one
 2.85%  swapper  [kernel.kallsyms]   [k] _raw_spin_lock

perf after:
10.56%  swapper  [kernel.kallsyms] [k] snooze_loop
 8.72%  docker   [kernel.kallsyms] [k] snmp_get_cpu_field
 7.59%  docker   [kernel.kallsyms] [k] veth_stats_one
 3.65%  swapper  [kernel.kallsyms] [k] _raw_spin_lock

Signed-off-by: Raghavendra K T 
---
 net/ipv6/addrconf.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

 Change in V2:
 - Allocate stat calculation buffer in stack (Eric)
 - Use memset to zero temp buffer (David)

Thanks David and Eric for coments on V1 and as both of them pointed,
unfortunately we cannot get rid of buffer for calculation without
avoiding unaligned op.


diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 21c2c81..9bdfba3 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4624,16 +4624,22 @@ static inline void __snmp6_fill_statsdev(u64 *stats, 
atomic_long_t *mib,
 }
 
 static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib,
- int items, int bytes, size_t syncpoff)
+   int items, int bytes, size_t syncpoff,
+   u64 *buff)
 {
-   int i;
+   int i, c;
int pad = bytes - sizeof(u64) * items;
BUG_ON(pad < 0);
 
/* Use put_unaligned() because stats may not be aligned for u64. */
put_unaligned(items, [0]);
+
+   for_each_possible_cpu(c)
+   for (i = 1; i < items; i++)
+   buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff);
+
for (i = 1; i < items; i++)
-   put_unaligned(snmp_fold_field64(mib, i, syncpoff), [i]);
+   put_unaligned(buff[i], [i]);
 
memset([items], 0, pad);
 }
@@ -4641,10 +4647,13 @@ static inline void __snmp6_fill_stats64(u64 *stats, 
void __percpu *mib,
 static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype,
 int bytes)
 {
+   u64 buff[IPSTATS_MIB_MAX];
+
switch (attrtype) {
case IFLA_INET6_STATS:
-   __snmp6_fill_stats64(stats, idev->stats.ipv6,
-IPSTATS_MIB_MAX, bytes, offsetof(struct 
ipstats_mib, syncp));
+   memset(buff, 0, sizeof(buff));
+   __snmp6_fill_stats64(stats, idev->stats.ipv6, IPSTATS_MIB_MAX, 
bytes,
+offsetof(struct ipstats_mib, syncp), buff);
break;
case IFLA_INET6_ICMP6STATS:
__snmp6_fill_statsdev(stats, idev->stats.icmpv6dev->mibs, 
ICMP6_MIB_MAX, bytes);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Eric Dumazet

On Fri, 2015-08-28 at 17:35 -0700, Joe Perches wrote:

> That of course depends on what a "leaf" is and
> whether or not any other function call in the
> "leaf" consumes stack.
> 
> inet6_fill_ifla6_attrs does call other functions
> (none of which has the stack frame size of k.alloc)

Just define/use this automatic array in the damn leaf function.

That should not be hard, and maybe no one will complain and we can
work on more complex issues.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Joe Perches

On Fri, 2015-08-28 at 17:06 -0700, Eric Dumazet wrote:
> On Fri, 2015-08-28 at 16:12 -0700, Joe Perches wrote: 
> > Generally true.  It's always difficult to know how much
> > stack has been consumed though and smaller stack frames
> > are generally better.
[] 
> So for a _leaf_ function, it is better to declare an automatic variable,
> as you in fact reduce max stack depth.

That of course depends on what a "leaf" is and
whether or not any other function call in the
"leaf" consumes stack.

inet6_fill_ifla6_attrs does call other functions
(none of which has the stack frame size of k.alloc)

> Not only it uses less kernel stack, it is also way faster, as you avoid
> kmalloc()/kfree() overhead and reuse probably already hot cache lines in
> kernel stack.

yup.

You'll also never neglect to free stack like the
original RFC patch neglected to free the alloc.

cheers, Joe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Eric Dumazet

On Fri, 2015-08-28 at 16:12 -0700, Joe Perches wrote:

> Generally true.  It's always difficult to know how much
> stack has been consumed though and smaller stack frames
> are generally better.

Calling kmalloc(288, GFP_KERNEL) might use way more than 288 bytes in
kernel stack on 64 bit arch.

__slab_alloc() itself for example uses 208 bytes on stack, so add all
others, and you might go above 500 bytes.

So for a _leaf_ function, it is better to declare an automatic variable,
as you in fact reduce max stack depth.

Not only it uses less kernel stack, it is also way faster, as you avoid
kmalloc()/kfree() overhead and reuse probably already hot cache lines in
kernel stack.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Joe Perches

On Fri, 2015-08-28 at 15:29 -0700, Eric Dumazet wrote:
> On Fri, 2015-08-28 at 14:26 -0700, Joe Perches wrote:
> 1) u64 array[XX] on stack is naturally aligned,

Of course it is.

> kzalloc() wont improve this at all. Not sure what you believe.

An alloc would only reduce stack use.

Copying into the buffer, then copying the buffer into the
skb may be desirable on some arches though.

> 2) put_unaligned() is basically a normal memory write on x86.
>  memcpy(dst,src,...) will have a problem anyway on arches that care,
> because src & dst wont have same alignment.

OK, so all the world's an x86?

On arm32, copying 288 bytes using nearly all aligned word
transfers is generally faster than using only unsigned
short transfers.

> 288 bytes on stack in a leaf function in this path is totally fine, it
> is not like we're calling ext4/xfs/nfs code after this point.

Generally true.  It's always difficult to know how much
stack has been consumed though and smaller stack frames
are generally better.

Anyway, the block copy from either the alloc'd or stack
buffer amounts only to a slight performance improvement
for arm32.  It doesn't really have much other utility.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Eric Dumazet

On Fri, 2015-08-28 at 14:26 -0700, Joe Perches wrote:

> Always a possibility, but I don't think so.
> 
> > put_unaligned is happening on a space allocated from rtnetlink skb, not
> > the temp space needed to perform the per cpu folding.
> 
> That's why I suggested changing the snmp_fill_stats arguments.
> 
> If the naturally aligned allocated u64 array is used and then
> copied as a block to the rtnetlink skb, I believe there's no
> alignment issue that would require put_unaligned.

1) u64 array[XX] on stack is naturally aligned,
kzalloc() wont improve this at all. Not sure what you believe.

2) put_unaligned() is basically a normal memory write on x86.
 memcpy(dst,src,...) will have a problem anyway on arches that care,
because src & dst wont have same alignment.

288 bytes on stack in a leaf function in this path is totally fine, it
is not like we're calling ext4/xfs/nfs code after this point.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Joe Perches

On Fri, 2015-08-28 at 14:14 -0700, Eric Dumazet wrote:
> On Fri, 2015-08-28 at 14:09 -0700, Joe Perches wrote:
> > On Fri, 2015-08-28 at 13:55 -0700, Eric Dumazet wrote:
> > > On Fri, 2015-08-28 at 13:53 -0700, Joe Perches wrote:
> > > > It's 288 bytes on stack, maybe a kzalloc would be clearer too.
> > > 
> > > Could you read patch history and check why this has been rejected ?
> > 
> > I don't see a rejection, just that the initial
> > submission didn't check the allocation or add
> > an allocation buffer via kcalloc/kzalloc to the
> > inet6_fill_ifla6_attrs caller and change the
> > snmp6_fill_stats arguments.
> > 
> > It could also eliminate the put_unaligned calls.
> 
> Not really. You do not properly read this code.

Always a possibility, but I don't think so.

> put_unaligned is happening on a space allocated from rtnetlink skb, not
> the temp space needed to perform the per cpu folding.

That's why I suggested changing the snmp_fill_stats arguments.

If the naturally aligned allocated u64 array is used and then
copied as a block to the rtnetlink skb, I believe there's no
alignment issue that would require put_unaligned.

Do I still miss something?

> https://lkml.org/lkml/2015/8/25/476

I read that.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Eric Dumazet

On Fri, 2015-08-28 at 14:09 -0700, Joe Perches wrote:
> On Fri, 2015-08-28 at 13:55 -0700, Eric Dumazet wrote:
> > On Fri, 2015-08-28 at 13:53 -0700, Joe Perches wrote:
> > > On Fri, 2015-08-28 at 13:33 -0700, Eric Dumazet wrote:
> > > > We do not bother for small struct.
> > > > 
> > > > Here, the array is big enough that David prefers having an explicit
> > > > memset() so that it clearly shows that author of this code was aware of
> > > > this.
> > > 
> > > It's 288 bytes on stack, maybe a kzalloc would be clearer too.
> > 
> > Could you read patch history and check why this has been rejected ?
> 
> I don't see a rejection, just that the initial
> submission didn't check the allocation or add
> an allocation buffer via kcalloc/kzalloc to the
> inet6_fill_ifla6_attrs caller and change the
> snmp6_fill_stats arguments.
> 
> It could also eliminate the put_unaligned calls.

Not really. You do not properly read this code.

put_unaligned is happening on a space allocated from rtnetlink skb, not
the temp space needed to perform the per cpu folding.

> 
> https://lkml.org/lkml/2015/8/25/114
> 
> Was there some other thread?

Same thread

https://lkml.org/lkml/2015/8/25/476



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Joe Perches

On Fri, 2015-08-28 at 13:55 -0700, Eric Dumazet wrote:
> On Fri, 2015-08-28 at 13:53 -0700, Joe Perches wrote:
> > On Fri, 2015-08-28 at 13:33 -0700, Eric Dumazet wrote:
> > > We do not bother for small struct.
> > > 
> > > Here, the array is big enough that David prefers having an explicit
> > > memset() so that it clearly shows that author of this code was aware of
> > > this.
> > 
> > It's 288 bytes on stack, maybe a kzalloc would be clearer too.
> 
> Could you read patch history and check why this has been rejected ?

I don't see a rejection, just that the initial
submission didn't check the allocation or add
an allocation buffer via kcalloc/kzalloc to the
inet6_fill_ifla6_attrs caller and change the
snmp6_fill_stats arguments.

It could also eliminate the put_unaligned calls.

https://lkml.org/lkml/2015/8/25/114

Was there some other thread?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Eric Dumazet

On Fri, 2015-08-28 at 13:53 -0700, Joe Perches wrote:
> On Fri, 2015-08-28 at 13:33 -0700, Eric Dumazet wrote:
> > We do not bother for small struct.
> > 
> > Here, the array is big enough that David prefers having an explicit
> > memset() so that it clearly shows that author of this code was aware of
> > this.
> 
> It's 288 bytes on stack, maybe a kzalloc would be clearer too.

Could you read patch history and check why this has been rejected ?

Thank you.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Joe Perches

On Fri, 2015-08-28 at 13:33 -0700, Eric Dumazet wrote:
> We do not bother for small struct.
> 
> Here, the array is big enough that David prefers having an explicit
> memset() so that it clearly shows that author of this code was aware of
> this.

It's 288 bytes on stack, maybe a kzalloc would be clearer too.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Eric Dumazet

On Fri, 2015-08-28 at 12:20 -0700, Joe Perches wrote:

> I don't read it as particularly different.
> 
> There are > 100 uses of the not quite a memset initialization
> style using "= { <0,> }" in net/
> 
> $ git grep -E "=\s*\{\s*0?\s*,?\s*\}" net | wc -l
> 138
> 
> There is a difference though if a struct is copied to
> user-space as a {} initialization only guarantees that
> struct members are initialized to 0 where memset also
> zeros any alignment padding.

We do not bother for small struct.

Here, the array is big enough that David prefers having an explicit
memset() so that it clearly shows that author of this code was aware of
this.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Joe Perches

On Fri, 2015-08-28 at 11:24 -0700, David Miller wrote:
> From: Raghavendra K T 
> Date: Fri, 28 Aug 2015 12:09:52 +0530
> 
> > On 08/28/2015 12:08 AM, David Miller wrote:
> >> From: Raghavendra K T 
> >> Date: Wed, 26 Aug 2015 23:07:33 +0530
> >>
> >>> @@ -4641,10 +4647,12 @@ static inline void __snmp6_fill_stats64(u64
> >>> *stats, void __percpu *mib,
> >>>   static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int
> >>>   attrtype,
> >>>int bytes)
> >>>   {
> >>> + u64 buff[IPSTATS_MIB_MAX] = {0,};
> >>> +
>  ...
> > hope you wanted to know the overhead than to change the current
> > patch. please let me know..
> 
> I want you to change that variable initializer to an explicit memset().
> 
> The compiler is emitting a memset() or similar _anyways_.
> 
> Not because it will have any impact at all upon performance, but because
> of how it looks to people trying to read and understand the code.

I don't read it as particularly different.

There are > 100 uses of the not quite a memset initialization
style using "= { <0,> }" in net/

$ git grep -E "=\s*\{\s*0?\s*,?\s*\}" net | wc -l
138

There is a difference though if a struct is copied to
user-space as a {} initialization only guarantees that
struct members are initialized to 0 where memset also
zeros any alignment padding.

Maybe a checkpatch rule like this?
---
 scripts/checkpatch.pl | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index e14dcdb..f79e5c9 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -3237,6 +3237,13 @@ sub process {
next;
}
 
+# check for non-global initializations that could be memset
+   if ($realfile =~ m@^(drivers/net/|net/)@ &&
+   $sline =~ /^.\s+$Declare\s*$Ident\s*=\s*\{\s*0?\s*,?\s*\}/) 
{
+   CHK("BRACE_INITIALIZATION",
+   "Prefer an explicit memset to a declaration 
initialization\n" . $herecurr);
+   }
+
 # check for initialisation to aggregates open brace on the next line
if ($line =~ /^.\s*{/ &&
$prevline =~ /(?:^|[^=])=\s*$/) {



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread David Miller

From: Raghavendra K T 
Date: Fri, 28 Aug 2015 12:09:52 +0530

> On 08/28/2015 12:08 AM, David Miller wrote:
>> From: Raghavendra K T 
>> Date: Wed, 26 Aug 2015 23:07:33 +0530
>>
>>> @@ -4641,10 +4647,12 @@ static inline void __snmp6_fill_stats64(u64
>>> *stats, void __percpu *mib,
>>>   static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int
>>>   attrtype,
>>>  int bytes)
>>>   {
>>> +   u64 buff[IPSTATS_MIB_MAX] = {0,};
>>> +
 ...
> hope you wanted to know the overhead than to change the current
> patch. please let me know..

I want you to change that variable initializer to an explicit memset().

The compiler is emitting a memset() or similar _anyways_.

Not because it will have any impact at all upon performance, but because
of how it looks to people trying to read and understand the code.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Raghavendra K T


On 08/28/2015 12:08 AM, David Miller wrote:

From: Raghavendra K T 
Date: Wed, 26 Aug 2015 23:07:33 +0530


@@ -4641,10 +4647,12 @@ static inline void __snmp6_fill_stats64(u64 *stats, 
void __percpu *mib,
  static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype,
 int bytes)
  {
+   u64 buff[IPSTATS_MIB_MAX] = {0,};
+
switch (attrtype) {
case IFLA_INET6_STATS:
-   __snmp6_fill_stats64(stats, idev->stats.ipv6,


I would suggest using an explicit memset() here, it makes the overhead incurred
by this scheme clearer.



I changed the code to look like below to measure fill_stat overhead:

container creation now took: 3.012s
it was:
without patch : 6.86sec
with current patch: 3.34sec

and perf did not show the snmp6_fill_stats() parent traces.

changed code:
snmp6_fill_stats(...)
{
switch (attrtype) {
case IFLA_INET6_STATS:
put_unaligned(IPSTATS_MIB_MAX, [0]);
memset([1], 0, IPSTATS_MIB_MAX-1);

//__snmp6_fill_stats64(stats, idev->stats.ipv6, 
IPSTATS_MIB_MAX, bytes,
//   offsetof(struct ipstats_mib, 
syncp), buff);

.
}

So in summary:
The current patch amounts to reduction in major overhead in fill_stat,
though there is still percpu walk overhead (0.33sec difference).

[ percpu walk overead grows when create for e.g. 3k containers].

cache miss: there was no major difference (around 1.4%) w.r.t patch

Hi David,
hope you wanted to know the overhead than to change the current patch. 
please let me know..


Eric, does V2 patch look good now.. please add your ack/review

Details:
time
=
time docker run -itd  ubuntu:15.04  /bin/bash
b6670c321b5957f004e281cbb14512deafd0c0be6a39707c2f3dc95649bbc394

real0m3.012s
user0m0.093s
sys 0m0.009s

perf:
==
# Samples: 18K of event 'cycles'
# Event count (approx.): 12838752009
# Overhead  Command  Shared Object  Symbol 


#   ...  .  
#
15.29%  swapper  [kernel.kallsyms]  [k] snooze_loop 

 9.37%  docker   docker [.] scanblock 

 6.47%  docker   [kernel.kallsyms]  [k] veth_stats_one 

 3.87%  swapper  [kernel.kallsyms]  [k] _raw_spin_lock 


 2.71%  docker   docker [.]


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Raghavendra K T


On 08/28/2015 12:08 AM, David Miller wrote:

From: Raghavendra K T raghavendra...@linux.vnet.ibm.com
Date: Wed, 26 Aug 2015 23:07:33 +0530


@@ -4641,10 +4647,12 @@ static inline void __snmp6_fill_stats64(u64 *stats, 
void __percpu *mib,
  static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype,
 int bytes)
  {
+   u64 buff[IPSTATS_MIB_MAX] = {0,};
+
switch (attrtype) {
case IFLA_INET6_STATS:
-   __snmp6_fill_stats64(stats, idev-stats.ipv6,


I would suggest using an explicit memset() here, it makes the overhead incurred
by this scheme clearer.



I changed the code to look like below to measure fill_stat overhead:

container creation now took: 3.012s
it was:
without patch : 6.86sec
with current patch: 3.34sec

and perf did not show the snmp6_fill_stats() parent traces.

changed code:
snmp6_fill_stats(...)
{
switch (attrtype) {
case IFLA_INET6_STATS:
put_unaligned(IPSTATS_MIB_MAX, stats[0]);
memset(stats[1], 0, IPSTATS_MIB_MAX-1);

//__snmp6_fill_stats64(stats, idev-stats.ipv6, 
IPSTATS_MIB_MAX, bytes,
//   offsetof(struct ipstats_mib, 
syncp), buff);

.
}

So in summary:
The current patch amounts to reduction in major overhead in fill_stat,
though there is still percpu walk overhead (0.33sec difference).

[ percpu walk overead grows when create for e.g. 3k containers].

cache miss: there was no major difference (around 1.4%) w.r.t patch

Hi David,
hope you wanted to know the overhead than to change the current patch. 
please let me know..


Eric, does V2 patch look good now.. please add your ack/review

Details:
time
=
time docker run -itd  ubuntu:15.04  /bin/bash
b6670c321b5957f004e281cbb14512deafd0c0be6a39707c2f3dc95649bbc394

real0m3.012s
user0m0.093s
sys 0m0.009s

perf:
==
# Samples: 18K of event 'cycles'
# Event count (approx.): 12838752009
# Overhead  Command  Shared Object  Symbol 


#   ...  .  
#
15.29%  swapper  [kernel.kallsyms]  [k] snooze_loop 

 9.37%  docker   docker [.] scanblock 

 6.47%  docker   [kernel.kallsyms]  [k] veth_stats_one 

 3.87%  swapper  [kernel.kallsyms]  [k] _raw_spin_lock 


 2.71%  docker   docker [.]


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread David Miller

From: Raghavendra K T raghavendra...@linux.vnet.ibm.com
Date: Sat, 29 Aug 2015 08:27:15 +0530

 resending the patch with memset. Please let me know if you want to
 resend all the patches.

Do not post patches as replies to existing discussion threads.

Instead, make a new, fresh, patch posting, updating the Subject line
as needed.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Eric Dumazet

On Sat, 2015-08-29 at 08:27 +0530, Raghavendra K T wrote:
  
   /* Use put_unaligned() because stats may not be aligned for u64. */
   put_unaligned(items, stats[0]);


   for (i = 1; i  items; i++)
 - put_unaligned(snmp_fold_field64(mib, i, syncpoff), stats[i]);
 + put_unaligned(buff[i], stats[i]);
  

I believe Joe suggested following code instead :

buff[0] = items;
memcpy(stats, buff, items * sizeof(u64));

Also please move buff[] array into __snmp6_fill_stats64() to make it
clear it is used in a 'leaf' function.

(even if calling memcpy()/memset() makes it not a leaf function)


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Raghavendra K T

* David Miller da...@davemloft.net [2015-08-28 11:24:13]:

 From: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 Date: Fri, 28 Aug 2015 12:09:52 +0530
 
  On 08/28/2015 12:08 AM, David Miller wrote:
  From: Raghavendra K T raghavendra...@linux.vnet.ibm.com
  Date: Wed, 26 Aug 2015 23:07:33 +0530
 
  @@ -4641,10 +4647,12 @@ static inline void __snmp6_fill_stats64(u64
  *stats, void __percpu *mib,
static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int
attrtype,
 int bytes)
{
  + u64 buff[IPSTATS_MIB_MAX] = {0,};
  +
  ...
  hope you wanted to know the overhead than to change the current
  patch. please let me know..
 
 I want you to change that variable initializer to an explicit memset().
 
 The compiler is emitting a memset() or similar _anyways_.
 
 Not because it will have any impact at all upon performance, but because
 of how it looks to people trying to read and understand the code.
 
 

Hi David,
resending the patch with memset. Please let me know if you want to
resend all the patches.

8
From: Raghavendra K T raghavendra...@linux.vnet.ibm.com
Subject: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking
 all the percpu data at once

Docker container creation linearly increased from around 1.6 sec to 7.5 sec
(at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field.

reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks
through per cpu data of an item (iteratively for around 90 items).

idea: This patch tries to aggregate the statistics by going through
all the items of each cpu sequentially which is reducing cache
misses.

Docker creation got faster by more than 2x after the patch.

Result:
   Before   After
Docker creation time   6.836s   3.357s
cache miss 2.7% 1.38%

perf before:
50.73%  docker   [kernel.kallsyms]   [k] snmp_fold_field
 9.07%  swapper  [kernel.kallsyms]   [k] snooze_loop
 3.49%  docker   [kernel.kallsyms]   [k] veth_stats_one
 2.85%  swapper  [kernel.kallsyms]   [k] _raw_spin_lock

perf after:
10.56%  swapper  [kernel.kallsyms] [k] snooze_loop
 8.72%  docker   [kernel.kallsyms] [k] snmp_get_cpu_field
 7.59%  docker   [kernel.kallsyms] [k] veth_stats_one
 3.65%  swapper  [kernel.kallsyms] [k] _raw_spin_lock

Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---
 net/ipv6/addrconf.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

 Change in V2:
 - Allocate stat calculation buffer in stack (Eric)
 - Use memset to zero temp buffer (David)

Thanks David and Eric for coments on V1 and as both of them pointed,
unfortunately we cannot get rid of buffer for calculation without
avoiding unaligned op.


diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 21c2c81..9bdfba3 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4624,16 +4624,22 @@ static inline void __snmp6_fill_statsdev(u64 *stats, 
atomic_long_t *mib,
 }
 
 static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib,
- int items, int bytes, size_t syncpoff)
+   int items, int bytes, size_t syncpoff,
+   u64 *buff)
 {
-   int i;
+   int i, c;
int pad = bytes - sizeof(u64) * items;
BUG_ON(pad  0);
 
/* Use put_unaligned() because stats may not be aligned for u64. */
put_unaligned(items, stats[0]);
+
+   for_each_possible_cpu(c)
+   for (i = 1; i  items; i++)
+   buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff);
+
for (i = 1; i  items; i++)
-   put_unaligned(snmp_fold_field64(mib, i, syncpoff), stats[i]);
+   put_unaligned(buff[i], stats[i]);
 
memset(stats[items], 0, pad);
 }
@@ -4641,10 +4647,13 @@ static inline void __snmp6_fill_stats64(u64 *stats, 
void __percpu *mib,
 static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype,
 int bytes)
 {
+   u64 buff[IPSTATS_MIB_MAX];
+
switch (attrtype) {
case IFLA_INET6_STATS:
-   __snmp6_fill_stats64(stats, idev-stats.ipv6,
-IPSTATS_MIB_MAX, bytes, offsetof(struct 
ipstats_mib, syncp));
+   memset(buff, 0, sizeof(buff));
+   __snmp6_fill_stats64(stats, idev-stats.ipv6, IPSTATS_MIB_MAX, 
bytes,
+offsetof(struct ipstats_mib, syncp), buff);
break;
case IFLA_INET6_ICMP6STATS:
__snmp6_fill_statsdev(stats, idev-stats.icmpv6dev-mibs, 
ICMP6_MIB_MAX, bytes);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread David Miller

From: Raghavendra K T raghavendra...@linux.vnet.ibm.com
Date: Fri, 28 Aug 2015 12:09:52 +0530

 On 08/28/2015 12:08 AM, David Miller wrote:
 From: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 Date: Wed, 26 Aug 2015 23:07:33 +0530

 @@ -4641,10 +4647,12 @@ static inline void __snmp6_fill_stats64(u64
 *stats, void __percpu *mib,
   static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int
   attrtype,
  int bytes)
   {
 +   u64 buff[IPSTATS_MIB_MAX] = {0,};
 +
 ...
 hope you wanted to know the overhead than to change the current
 patch. please let me know..

I want you to change that variable initializer to an explicit memset().

The compiler is emitting a memset() or similar _anyways_.

Not because it will have any impact at all upon performance, but because
of how it looks to people trying to read and understand the code.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Joe Perches

On Fri, 2015-08-28 at 15:29 -0700, Eric Dumazet wrote:
 On Fri, 2015-08-28 at 14:26 -0700, Joe Perches wrote:
 1) u64 array[XX] on stack is naturally aligned,

Of course it is.

 kzalloc() wont improve this at all. Not sure what you believe.

An alloc would only reduce stack use.

Copying into the buffer, then copying the buffer into the
skb may be desirable on some arches though.

 2) put_unaligned() is basically a normal memory write on x86.
  memcpy(dst,src,...) will have a problem anyway on arches that care,
 because src  dst wont have same alignment.

OK, so all the world's an x86?

On arm32, copying 288 bytes using nearly all aligned word
transfers is generally faster than using only unsigned
short transfers.

 288 bytes on stack in a leaf function in this path is totally fine, it
 is not like we're calling ext4/xfs/nfs code after this point.

Generally true.  It's always difficult to know how much
stack has been consumed though and smaller stack frames
are generally better.

Anyway, the block copy from either the alloc'd or stack
buffer amounts only to a slight performance improvement
for arm32.  It doesn't really have much other utility.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Eric Dumazet

On Fri, 2015-08-28 at 14:26 -0700, Joe Perches wrote:

 Always a possibility, but I don't think so.
 
  put_unaligned is happening on a space allocated from rtnetlink skb, not
  the temp space needed to perform the per cpu folding.
 
 That's why I suggested changing the snmp_fill_stats arguments.
 
 If the naturally aligned allocated u64 array is used and then
 copied as a block to the rtnetlink skb, I believe there's no
 alignment issue that would require put_unaligned.

1) u64 array[XX] on stack is naturally aligned,
kzalloc() wont improve this at all. Not sure what you believe.

2) put_unaligned() is basically a normal memory write on x86.
 memcpy(dst,src,...) will have a problem anyway on arches that care,
because src  dst wont have same alignment.

288 bytes on stack in a leaf function in this path is totally fine, it
is not like we're calling ext4/xfs/nfs code after this point.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Joe Perches

On Fri, 2015-08-28 at 13:33 -0700, Eric Dumazet wrote:
 We do not bother for small struct.
 
 Here, the array is big enough that David prefers having an explicit
 memset() so that it clearly shows that author of this code was aware of
 this.

It's 288 bytes on stack, maybe a kzalloc would be clearer too.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Joe Perches

On Fri, 2015-08-28 at 13:55 -0700, Eric Dumazet wrote:
 On Fri, 2015-08-28 at 13:53 -0700, Joe Perches wrote:
  On Fri, 2015-08-28 at 13:33 -0700, Eric Dumazet wrote:
   We do not bother for small struct.
   
   Here, the array is big enough that David prefers having an explicit
   memset() so that it clearly shows that author of this code was aware of
   this.
  
  It's 288 bytes on stack, maybe a kzalloc would be clearer too.
 
 Could you read patch history and check why this has been rejected ?

I don't see a rejection, just that the initial
submission didn't check the allocation or add
an allocation buffer via kcalloc/kzalloc to the
inet6_fill_ifla6_attrs caller and change the
snmp6_fill_stats arguments.

It could also eliminate the put_unaligned calls.

https://lkml.org/lkml/2015/8/25/114

Was there some other thread?


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Eric Dumazet

On Fri, 2015-08-28 at 13:53 -0700, Joe Perches wrote:
 On Fri, 2015-08-28 at 13:33 -0700, Eric Dumazet wrote:
  We do not bother for small struct.
  
  Here, the array is big enough that David prefers having an explicit
  memset() so that it clearly shows that author of this code was aware of
  this.
 
 It's 288 bytes on stack, maybe a kzalloc would be clearer too.

Could you read patch history and check why this has been rejected ?

Thank you.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Eric Dumazet

On Fri, 2015-08-28 at 14:09 -0700, Joe Perches wrote:
 On Fri, 2015-08-28 at 13:55 -0700, Eric Dumazet wrote:
  On Fri, 2015-08-28 at 13:53 -0700, Joe Perches wrote:
   On Fri, 2015-08-28 at 13:33 -0700, Eric Dumazet wrote:
We do not bother for small struct.

Here, the array is big enough that David prefers having an explicit
memset() so that it clearly shows that author of this code was aware of
this.
   
   It's 288 bytes on stack, maybe a kzalloc would be clearer too.
  
  Could you read patch history and check why this has been rejected ?
 
 I don't see a rejection, just that the initial
 submission didn't check the allocation or add
 an allocation buffer via kcalloc/kzalloc to the
 inet6_fill_ifla6_attrs caller and change the
 snmp6_fill_stats arguments.
 
 It could also eliminate the put_unaligned calls.

Not really. You do not properly read this code.

put_unaligned is happening on a space allocated from rtnetlink skb, not
the temp space needed to perform the per cpu folding.

 
 https://lkml.org/lkml/2015/8/25/114
 
 Was there some other thread?

Same thread

https://lkml.org/lkml/2015/8/25/476



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Joe Perches

On Fri, 2015-08-28 at 14:14 -0700, Eric Dumazet wrote:
 On Fri, 2015-08-28 at 14:09 -0700, Joe Perches wrote:
  On Fri, 2015-08-28 at 13:55 -0700, Eric Dumazet wrote:
   On Fri, 2015-08-28 at 13:53 -0700, Joe Perches wrote:
It's 288 bytes on stack, maybe a kzalloc would be clearer too.
   
   Could you read patch history and check why this has been rejected ?
  
  I don't see a rejection, just that the initial
  submission didn't check the allocation or add
  an allocation buffer via kcalloc/kzalloc to the
  inet6_fill_ifla6_attrs caller and change the
  snmp6_fill_stats arguments.
  
  It could also eliminate the put_unaligned calls.
 
 Not really. You do not properly read this code.

Always a possibility, but I don't think so.

 put_unaligned is happening on a space allocated from rtnetlink skb, not
 the temp space needed to perform the per cpu folding.

That's why I suggested changing the snmp_fill_stats arguments.

If the naturally aligned allocated u64 array is used and then
copied as a block to the rtnetlink skb, I believe there's no
alignment issue that would require put_unaligned.

Do I still miss something?

 https://lkml.org/lkml/2015/8/25/476

I read that.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Eric Dumazet

On Fri, 2015-08-28 at 12:20 -0700, Joe Perches wrote:

 I don't read it as particularly different.
 
 There are  100 uses of the not quite a memset initialization
 style using = { 0, } in net/
 
 $ git grep -E =\s*\{\s*0?\s*,?\s*\} net | wc -l
 138
 
 There is a difference though if a struct is copied to
 user-space as a {} initialization only guarantees that
 struct members are initialized to 0 where memset also
 zeros any alignment padding.

We do not bother for small struct.

Here, the array is big enough that David prefers having an explicit
memset() so that it clearly shows that author of this code was aware of
this.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Joe Perches

On Fri, 2015-08-28 at 11:24 -0700, David Miller wrote:
 From: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 Date: Fri, 28 Aug 2015 12:09:52 +0530
 
  On 08/28/2015 12:08 AM, David Miller wrote:
  From: Raghavendra K T raghavendra...@linux.vnet.ibm.com
  Date: Wed, 26 Aug 2015 23:07:33 +0530
 
  @@ -4641,10 +4647,12 @@ static inline void __snmp6_fill_stats64(u64
  *stats, void __percpu *mib,
static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int
attrtype,
 int bytes)
{
  + u64 buff[IPSTATS_MIB_MAX] = {0,};
  +
  ...
  hope you wanted to know the overhead than to change the current
  patch. please let me know..
 
 I want you to change that variable initializer to an explicit memset().
 
 The compiler is emitting a memset() or similar _anyways_.
 
 Not because it will have any impact at all upon performance, but because
 of how it looks to people trying to read and understand the code.

I don't read it as particularly different.

There are  100 uses of the not quite a memset initialization
style using = { 0, } in net/

$ git grep -E =\s*\{\s*0?\s*,?\s*\} net | wc -l
138

There is a difference though if a struct is copied to
user-space as a {} initialization only guarantees that
struct members are initialized to 0 where memset also
zeros any alignment padding.

Maybe a checkpatch rule like this?
---
 scripts/checkpatch.pl | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index e14dcdb..f79e5c9 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -3237,6 +3237,13 @@ sub process {
next;
}
 
+# check for non-global initializations that could be memset
+   if ($realfile =~ m@^(drivers/net/|net/)@ 
+   $sline =~ /^.\s+$Declare\s*$Ident\s*=\s*\{\s*0?\s*,?\s*\}/) 
{
+   CHK(BRACE_INITIALIZATION,
+   Prefer an explicit memset to a declaration 
initialization\n . $herecurr);
+   }
+
 # check for initialisation to aggregates open brace on the next line
if ($line =~ /^.\s*{/ 
$prevline =~ /(?:^|[^=])=\s*$/) {



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Eric Dumazet

On Fri, 2015-08-28 at 16:12 -0700, Joe Perches wrote:

 Generally true.  It's always difficult to know how much
 stack has been consumed though and smaller stack frames
 are generally better.

Calling kmalloc(288, GFP_KERNEL) might use way more than 288 bytes in
kernel stack on 64 bit arch.

__slab_alloc() itself for example uses 208 bytes on stack, so add all
others, and you might go above 500 bytes.
 
So for a _leaf_ function, it is better to declare an automatic variable,
as you in fact reduce max stack depth.

Not only it uses less kernel stack, it is also way faster, as you avoid
kmalloc()/kfree() overhead and reuse probably already hot cache lines in
kernel stack.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Joe Perches

On Fri, 2015-08-28 at 17:06 -0700, Eric Dumazet wrote:
 On Fri, 2015-08-28 at 16:12 -0700, Joe Perches wrote: 
  Generally true.  It's always difficult to know how much
  stack has been consumed though and smaller stack frames
  are generally better.
[] 
 So for a _leaf_ function, it is better to declare an automatic variable,
 as you in fact reduce max stack depth.

That of course depends on what a leaf is and
whether or not any other function call in the
leaf consumes stack.

inet6_fill_ifla6_attrs does call other functions
(none of which has the stack frame size of k.alloc)

 Not only it uses less kernel stack, it is also way faster, as you avoid
 kmalloc()/kfree() overhead and reuse probably already hot cache lines in
 kernel stack.

yup.

You'll also never neglect to free stack like the
original RFC patch neglected to free the alloc.

cheers, Joe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Eric Dumazet

On Fri, 2015-08-28 at 17:35 -0700, Joe Perches wrote:

 That of course depends on what a leaf is and
 whether or not any other function call in the
 leaf consumes stack.
 
 inet6_fill_ifla6_attrs does call other functions
 (none of which has the stack frame size of k.alloc)

Just define/use this automatic array in the damn leaf function.

That should not be hard, and maybe no one will complain and we can
work on more complex issues.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-27 Thread David Miller

From: Raghavendra K T 
Date: Wed, 26 Aug 2015 23:07:33 +0530

> @@ -4641,10 +4647,12 @@ static inline void __snmp6_fill_stats64(u64 *stats, 
> void __percpu *mib,
>  static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int 
> attrtype,
>int bytes)
>  {
> + u64 buff[IPSTATS_MIB_MAX] = {0,};
> +
>   switch (attrtype) {
>   case IFLA_INET6_STATS:
> - __snmp6_fill_stats64(stats, idev->stats.ipv6,

I would suggest using an explicit memset() here, it makes the overhead incurred
by this scheme clearer.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-27 Thread David Miller

From: Raghavendra K T raghavendra...@linux.vnet.ibm.com
Date: Wed, 26 Aug 2015 23:07:33 +0530

 @@ -4641,10 +4647,12 @@ static inline void __snmp6_fill_stats64(u64 *stats, 
 void __percpu *mib,
  static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int 
 attrtype,
int bytes)
  {
 + u64 buff[IPSTATS_MIB_MAX] = {0,};
 +
   switch (attrtype) {
   case IFLA_INET6_STATS:
 - __snmp6_fill_stats64(stats, idev-stats.ipv6,

I would suggest using an explicit memset() here, it makes the overhead incurred
by this scheme clearer.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

40 matches

Mail list logo