Re: [PATCH] lib: Make _find_next_bit helper function inline

2015-08-30 Thread Rasmus Villemoes
I've lost track of what's up and down in this, but now that I look at
this again let me throw in my two observations of stupid gcc behaviour:
For the current code, both debian's gcc (4.7) and 5.1 partially inlines
_find_next_bit, namely the "if (!nbits || start >= nbits)" test. I know it
does it to avoid a function call, but in this case the early return
condition is unlikely, so there's not much to gain. Moreover, it fails
to optimize the test to simply "if (start >= nbits)" - everything being
unsigned, these are obviously equivalent.

Rasmus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lib: Make _find_next_bit helper function inline

2015-08-30 Thread Rasmus Villemoes
I've lost track of what's up and down in this, but now that I look at
this again let me throw in my two observations of stupid gcc behaviour:
For the current code, both debian's gcc (4.7) and 5.1 partially inlines
_find_next_bit, namely the if (!nbits || start = nbits) test. I know it
does it to avoid a function call, but in this case the early return
condition is unlikely, so there's not much to gain. Moreover, it fails
to optimize the test to simply if (start = nbits) - everything being
unsigned, these are obviously equivalent.

Rasmus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lib: Make _find_next_bit helper function inline

2015-08-29 Thread Yury


On 24.08.2015 01:53, Alexey Klimov wrote:

Hi Cassidy,


On Wed, Jul 29, 2015 at 11:40 PM, Cassidy Burden  wrote:

I changed the test module to now set the entire array to all 0/1s and
only flip a few bits. There appears to be a performance benefit, but
it's only 2-3% better (if that). If the main benefit of the original
patch was to save space then inlining definitely doesn't seem worth the
small gains in real use cases.

find_next_zero_bit (us)
old  new inline
1444017080   17086
4779 51815069
1084412720   12746
9642 11312   11253
3858 38183668
1054012349   12307
1247014716   14697
5403 60025942
2282 18201418
1363216056   15998
1104813019   13030
6025 67906706
1325515586   15605
3038 27442539
1035312219   12239
1049812251   12322
1476717452   17454
1278515048   15052
1655 1034691
9924 11611   11558

find_next_bit (us)
old  new inline
8535 99369667
1466617372   16880
2315 17991355
6578 90928806
6548 75587274
9448 11213   10821
3467 34973449
2719 30792911
6115 79897796
1358216113   15643
4643 49464766
3406 37283536
7118 90458805
3174 30112701
1330016780   16252
1428516848   16330
1158313669   13207
1306315455   14989
1266114955   14500
1206814166   13790

On 7/29/2015 6:30 AM, Alexey Klimov wrote:


I will re-check on another machine. It's really interesting if
__always_inline makes things better for aarch64 and worse for x86_64. It
will be nice if someone will check it on x86_64 too.



Very odd, this may be related to the other compiler optimizations Yuri
mentioned?


It's better to ask Yury, i hope he can answer some day.

Do you need to re-check this (with more iterations or on another machine(s))?



Hi, Alexey, Cassidy,

(restoring Rasmus, George)

I found no difference between original and inline versions for x86_64:
(Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz)

find_next_bit   find_next_zero_bit
old new inline  old new inline
24  27  28  22  28  28
24  27  28  23  27  28
24  27  28  23  27  28

Inspecting assembler code, I found that GCC wants to see helper separated,
even if you provide '__always_inline':

inline :   current :  

280:cmp%rdx,%rsi210:cmp%rdx,%rsi
283:jbe295  213:  jbe227 

285:test   %rsi,%rsi215:test   %rsi,%rsi
288:je 295  218:  je 227 

28a:push   %rbp 21a:push   %rbp
28b:mov%rsp,%rbp21b:xor%ecx,%ecx
28e:callq  0  21d:  mov%rsp,%rbp
293:pop%rbp 220:callq  0 
<_find_next_bit.part.0>
294:retq225:pop%rbp
295:mov%rsi,%rax226:retq
298:retq227:mov%rsi,%rax
299:nopl   0x0(%rax)22a:retq
22b:nopl   0x0(%rax,%rax,1)

So things are looking like x86_64 gcc (at least 4.9.2 build for Ubuntu)
ignores '__always_inline' hint as well as 'inline'. But in case of
__always_inline compiler does something not really smart: it introduces
 and  helpers
and so increases text size from 0x250 to 0x2b9 bytes, but doesn't really
inline to optimize push/pop and call/ret. I don't like inline, as I
already told, but I believe that complete disabling is bad idea.
Maybe someone knows another trick to make inline work?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lib: Make _find_next_bit helper function inline

2015-08-29 Thread Yury


On 24.08.2015 01:53, Alexey Klimov wrote:

Hi Cassidy,


On Wed, Jul 29, 2015 at 11:40 PM, Cassidy Burden cbur...@codeaurora.org wrote:

I changed the test module to now set the entire array to all 0/1s and
only flip a few bits. There appears to be a performance benefit, but
it's only 2-3% better (if that). If the main benefit of the original
patch was to save space then inlining definitely doesn't seem worth the
small gains in real use cases.

find_next_zero_bit (us)
old  new inline
1444017080   17086
4779 51815069
1084412720   12746
9642 11312   11253
3858 38183668
1054012349   12307
1247014716   14697
5403 60025942
2282 18201418
1363216056   15998
1104813019   13030
6025 67906706
1325515586   15605
3038 27442539
1035312219   12239
1049812251   12322
1476717452   17454
1278515048   15052
1655 1034691
9924 11611   11558

find_next_bit (us)
old  new inline
8535 99369667
1466617372   16880
2315 17991355
6578 90928806
6548 75587274
9448 11213   10821
3467 34973449
2719 30792911
6115 79897796
1358216113   15643
4643 49464766
3406 37283536
7118 90458805
3174 30112701
1330016780   16252
1428516848   16330
1158313669   13207
1306315455   14989
1266114955   14500
1206814166   13790

On 7/29/2015 6:30 AM, Alexey Klimov wrote:


I will re-check on another machine. It's really interesting if
__always_inline makes things better for aarch64 and worse for x86_64. It
will be nice if someone will check it on x86_64 too.



Very odd, this may be related to the other compiler optimizations Yuri
mentioned?


It's better to ask Yury, i hope he can answer some day.

Do you need to re-check this (with more iterations or on another machine(s))?



Hi, Alexey, Cassidy,

(restoring Rasmus, George)

I found no difference between original and inline versions for x86_64:
(Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz)

find_next_bit   find_next_zero_bit
old new inline  old new inline
24  27  28  22  28  28
24  27  28  23  27  28
24  27  28  23  27  28

Inspecting assembler code, I found that GCC wants to see helper separated,
even if you provide '__always_inline':

inline find_next_bit_new:   current find_next_bit_new:  

280:cmp%rdx,%rsi210:cmp%rdx,%rsi
283:jbe295 find_next_bit_new+0x15 213:  jbe227 
find_next_bit_new+0x17
285:test   %rsi,%rsi215:test   %rsi,%rsi
288:je 295 find_next_bit_new+0x15 218:  je 227 
find_next_bit_new+0x17
28a:push   %rbp 21a:push   %rbp
28b:mov%rsp,%rbp21b:xor%ecx,%ecx
28e:callq  0 find_next_bit_new.part.0 21d:  mov%rsp,%rbp
293:pop%rbp 220:callq  0 
_find_next_bit.part.0
294:retq225:pop%rbp
295:mov%rsi,%rax226:retq
298:retq227:mov%rsi,%rax
299:nopl   0x0(%rax)22a:retq
22b:nopl   0x0(%rax,%rax,1)

So things are looking like x86_64 gcc (at least 4.9.2 build for Ubuntu)
ignores '__always_inline' hint as well as 'inline'. But in case of
__always_inline compiler does something not really smart: it introduces
find_next_bit_new.part.0 and find_next_zero_bit_new.part.1 helpers
and so increases text size from 0x250 to 0x2b9 bytes, but doesn't really
inline to optimize push/pop and call/ret. I don't like inline, as I
already told, but I believe that complete disabling is bad idea.
Maybe someone knows another trick to make inline work?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lib: Make _find_next_bit helper function inline

2015-08-23 Thread Alexey Klimov
Hi Cassidy,


On Wed, Jul 29, 2015 at 11:40 PM, Cassidy Burden  wrote:
> I changed the test module to now set the entire array to all 0/1s and
> only flip a few bits. There appears to be a performance benefit, but
> it's only 2-3% better (if that). If the main benefit of the original
> patch was to save space then inlining definitely doesn't seem worth the
> small gains in real use cases.
>
> find_next_zero_bit (us)
> old  new inline
> 1444017080   17086
> 4779 51815069
> 1084412720   12746
> 9642 11312   11253
> 3858 38183668
> 1054012349   12307
> 1247014716   14697
> 5403 60025942
> 2282 18201418
> 1363216056   15998
> 1104813019   13030
> 6025 67906706
> 1325515586   15605
> 3038 27442539
> 1035312219   12239
> 1049812251   12322
> 1476717452   17454
> 1278515048   15052
> 1655 1034691
> 9924 11611   11558
>
> find_next_bit (us)
> old  new inline
> 8535 99369667
> 1466617372   16880
> 2315 17991355
> 6578 90928806
> 6548 75587274
> 9448 11213   10821
> 3467 34973449
> 2719 30792911
> 6115 79897796
> 1358216113   15643
> 4643 49464766
> 3406 37283536
> 7118 90458805
> 3174 30112701
> 1330016780   16252
> 1428516848   16330
> 1158313669   13207
> 1306315455   14989
> 1266114955   14500
> 1206814166   13790
>
> On 7/29/2015 6:30 AM, Alexey Klimov wrote:
>>
>> I will re-check on another machine. It's really interesting if
>> __always_inline makes things better for aarch64 and worse for x86_64. It
>> will be nice if someone will check it on x86_64 too.
>
>
> Very odd, this may be related to the other compiler optimizations Yuri
> mentioned?

It's better to ask Yury, i hope he can answer some day.

Do you need to re-check this (with more iterations or on another machine(s))?

-- 
Best regards, Klimov Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lib: Make _find_next_bit helper function inline

2015-08-23 Thread Alexey Klimov
Hi Cassidy,


On Wed, Jul 29, 2015 at 11:40 PM, Cassidy Burden cbur...@codeaurora.org wrote:
 I changed the test module to now set the entire array to all 0/1s and
 only flip a few bits. There appears to be a performance benefit, but
 it's only 2-3% better (if that). If the main benefit of the original
 patch was to save space then inlining definitely doesn't seem worth the
 small gains in real use cases.

 find_next_zero_bit (us)
 old  new inline
 1444017080   17086
 4779 51815069
 1084412720   12746
 9642 11312   11253
 3858 38183668
 1054012349   12307
 1247014716   14697
 5403 60025942
 2282 18201418
 1363216056   15998
 1104813019   13030
 6025 67906706
 1325515586   15605
 3038 27442539
 1035312219   12239
 1049812251   12322
 1476717452   17454
 1278515048   15052
 1655 1034691
 9924 11611   11558

 find_next_bit (us)
 old  new inline
 8535 99369667
 1466617372   16880
 2315 17991355
 6578 90928806
 6548 75587274
 9448 11213   10821
 3467 34973449
 2719 30792911
 6115 79897796
 1358216113   15643
 4643 49464766
 3406 37283536
 7118 90458805
 3174 30112701
 1330016780   16252
 1428516848   16330
 1158313669   13207
 1306315455   14989
 1266114955   14500
 1206814166   13790

 On 7/29/2015 6:30 AM, Alexey Klimov wrote:

 I will re-check on another machine. It's really interesting if
 __always_inline makes things better for aarch64 and worse for x86_64. It
 will be nice if someone will check it on x86_64 too.


 Very odd, this may be related to the other compiler optimizations Yuri
 mentioned?

It's better to ask Yury, i hope he can answer some day.

Do you need to re-check this (with more iterations or on another machine(s))?

-- 
Best regards, Klimov Alexey
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lib: Make _find_next_bit helper function inline

2015-07-29 Thread Cassidy Burden

I changed the test module to now set the entire array to all 0/1s and
only flip a few bits. There appears to be a performance benefit, but
it's only 2-3% better (if that). If the main benefit of the original
patch was to save space then inlining definitely doesn't seem worth the
small gains in real use cases.

find_next_zero_bit (us)
old  new inline
1444017080   17086
4779 51815069
1084412720   12746
9642 11312   11253
3858 38183668
1054012349   12307
1247014716   14697
5403 60025942
2282 18201418
1363216056   15998
1104813019   13030
6025 67906706
1325515586   15605
3038 27442539
1035312219   12239
1049812251   12322
1476717452   17454
1278515048   15052
1655 1034691
9924 11611   11558

find_next_bit (us)
old  new inline
8535 99369667
1466617372   16880
2315 17991355
6578 90928806
6548 75587274
9448 11213   10821
3467 34973449
2719 30792911
6115 79897796
1358216113   15643
4643 49464766
3406 37283536
7118 90458805
3174 30112701
1330016780   16252
1428516848   16330
1158313669   13207
1306315455   14989
1266114955   14500
1206814166   13790

On 7/29/2015 6:30 AM, Alexey Klimov wrote:

I will re-check on another machine. It's really interesting if
__always_inline makes things better for aarch64 and worse for x86_64. It
will be nice if someone will check it on x86_64 too.


Very odd, this may be related to the other compiler optimizations Yuri
mentioned?
--
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lib: Make _find_next_bit helper function inline

2015-07-29 Thread Alexey Klimov
On Вт., 2015-07-28 at 14:45 -0700, Andrew Morton wrote:
> On Wed, 29 Jul 2015 00:23:18 +0300 Yury  wrote:
> 
> > But I think, before/after for x86 is needed as well.
> 
> That would be nice.
> 
> > And why don't you consider '__always_inline__'? Simple inline is only a 
> > hint and
> > guarantees nothing.
> 
> Yup.  My x86_64 compiler just ignores the "inline".  When I use
> __always_inline, find_bit.o's text goes from 776 bytes to 863. 
> Hopefully we get something in return for that bloat!

On my x86_64 (core-i5 something, with disabled cpufreq) i got following
numbers:
find_next_zero_bit
old new __always_inline
20  21  22  
20  21  22
20  22  23
21  21  22
21  21  23
20  21  22
20  21  23
21  22  23
20  22  22
21  21  22

find_next_bit
old new __always_inline
19  21  24
19  22  24
19  22  24
19  21  24
20  22  24
19  21  23
19  21  23
20  21  24
19  22  24
19  21  24

I will re-check on another machine. It's really interesting if
__always_inline makes things better for aarch64 and worse for x86_64. It
will be nice if someone will check it on x86_64 too.

Best regards,
Alexey Klimov.

> Also, if _find_next_bit() benefits from this then _find_next_bit_le()
> will presumably also benefit.
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lib: Make _find_next_bit helper function inline

2015-07-29 Thread Cassidy Burden

I changed the test module to now set the entire array to all 0/1s and
only flip a few bits. There appears to be a performance benefit, but
it's only 2-3% better (if that). If the main benefit of the original
patch was to save space then inlining definitely doesn't seem worth the
small gains in real use cases.

find_next_zero_bit (us)
old  new inline
1444017080   17086
4779 51815069
1084412720   12746
9642 11312   11253
3858 38183668
1054012349   12307
1247014716   14697
5403 60025942
2282 18201418
1363216056   15998
1104813019   13030
6025 67906706
1325515586   15605
3038 27442539
1035312219   12239
1049812251   12322
1476717452   17454
1278515048   15052
1655 1034691
9924 11611   11558

find_next_bit (us)
old  new inline
8535 99369667
1466617372   16880
2315 17991355
6578 90928806
6548 75587274
9448 11213   10821
3467 34973449
2719 30792911
6115 79897796
1358216113   15643
4643 49464766
3406 37283536
7118 90458805
3174 30112701
1330016780   16252
1428516848   16330
1158313669   13207
1306315455   14989
1266114955   14500
1206814166   13790

On 7/29/2015 6:30 AM, Alexey Klimov wrote:

I will re-check on another machine. It's really interesting if
__always_inline makes things better for aarch64 and worse for x86_64. It
will be nice if someone will check it on x86_64 too.


Very odd, this may be related to the other compiler optimizations Yuri
mentioned?
--
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lib: Make _find_next_bit helper function inline

2015-07-29 Thread Alexey Klimov
On Вт., 2015-07-28 at 14:45 -0700, Andrew Morton wrote:
 On Wed, 29 Jul 2015 00:23:18 +0300 Yury yury.no...@gmail.com wrote:
 
  But I think, before/after for x86 is needed as well.
 
 That would be nice.
 
  And why don't you consider '__always_inline__'? Simple inline is only a 
  hint and
  guarantees nothing.
 
 Yup.  My x86_64 compiler just ignores the inline.  When I use
 __always_inline, find_bit.o's text goes from 776 bytes to 863. 
 Hopefully we get something in return for that bloat!

On my x86_64 (core-i5 something, with disabled cpufreq) i got following
numbers:
find_next_zero_bit
old new __always_inline
20  21  22  
20  21  22
20  22  23
21  21  22
21  21  23
20  21  22
20  21  23
21  22  23
20  22  22
21  21  22

find_next_bit
old new __always_inline
19  21  24
19  22  24
19  22  24
19  21  24
20  22  24
19  21  23
19  21  23
20  21  24
19  22  24
19  21  24

I will re-check on another machine. It's really interesting if
__always_inline makes things better for aarch64 and worse for x86_64. It
will be nice if someone will check it on x86_64 too.

Best regards,
Alexey Klimov.

 Also, if _find_next_bit() benefits from this then _find_next_bit_le()
 will presumably also benefit.
 
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lib: Make _find_next_bit helper function inline

2015-07-28 Thread Andrew Morton
On Wed, 29 Jul 2015 00:23:18 +0300 Yury  wrote:

> But I think, before/after for x86 is needed as well.

That would be nice.

> And why don't you consider '__always_inline__'? Simple inline is only a 
> hint and
> guarantees nothing.

Yup.  My x86_64 compiler just ignores the "inline".  When I use
__always_inline, find_bit.o's text goes from 776 bytes to 863. 
Hopefully we get something in return for that bloat!


Also, if _find_next_bit() benefits from this then _find_next_bit_le()
will presumably also benefit.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lib: Make _find_next_bit helper function inline

2015-07-28 Thread Yury

On 29.07.2015 00:23, Yury wrote:

On 28.07.2015 22:09, Cassidy Burden wrote:
I've tested Yury Norov's find_bit reimplementation with the 
test_find_bit

module (https://lkml.org/lkml/2015/3/8/141) and measured about 35-40%
performance degradation on arm64 3.18 run with fixed CPU frequency.

The performance degradation appears to be caused by the
helper function _find_next_bit. After inlining this function into
find_next_bit and find_next_zero_bit I get slightly better performance
than the old implementation:

find_next_zero_bit  find_next_bit
old  new inline old  new inline
26   36  24 24   33  23
25   36  24 24   33  23
26   36  24 24   33  23
25   36  24 24   33  23
25   36  24 24   33  23
25   37  24 24   33  23
25   37  24 24   33  23
25   37  24 24   33  23
25   36  24 24   33  23
25   37  24 24   33  23

Signed-off-by: Cassidy Burden 
Cc: Alexey Klimov 
Cc: David S. Miller 
Cc: Daniel Borkmann 
Cc: Hannes Frederic Sowa 
Cc: Lai Jiangshan 
Cc: Mark Salter 
Cc: AKASHI Takahiro 
Cc: Thomas Graf 
Cc: Valentin Rothberg 
Cc: Chris Wilson 
---
  lib/find_bit.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/find_bit.c b/lib/find_bit.c
index 18072ea..d0e04f9 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -28,7 +28,7 @@
   * find_next_zero_bit.  The difference is the "invert" argument, which
   * is XORed with each fetched word before searching it for one bits.
   */
-static unsigned long _find_next_bit(const unsigned long *addr,
+static inline unsigned long _find_next_bit(const unsigned long *addr,
  unsigned long nbits, unsigned long start, unsigned long 
invert)

  {
  unsigned long tmp;


Hi Cassidi,

At first, I'm really surprised that there's no assembler implementation
of find_bit routines for aarch64. Aarch32 has ones...

I was thinking on inlining the helper, but decided not to do this

1. Test is not too realistic. https://lkml.org/lkml/2015/2/1/224
The typical usage pattern is to look for a single bit or range of bits.
So in practice nobody calls find_next_bit thousand times.

2. Way more important to fit functions into as less cache lines as
possible. https://lkml.org/lkml/2015/2/12/114
In this case, inlining increases cache lines consumption almost twice...

3. Inlining prevents compiler from some other possible optimizations. 
It's
probable that in real module compiler will inline callers of 
_find_next_bit,
and final output will be better. I don't like to point out the 
compiler how

it should do its work.

Nevertheless, if this is your real case, and inlining helps, I'm OK 
with it.


But I think, before/after for x86 is needed as well.
And why don't you consider '__always_inline__'? Simple inline is only 
a hint and

guarantees nothing.


(Sorry for typo in your name. Call me Yuri next time.)

Adding Rasmus and George to CC

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lib: Make _find_next_bit helper function inline

2015-07-28 Thread Yury

On 28.07.2015 22:09, Cassidy Burden wrote:

I've tested Yury Norov's find_bit reimplementation with the test_find_bit
module (https://lkml.org/lkml/2015/3/8/141) and measured about 35-40%
performance degradation on arm64 3.18 run with fixed CPU frequency.

The performance degradation appears to be caused by the
helper function _find_next_bit. After inlining this function into
find_next_bit and find_next_zero_bit I get slightly better performance
than the old implementation:

find_next_zero_bit  find_next_bit
old  new inline old  new inline
26   36  24 24   33  23
25   36  24 24   33  23
26   36  24 24   33  23
25   36  24 24   33  23
25   36  24 24   33  23
25   37  24 24   33  23
25   37  24 24   33  23
25   37  24 24   33  23
25   36  24 24   33  23
25   37  24 24   33  23

Signed-off-by: Cassidy Burden 
Cc: Alexey Klimov 
Cc: David S. Miller 
Cc: Daniel Borkmann 
Cc: Hannes Frederic Sowa 
Cc: Lai Jiangshan 
Cc: Mark Salter 
Cc: AKASHI Takahiro 
Cc: Thomas Graf 
Cc: Valentin Rothberg 
Cc: Chris Wilson 
---
  lib/find_bit.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/find_bit.c b/lib/find_bit.c
index 18072ea..d0e04f9 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -28,7 +28,7 @@
   * find_next_zero_bit.  The difference is the "invert" argument, which
   * is XORed with each fetched word before searching it for one bits.
   */
-static unsigned long _find_next_bit(const unsigned long *addr,
+static inline unsigned long _find_next_bit(const unsigned long *addr,
unsigned long nbits, unsigned long start, unsigned long invert)
  {
unsigned long tmp;


Hi Cassidi,

At first, I'm really surprised that there's no assembler implementation
of find_bit routines for aarch64. Aarch32 has ones...

I was thinking on inlining the helper, but decided not to do this

1. Test is not too realistic. https://lkml.org/lkml/2015/2/1/224
The typical usage pattern is to look for a single bit or range of bits.
So in practice nobody calls find_next_bit thousand times.

2. Way more important to fit functions into as less cache lines as
possible. https://lkml.org/lkml/2015/2/12/114
In this case, inlining increases cache lines consumption almost twice...

3. Inlining prevents compiler from some other possible optimizations. It's
probable that in real module compiler will inline callers of _find_next_bit,
and final output will be better. I don't like to point out the compiler how
it should do its work.

Nevertheless, if this is your real case, and inlining helps, I'm OK with it.

But I think, before/after for x86 is needed as well.
And why don't you consider '__always_inline__'? Simple inline is only a 
hint and

guarantees nothing.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] lib: Make _find_next_bit helper function inline

2015-07-28 Thread Cassidy Burden
I've tested Yury Norov's find_bit reimplementation with the test_find_bit
module (https://lkml.org/lkml/2015/3/8/141) and measured about 35-40%
performance degradation on arm64 3.18 run with fixed CPU frequency.

The performance degradation appears to be caused by the
helper function _find_next_bit. After inlining this function into
find_next_bit and find_next_zero_bit I get slightly better performance
than the old implementation:

find_next_zero_bit  find_next_bit
old  new inline old  new inline
26   36  24 24   33  23
25   36  24 24   33  23
26   36  24 24   33  23
25   36  24 24   33  23
25   36  24 24   33  23
25   37  24 24   33  23
25   37  24 24   33  23
25   37  24 24   33  23
25   36  24 24   33  23
25   37  24 24   33  23

Signed-off-by: Cassidy Burden 
Cc: Alexey Klimov 
Cc: David S. Miller 
Cc: Daniel Borkmann 
Cc: Hannes Frederic Sowa 
Cc: Lai Jiangshan 
Cc: Mark Salter 
Cc: AKASHI Takahiro 
Cc: Thomas Graf 
Cc: Valentin Rothberg 
Cc: Chris Wilson 
---
 lib/find_bit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/find_bit.c b/lib/find_bit.c
index 18072ea..d0e04f9 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -28,7 +28,7 @@
  * find_next_zero_bit.  The difference is the "invert" argument, which
  * is XORed with each fetched word before searching it for one bits.
  */
-static unsigned long _find_next_bit(const unsigned long *addr,
+static inline unsigned long _find_next_bit(const unsigned long *addr,
unsigned long nbits, unsigned long start, unsigned long invert)
 {
unsigned long tmp;
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] lib: Make _find_next_bit helper function inline

2015-07-28 Thread Cassidy Burden
I've tested Yury Norov's find_bit reimplementation with the test_find_bit
module (https://lkml.org/lkml/2015/3/8/141) and measured about 35-40%
performance degradation on arm64 3.18 run with fixed CPU frequency.

The performance degradation appears to be caused by the
helper function _find_next_bit. After inlining this function into
find_next_bit and find_next_zero_bit I get slightly better performance
than the old implementation:

find_next_zero_bit  find_next_bit
old  new inline old  new inline
26   36  24 24   33  23
25   36  24 24   33  23
26   36  24 24   33  23
25   36  24 24   33  23
25   36  24 24   33  23
25   37  24 24   33  23
25   37  24 24   33  23
25   37  24 24   33  23
25   36  24 24   33  23
25   37  24 24   33  23

Signed-off-by: Cassidy Burden cbur...@codeaurora.org
Cc: Alexey Klimov klimov.li...@gmail.com
Cc: David S. Miller da...@davemloft.net
Cc: Daniel Borkmann dbork...@redhat.com
Cc: Hannes Frederic Sowa han...@stressinduktion.org
Cc: Lai Jiangshan la...@cn.fujitsu.com
Cc: Mark Salter msal...@redhat.com
Cc: AKASHI Takahiro takahiro.aka...@linaro.org
Cc: Thomas Graf tg...@suug.ch
Cc: Valentin Rothberg valentinrothb...@gmail.com
Cc: Chris Wilson ch...@chris-wilson.co.uk
---
 lib/find_bit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/find_bit.c b/lib/find_bit.c
index 18072ea..d0e04f9 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -28,7 +28,7 @@
  * find_next_zero_bit.  The difference is the invert argument, which
  * is XORed with each fetched word before searching it for one bits.
  */
-static unsigned long _find_next_bit(const unsigned long *addr,
+static inline unsigned long _find_next_bit(const unsigned long *addr,
unsigned long nbits, unsigned long start, unsigned long invert)
 {
unsigned long tmp;
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lib: Make _find_next_bit helper function inline

2015-07-28 Thread Yury

On 28.07.2015 22:09, Cassidy Burden wrote:

I've tested Yury Norov's find_bit reimplementation with the test_find_bit
module (https://lkml.org/lkml/2015/3/8/141) and measured about 35-40%
performance degradation on arm64 3.18 run with fixed CPU frequency.

The performance degradation appears to be caused by the
helper function _find_next_bit. After inlining this function into
find_next_bit and find_next_zero_bit I get slightly better performance
than the old implementation:

find_next_zero_bit  find_next_bit
old  new inline old  new inline
26   36  24 24   33  23
25   36  24 24   33  23
26   36  24 24   33  23
25   36  24 24   33  23
25   36  24 24   33  23
25   37  24 24   33  23
25   37  24 24   33  23
25   37  24 24   33  23
25   36  24 24   33  23
25   37  24 24   33  23

Signed-off-by: Cassidy Burden cbur...@codeaurora.org
Cc: Alexey Klimov klimov.li...@gmail.com
Cc: David S. Miller da...@davemloft.net
Cc: Daniel Borkmann dbork...@redhat.com
Cc: Hannes Frederic Sowa han...@stressinduktion.org
Cc: Lai Jiangshan la...@cn.fujitsu.com
Cc: Mark Salter msal...@redhat.com
Cc: AKASHI Takahiro takahiro.aka...@linaro.org
Cc: Thomas Graf tg...@suug.ch
Cc: Valentin Rothberg valentinrothb...@gmail.com
Cc: Chris Wilson ch...@chris-wilson.co.uk
---
  lib/find_bit.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/find_bit.c b/lib/find_bit.c
index 18072ea..d0e04f9 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -28,7 +28,7 @@
   * find_next_zero_bit.  The difference is the invert argument, which
   * is XORed with each fetched word before searching it for one bits.
   */
-static unsigned long _find_next_bit(const unsigned long *addr,
+static inline unsigned long _find_next_bit(const unsigned long *addr,
unsigned long nbits, unsigned long start, unsigned long invert)
  {
unsigned long tmp;


Hi Cassidi,

At first, I'm really surprised that there's no assembler implementation
of find_bit routines for aarch64. Aarch32 has ones...

I was thinking on inlining the helper, but decided not to do this

1. Test is not too realistic. https://lkml.org/lkml/2015/2/1/224
The typical usage pattern is to look for a single bit or range of bits.
So in practice nobody calls find_next_bit thousand times.

2. Way more important to fit functions into as less cache lines as
possible. https://lkml.org/lkml/2015/2/12/114
In this case, inlining increases cache lines consumption almost twice...

3. Inlining prevents compiler from some other possible optimizations. It's
probable that in real module compiler will inline callers of _find_next_bit,
and final output will be better. I don't like to point out the compiler how
it should do its work.

Nevertheless, if this is your real case, and inlining helps, I'm OK with it.

But I think, before/after for x86 is needed as well.
And why don't you consider '__always_inline__'? Simple inline is only a 
hint and

guarantees nothing.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lib: Make _find_next_bit helper function inline

2015-07-28 Thread Yury

On 29.07.2015 00:23, Yury wrote:

On 28.07.2015 22:09, Cassidy Burden wrote:
I've tested Yury Norov's find_bit reimplementation with the 
test_find_bit

module (https://lkml.org/lkml/2015/3/8/141) and measured about 35-40%
performance degradation on arm64 3.18 run with fixed CPU frequency.

The performance degradation appears to be caused by the
helper function _find_next_bit. After inlining this function into
find_next_bit and find_next_zero_bit I get slightly better performance
than the old implementation:

find_next_zero_bit  find_next_bit
old  new inline old  new inline
26   36  24 24   33  23
25   36  24 24   33  23
26   36  24 24   33  23
25   36  24 24   33  23
25   36  24 24   33  23
25   37  24 24   33  23
25   37  24 24   33  23
25   37  24 24   33  23
25   36  24 24   33  23
25   37  24 24   33  23

Signed-off-by: Cassidy Burden cbur...@codeaurora.org
Cc: Alexey Klimov klimov.li...@gmail.com
Cc: David S. Miller da...@davemloft.net
Cc: Daniel Borkmann dbork...@redhat.com
Cc: Hannes Frederic Sowa han...@stressinduktion.org
Cc: Lai Jiangshan la...@cn.fujitsu.com
Cc: Mark Salter msal...@redhat.com
Cc: AKASHI Takahiro takahiro.aka...@linaro.org
Cc: Thomas Graf tg...@suug.ch
Cc: Valentin Rothberg valentinrothb...@gmail.com
Cc: Chris Wilson ch...@chris-wilson.co.uk
---
  lib/find_bit.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/find_bit.c b/lib/find_bit.c
index 18072ea..d0e04f9 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -28,7 +28,7 @@
   * find_next_zero_bit.  The difference is the invert argument, which
   * is XORed with each fetched word before searching it for one bits.
   */
-static unsigned long _find_next_bit(const unsigned long *addr,
+static inline unsigned long _find_next_bit(const unsigned long *addr,
  unsigned long nbits, unsigned long start, unsigned long 
invert)

  {
  unsigned long tmp;


Hi Cassidi,

At first, I'm really surprised that there's no assembler implementation
of find_bit routines for aarch64. Aarch32 has ones...

I was thinking on inlining the helper, but decided not to do this

1. Test is not too realistic. https://lkml.org/lkml/2015/2/1/224
The typical usage pattern is to look for a single bit or range of bits.
So in practice nobody calls find_next_bit thousand times.

2. Way more important to fit functions into as less cache lines as
possible. https://lkml.org/lkml/2015/2/12/114
In this case, inlining increases cache lines consumption almost twice...

3. Inlining prevents compiler from some other possible optimizations. 
It's
probable that in real module compiler will inline callers of 
_find_next_bit,
and final output will be better. I don't like to point out the 
compiler how

it should do its work.

Nevertheless, if this is your real case, and inlining helps, I'm OK 
with it.


But I think, before/after for x86 is needed as well.
And why don't you consider '__always_inline__'? Simple inline is only 
a hint and

guarantees nothing.


(Sorry for typo in your name. Call me Yuri next time.)

Adding Rasmus and George to CC

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lib: Make _find_next_bit helper function inline

2015-07-28 Thread Andrew Morton
On Wed, 29 Jul 2015 00:23:18 +0300 Yury yury.no...@gmail.com wrote:

 But I think, before/after for x86 is needed as well.

That would be nice.

 And why don't you consider '__always_inline__'? Simple inline is only a 
 hint and
 guarantees nothing.

Yup.  My x86_64 compiler just ignores the inline.  When I use
__always_inline, find_bit.o's text goes from 776 bytes to 863. 
Hopefully we get something in return for that bloat!


Also, if _find_next_bit() benefits from this then _find_next_bit_le()
will presumably also benefit.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/