Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")

2018-06-05 Thread Linus Torvalds
On Tue, Jun 5, 2018 at 4:20 PM Alexey Dobriyan  wrote:
>
> This is Broadwell Xeon E5-2620 v4.
> Which is somewhat strange indeed because it should be modern enough.

Yeah, odd.

Here's the benchmark I used:

  #define SIZE 4068

  int main(int argc, char **argv)
  {
int i;
unsigned char buffer[SIZE], *p;

for (i = 0; i < 100; i++)
asm volatile(
"1: movq %[zero],(%[mem]); addq %[eight],%[mem]; decl
%[count]; jne 1b"
: [mem] "=r" (p)
: [zero] "i" (0l), [eight] "i" (8l),
 "0" (buffer), [count] "r" (SIZE/8));
  }

where you can change that "i" for [zero] and [eight] to be "r" to get
the register version.

I just timed it, because I'm lazy and perf seemed to be overkill.

It might be some very specific loop buffer issue or something.

Or maybe my benchmark above is broken, I didn't really verify that the
end result was any good (I just did an objdump to verify the asm code
superficially).

 Linus


Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")

2018-06-05 Thread Linus Torvalds
On Tue, Jun 5, 2018 at 4:20 PM Alexey Dobriyan  wrote:
>
> This is Broadwell Xeon E5-2620 v4.
> Which is somewhat strange indeed because it should be modern enough.

Yeah, odd.

Here's the benchmark I used:

  #define SIZE 4068

  int main(int argc, char **argv)
  {
int i;
unsigned char buffer[SIZE], *p;

for (i = 0; i < 100; i++)
asm volatile(
"1: movq %[zero],(%[mem]); addq %[eight],%[mem]; decl
%[count]; jne 1b"
: [mem] "=r" (p)
: [zero] "i" (0l), [eight] "i" (8l),
 "0" (buffer), [count] "r" (SIZE/8));
  }

where you can change that "i" for [zero] and [eight] to be "r" to get
the register version.

I just timed it, because I'm lazy and perf seemed to be overkill.

It might be some very specific loop buffer issue or something.

Or maybe my benchmark above is broken, I didn't really verify that the
end result was any good (I just did an objdump to verify the asm code
superficially).

 Linus


Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")

2018-06-05 Thread Alexey Dobriyan
On Tue, Jun 05, 2018 at 04:04:37PM -0700, Linus Torvalds wrote:
> On Tue, Jun 5, 2018 at 4:01 PM Linus Torvalds
>  wrote:
> >
> > On Tue, Jun 5, 2018 at 3:41 PM Alexey Dobriyan  wrote:
> > >
> > > On my potato performance increase is 33%, sheesh.
> > > And CPU starts doing 3 instructions per cycle vs 2.
> >
> > Whee. That's a shockingly big difference. On my CPU (i7-6700K) it
> > makes absolutely no difference whether the values are integers or in
> > registers.
> 
> In fact, looking at Agner Fog's instruction lists, I don't see any CPU
> where it would make a difference, except for the P4 (where the
> immediate looks like it's a bad idea because it's an extra uop, but it
> might pack fine and not be noticeable).
> 
> But maybe I'm missing something subtle. What CPU, out of morbid interest?

This is Broadwell Xeon E5-2620 v4.
Which is somewhat strange indeed because it should be modern enough.


Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")

2018-06-05 Thread Alexey Dobriyan
On Tue, Jun 05, 2018 at 04:04:37PM -0700, Linus Torvalds wrote:
> On Tue, Jun 5, 2018 at 4:01 PM Linus Torvalds
>  wrote:
> >
> > On Tue, Jun 5, 2018 at 3:41 PM Alexey Dobriyan  wrote:
> > >
> > > On my potato performance increase is 33%, sheesh.
> > > And CPU starts doing 3 instructions per cycle vs 2.
> >
> > Whee. That's a shockingly big difference. On my CPU (i7-6700K) it
> > makes absolutely no difference whether the values are integers or in
> > registers.
> 
> In fact, looking at Agner Fog's instruction lists, I don't see any CPU
> where it would make a difference, except for the P4 (where the
> immediate looks like it's a bad idea because it's an extra uop, but it
> might pack fine and not be noticeable).
> 
> But maybe I'm missing something subtle. What CPU, out of morbid interest?

This is Broadwell Xeon E5-2620 v4.
Which is somewhat strange indeed because it should be modern enough.


Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")

2018-06-05 Thread Linus Torvalds
On Tue, Jun 5, 2018 at 4:01 PM Linus Torvalds
 wrote:
>
> On Tue, Jun 5, 2018 at 3:41 PM Alexey Dobriyan  wrote:
> >
> > On my potato performance increase is 33%, sheesh.
> > And CPU starts doing 3 instructions per cycle vs 2.
>
> Whee. That's a shockingly big difference. On my CPU (i7-6700K) it
> makes absolutely no difference whether the values are integers or in
> registers.

In fact, looking at Agner Fog's instruction lists, I don't see any CPU
where it would make a difference, except for the P4 (where the
immediate looks like it's a bad idea because it's an extra uop, but it
might pack fine and not be noticeable).

But maybe I'm missing something subtle. What CPU, out of morbid interest?

   Linus


Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")

2018-06-05 Thread Linus Torvalds
On Tue, Jun 5, 2018 at 4:01 PM Linus Torvalds
 wrote:
>
> On Tue, Jun 5, 2018 at 3:41 PM Alexey Dobriyan  wrote:
> >
> > On my potato performance increase is 33%, sheesh.
> > And CPU starts doing 3 instructions per cycle vs 2.
>
> Whee. That's a shockingly big difference. On my CPU (i7-6700K) it
> makes absolutely no difference whether the values are integers or in
> registers.

In fact, looking at Agner Fog's instruction lists, I don't see any CPU
where it would make a difference, except for the P4 (where the
immediate looks like it's a bad idea because it's an extra uop, but it
might pack fine and not be noticeable).

But maybe I'm missing something subtle. What CPU, out of morbid interest?

   Linus


Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")

2018-06-05 Thread Linus Torvalds
On Tue, Jun 5, 2018 at 3:41 PM Alexey Dobriyan  wrote:
>
> On my potato performance increase is 33%, sheesh.
> And CPU starts doing 3 instructions per cycle vs 2.

Whee. That's a shockingly big difference. On my CPU (i7-6700K) it
makes absolutely no difference whether the values are integers or in
registers.

   Linus


Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")

2018-06-05 Thread Linus Torvalds
On Tue, Jun 5, 2018 at 3:41 PM Alexey Dobriyan  wrote:
>
> On my potato performance increase is 33%, sheesh.
> And CPU starts doing 3 instructions per cycle vs 2.

Whee. That's a shockingly big difference. On my CPU (i7-6700K) it
makes absolutely no difference whether the values are integers or in
registers.

   Linus


Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")

2018-06-05 Thread Alexey Dobriyan
On Tue, Jun 05, 2018 at 10:32:55AM -0700, Linus Torvalds wrote:
> On Tue, Jun 5, 2018 at 10:22 AM Alexey Dobriyan  wrote:
> >
> > Tested? :^) I had P4 maybe ~15(?) years ago.
> 
> Did you EVEN test it on what you have today?
> 
> Do you have any numbers at all, in other words?
> 
> Micro-optimizations need numbers. Otherwise they aren't
> micro-optimizations, they are just "change code randomly".

On my potato performance increase is 33%, sheesh.
And CPU starts doing 3 instructions per cycle vs 2.

benchmark is "clear_user(p + 4096 - 4068, 4068)"
4068 comes from booting Debian 8 with printk.

f0(4068) (old clear_user)

$ taskset -c 15 perf stat -r 16 ./a.out

 Performance counter stats for './a.out' (16 runs):

   2033.189084  task-clock (msec) #1.000 CPUs utilized  
  ( +-  0.41% )
 2  context-switches  #0.001 K/sec  
  ( +- 11.11% )
 0  cpu-migrations#0.000 K/sec
46  page-faults   #0.023 K/sec  
  ( +-  0.91% )
 4,268,425,486  cycles#2.099 GHz
  ( +-  0.41% )
 8,672,326,256  instructions  #2.03  insn per cycle 
  ( +-  0.00% )
 2,169,900,710  branches  # 1067.240 M/sec  
  ( +-  0.00% )
 4,226,258  branch-misses #0.19% of all branches
  ( +-  0.01% )

   2.033700109 seconds time elapsed 
 ( +-  0.41% )

f1(4068) (new clear_user)
$ taskset -c 15 perf stat -r 16 ./a.out

 Performance counter stats for './a.out' (16 runs):

   1345.149992  task-clock (msec) #1.000 CPUs utilized  
  ( +-  0.01% )
 2  context-switches  #0.002 K/sec  
  ( +-  8.35% )
 0  cpu-migrations#0.000 K/sec
46  page-faults   #0.034 K/sec  
  ( +-  0.82% )
 2,823,965,728  cycles#2.099 GHz
  ( +-  0.01% )
 8,661,733,733  instructions  #3.07  insn per cycle 
  ( +-  0.00% )
 2,169,437,410  branches  # 1612.785 M/sec  
  ( +-  0.00% )
 4,216,469  branch-misses #0.19% of all branches
  ( +-  0.01% )

   1.345375114 seconds time elapsed 
 ( +-  0.01% )

-
CFLAGS = -Wall -fno-strict-aliasing -fno-common -fshort-wchar -std=gnu89 
-mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -falign-jumps=1 
-falign-loops=1 -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 
-mskip-rax-setup -mtune=generic -mno-red-zone -funit-at-a-time -pipe 
-Wno-sign-compare -fno-asynchronous-unwind-tables 
-fno-delete-null-pointer-checks -O2 --param=allow-store-data-races=0 
-fno-stack-protector -fomit-frame-pointer -fno-var-tracking-assignments -g 
-femit-struct-debug-baseonly -fno-var-tracking -fno-strict-overflow 
-fno-merge-all-constants -fmerge-constants -fno-stack-check -fconserve-stack


0780 :
 780:   movrax,rsi
 783:   movrcx,rsi
 786:   xoredx,edx
 788:   andeax,0x7
 78b:   shrrcx,0x3
 78f:   movesi,0x8
 794:   test   rcx,rcx
 797:   je 7a3 
 799:   movQWORD PTR [rdi],rdx
 79c:   addrdi,rsi
 79f:   dececx
 7a1:   jne799 
 7a3:   movrcx,rax
 7a6:   test   ecx,ecx
 7a8:   je 7b3 
 7aa:   movBYTE PTR [rdi],dl
 7ac:   incrdi
 7af:   dececx
 7b1:   jne7aa 
 7b3:   movrax,rcx
 7b6:   ret

07c0 :
 7c0:   movrax,rsi
 7c3:   shrrsi,0x3
 7c7:   andeax,0x7
 7ca:   movrcx,rsi
 7cd:   test   rcx,rcx
 7d0:   je 7e1 
 7d2:   movQWORD PTR [rdi],0x0
 7d9:   addrdi,0x8
 7dd:   dececx
 7df:   jne7d2 
 7e1:   movrcx,rax
 7e4:   test   ecx,ecx
 7e6:   je 7f2 
 7e8:   movBYTE PTR [rdi],0x0
 7eb:   incrdi
 7ee:   dececx
 7f0:   jne7e8 
 7f2:   movrax,rcx
 7f5:   ret


Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")

2018-06-05 Thread Alexey Dobriyan
On Tue, Jun 05, 2018 at 10:32:55AM -0700, Linus Torvalds wrote:
> On Tue, Jun 5, 2018 at 10:22 AM Alexey Dobriyan  wrote:
> >
> > Tested? :^) I had P4 maybe ~15(?) years ago.
> 
> Did you EVEN test it on what you have today?
> 
> Do you have any numbers at all, in other words?
> 
> Micro-optimizations need numbers. Otherwise they aren't
> micro-optimizations, they are just "change code randomly".

On my potato performance increase is 33%, sheesh.
And CPU starts doing 3 instructions per cycle vs 2.

benchmark is "clear_user(p + 4096 - 4068, 4068)"
4068 comes from booting Debian 8 with printk.

f0(4068) (old clear_user)

$ taskset -c 15 perf stat -r 16 ./a.out

 Performance counter stats for './a.out' (16 runs):

   2033.189084  task-clock (msec) #1.000 CPUs utilized  
  ( +-  0.41% )
 2  context-switches  #0.001 K/sec  
  ( +- 11.11% )
 0  cpu-migrations#0.000 K/sec
46  page-faults   #0.023 K/sec  
  ( +-  0.91% )
 4,268,425,486  cycles#2.099 GHz
  ( +-  0.41% )
 8,672,326,256  instructions  #2.03  insn per cycle 
  ( +-  0.00% )
 2,169,900,710  branches  # 1067.240 M/sec  
  ( +-  0.00% )
 4,226,258  branch-misses #0.19% of all branches
  ( +-  0.01% )

   2.033700109 seconds time elapsed 
 ( +-  0.41% )

f1(4068) (new clear_user)
$ taskset -c 15 perf stat -r 16 ./a.out

 Performance counter stats for './a.out' (16 runs):

   1345.149992  task-clock (msec) #1.000 CPUs utilized  
  ( +-  0.01% )
 2  context-switches  #0.002 K/sec  
  ( +-  8.35% )
 0  cpu-migrations#0.000 K/sec
46  page-faults   #0.034 K/sec  
  ( +-  0.82% )
 2,823,965,728  cycles#2.099 GHz
  ( +-  0.01% )
 8,661,733,733  instructions  #3.07  insn per cycle 
  ( +-  0.00% )
 2,169,437,410  branches  # 1612.785 M/sec  
  ( +-  0.00% )
 4,216,469  branch-misses #0.19% of all branches
  ( +-  0.01% )

   1.345375114 seconds time elapsed 
 ( +-  0.01% )

-
CFLAGS = -Wall -fno-strict-aliasing -fno-common -fshort-wchar -std=gnu89 
-mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -falign-jumps=1 
-falign-loops=1 -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 
-mskip-rax-setup -mtune=generic -mno-red-zone -funit-at-a-time -pipe 
-Wno-sign-compare -fno-asynchronous-unwind-tables 
-fno-delete-null-pointer-checks -O2 --param=allow-store-data-races=0 
-fno-stack-protector -fomit-frame-pointer -fno-var-tracking-assignments -g 
-femit-struct-debug-baseonly -fno-var-tracking -fno-strict-overflow 
-fno-merge-all-constants -fmerge-constants -fno-stack-check -fconserve-stack


0780 :
 780:   movrax,rsi
 783:   movrcx,rsi
 786:   xoredx,edx
 788:   andeax,0x7
 78b:   shrrcx,0x3
 78f:   movesi,0x8
 794:   test   rcx,rcx
 797:   je 7a3 
 799:   movQWORD PTR [rdi],rdx
 79c:   addrdi,rsi
 79f:   dececx
 7a1:   jne799 
 7a3:   movrcx,rax
 7a6:   test   ecx,ecx
 7a8:   je 7b3 
 7aa:   movBYTE PTR [rdi],dl
 7ac:   incrdi
 7af:   dececx
 7b1:   jne7aa 
 7b3:   movrax,rcx
 7b6:   ret

07c0 :
 7c0:   movrax,rsi
 7c3:   shrrsi,0x3
 7c7:   andeax,0x7
 7ca:   movrcx,rsi
 7cd:   test   rcx,rcx
 7d0:   je 7e1 
 7d2:   movQWORD PTR [rdi],0x0
 7d9:   addrdi,0x8
 7dd:   dececx
 7df:   jne7d2 
 7e1:   movrcx,rax
 7e4:   test   ecx,ecx
 7e6:   je 7f2 
 7e8:   movBYTE PTR [rdi],0x0
 7eb:   incrdi
 7ee:   dececx
 7f0:   jne7e8 
 7f2:   movrax,rcx
 7f5:   ret


Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")

2018-06-05 Thread Linus Torvalds
On Tue, Jun 5, 2018 at 10:22 AM Alexey Dobriyan  wrote:
>
> Tested? :^) I had P4 maybe ~15(?) years ago.

Did you EVEN test it on what you have today?

Do you have any numbers at all, in other words?

Micro-optimizations need numbers. Otherwise they aren't
micro-optimizations, they are just "change code randomly".

Linus


Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")

2018-06-05 Thread Linus Torvalds
On Tue, Jun 5, 2018 at 10:22 AM Alexey Dobriyan  wrote:
>
> Tested? :^) I had P4 maybe ~15(?) years ago.

Did you EVEN test it on what you have today?

Do you have any numbers at all, in other words?

Micro-optimizations need numbers. Otherwise they aren't
micro-optimizations, they are just "change code randomly".

Linus


Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")

2018-06-05 Thread Alexey Dobriyan
On Tue, Jun 05, 2018 at 05:05:14PM +0200, Ingo Molnar wrote:
> 
> * Linus Torvalds  wrote:
> 
> > On Mon, Jun 4, 2018 at 5:21 AM Ingo Molnar  wrote:
> > >
> > >  - __clear_user() micro-optimization (Alexey Dobriyan)
> > 
> > Was this actually tested?
> 
> I'm not sure - Alexey?
> 
> > I think one reason people avoided the constant was that on some
> > microarchitecture it ended up being a separate uop just for the
> > constant generation, because it wouldn't fit in a single uop.

> Ok, fair point and agreed - if Alexey sends some measurements to back the 
> change 
> I'll keep this, otherwise queue up a revert.

Tested? :^) I had P4 maybe ~15(?) years ago.

godbolt.org earliest compiler is 4.1.2 and it generates "movb [r32], imm8"
with "-m32 -O2 -march=pentium4" for simple memset-style loop
if it counts for something.

Actually I think __clear_user should be rewritten in C with assembly.
It's biggest user is probably ELF loader and those partial page .bss
clears should be noticeable.


Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")

2018-06-05 Thread Alexey Dobriyan
On Tue, Jun 05, 2018 at 05:05:14PM +0200, Ingo Molnar wrote:
> 
> * Linus Torvalds  wrote:
> 
> > On Mon, Jun 4, 2018 at 5:21 AM Ingo Molnar  wrote:
> > >
> > >  - __clear_user() micro-optimization (Alexey Dobriyan)
> > 
> > Was this actually tested?
> 
> I'm not sure - Alexey?
> 
> > I think one reason people avoided the constant was that on some
> > microarchitecture it ended up being a separate uop just for the
> > constant generation, because it wouldn't fit in a single uop.

> Ok, fair point and agreed - if Alexey sends some measurements to back the 
> change 
> I'll keep this, otherwise queue up a revert.

Tested? :^) I had P4 maybe ~15(?) years ago.

godbolt.org earliest compiler is 4.1.2 and it generates "movb [r32], imm8"
with "-m32 -O2 -march=pentium4" for simple memset-style loop
if it counts for something.

Actually I think __clear_user should be rewritten in C with assembly.
It's biggest user is probably ELF loader and those partial page .bss
clears should be noticeable.


Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")

2018-06-05 Thread Linus Torvalds
On Tue, Jun 5, 2018 at 8:05 AM Ingo Molnar  wrote:
>
> Ok, fair point and agreed - if Alexey sends some measurements to back the 
> change
> I'll keep this, otherwise queue up a revert.

I don't think it needs to be reverted, it's not like it's likely to
hurt on any modern CPU's. The issues I talked about are fairly
historical - barely even 64-bit cpus - and I'm not sure an extra uop
to carry a constant around even matters in that code sequence.

It was more a generic issue - any micro-optimization should be based
on numbers (and there should be some numbers in the commit message),
not on "this should be faster". Because while intuitively immediates
_should_ be faster than registers, that's simply not always "obviously
true". It _may_ be true. But numbers talk.

  Linus


Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")

2018-06-05 Thread Linus Torvalds
On Tue, Jun 5, 2018 at 8:05 AM Ingo Molnar  wrote:
>
> Ok, fair point and agreed - if Alexey sends some measurements to back the 
> change
> I'll keep this, otherwise queue up a revert.

I don't think it needs to be reverted, it's not like it's likely to
hurt on any modern CPU's. The issues I talked about are fairly
historical - barely even 64-bit cpus - and I'm not sure an extra uop
to carry a constant around even matters in that code sequence.

It was more a generic issue - any micro-optimization should be based
on numbers (and there should be some numbers in the commit message),
not on "this should be faster". Because while intuitively immediates
_should_ be faster than registers, that's simply not always "obviously
true". It _may_ be true. But numbers talk.

  Linus


Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")

2018-06-05 Thread Ingo Molnar


* Linus Torvalds  wrote:

> On Mon, Jun 4, 2018 at 5:21 AM Ingo Molnar  wrote:
> >
> >  - __clear_user() micro-optimization (Alexey Dobriyan)
> 
> Was this actually tested?

I'm not sure - Alexey?

> I think one reason people avoided the constant was that on some
> microarchitecture it ended up being a separate uop just for the
> constant generation, because it wouldn't fit in a single uop.
> 
> I'm pretty sure that used to be the case for P4, for example.
> 
> Afaik there have also been issues with decoding instructions that have
> both an immediate and a memory offset.
> 
> I suspect none of this is an issue on modern cores, but there really
> at least historically were cases where
> 
>mov %reg,mem
> 
> was better than
> 
>mov $imm,mem
> 
> if %reg already had the right value, so it's not at all 100% obvious
> that the micro-optimization really _optimizes_ anything.
> 
> Any time people do this, they should add numbers.

Ok, fair point and agreed - if Alexey sends some measurements to back the 
change 
I'll keep this, otherwise queue up a revert.

Thanks,

Ingo


Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")

2018-06-05 Thread Ingo Molnar


* Linus Torvalds  wrote:

> On Mon, Jun 4, 2018 at 5:21 AM Ingo Molnar  wrote:
> >
> >  - __clear_user() micro-optimization (Alexey Dobriyan)
> 
> Was this actually tested?

I'm not sure - Alexey?

> I think one reason people avoided the constant was that on some
> microarchitecture it ended up being a separate uop just for the
> constant generation, because it wouldn't fit in a single uop.
> 
> I'm pretty sure that used to be the case for P4, for example.
> 
> Afaik there have also been issues with decoding instructions that have
> both an immediate and a memory offset.
> 
> I suspect none of this is an issue on modern cores, but there really
> at least historically were cases where
> 
>mov %reg,mem
> 
> was better than
> 
>mov $imm,mem
> 
> if %reg already had the right value, so it's not at all 100% obvious
> that the micro-optimization really _optimizes_ anything.
> 
> Any time people do this, they should add numbers.

Ok, fair point and agreed - if Alexey sends some measurements to back the 
change 
I'll keep this, otherwise queue up a revert.

Thanks,

Ingo


Re: [GIT PULL] x86/asm changes for v4.18

2018-06-04 Thread Linus Torvalds
On Mon, Jun 4, 2018 at 5:21 AM Ingo Molnar  wrote:
>
>  - __clear_user() micro-optimization (Alexey Dobriyan)

Was this actually tested?

I think one reason people avoided the constant was that on some
microarchitecture it ended up being a separate uop just for the
constant generation, because it wouldn't fit in a single uop.

I'm pretty sure that used to be the case for P4, for example.

Afaik there have also been issues with decoding instructions that have
both an immediate and a memory offset.

I suspect none of this is an issue on modern cores, but there really
at least historically were cases where

   mov %reg,mem

was better than

   mov $imm,mem

if %reg already had the right value, so it's not at all 100% obvious
that the micro-optimization really _optimizes_ anything.

Any time people do this, they should add numbers.

Linus


Re: [GIT PULL] x86/asm changes for v4.18

2018-06-04 Thread Linus Torvalds
On Mon, Jun 4, 2018 at 5:21 AM Ingo Molnar  wrote:
>
>  - __clear_user() micro-optimization (Alexey Dobriyan)

Was this actually tested?

I think one reason people avoided the constant was that on some
microarchitecture it ended up being a separate uop just for the
constant generation, because it wouldn't fit in a single uop.

I'm pretty sure that used to be the case for P4, for example.

Afaik there have also been issues with decoding instructions that have
both an immediate and a memory offset.

I suspect none of this is an issue on modern cores, but there really
at least historically were cases where

   mov %reg,mem

was better than

   mov $imm,mem

if %reg already had the right value, so it's not at all 100% obvious
that the micro-optimization really _optimizes_ anything.

Any time people do this, they should add numbers.

Linus


[GIT PULL] x86/asm changes for v4.18

2018-06-04 Thread Ingo Molnar
Linus,

Please pull the latest x86-asm-for-linus git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-asm-for-linus

   # HEAD: 6469a0ee0a06b2ea1f5afbb1d5a3feed017d4c7a x86/io: Define 
readq()/writeq() to use 64-bit type

Two smaller changes:

 - better support (non-atomic) 64-bit readq()/writeq() variants (Andy 
Shevchenko)

 - __clear_user() micro-optimization (Alexey Dobriyan)

 Thanks,

Ingo

-->
Alexey Dobriyan (1):
  x86/asm/64: Micro-optimize __clear_user() - Use immediate constants

Andy Shevchenko (1):
  x86/io: Define readq()/writeq() to use 64-bit type


 arch/x86/include/asm/io.h  | 8 
 arch/x86/lib/usercopy_64.c | 9 -
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index f6e5b9375d8c..6de64840dd22 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -94,10 +94,10 @@ build_mmio_write(__writel, "l", unsigned int, "r", )
 
 #ifdef CONFIG_X86_64
 
-build_mmio_read(readq, "q", unsigned long, "=r", :"memory")
-build_mmio_read(__readq, "q", unsigned long, "=r", )
-build_mmio_write(writeq, "q", unsigned long, "r", :"memory")
-build_mmio_write(__writeq, "q", unsigned long, "r", )
+build_mmio_read(readq, "q", u64, "=r", :"memory")
+build_mmio_read(__readq, "q", u64, "=r", )
+build_mmio_write(writeq, "q", u64, "r", :"memory")
+build_mmio_write(__writeq, "q", u64, "r", )
 
 #define readq_relaxed(a)   __readq(a)
 #define writeq_relaxed(v, a)   __writeq(v, a)
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 75d3776123cc..a624dcc4de10 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -23,13 +23,13 @@ unsigned long __clear_user(void __user *addr, unsigned long 
size)
asm volatile(
"   testq  %[size8],%[size8]\n"
"   jz 4f\n"
-   "0: movq %[zero],(%[dst])\n"
-   "   addq   %[eight],%[dst]\n"
+   "0: movq $0,(%[dst])\n"
+   "   addq   $8,%[dst]\n"
"   decl %%ecx ; jnz   0b\n"
"4: movq  %[size1],%%rcx\n"
"   testl %%ecx,%%ecx\n"
"   jz 2f\n"
-   "1: movb   %b[zero],(%[dst])\n"
+   "1: movb   $0,(%[dst])\n"
"   incq   %[dst]\n"
"   decl %%ecx ; jnz  1b\n"
"2:\n"
@@ -40,8 +40,7 @@ unsigned long __clear_user(void __user *addr, unsigned long 
size)
_ASM_EXTABLE(0b,3b)
_ASM_EXTABLE(1b,2b)
: [size8] "="(size), [dst] "=" (__d0)
-   : [size1] "r"(size & 7), "[size8]" (size / 8), "[dst]"(addr),
- [zero] "r" (0UL), [eight] "r" (8UL));
+   : [size1] "r"(size & 7), "[size8]" (size / 8), "[dst]"(addr));
clac();
return size;
 }


[GIT PULL] x86/asm changes for v4.18

2018-06-04 Thread Ingo Molnar
Linus,

Please pull the latest x86-asm-for-linus git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-asm-for-linus

   # HEAD: 6469a0ee0a06b2ea1f5afbb1d5a3feed017d4c7a x86/io: Define 
readq()/writeq() to use 64-bit type

Two smaller changes:

 - better support (non-atomic) 64-bit readq()/writeq() variants (Andy 
Shevchenko)

 - __clear_user() micro-optimization (Alexey Dobriyan)

 Thanks,

Ingo

-->
Alexey Dobriyan (1):
  x86/asm/64: Micro-optimize __clear_user() - Use immediate constants

Andy Shevchenko (1):
  x86/io: Define readq()/writeq() to use 64-bit type


 arch/x86/include/asm/io.h  | 8 
 arch/x86/lib/usercopy_64.c | 9 -
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index f6e5b9375d8c..6de64840dd22 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -94,10 +94,10 @@ build_mmio_write(__writel, "l", unsigned int, "r", )
 
 #ifdef CONFIG_X86_64
 
-build_mmio_read(readq, "q", unsigned long, "=r", :"memory")
-build_mmio_read(__readq, "q", unsigned long, "=r", )
-build_mmio_write(writeq, "q", unsigned long, "r", :"memory")
-build_mmio_write(__writeq, "q", unsigned long, "r", )
+build_mmio_read(readq, "q", u64, "=r", :"memory")
+build_mmio_read(__readq, "q", u64, "=r", )
+build_mmio_write(writeq, "q", u64, "r", :"memory")
+build_mmio_write(__writeq, "q", u64, "r", )
 
 #define readq_relaxed(a)   __readq(a)
 #define writeq_relaxed(v, a)   __writeq(v, a)
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 75d3776123cc..a624dcc4de10 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -23,13 +23,13 @@ unsigned long __clear_user(void __user *addr, unsigned long 
size)
asm volatile(
"   testq  %[size8],%[size8]\n"
"   jz 4f\n"
-   "0: movq %[zero],(%[dst])\n"
-   "   addq   %[eight],%[dst]\n"
+   "0: movq $0,(%[dst])\n"
+   "   addq   $8,%[dst]\n"
"   decl %%ecx ; jnz   0b\n"
"4: movq  %[size1],%%rcx\n"
"   testl %%ecx,%%ecx\n"
"   jz 2f\n"
-   "1: movb   %b[zero],(%[dst])\n"
+   "1: movb   $0,(%[dst])\n"
"   incq   %[dst]\n"
"   decl %%ecx ; jnz  1b\n"
"2:\n"
@@ -40,8 +40,7 @@ unsigned long __clear_user(void __user *addr, unsigned long 
size)
_ASM_EXTABLE(0b,3b)
_ASM_EXTABLE(1b,2b)
: [size8] "="(size), [dst] "=" (__d0)
-   : [size1] "r"(size & 7), "[size8]" (size / 8), "[dst]"(addr),
- [zero] "r" (0UL), [eight] "r" (8UL));
+   : [size1] "r"(size & 7), "[size8]" (size / 8), "[dst]"(addr));
clac();
return size;
 }