[tip:x86/asm] x86/defconfig: Turn on CONFIG_CC_OPTIMIZE_FOR_SIZE= y in the 64-bit defconfig

2013-01-26 Thread tip-bot for Ma Ling
Commit-ID:  d94ffd677469ef729e9d6e968191872577a6119e
Gitweb: http://git.kernel.org/tip/d94ffd677469ef729e9d6e968191872577a6119e
Author: Ma Ling 
AuthorDate: Fri, 25 Jan 2013 09:11:01 -0500
Committer:  Ingo Molnar 
CommitDate: Sat, 26 Jan 2013 13:09:15 +0100

x86/defconfig: Turn on CONFIG_CC_OPTIMIZE_FOR_SIZE=y in the 64-bit defconfig

Currently we use O2 as compiler option for better performance,
although it will enlarge code size, in modern CPUs larger
instructon and unified cache, sophisticated instruction prefetch
weaken instruction cache miss, meanwhile flags such as
 -falign-functions, -falign-jumps, -falign-loops, -falign-labels
are very helpful to improve CPU front-end throughput because CPU
fetch instruction by 16 aligned–bytes code block per cycle.

In order to save power and get higher performance, Sandy Bridge
starts to introduce decoded-cache, instructions will be kept in
it after decode stage. When CPU refetches the instruction,
decoded cache could provide 32 aligned-bytes instruction block,
instead of 16 bytes from I-cache, fewer branch miss penalty
resulted from shorter pipeline. It requires hot code should be
put into decoded cache as possible we can. Sandy Bridge, Ivy
Bridge, and Haswell all implemented this feature, Os-Optimize
for size should be better than O2 on them.

Based on above reasons, we compiled linux kernel 3.6.9 with O2
and Os respectively. The results show Os improve performance
netperf 4.8%, 2.7% for volano as below:

O2 + netperf
Performance counter stats for 'netperf' (3 runs):

   5416.157986 task-clock#0.541 CPUs utilized   
 ( +-  0.19% )
   348,249 context-switches  #0.064 M/sec   
 ( +-  0.17% )
 0 CPU-migrations#0.000 M/sec   
 ( +-  0.00% )
   353 page-faults   #0.000 M/sec   
 ( +-  0.16% )
13,166,254,384 cycles#2.431 GHz 
 ( +-  0.18% )
 8,827,499,807 stalled-cycles-frontend   #   67.05% frontend cycles idle
 ( +-  0.29% )
 5,951,234,060 stalled-cycles-backend#   45.20% backend  cycles idle
 ( +-  0.44% )
 8,122,481,914 instructions  #0.62  insns per cycle
 #1.09  stalled cycles per insn 
 ( +-  0.17% )
 1,415,864,138 branches  #  261.415 M/sec   
 ( +-  0.17% )
16,975,308 branch-misses #1.20% of all branches 
 ( +-  0.61% )

  10.007215371 seconds time elapsed 
 ( +-  0.03% )

Os + netperf

Performance counter stats for 'netperf' (3 runs):

   5395.386704 task-clock#0.539 CPUs utilized   
 ( +-  0.14% )
   345,880 context-switches  #0.064 M/sec   
 ( +-  0.25% )
 0 CPU-migrations#0.000 M/sec   
 ( +-  0.00% )
   354 page-faults   #0.000 M/sec   
 ( +-  0.00% )
13,142,706,297 cycles#2.436 GHz 
 ( +-  0.23% )
 8,379,382,641 stalled-cycles-frontend   #   63.76% frontend cycles idle
 ( +-  0.50% )
 5,513,722,219 stalled-cycles-backend#   41.95% backend  cycles idle
 ( +-  0.71% )
 8,554,202,795 instructions  #0.65  insns per cycle
 #0.98  stalled cycles per insn 
 ( +-  0.25% )
 1,530,020,505 branches  #  283.579 M/sec   
 ( +-  0.25% )
17,710,406 branch-misses #1.16% of all branches 
 ( +-  1.00% )

  10.004859867 seconds time elapsed

During the same time (10.004859867 seconds) IPC from Os is 0.65,
O2 is 0.62, Os improved performance 4.8%.

O2 + volano
Performance counter stats for './loopclient.sh openjdk' (3 runs):

 210627.115313 task-clock#0.781 CPUs utilized   
 ( +-  0.92% )
13,812,610 context-switches  #0.066 M/sec   
 ( +-  0.17% )
 2,352,755 CPU-migrations#0.011 M/sec   
 ( +-  0.84% )
   208,333 page-faults   #0.001 M/sec   
 ( +-  1.58% )
   525,627,073,405 cycles#2.496 GHz 
 ( +-  0.96% )
   428,177,571,365 stalled-cycles-frontend   #   81.46% frontend cycles idle
 ( +-  1.09% )
   370,885,224,739 stalled-cycles-backend#   70.56% backend  cycles idle
 ( +-  1.18% )
   187,662,577,544 instructions  #0.36  insns per cycle
 #2.28  stalled cycles per insn 
 ( +-  0.31% )
35,684,976,425 branches  #  169.423 M/sec   
 ( +-  0.45% )
 1,062,086,942 branch-misses #2.98

[tip:x86/asm] x86/defconfig: Turn on CONFIG_CC_OPTIMIZE_FOR_SIZE= y in the 64-bit defconfig

2013-01-26 Thread tip-bot for Ma Ling
Commit-ID:  d94ffd677469ef729e9d6e968191872577a6119e
Gitweb: http://git.kernel.org/tip/d94ffd677469ef729e9d6e968191872577a6119e
Author: Ma Ling ling...@alipay.com
AuthorDate: Fri, 25 Jan 2013 09:11:01 -0500
Committer:  Ingo Molnar mi...@kernel.org
CommitDate: Sat, 26 Jan 2013 13:09:15 +0100

x86/defconfig: Turn on CONFIG_CC_OPTIMIZE_FOR_SIZE=y in the 64-bit defconfig

Currently we use O2 as compiler option for better performance,
although it will enlarge code size, in modern CPUs larger
instructon and unified cache, sophisticated instruction prefetch
weaken instruction cache miss, meanwhile flags such as
 -falign-functions, -falign-jumps, -falign-loops, -falign-labels
are very helpful to improve CPU front-end throughput because CPU
fetch instruction by 16 aligned–bytes code block per cycle.

In order to save power and get higher performance, Sandy Bridge
starts to introduce decoded-cache, instructions will be kept in
it after decode stage. When CPU refetches the instruction,
decoded cache could provide 32 aligned-bytes instruction block,
instead of 16 bytes from I-cache, fewer branch miss penalty
resulted from shorter pipeline. It requires hot code should be
put into decoded cache as possible we can. Sandy Bridge, Ivy
Bridge, and Haswell all implemented this feature, Os-Optimize
for size should be better than O2 on them.

Based on above reasons, we compiled linux kernel 3.6.9 with O2
and Os respectively. The results show Os improve performance
netperf 4.8%, 2.7% for volano as below:

O2 + netperf
Performance counter stats for 'netperf' (3 runs):

   5416.157986 task-clock#0.541 CPUs utilized   
 ( +-  0.19% )
   348,249 context-switches  #0.064 M/sec   
 ( +-  0.17% )
 0 CPU-migrations#0.000 M/sec   
 ( +-  0.00% )
   353 page-faults   #0.000 M/sec   
 ( +-  0.16% )
13,166,254,384 cycles#2.431 GHz 
 ( +-  0.18% )
 8,827,499,807 stalled-cycles-frontend   #   67.05% frontend cycles idle
 ( +-  0.29% )
 5,951,234,060 stalled-cycles-backend#   45.20% backend  cycles idle
 ( +-  0.44% )
 8,122,481,914 instructions  #0.62  insns per cycle
 #1.09  stalled cycles per insn 
 ( +-  0.17% )
 1,415,864,138 branches  #  261.415 M/sec   
 ( +-  0.17% )
16,975,308 branch-misses #1.20% of all branches 
 ( +-  0.61% )

  10.007215371 seconds time elapsed 
 ( +-  0.03% )

Os + netperf

Performance counter stats for 'netperf' (3 runs):

   5395.386704 task-clock#0.539 CPUs utilized   
 ( +-  0.14% )
   345,880 context-switches  #0.064 M/sec   
 ( +-  0.25% )
 0 CPU-migrations#0.000 M/sec   
 ( +-  0.00% )
   354 page-faults   #0.000 M/sec   
 ( +-  0.00% )
13,142,706,297 cycles#2.436 GHz 
 ( +-  0.23% )
 8,379,382,641 stalled-cycles-frontend   #   63.76% frontend cycles idle
 ( +-  0.50% )
 5,513,722,219 stalled-cycles-backend#   41.95% backend  cycles idle
 ( +-  0.71% )
 8,554,202,795 instructions  #0.65  insns per cycle
 #0.98  stalled cycles per insn 
 ( +-  0.25% )
 1,530,020,505 branches  #  283.579 M/sec   
 ( +-  0.25% )
17,710,406 branch-misses #1.16% of all branches 
 ( +-  1.00% )

  10.004859867 seconds time elapsed

During the same time (10.004859867 seconds) IPC from Os is 0.65,
O2 is 0.62, Os improved performance 4.8%.

O2 + volano
Performance counter stats for './loopclient.sh openjdk' (3 runs):

 210627.115313 task-clock#0.781 CPUs utilized   
 ( +-  0.92% )
13,812,610 context-switches  #0.066 M/sec   
 ( +-  0.17% )
 2,352,755 CPU-migrations#0.011 M/sec   
 ( +-  0.84% )
   208,333 page-faults   #0.001 M/sec   
 ( +-  1.58% )
   525,627,073,405 cycles#2.496 GHz 
 ( +-  0.96% )
   428,177,571,365 stalled-cycles-frontend   #   81.46% frontend cycles idle
 ( +-  1.09% )
   370,885,224,739 stalled-cycles-backend#   70.56% backend  cycles idle
 ( +-  1.18% )
   187,662,577,544 instructions  #0.36  insns per cycle
 #2.28  stalled cycles per insn 
 ( +-  0.31% )
35,684,976,425 branches  #  169.423 M/sec   
 ( +-  0.45% )
 1,062,086,942 branch

[tip:x86/asm] x86/asm: Clean up copy_page_*() comments and code

2012-10-24 Thread tip-bot for Ma Ling
Commit-ID:  269833bd5a0f4443873da358b71675a890b47c3c
Gitweb: http://git.kernel.org/tip/269833bd5a0f4443873da358b71675a890b47c3c
Author: Ma Ling 
AuthorDate: Thu, 18 Oct 2012 03:52:45 +0800
Committer:  Ingo Molnar 
CommitDate: Wed, 24 Oct 2012 12:42:47 +0200

x86/asm: Clean up copy_page_*() comments and code

Modern CPUs use fast-string instruction to accelerate copy
performance, by combining data into 128 bit chunks.

Modify comments and coding style to match it.

Signed-off-by: Ma Ling 
Cc: i...@google.com
Link: 
http://lkml.kernel.org/r/1350503565-19167-1-git-send-email-ling...@intel.com
[ Cleaned up the clean up. ]
Signed-off-by: Ingo Molnar 
---
 arch/x86/lib/copy_page_64.S |  120 +--
 1 files changed, 59 insertions(+), 61 deletions(-)

diff --git a/arch/x86/lib/copy_page_64.S b/arch/x86/lib/copy_page_64.S
index 6b34d04..176cca6 100644
--- a/arch/x86/lib/copy_page_64.S
+++ b/arch/x86/lib/copy_page_64.S
@@ -5,91 +5,89 @@
 #include 
 
ALIGN
-copy_page_c:
+copy_page_rep:
CFI_STARTPROC
-   movl $4096/8,%ecx
-   rep movsq
+   movl$4096/8, %ecx
+   rep movsq
ret
CFI_ENDPROC
-ENDPROC(copy_page_c)
+ENDPROC(copy_page_rep)
 
-/* Don't use streaming store because it's better when the target
-   ends up in cache. */
-   
-/* Could vary the prefetch distance based on SMP/UP */
+/*
+ *  Don't use streaming copy unless the CPU indicates X86_FEATURE_REP_GOOD.
+ *  Could vary the prefetch distance based on SMP/UP.
+*/
 
 ENTRY(copy_page)
CFI_STARTPROC
-   subq$2*8,%rsp
+   subq$2*8,   %rsp
CFI_ADJUST_CFA_OFFSET 2*8
-   movq%rbx,(%rsp)
+   movq%rbx,   (%rsp)
CFI_REL_OFFSET rbx, 0
-   movq%r12,1*8(%rsp)
+   movq%r12,   1*8(%rsp)
CFI_REL_OFFSET r12, 1*8
 
-   movl$(4096/64)-5,%ecx
+   movl$(4096/64)-5,   %ecx
.p2align 4
 .Loop64:
-   dec %rcx
-
-   movq(%rsi), %rax
-   movq  8 (%rsi), %rbx
-   movq 16 (%rsi), %rdx
-   movq 24 (%rsi), %r8
-   movq 32 (%rsi), %r9
-   movq 40 (%rsi), %r10
-   movq 48 (%rsi), %r11
-   movq 56 (%rsi), %r12
+   dec %rcx
+   movq0x8*0(%rsi), %rax
+   movq0x8*1(%rsi), %rbx
+   movq0x8*2(%rsi), %rdx
+   movq0x8*3(%rsi), %r8
+   movq0x8*4(%rsi), %r9
+   movq0x8*5(%rsi), %r10
+   movq0x8*6(%rsi), %r11
+   movq0x8*7(%rsi), %r12
 
prefetcht0 5*64(%rsi)
 
-   movq %rax,(%rdi)
-   movq %rbx,  8 (%rdi)
-   movq %rdx, 16 (%rdi)
-   movq %r8,  24 (%rdi)
-   movq %r9,  32 (%rdi)
-   movq %r10, 40 (%rdi)
-   movq %r11, 48 (%rdi)
-   movq %r12, 56 (%rdi)
+   movq%rax, 0x8*0(%rdi)
+   movq%rbx, 0x8*1(%rdi)
+   movq%rdx, 0x8*2(%rdi)
+   movq%r8,  0x8*3(%rdi)
+   movq%r9,  0x8*4(%rdi)
+   movq%r10, 0x8*5(%rdi)
+   movq%r11, 0x8*6(%rdi)
+   movq%r12, 0x8*7(%rdi)
 
-   leaq64 (%rsi), %rsi
-   leaq64 (%rdi), %rdi
+   leaq64 (%rsi), %rsi
+   leaq64 (%rdi), %rdi
 
-   jnz .Loop64
+   jnz .Loop64
 
-   movl$5,%ecx
+   movl$5, %ecx
.p2align 4
 .Loop2:
-   decl   %ecx
-
-   movq(%rsi), %rax
-   movq  8 (%rsi), %rbx
-   movq 16 (%rsi), %rdx
-   movq 24 (%rsi), %r8
-   movq 32 (%rsi), %r9
-   movq 40 (%rsi), %r10
-   movq 48 (%rsi), %r11
-   movq 56 (%rsi), %r12
-
-   movq %rax,(%rdi)
-   movq %rbx,  8 (%rdi)
-   movq %rdx, 16 (%rdi)
-   movq %r8,  24 (%rdi)
-   movq %r9,  32 (%rdi)
-   movq %r10, 40 (%rdi)
-   movq %r11, 48 (%rdi)
-   movq %r12, 56 (%rdi)
-
-   leaq64(%rdi),%rdi
-   leaq64(%rsi),%rsi
-
+   decl%ecx
+
+   movq0x8*0(%rsi), %rax
+   movq0x8*1(%rsi), %rbx
+   movq0x8*2(%rsi), %rdx
+   movq0x8*3(%rsi), %r8
+   movq0x8*4(%rsi), %r9
+   movq0x8*5(%rsi), %r10
+   movq0x8*6(%rsi), %r11
+   movq0x8*7(%rsi), %r12
+
+   movq%rax, 0x8*0(%rdi)
+   movq%rbx, 0x8*1(%rdi)
+   movq%rdx, 0x8*2(%rdi)
+   movq%r8,  0x8*3(%rdi)
+   movq%r9,  0x8*4(%rdi)
+   movq%r10, 0x8*5(%rdi)
+   movq%r11, 0x8*6(%rdi)
+   movq%r12, 0x8*7(%rdi)
+
+   leaq64(%rdi), %rdi
+   leaq64(%rsi), %rsi
jnz .Loop2
 
-   movq(%rsp),%rbx
+   movq(%rsp), %rbx
CFI_RESTORE rbx
-   movq1*8(%rsp),%r12
+   movq1*8(%rsp), %r12
CFI_RESTORE r12
-   addq$2*8,%rsp
+   addq$2*8, %rsp
CFI_ADJUST_CFA_OFFSET -2*8
ret
 .Lcopy_page_end:
@@ -103,7 +101,7 @@ ENDPROC(copy_page)
 
.section

[tip:x86/asm] x86/asm: Clean up copy_page_*() comments and code

2012-10-24 Thread tip-bot for Ma Ling
Commit-ID:  269833bd5a0f4443873da358b71675a890b47c3c
Gitweb: http://git.kernel.org/tip/269833bd5a0f4443873da358b71675a890b47c3c
Author: Ma Ling ling...@intel.com
AuthorDate: Thu, 18 Oct 2012 03:52:45 +0800
Committer:  Ingo Molnar mi...@kernel.org
CommitDate: Wed, 24 Oct 2012 12:42:47 +0200

x86/asm: Clean up copy_page_*() comments and code

Modern CPUs use fast-string instruction to accelerate copy
performance, by combining data into 128 bit chunks.

Modify comments and coding style to match it.

Signed-off-by: Ma Ling ling...@intel.com
Cc: i...@google.com
Link: 
http://lkml.kernel.org/r/1350503565-19167-1-git-send-email-ling...@intel.com
[ Cleaned up the clean up. ]
Signed-off-by: Ingo Molnar mi...@kernel.org
---
 arch/x86/lib/copy_page_64.S |  120 +--
 1 files changed, 59 insertions(+), 61 deletions(-)

diff --git a/arch/x86/lib/copy_page_64.S b/arch/x86/lib/copy_page_64.S
index 6b34d04..176cca6 100644
--- a/arch/x86/lib/copy_page_64.S
+++ b/arch/x86/lib/copy_page_64.S
@@ -5,91 +5,89 @@
 #include asm/alternative-asm.h
 
ALIGN
-copy_page_c:
+copy_page_rep:
CFI_STARTPROC
-   movl $4096/8,%ecx
-   rep movsq
+   movl$4096/8, %ecx
+   rep movsq
ret
CFI_ENDPROC
-ENDPROC(copy_page_c)
+ENDPROC(copy_page_rep)
 
-/* Don't use streaming store because it's better when the target
-   ends up in cache. */
-   
-/* Could vary the prefetch distance based on SMP/UP */
+/*
+ *  Don't use streaming copy unless the CPU indicates X86_FEATURE_REP_GOOD.
+ *  Could vary the prefetch distance based on SMP/UP.
+*/
 
 ENTRY(copy_page)
CFI_STARTPROC
-   subq$2*8,%rsp
+   subq$2*8,   %rsp
CFI_ADJUST_CFA_OFFSET 2*8
-   movq%rbx,(%rsp)
+   movq%rbx,   (%rsp)
CFI_REL_OFFSET rbx, 0
-   movq%r12,1*8(%rsp)
+   movq%r12,   1*8(%rsp)
CFI_REL_OFFSET r12, 1*8
 
-   movl$(4096/64)-5,%ecx
+   movl$(4096/64)-5,   %ecx
.p2align 4
 .Loop64:
-   dec %rcx
-
-   movq(%rsi), %rax
-   movq  8 (%rsi), %rbx
-   movq 16 (%rsi), %rdx
-   movq 24 (%rsi), %r8
-   movq 32 (%rsi), %r9
-   movq 40 (%rsi), %r10
-   movq 48 (%rsi), %r11
-   movq 56 (%rsi), %r12
+   dec %rcx
+   movq0x8*0(%rsi), %rax
+   movq0x8*1(%rsi), %rbx
+   movq0x8*2(%rsi), %rdx
+   movq0x8*3(%rsi), %r8
+   movq0x8*4(%rsi), %r9
+   movq0x8*5(%rsi), %r10
+   movq0x8*6(%rsi), %r11
+   movq0x8*7(%rsi), %r12
 
prefetcht0 5*64(%rsi)
 
-   movq %rax,(%rdi)
-   movq %rbx,  8 (%rdi)
-   movq %rdx, 16 (%rdi)
-   movq %r8,  24 (%rdi)
-   movq %r9,  32 (%rdi)
-   movq %r10, 40 (%rdi)
-   movq %r11, 48 (%rdi)
-   movq %r12, 56 (%rdi)
+   movq%rax, 0x8*0(%rdi)
+   movq%rbx, 0x8*1(%rdi)
+   movq%rdx, 0x8*2(%rdi)
+   movq%r8,  0x8*3(%rdi)
+   movq%r9,  0x8*4(%rdi)
+   movq%r10, 0x8*5(%rdi)
+   movq%r11, 0x8*6(%rdi)
+   movq%r12, 0x8*7(%rdi)
 
-   leaq64 (%rsi), %rsi
-   leaq64 (%rdi), %rdi
+   leaq64 (%rsi), %rsi
+   leaq64 (%rdi), %rdi
 
-   jnz .Loop64
+   jnz .Loop64
 
-   movl$5,%ecx
+   movl$5, %ecx
.p2align 4
 .Loop2:
-   decl   %ecx
-
-   movq(%rsi), %rax
-   movq  8 (%rsi), %rbx
-   movq 16 (%rsi), %rdx
-   movq 24 (%rsi), %r8
-   movq 32 (%rsi), %r9
-   movq 40 (%rsi), %r10
-   movq 48 (%rsi), %r11
-   movq 56 (%rsi), %r12
-
-   movq %rax,(%rdi)
-   movq %rbx,  8 (%rdi)
-   movq %rdx, 16 (%rdi)
-   movq %r8,  24 (%rdi)
-   movq %r9,  32 (%rdi)
-   movq %r10, 40 (%rdi)
-   movq %r11, 48 (%rdi)
-   movq %r12, 56 (%rdi)
-
-   leaq64(%rdi),%rdi
-   leaq64(%rsi),%rsi
-
+   decl%ecx
+
+   movq0x8*0(%rsi), %rax
+   movq0x8*1(%rsi), %rbx
+   movq0x8*2(%rsi), %rdx
+   movq0x8*3(%rsi), %r8
+   movq0x8*4(%rsi), %r9
+   movq0x8*5(%rsi), %r10
+   movq0x8*6(%rsi), %r11
+   movq0x8*7(%rsi), %r12
+
+   movq%rax, 0x8*0(%rdi)
+   movq%rbx, 0x8*1(%rdi)
+   movq%rdx, 0x8*2(%rdi)
+   movq%r8,  0x8*3(%rdi)
+   movq%r9,  0x8*4(%rdi)
+   movq%r10, 0x8*5(%rdi)
+   movq%r11, 0x8*6(%rdi)
+   movq%r12, 0x8*7(%rdi)
+
+   leaq64(%rdi), %rdi
+   leaq64(%rsi), %rsi
jnz .Loop2
 
-   movq(%rsp),%rbx
+   movq(%rsp), %rbx
CFI_RESTORE rbx
-   movq1*8(%rsp),%r12
+   movq1*8(%rsp), %r12
CFI_RESTORE r12
-   addq$2*8,%rsp
+   addq$2*8, %rsp
CFI_ADJUST_CFA_OFFSET -2*8
ret

RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-14 Thread Ma, Ling
Thanks Boris!
So the patch is helpful and no impact for other/older machines,
I will re-send new version according to comments.
Any further comments are appreciated!

Regards
Ling

> -Original Message-
> From: Borislav Petkov [mailto:b...@alien8.de]
> Sent: Sunday, October 14, 2012 6:58 PM
> To: Ma, Ling
> Cc: Konrad Rzeszutek Wilk; mi...@elte.hu; h...@zytor.com;
> t...@linutronix.de; linux-kernel@vger.kernel.org; i...@google.com;
> George Spelvin
> Subject: Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging
> instruction sequence and saving register
> 
> On Fri, Oct 12, 2012 at 08:04:11PM +0200, Borislav Petkov wrote:
> > Right, so benchmark shows around 20% speedup on Bulldozer but this is
> > a microbenchmark and before pursue this further, we need to verify
> > whether this brings any palpable speedup with a real benchmark, I
> > don't know, kernbench, netbench, whatever. Even something as boring
> as
> > kernel build. And probably check for perf regressions on the rest of
> > the uarches.
> 
> Ok, so to summarize, on AMD we're using REP MOVSQ which is even faster
> than the unrolled version. I've added the REP MOVSQ version to the
> µbenchmark. It nicely validates that we're correctly setting
> X86_FEATURE_REP_GOOD on everything >= F10h and some K8s.
> 
> So, to answer Konrad's question: those patches don't concern AMD
> machines.
> 
> Thanks.
> 
> --
> Regards/Gruss,
> Boris.


RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-14 Thread Ma, Ling
Thanks Boris!
So the patch is helpful and no impact for other/older machines,
I will re-send new version according to comments.
Any further comments are appreciated!

Regards
Ling

 -Original Message-
 From: Borislav Petkov [mailto:b...@alien8.de]
 Sent: Sunday, October 14, 2012 6:58 PM
 To: Ma, Ling
 Cc: Konrad Rzeszutek Wilk; mi...@elte.hu; h...@zytor.com;
 t...@linutronix.de; linux-kernel@vger.kernel.org; i...@google.com;
 George Spelvin
 Subject: Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging
 instruction sequence and saving register
 
 On Fri, Oct 12, 2012 at 08:04:11PM +0200, Borislav Petkov wrote:
  Right, so benchmark shows around 20% speedup on Bulldozer but this is
  a microbenchmark and before pursue this further, we need to verify
  whether this brings any palpable speedup with a real benchmark, I
  don't know, kernbench, netbench, whatever. Even something as boring
 as
  kernel build. And probably check for perf regressions on the rest of
  the uarches.
 
 Ok, so to summarize, on AMD we're using REP MOVSQ which is even faster
 than the unrolled version. I've added the REP MOVSQ version to the
 µbenchmark. It nicely validates that we're correctly setting
 X86_FEATURE_REP_GOOD on everything = F10h and some K8s.
 
 So, to answer Konrad's question: those patches don't concern AMD
 machines.
 
 Thanks.
 
 --
 Regards/Gruss,
 Boris.


RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread Ma, Ling
> If you can't test the CPUs who run this code I think it's safer if you
> add a new variant for Atom, not change the existing well tested code.
> Otherwise you risk performance regressions on these older CPUs.

I found one older machine, and tested the code on it, the results between them 
are almost the same as below(attached cpu info).
  1 copy_page_org   copy_page_new
  2 TPT: Len 4096, alignment  0/ 0: 22522218
  3 TPT: Len 4096, alignment  0/ 0: 22442193
  4 TPT: Len 4096, alignment  0/ 0: 22612227
  5 TPT: Len 4096, alignment  0/ 0: 22352244
  6 TPT: Len 4096, alignment  0/ 0: 22612184

Thanks
Ling


xeon-cpu-info
Description: xeon-cpu-info


RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread Ma, Ling
> > > So is that also true for AMD CPUs?
> > Although Bulldozer put 32byte instruction into decoupled 16byte entry
> > buffers, it still decode 4 instructions per cycle, so 4 instructions
> > will be fed into execution unit and
> > 2 loads ,1 write will be issued per cycle.
> 
> I'd be very interested with what benchmarks are you seeing that perf
> improvement on Atom and who knows, maybe I could find time to run them
> on Bulldozer and see how your patch behaves there :-).M
I use another benchmark from gcc, there are many code, and extract one simple 
benchmark, you may use it to test (cc -o copy_page copy_page.c),
my initial result shows new copy page version is still better on bulldozer 
machine, because the machine is first release, please verify result.
And CC to Ian.

Thanks
Ling

#include
#include 


typedef unsigned long long int hp_timing_t;
#define  MAXSAMPLESTPT1000
#define  MAXCOPYSIZE  (1024 * 1024)
#define  ORIG  0
#define  NEW   1
static char* buf1 = NULL;
static char* buf2 = NULL;
static int repeat_one_test = 32;

hp_timing_t _dl_hp_timing_overhead;
# define HP_TIMING_NOW(Var) \
  ({ unsigned long long _hi, _lo; \
 asm volatile ("rdtsc" : "=a" (_lo), "=d" (_hi)); \
 (Var) = _hi << 32 | _lo; })

#define HP_TIMING_DIFF(Diff, Start, End)(Diff) = ((End) - (Start))
#define HP_TIMING_TOTAL(total_time, start, end) \
  do\
{   \
  hp_timing_t tmptime;  \
  HP_TIMING_DIFF (tmptime, start + _dl_hp_timing_overhead, end);\
total_time += tmptime;  \
}   \
  while (0)

#define HP_TIMING_BEST(best_time, start, end)   \
  do\
{   \
  hp_timing_t tmptime;  \
  HP_TIMING_DIFF (tmptime, start + _dl_hp_timing_overhead, end);\
  if (best_time > tmptime)  \
best_time = tmptime;\
}   \
  while (0)


void copy_page_org(char *dst, char *src, int len);
void copy_page_new(char *dst, char *src, int len);
void memcpy_c(char *dst, char *src, int len);
void (*do_memcpy)(char *dst, char *src, int len);

static void
do_one_test ( char *dst, char *src,
 size_t len)
{
  hp_timing_t start __attribute ((unused));
  hp_timing_t stop __attribute ((unused));
  hp_timing_t best_time = ~ (hp_timing_t) 0;
  size_t i,j;

  for (i = 0; i < repeat_one_test; ++i)
{
  HP_TIMING_NOW (start);
  do_memcpy ( dst, src, len);
  HP_TIMING_NOW (stop);
  HP_TIMING_BEST (best_time, start, stop);
}

  printf ("\t%zd", (size_t) best_time);
}

static void
do_test (size_t align1, size_t align2, size_t len)
{
  size_t i, j;
  char *s1, *s2;

  s1 = (char *) (buf1 + align1);
  s2 = (char *) (buf2 + align2);


   printf ("TPT: Len %4zd, alignment %2zd/%2zd:", len, align1, align2);
   do_memcpy = copy_page_org;
   do_one_test (s2, s1, len);
   do_memcpy = copy_page_new;
   do_one_test (s2+ (1 << 16), s1 + (1 << 16), len);
putchar ('\n');
}

static test_init(void)
{
  int i;
  buf1 = valloc(MAXCOPYSIZE);
  buf2 = valloc(MAXCOPYSIZE);

  for (i = 0; i < MAXCOPYSIZE ; i = i + 64) {
buf1[i] = buf2[i] = i & 0xff;
  }

}

void copy_page_new(char *dst, char *src, int len)
{
__asm__("mov$(4096/64)-5, %ecx");
__asm__("1:");
__asm__("prefetcht0 5*64(%rsi)");
__asm__("decb   %cl");

__asm__("movq   0x8*0(%rsi), %r10");
__asm__("movq   0x8*1(%rsi), %rax");
__asm__("movq   0x8*2(%rsi), %r8");
__asm__("movq   0x8*3(%rsi), %r9");
__asm__("movq   %r10, 0x8*0(%rdi)");
__asm__("movq   %rax, 0x8*1(%rdi)");
__asm__("movq   %r8, 0x8*2(%rdi)");
__asm__("movq   %r9, 0x8*3(%rdi)");

__asm__("movq   0x8*4(%rsi), %r10");
__asm__("movq   0x8*5(%rsi), %rax");
__asm__("movq   0x8*6(%rsi), %r8");
__asm__("movq   0x8*7(%rsi), %r9");
__asm__("leaq   64(%rsi), %rsi");
__asm__("movq   %r10, 0x8*4(%rdi)");
__asm__("movq   %rax, 0x8*5(%rdi)");
__asm__("movq   %r8, 0x8*6(%rdi)");
__asm__("movq   %r9, 0x8*7(%rdi)");
__asm__("leaq   64(%rdi), %rdi");
__asm__("jnz 1b");
__asm__("mov$5, %dl");
__asm__("2:");
__asm__("decb   %dl");
__asm__("movq   0x8*0(%rsi), %r10");
__asm__("movq   0x8*1(%rsi), %rax");
__asm__("movq   0x8*2(%rsi), %r8");
__asm__("movq   0x8*3(%rsi), %r9");

RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread Ma, Ling
   So is that also true for AMD CPUs?
  Although Bulldozer put 32byte instruction into decoupled 16byte entry
  buffers, it still decode 4 instructions per cycle, so 4 instructions
  will be fed into execution unit and
  2 loads ,1 write will be issued per cycle.
 
 I'd be very interested with what benchmarks are you seeing that perf
 improvement on Atom and who knows, maybe I could find time to run them
 on Bulldozer and see how your patch behaves there :-).M
I use another benchmark from gcc, there are many code, and extract one simple 
benchmark, you may use it to test (cc -o copy_page copy_page.c),
my initial result shows new copy page version is still better on bulldozer 
machine, because the machine is first release, please verify result.
And CC to Ian.

Thanks
Ling

#includestdio.h
#include stdlib.h


typedef unsigned long long int hp_timing_t;
#define  MAXSAMPLESTPT1000
#define  MAXCOPYSIZE  (1024 * 1024)
#define  ORIG  0
#define  NEW   1
static char* buf1 = NULL;
static char* buf2 = NULL;
static int repeat_one_test = 32;

hp_timing_t _dl_hp_timing_overhead;
# define HP_TIMING_NOW(Var) \
  ({ unsigned long long _hi, _lo; \
 asm volatile (rdtsc : =a (_lo), =d (_hi)); \
 (Var) = _hi  32 | _lo; })

#define HP_TIMING_DIFF(Diff, Start, End)(Diff) = ((End) - (Start))
#define HP_TIMING_TOTAL(total_time, start, end) \
  do\
{   \
  hp_timing_t tmptime;  \
  HP_TIMING_DIFF (tmptime, start + _dl_hp_timing_overhead, end);\
total_time += tmptime;  \
}   \
  while (0)

#define HP_TIMING_BEST(best_time, start, end)   \
  do\
{   \
  hp_timing_t tmptime;  \
  HP_TIMING_DIFF (tmptime, start + _dl_hp_timing_overhead, end);\
  if (best_time  tmptime)  \
best_time = tmptime;\
}   \
  while (0)


void copy_page_org(char *dst, char *src, int len);
void copy_page_new(char *dst, char *src, int len);
void memcpy_c(char *dst, char *src, int len);
void (*do_memcpy)(char *dst, char *src, int len);

static void
do_one_test ( char *dst, char *src,
 size_t len)
{
  hp_timing_t start __attribute ((unused));
  hp_timing_t stop __attribute ((unused));
  hp_timing_t best_time = ~ (hp_timing_t) 0;
  size_t i,j;

  for (i = 0; i  repeat_one_test; ++i)
{
  HP_TIMING_NOW (start);
  do_memcpy ( dst, src, len);
  HP_TIMING_NOW (stop);
  HP_TIMING_BEST (best_time, start, stop);
}

  printf (\t%zd, (size_t) best_time);
}

static void
do_test (size_t align1, size_t align2, size_t len)
{
  size_t i, j;
  char *s1, *s2;

  s1 = (char *) (buf1 + align1);
  s2 = (char *) (buf2 + align2);


   printf (TPT: Len %4zd, alignment %2zd/%2zd:, len, align1, align2);
   do_memcpy = copy_page_org;
   do_one_test (s2, s1, len);
   do_memcpy = copy_page_new;
   do_one_test (s2+ (1  16), s1 + (1  16), len);
putchar ('\n');
}

static test_init(void)
{
  int i;
  buf1 = valloc(MAXCOPYSIZE);
  buf2 = valloc(MAXCOPYSIZE);

  for (i = 0; i  MAXCOPYSIZE ; i = i + 64) {
buf1[i] = buf2[i] = i  0xff;
  }

}

void copy_page_new(char *dst, char *src, int len)
{
__asm__(mov$(4096/64)-5, %ecx);
__asm__(1:);
__asm__(prefetcht0 5*64(%rsi));
__asm__(decb   %cl);

__asm__(movq   0x8*0(%rsi), %r10);
__asm__(movq   0x8*1(%rsi), %rax);
__asm__(movq   0x8*2(%rsi), %r8);
__asm__(movq   0x8*3(%rsi), %r9);
__asm__(movq   %r10, 0x8*0(%rdi));
__asm__(movq   %rax, 0x8*1(%rdi));
__asm__(movq   %r8, 0x8*2(%rdi));
__asm__(movq   %r9, 0x8*3(%rdi));

__asm__(movq   0x8*4(%rsi), %r10);
__asm__(movq   0x8*5(%rsi), %rax);
__asm__(movq   0x8*6(%rsi), %r8);
__asm__(movq   0x8*7(%rsi), %r9);
__asm__(leaq   64(%rsi), %rsi);
__asm__(movq   %r10, 0x8*4(%rdi));
__asm__(movq   %rax, 0x8*5(%rdi));
__asm__(movq   %r8, 0x8*6(%rdi));
__asm__(movq   %r9, 0x8*7(%rdi));
__asm__(leaq   64(%rdi), %rdi);
__asm__(jnz 1b);
__asm__(mov$5, %dl);
__asm__(2:);
__asm__(decb   %dl);
__asm__(movq   0x8*0(%rsi), %r10);
__asm__(movq   0x8*1(%rsi), %rax);
__asm__(movq   0x8*2(%rsi), %r8);
__asm__(movq   0x8*3(%rsi), %r9);
__asm__(movq   %r10, 0x8*0(%rdi));
__asm__(movq   %rax, 0x8*1(%rdi));
   

RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread Ma, Ling
 If you can't test the CPUs who run this code I think it's safer if you
 add a new variant for Atom, not change the existing well tested code.
 Otherwise you risk performance regressions on these older CPUs.

I found one older machine, and tested the code on it, the results between them 
are almost the same as below(attached cpu info).
  1 copy_page_org   copy_page_new
  2 TPT: Len 4096, alignment  0/ 0: 22522218
  3 TPT: Len 4096, alignment  0/ 0: 22442193
  4 TPT: Len 4096, alignment  0/ 0: 22612227
  5 TPT: Len 4096, alignment  0/ 0: 22352244
  6 TPT: Len 4096, alignment  0/ 0: 22612184

Thanks
Ling


xeon-cpu-info
Description: xeon-cpu-info


RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-11 Thread Ma, Ling
> > Load and write operation occupy about 35% and 10% respectively for
> > most industry benchmarks. Fetched 16-aligned bytes code include about
> > 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write.
> > Modern CPU support 2 load and 1 write per cycle, so throughput from
> > write is bottleneck for memcpy or copy_page, and some slight CPU only
> > support one mem operation per cycle. So it is enough to issue one
> read
> > and write instruction per cycle, and we can save registers.
> 
> So is that also true for AMD CPUs?
Although Bulldozer put 32byte instruction into decoupled 16byte entry buffers,
it still decode 4 instructions per cycle, so 4 instructions will be fed into 
execution unit and
2 loads ,1 write will be issued per cycle.

Thanks
Ling
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-11 Thread Ma, Ling
> > Load and write operation occupy about 35% and 10% respectively for
> > most industry benchmarks. Fetched 16-aligned bytes code include about
> > 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write.
> > Modern CPU support 2 load and 1 write per cycle, so throughput from
> > write is bottleneck for memcpy or copy_page, and some slight CPU only
> > support one mem operation per cycle. So it is enough to issue one
> read
> > and write instruction per cycle, and we can save registers.
> 
> I don't think "saving registers" is a useful goal here.

Ling: issuing one read and write ops in one cycle is enough for copy_page or 
memcpy performance,
so we could avoid saving and restoring registers operation.

> >
> > In this patch we also re-arrange instruction sequence to improve
> > performance The performance on atom is improved about 11%, 9% on
> > hot/cold-cache case respectively.
> 
> That's great, but the question is what happened to the older CPUs that
> also this sequence. It may be safer to add a new variant for Atom,
> unless you can benchmark those too.

Ling: 
I tested new and original version on core2, the patch improved performance 
about 9%,
Although core2 is out-of-order pipeline and weaken instruction sequence 
requirement, 
because of ROB size limitation, new patch issues write operation earlier and
get more parallelism possibility for the pair of write and load ops and better 
result.
Attached core2-cpu-info (I have no older machine)


Thanks
Ling

 


core2-cpu-info
Description: core2-cpu-info


RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-11 Thread Ma, Ling
  Load and write operation occupy about 35% and 10% respectively for
  most industry benchmarks. Fetched 16-aligned bytes code include about
  4 instructions, implying 1.34(0.35 * 4) load, 0.4 write.
  Modern CPU support 2 load and 1 write per cycle, so throughput from
  write is bottleneck for memcpy or copy_page, and some slight CPU only
  support one mem operation per cycle. So it is enough to issue one
 read
  and write instruction per cycle, and we can save registers.
 
 I don't think saving registers is a useful goal here.

Ling: issuing one read and write ops in one cycle is enough for copy_page or 
memcpy performance,
so we could avoid saving and restoring registers operation.

 
  In this patch we also re-arrange instruction sequence to improve
  performance The performance on atom is improved about 11%, 9% on
  hot/cold-cache case respectively.
 
 That's great, but the question is what happened to the older CPUs that
 also this sequence. It may be safer to add a new variant for Atom,
 unless you can benchmark those too.

Ling: 
I tested new and original version on core2, the patch improved performance 
about 9%,
Although core2 is out-of-order pipeline and weaken instruction sequence 
requirement, 
because of ROB size limitation, new patch issues write operation earlier and
get more parallelism possibility for the pair of write and load ops and better 
result.
Attached core2-cpu-info (I have no older machine)


Thanks
Ling

 


core2-cpu-info
Description: core2-cpu-info


RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-11 Thread Ma, Ling
  Load and write operation occupy about 35% and 10% respectively for
  most industry benchmarks. Fetched 16-aligned bytes code include about
  4 instructions, implying 1.34(0.35 * 4) load, 0.4 write.
  Modern CPU support 2 load and 1 write per cycle, so throughput from
  write is bottleneck for memcpy or copy_page, and some slight CPU only
  support one mem operation per cycle. So it is enough to issue one
 read
  and write instruction per cycle, and we can save registers.
 
 So is that also true for AMD CPUs?
Although Bulldozer put 32byte instruction into decoupled 16byte entry buffers,
it still decode 4 instructions per cycle, so 4 instructions will be fed into 
execution unit and
2 loads ,1 write will be issued per cycle.

Thanks
Ling
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/