[tip:x86/asm] x86/defconfig: Turn on CONFIG_CC_OPTIMIZE_FOR_SIZE= y in the 64-bit defconfig
Commit-ID: d94ffd677469ef729e9d6e968191872577a6119e Gitweb: http://git.kernel.org/tip/d94ffd677469ef729e9d6e968191872577a6119e Author: Ma Ling AuthorDate: Fri, 25 Jan 2013 09:11:01 -0500 Committer: Ingo Molnar CommitDate: Sat, 26 Jan 2013 13:09:15 +0100 x86/defconfig: Turn on CONFIG_CC_OPTIMIZE_FOR_SIZE=y in the 64-bit defconfig Currently we use O2 as compiler option for better performance, although it will enlarge code size, in modern CPUs larger instructon and unified cache, sophisticated instruction prefetch weaken instruction cache miss, meanwhile flags such as -falign-functions, -falign-jumps, -falign-loops, -falign-labels are very helpful to improve CPU front-end throughput because CPU fetch instruction by 16 aligned–bytes code block per cycle. In order to save power and get higher performance, Sandy Bridge starts to introduce decoded-cache, instructions will be kept in it after decode stage. When CPU refetches the instruction, decoded cache could provide 32 aligned-bytes instruction block, instead of 16 bytes from I-cache, fewer branch miss penalty resulted from shorter pipeline. It requires hot code should be put into decoded cache as possible we can. Sandy Bridge, Ivy Bridge, and Haswell all implemented this feature, Os-Optimize for size should be better than O2 on them. Based on above reasons, we compiled linux kernel 3.6.9 with O2 and Os respectively. The results show Os improve performance netperf 4.8%, 2.7% for volano as below: O2 + netperf Performance counter stats for 'netperf' (3 runs): 5416.157986 task-clock#0.541 CPUs utilized ( +- 0.19% ) 348,249 context-switches #0.064 M/sec ( +- 0.17% ) 0 CPU-migrations#0.000 M/sec ( +- 0.00% ) 353 page-faults #0.000 M/sec ( +- 0.16% ) 13,166,254,384 cycles#2.431 GHz ( +- 0.18% ) 8,827,499,807 stalled-cycles-frontend # 67.05% frontend cycles idle ( +- 0.29% ) 5,951,234,060 stalled-cycles-backend# 45.20% backend cycles idle ( +- 0.44% ) 8,122,481,914 instructions #0.62 insns per cycle #1.09 stalled cycles per insn ( +- 0.17% ) 1,415,864,138 branches # 261.415 M/sec ( +- 0.17% ) 16,975,308 branch-misses #1.20% of all branches ( +- 0.61% ) 10.007215371 seconds time elapsed ( +- 0.03% ) Os + netperf Performance counter stats for 'netperf' (3 runs): 5395.386704 task-clock#0.539 CPUs utilized ( +- 0.14% ) 345,880 context-switches #0.064 M/sec ( +- 0.25% ) 0 CPU-migrations#0.000 M/sec ( +- 0.00% ) 354 page-faults #0.000 M/sec ( +- 0.00% ) 13,142,706,297 cycles#2.436 GHz ( +- 0.23% ) 8,379,382,641 stalled-cycles-frontend # 63.76% frontend cycles idle ( +- 0.50% ) 5,513,722,219 stalled-cycles-backend# 41.95% backend cycles idle ( +- 0.71% ) 8,554,202,795 instructions #0.65 insns per cycle #0.98 stalled cycles per insn ( +- 0.25% ) 1,530,020,505 branches # 283.579 M/sec ( +- 0.25% ) 17,710,406 branch-misses #1.16% of all branches ( +- 1.00% ) 10.004859867 seconds time elapsed During the same time (10.004859867 seconds) IPC from Os is 0.65, O2 is 0.62, Os improved performance 4.8%. O2 + volano Performance counter stats for './loopclient.sh openjdk' (3 runs): 210627.115313 task-clock#0.781 CPUs utilized ( +- 0.92% ) 13,812,610 context-switches #0.066 M/sec ( +- 0.17% ) 2,352,755 CPU-migrations#0.011 M/sec ( +- 0.84% ) 208,333 page-faults #0.001 M/sec ( +- 1.58% ) 525,627,073,405 cycles#2.496 GHz ( +- 0.96% ) 428,177,571,365 stalled-cycles-frontend # 81.46% frontend cycles idle ( +- 1.09% ) 370,885,224,739 stalled-cycles-backend# 70.56% backend cycles idle ( +- 1.18% ) 187,662,577,544 instructions #0.36 insns per cycle #2.28 stalled cycles per insn ( +- 0.31% ) 35,684,976,425 branches # 169.423 M/sec ( +- 0.45% ) 1,062,086,942 branch-misses #2.98
[tip:x86/asm] x86/defconfig: Turn on CONFIG_CC_OPTIMIZE_FOR_SIZE= y in the 64-bit defconfig
Commit-ID: d94ffd677469ef729e9d6e968191872577a6119e Gitweb: http://git.kernel.org/tip/d94ffd677469ef729e9d6e968191872577a6119e Author: Ma Ling ling...@alipay.com AuthorDate: Fri, 25 Jan 2013 09:11:01 -0500 Committer: Ingo Molnar mi...@kernel.org CommitDate: Sat, 26 Jan 2013 13:09:15 +0100 x86/defconfig: Turn on CONFIG_CC_OPTIMIZE_FOR_SIZE=y in the 64-bit defconfig Currently we use O2 as compiler option for better performance, although it will enlarge code size, in modern CPUs larger instructon and unified cache, sophisticated instruction prefetch weaken instruction cache miss, meanwhile flags such as -falign-functions, -falign-jumps, -falign-loops, -falign-labels are very helpful to improve CPU front-end throughput because CPU fetch instruction by 16 aligned–bytes code block per cycle. In order to save power and get higher performance, Sandy Bridge starts to introduce decoded-cache, instructions will be kept in it after decode stage. When CPU refetches the instruction, decoded cache could provide 32 aligned-bytes instruction block, instead of 16 bytes from I-cache, fewer branch miss penalty resulted from shorter pipeline. It requires hot code should be put into decoded cache as possible we can. Sandy Bridge, Ivy Bridge, and Haswell all implemented this feature, Os-Optimize for size should be better than O2 on them. Based on above reasons, we compiled linux kernel 3.6.9 with O2 and Os respectively. The results show Os improve performance netperf 4.8%, 2.7% for volano as below: O2 + netperf Performance counter stats for 'netperf' (3 runs): 5416.157986 task-clock#0.541 CPUs utilized ( +- 0.19% ) 348,249 context-switches #0.064 M/sec ( +- 0.17% ) 0 CPU-migrations#0.000 M/sec ( +- 0.00% ) 353 page-faults #0.000 M/sec ( +- 0.16% ) 13,166,254,384 cycles#2.431 GHz ( +- 0.18% ) 8,827,499,807 stalled-cycles-frontend # 67.05% frontend cycles idle ( +- 0.29% ) 5,951,234,060 stalled-cycles-backend# 45.20% backend cycles idle ( +- 0.44% ) 8,122,481,914 instructions #0.62 insns per cycle #1.09 stalled cycles per insn ( +- 0.17% ) 1,415,864,138 branches # 261.415 M/sec ( +- 0.17% ) 16,975,308 branch-misses #1.20% of all branches ( +- 0.61% ) 10.007215371 seconds time elapsed ( +- 0.03% ) Os + netperf Performance counter stats for 'netperf' (3 runs): 5395.386704 task-clock#0.539 CPUs utilized ( +- 0.14% ) 345,880 context-switches #0.064 M/sec ( +- 0.25% ) 0 CPU-migrations#0.000 M/sec ( +- 0.00% ) 354 page-faults #0.000 M/sec ( +- 0.00% ) 13,142,706,297 cycles#2.436 GHz ( +- 0.23% ) 8,379,382,641 stalled-cycles-frontend # 63.76% frontend cycles idle ( +- 0.50% ) 5,513,722,219 stalled-cycles-backend# 41.95% backend cycles idle ( +- 0.71% ) 8,554,202,795 instructions #0.65 insns per cycle #0.98 stalled cycles per insn ( +- 0.25% ) 1,530,020,505 branches # 283.579 M/sec ( +- 0.25% ) 17,710,406 branch-misses #1.16% of all branches ( +- 1.00% ) 10.004859867 seconds time elapsed During the same time (10.004859867 seconds) IPC from Os is 0.65, O2 is 0.62, Os improved performance 4.8%. O2 + volano Performance counter stats for './loopclient.sh openjdk' (3 runs): 210627.115313 task-clock#0.781 CPUs utilized ( +- 0.92% ) 13,812,610 context-switches #0.066 M/sec ( +- 0.17% ) 2,352,755 CPU-migrations#0.011 M/sec ( +- 0.84% ) 208,333 page-faults #0.001 M/sec ( +- 1.58% ) 525,627,073,405 cycles#2.496 GHz ( +- 0.96% ) 428,177,571,365 stalled-cycles-frontend # 81.46% frontend cycles idle ( +- 1.09% ) 370,885,224,739 stalled-cycles-backend# 70.56% backend cycles idle ( +- 1.18% ) 187,662,577,544 instructions #0.36 insns per cycle #2.28 stalled cycles per insn ( +- 0.31% ) 35,684,976,425 branches # 169.423 M/sec ( +- 0.45% ) 1,062,086,942 branch
[tip:x86/asm] x86/asm: Clean up copy_page_*() comments and code
Commit-ID: 269833bd5a0f4443873da358b71675a890b47c3c Gitweb: http://git.kernel.org/tip/269833bd5a0f4443873da358b71675a890b47c3c Author: Ma Ling AuthorDate: Thu, 18 Oct 2012 03:52:45 +0800 Committer: Ingo Molnar CommitDate: Wed, 24 Oct 2012 12:42:47 +0200 x86/asm: Clean up copy_page_*() comments and code Modern CPUs use fast-string instruction to accelerate copy performance, by combining data into 128 bit chunks. Modify comments and coding style to match it. Signed-off-by: Ma Ling Cc: i...@google.com Link: http://lkml.kernel.org/r/1350503565-19167-1-git-send-email-ling...@intel.com [ Cleaned up the clean up. ] Signed-off-by: Ingo Molnar --- arch/x86/lib/copy_page_64.S | 120 +-- 1 files changed, 59 insertions(+), 61 deletions(-) diff --git a/arch/x86/lib/copy_page_64.S b/arch/x86/lib/copy_page_64.S index 6b34d04..176cca6 100644 --- a/arch/x86/lib/copy_page_64.S +++ b/arch/x86/lib/copy_page_64.S @@ -5,91 +5,89 @@ #include ALIGN -copy_page_c: +copy_page_rep: CFI_STARTPROC - movl $4096/8,%ecx - rep movsq + movl$4096/8, %ecx + rep movsq ret CFI_ENDPROC -ENDPROC(copy_page_c) +ENDPROC(copy_page_rep) -/* Don't use streaming store because it's better when the target - ends up in cache. */ - -/* Could vary the prefetch distance based on SMP/UP */ +/* + * Don't use streaming copy unless the CPU indicates X86_FEATURE_REP_GOOD. + * Could vary the prefetch distance based on SMP/UP. +*/ ENTRY(copy_page) CFI_STARTPROC - subq$2*8,%rsp + subq$2*8, %rsp CFI_ADJUST_CFA_OFFSET 2*8 - movq%rbx,(%rsp) + movq%rbx, (%rsp) CFI_REL_OFFSET rbx, 0 - movq%r12,1*8(%rsp) + movq%r12, 1*8(%rsp) CFI_REL_OFFSET r12, 1*8 - movl$(4096/64)-5,%ecx + movl$(4096/64)-5, %ecx .p2align 4 .Loop64: - dec %rcx - - movq(%rsi), %rax - movq 8 (%rsi), %rbx - movq 16 (%rsi), %rdx - movq 24 (%rsi), %r8 - movq 32 (%rsi), %r9 - movq 40 (%rsi), %r10 - movq 48 (%rsi), %r11 - movq 56 (%rsi), %r12 + dec %rcx + movq0x8*0(%rsi), %rax + movq0x8*1(%rsi), %rbx + movq0x8*2(%rsi), %rdx + movq0x8*3(%rsi), %r8 + movq0x8*4(%rsi), %r9 + movq0x8*5(%rsi), %r10 + movq0x8*6(%rsi), %r11 + movq0x8*7(%rsi), %r12 prefetcht0 5*64(%rsi) - movq %rax,(%rdi) - movq %rbx, 8 (%rdi) - movq %rdx, 16 (%rdi) - movq %r8, 24 (%rdi) - movq %r9, 32 (%rdi) - movq %r10, 40 (%rdi) - movq %r11, 48 (%rdi) - movq %r12, 56 (%rdi) + movq%rax, 0x8*0(%rdi) + movq%rbx, 0x8*1(%rdi) + movq%rdx, 0x8*2(%rdi) + movq%r8, 0x8*3(%rdi) + movq%r9, 0x8*4(%rdi) + movq%r10, 0x8*5(%rdi) + movq%r11, 0x8*6(%rdi) + movq%r12, 0x8*7(%rdi) - leaq64 (%rsi), %rsi - leaq64 (%rdi), %rdi + leaq64 (%rsi), %rsi + leaq64 (%rdi), %rdi - jnz .Loop64 + jnz .Loop64 - movl$5,%ecx + movl$5, %ecx .p2align 4 .Loop2: - decl %ecx - - movq(%rsi), %rax - movq 8 (%rsi), %rbx - movq 16 (%rsi), %rdx - movq 24 (%rsi), %r8 - movq 32 (%rsi), %r9 - movq 40 (%rsi), %r10 - movq 48 (%rsi), %r11 - movq 56 (%rsi), %r12 - - movq %rax,(%rdi) - movq %rbx, 8 (%rdi) - movq %rdx, 16 (%rdi) - movq %r8, 24 (%rdi) - movq %r9, 32 (%rdi) - movq %r10, 40 (%rdi) - movq %r11, 48 (%rdi) - movq %r12, 56 (%rdi) - - leaq64(%rdi),%rdi - leaq64(%rsi),%rsi - + decl%ecx + + movq0x8*0(%rsi), %rax + movq0x8*1(%rsi), %rbx + movq0x8*2(%rsi), %rdx + movq0x8*3(%rsi), %r8 + movq0x8*4(%rsi), %r9 + movq0x8*5(%rsi), %r10 + movq0x8*6(%rsi), %r11 + movq0x8*7(%rsi), %r12 + + movq%rax, 0x8*0(%rdi) + movq%rbx, 0x8*1(%rdi) + movq%rdx, 0x8*2(%rdi) + movq%r8, 0x8*3(%rdi) + movq%r9, 0x8*4(%rdi) + movq%r10, 0x8*5(%rdi) + movq%r11, 0x8*6(%rdi) + movq%r12, 0x8*7(%rdi) + + leaq64(%rdi), %rdi + leaq64(%rsi), %rsi jnz .Loop2 - movq(%rsp),%rbx + movq(%rsp), %rbx CFI_RESTORE rbx - movq1*8(%rsp),%r12 + movq1*8(%rsp), %r12 CFI_RESTORE r12 - addq$2*8,%rsp + addq$2*8, %rsp CFI_ADJUST_CFA_OFFSET -2*8 ret .Lcopy_page_end: @@ -103,7 +101,7 @@ ENDPROC(copy_page) .section
[tip:x86/asm] x86/asm: Clean up copy_page_*() comments and code
Commit-ID: 269833bd5a0f4443873da358b71675a890b47c3c Gitweb: http://git.kernel.org/tip/269833bd5a0f4443873da358b71675a890b47c3c Author: Ma Ling ling...@intel.com AuthorDate: Thu, 18 Oct 2012 03:52:45 +0800 Committer: Ingo Molnar mi...@kernel.org CommitDate: Wed, 24 Oct 2012 12:42:47 +0200 x86/asm: Clean up copy_page_*() comments and code Modern CPUs use fast-string instruction to accelerate copy performance, by combining data into 128 bit chunks. Modify comments and coding style to match it. Signed-off-by: Ma Ling ling...@intel.com Cc: i...@google.com Link: http://lkml.kernel.org/r/1350503565-19167-1-git-send-email-ling...@intel.com [ Cleaned up the clean up. ] Signed-off-by: Ingo Molnar mi...@kernel.org --- arch/x86/lib/copy_page_64.S | 120 +-- 1 files changed, 59 insertions(+), 61 deletions(-) diff --git a/arch/x86/lib/copy_page_64.S b/arch/x86/lib/copy_page_64.S index 6b34d04..176cca6 100644 --- a/arch/x86/lib/copy_page_64.S +++ b/arch/x86/lib/copy_page_64.S @@ -5,91 +5,89 @@ #include asm/alternative-asm.h ALIGN -copy_page_c: +copy_page_rep: CFI_STARTPROC - movl $4096/8,%ecx - rep movsq + movl$4096/8, %ecx + rep movsq ret CFI_ENDPROC -ENDPROC(copy_page_c) +ENDPROC(copy_page_rep) -/* Don't use streaming store because it's better when the target - ends up in cache. */ - -/* Could vary the prefetch distance based on SMP/UP */ +/* + * Don't use streaming copy unless the CPU indicates X86_FEATURE_REP_GOOD. + * Could vary the prefetch distance based on SMP/UP. +*/ ENTRY(copy_page) CFI_STARTPROC - subq$2*8,%rsp + subq$2*8, %rsp CFI_ADJUST_CFA_OFFSET 2*8 - movq%rbx,(%rsp) + movq%rbx, (%rsp) CFI_REL_OFFSET rbx, 0 - movq%r12,1*8(%rsp) + movq%r12, 1*8(%rsp) CFI_REL_OFFSET r12, 1*8 - movl$(4096/64)-5,%ecx + movl$(4096/64)-5, %ecx .p2align 4 .Loop64: - dec %rcx - - movq(%rsi), %rax - movq 8 (%rsi), %rbx - movq 16 (%rsi), %rdx - movq 24 (%rsi), %r8 - movq 32 (%rsi), %r9 - movq 40 (%rsi), %r10 - movq 48 (%rsi), %r11 - movq 56 (%rsi), %r12 + dec %rcx + movq0x8*0(%rsi), %rax + movq0x8*1(%rsi), %rbx + movq0x8*2(%rsi), %rdx + movq0x8*3(%rsi), %r8 + movq0x8*4(%rsi), %r9 + movq0x8*5(%rsi), %r10 + movq0x8*6(%rsi), %r11 + movq0x8*7(%rsi), %r12 prefetcht0 5*64(%rsi) - movq %rax,(%rdi) - movq %rbx, 8 (%rdi) - movq %rdx, 16 (%rdi) - movq %r8, 24 (%rdi) - movq %r9, 32 (%rdi) - movq %r10, 40 (%rdi) - movq %r11, 48 (%rdi) - movq %r12, 56 (%rdi) + movq%rax, 0x8*0(%rdi) + movq%rbx, 0x8*1(%rdi) + movq%rdx, 0x8*2(%rdi) + movq%r8, 0x8*3(%rdi) + movq%r9, 0x8*4(%rdi) + movq%r10, 0x8*5(%rdi) + movq%r11, 0x8*6(%rdi) + movq%r12, 0x8*7(%rdi) - leaq64 (%rsi), %rsi - leaq64 (%rdi), %rdi + leaq64 (%rsi), %rsi + leaq64 (%rdi), %rdi - jnz .Loop64 + jnz .Loop64 - movl$5,%ecx + movl$5, %ecx .p2align 4 .Loop2: - decl %ecx - - movq(%rsi), %rax - movq 8 (%rsi), %rbx - movq 16 (%rsi), %rdx - movq 24 (%rsi), %r8 - movq 32 (%rsi), %r9 - movq 40 (%rsi), %r10 - movq 48 (%rsi), %r11 - movq 56 (%rsi), %r12 - - movq %rax,(%rdi) - movq %rbx, 8 (%rdi) - movq %rdx, 16 (%rdi) - movq %r8, 24 (%rdi) - movq %r9, 32 (%rdi) - movq %r10, 40 (%rdi) - movq %r11, 48 (%rdi) - movq %r12, 56 (%rdi) - - leaq64(%rdi),%rdi - leaq64(%rsi),%rsi - + decl%ecx + + movq0x8*0(%rsi), %rax + movq0x8*1(%rsi), %rbx + movq0x8*2(%rsi), %rdx + movq0x8*3(%rsi), %r8 + movq0x8*4(%rsi), %r9 + movq0x8*5(%rsi), %r10 + movq0x8*6(%rsi), %r11 + movq0x8*7(%rsi), %r12 + + movq%rax, 0x8*0(%rdi) + movq%rbx, 0x8*1(%rdi) + movq%rdx, 0x8*2(%rdi) + movq%r8, 0x8*3(%rdi) + movq%r9, 0x8*4(%rdi) + movq%r10, 0x8*5(%rdi) + movq%r11, 0x8*6(%rdi) + movq%r12, 0x8*7(%rdi) + + leaq64(%rdi), %rdi + leaq64(%rsi), %rsi jnz .Loop2 - movq(%rsp),%rbx + movq(%rsp), %rbx CFI_RESTORE rbx - movq1*8(%rsp),%r12 + movq1*8(%rsp), %r12 CFI_RESTORE r12 - addq$2*8,%rsp + addq$2*8, %rsp CFI_ADJUST_CFA_OFFSET -2*8 ret
RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register
Thanks Boris! So the patch is helpful and no impact for other/older machines, I will re-send new version according to comments. Any further comments are appreciated! Regards Ling > -Original Message- > From: Borislav Petkov [mailto:b...@alien8.de] > Sent: Sunday, October 14, 2012 6:58 PM > To: Ma, Ling > Cc: Konrad Rzeszutek Wilk; mi...@elte.hu; h...@zytor.com; > t...@linutronix.de; linux-kernel@vger.kernel.org; i...@google.com; > George Spelvin > Subject: Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging > instruction sequence and saving register > > On Fri, Oct 12, 2012 at 08:04:11PM +0200, Borislav Petkov wrote: > > Right, so benchmark shows around 20% speedup on Bulldozer but this is > > a microbenchmark and before pursue this further, we need to verify > > whether this brings any palpable speedup with a real benchmark, I > > don't know, kernbench, netbench, whatever. Even something as boring > as > > kernel build. And probably check for perf regressions on the rest of > > the uarches. > > Ok, so to summarize, on AMD we're using REP MOVSQ which is even faster > than the unrolled version. I've added the REP MOVSQ version to the > µbenchmark. It nicely validates that we're correctly setting > X86_FEATURE_REP_GOOD on everything >= F10h and some K8s. > > So, to answer Konrad's question: those patches don't concern AMD > machines. > > Thanks. > > -- > Regards/Gruss, > Boris.
RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register
Thanks Boris! So the patch is helpful and no impact for other/older machines, I will re-send new version according to comments. Any further comments are appreciated! Regards Ling -Original Message- From: Borislav Petkov [mailto:b...@alien8.de] Sent: Sunday, October 14, 2012 6:58 PM To: Ma, Ling Cc: Konrad Rzeszutek Wilk; mi...@elte.hu; h...@zytor.com; t...@linutronix.de; linux-kernel@vger.kernel.org; i...@google.com; George Spelvin Subject: Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register On Fri, Oct 12, 2012 at 08:04:11PM +0200, Borislav Petkov wrote: Right, so benchmark shows around 20% speedup on Bulldozer but this is a microbenchmark and before pursue this further, we need to verify whether this brings any palpable speedup with a real benchmark, I don't know, kernbench, netbench, whatever. Even something as boring as kernel build. And probably check for perf regressions on the rest of the uarches. Ok, so to summarize, on AMD we're using REP MOVSQ which is even faster than the unrolled version. I've added the REP MOVSQ version to the µbenchmark. It nicely validates that we're correctly setting X86_FEATURE_REP_GOOD on everything = F10h and some K8s. So, to answer Konrad's question: those patches don't concern AMD machines. Thanks. -- Regards/Gruss, Boris.
RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register
> If you can't test the CPUs who run this code I think it's safer if you > add a new variant for Atom, not change the existing well tested code. > Otherwise you risk performance regressions on these older CPUs. I found one older machine, and tested the code on it, the results between them are almost the same as below(attached cpu info). 1 copy_page_org copy_page_new 2 TPT: Len 4096, alignment 0/ 0: 22522218 3 TPT: Len 4096, alignment 0/ 0: 22442193 4 TPT: Len 4096, alignment 0/ 0: 22612227 5 TPT: Len 4096, alignment 0/ 0: 22352244 6 TPT: Len 4096, alignment 0/ 0: 22612184 Thanks Ling xeon-cpu-info Description: xeon-cpu-info
RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register
> > > So is that also true for AMD CPUs? > > Although Bulldozer put 32byte instruction into decoupled 16byte entry > > buffers, it still decode 4 instructions per cycle, so 4 instructions > > will be fed into execution unit and > > 2 loads ,1 write will be issued per cycle. > > I'd be very interested with what benchmarks are you seeing that perf > improvement on Atom and who knows, maybe I could find time to run them > on Bulldozer and see how your patch behaves there :-).M I use another benchmark from gcc, there are many code, and extract one simple benchmark, you may use it to test (cc -o copy_page copy_page.c), my initial result shows new copy page version is still better on bulldozer machine, because the machine is first release, please verify result. And CC to Ian. Thanks Ling #include #include typedef unsigned long long int hp_timing_t; #define MAXSAMPLESTPT1000 #define MAXCOPYSIZE (1024 * 1024) #define ORIG 0 #define NEW 1 static char* buf1 = NULL; static char* buf2 = NULL; static int repeat_one_test = 32; hp_timing_t _dl_hp_timing_overhead; # define HP_TIMING_NOW(Var) \ ({ unsigned long long _hi, _lo; \ asm volatile ("rdtsc" : "=a" (_lo), "=d" (_hi)); \ (Var) = _hi << 32 | _lo; }) #define HP_TIMING_DIFF(Diff, Start, End)(Diff) = ((End) - (Start)) #define HP_TIMING_TOTAL(total_time, start, end) \ do\ { \ hp_timing_t tmptime; \ HP_TIMING_DIFF (tmptime, start + _dl_hp_timing_overhead, end);\ total_time += tmptime; \ } \ while (0) #define HP_TIMING_BEST(best_time, start, end) \ do\ { \ hp_timing_t tmptime; \ HP_TIMING_DIFF (tmptime, start + _dl_hp_timing_overhead, end);\ if (best_time > tmptime) \ best_time = tmptime;\ } \ while (0) void copy_page_org(char *dst, char *src, int len); void copy_page_new(char *dst, char *src, int len); void memcpy_c(char *dst, char *src, int len); void (*do_memcpy)(char *dst, char *src, int len); static void do_one_test ( char *dst, char *src, size_t len) { hp_timing_t start __attribute ((unused)); hp_timing_t stop __attribute ((unused)); hp_timing_t best_time = ~ (hp_timing_t) 0; size_t i,j; for (i = 0; i < repeat_one_test; ++i) { HP_TIMING_NOW (start); do_memcpy ( dst, src, len); HP_TIMING_NOW (stop); HP_TIMING_BEST (best_time, start, stop); } printf ("\t%zd", (size_t) best_time); } static void do_test (size_t align1, size_t align2, size_t len) { size_t i, j; char *s1, *s2; s1 = (char *) (buf1 + align1); s2 = (char *) (buf2 + align2); printf ("TPT: Len %4zd, alignment %2zd/%2zd:", len, align1, align2); do_memcpy = copy_page_org; do_one_test (s2, s1, len); do_memcpy = copy_page_new; do_one_test (s2+ (1 << 16), s1 + (1 << 16), len); putchar ('\n'); } static test_init(void) { int i; buf1 = valloc(MAXCOPYSIZE); buf2 = valloc(MAXCOPYSIZE); for (i = 0; i < MAXCOPYSIZE ; i = i + 64) { buf1[i] = buf2[i] = i & 0xff; } } void copy_page_new(char *dst, char *src, int len) { __asm__("mov$(4096/64)-5, %ecx"); __asm__("1:"); __asm__("prefetcht0 5*64(%rsi)"); __asm__("decb %cl"); __asm__("movq 0x8*0(%rsi), %r10"); __asm__("movq 0x8*1(%rsi), %rax"); __asm__("movq 0x8*2(%rsi), %r8"); __asm__("movq 0x8*3(%rsi), %r9"); __asm__("movq %r10, 0x8*0(%rdi)"); __asm__("movq %rax, 0x8*1(%rdi)"); __asm__("movq %r8, 0x8*2(%rdi)"); __asm__("movq %r9, 0x8*3(%rdi)"); __asm__("movq 0x8*4(%rsi), %r10"); __asm__("movq 0x8*5(%rsi), %rax"); __asm__("movq 0x8*6(%rsi), %r8"); __asm__("movq 0x8*7(%rsi), %r9"); __asm__("leaq 64(%rsi), %rsi"); __asm__("movq %r10, 0x8*4(%rdi)"); __asm__("movq %rax, 0x8*5(%rdi)"); __asm__("movq %r8, 0x8*6(%rdi)"); __asm__("movq %r9, 0x8*7(%rdi)"); __asm__("leaq 64(%rdi), %rdi"); __asm__("jnz 1b"); __asm__("mov$5, %dl"); __asm__("2:"); __asm__("decb %dl"); __asm__("movq 0x8*0(%rsi), %r10"); __asm__("movq 0x8*1(%rsi), %rax"); __asm__("movq 0x8*2(%rsi), %r8"); __asm__("movq 0x8*3(%rsi), %r9");
RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register
So is that also true for AMD CPUs? Although Bulldozer put 32byte instruction into decoupled 16byte entry buffers, it still decode 4 instructions per cycle, so 4 instructions will be fed into execution unit and 2 loads ,1 write will be issued per cycle. I'd be very interested with what benchmarks are you seeing that perf improvement on Atom and who knows, maybe I could find time to run them on Bulldozer and see how your patch behaves there :-).M I use another benchmark from gcc, there are many code, and extract one simple benchmark, you may use it to test (cc -o copy_page copy_page.c), my initial result shows new copy page version is still better on bulldozer machine, because the machine is first release, please verify result. And CC to Ian. Thanks Ling #includestdio.h #include stdlib.h typedef unsigned long long int hp_timing_t; #define MAXSAMPLESTPT1000 #define MAXCOPYSIZE (1024 * 1024) #define ORIG 0 #define NEW 1 static char* buf1 = NULL; static char* buf2 = NULL; static int repeat_one_test = 32; hp_timing_t _dl_hp_timing_overhead; # define HP_TIMING_NOW(Var) \ ({ unsigned long long _hi, _lo; \ asm volatile (rdtsc : =a (_lo), =d (_hi)); \ (Var) = _hi 32 | _lo; }) #define HP_TIMING_DIFF(Diff, Start, End)(Diff) = ((End) - (Start)) #define HP_TIMING_TOTAL(total_time, start, end) \ do\ { \ hp_timing_t tmptime; \ HP_TIMING_DIFF (tmptime, start + _dl_hp_timing_overhead, end);\ total_time += tmptime; \ } \ while (0) #define HP_TIMING_BEST(best_time, start, end) \ do\ { \ hp_timing_t tmptime; \ HP_TIMING_DIFF (tmptime, start + _dl_hp_timing_overhead, end);\ if (best_time tmptime) \ best_time = tmptime;\ } \ while (0) void copy_page_org(char *dst, char *src, int len); void copy_page_new(char *dst, char *src, int len); void memcpy_c(char *dst, char *src, int len); void (*do_memcpy)(char *dst, char *src, int len); static void do_one_test ( char *dst, char *src, size_t len) { hp_timing_t start __attribute ((unused)); hp_timing_t stop __attribute ((unused)); hp_timing_t best_time = ~ (hp_timing_t) 0; size_t i,j; for (i = 0; i repeat_one_test; ++i) { HP_TIMING_NOW (start); do_memcpy ( dst, src, len); HP_TIMING_NOW (stop); HP_TIMING_BEST (best_time, start, stop); } printf (\t%zd, (size_t) best_time); } static void do_test (size_t align1, size_t align2, size_t len) { size_t i, j; char *s1, *s2; s1 = (char *) (buf1 + align1); s2 = (char *) (buf2 + align2); printf (TPT: Len %4zd, alignment %2zd/%2zd:, len, align1, align2); do_memcpy = copy_page_org; do_one_test (s2, s1, len); do_memcpy = copy_page_new; do_one_test (s2+ (1 16), s1 + (1 16), len); putchar ('\n'); } static test_init(void) { int i; buf1 = valloc(MAXCOPYSIZE); buf2 = valloc(MAXCOPYSIZE); for (i = 0; i MAXCOPYSIZE ; i = i + 64) { buf1[i] = buf2[i] = i 0xff; } } void copy_page_new(char *dst, char *src, int len) { __asm__(mov$(4096/64)-5, %ecx); __asm__(1:); __asm__(prefetcht0 5*64(%rsi)); __asm__(decb %cl); __asm__(movq 0x8*0(%rsi), %r10); __asm__(movq 0x8*1(%rsi), %rax); __asm__(movq 0x8*2(%rsi), %r8); __asm__(movq 0x8*3(%rsi), %r9); __asm__(movq %r10, 0x8*0(%rdi)); __asm__(movq %rax, 0x8*1(%rdi)); __asm__(movq %r8, 0x8*2(%rdi)); __asm__(movq %r9, 0x8*3(%rdi)); __asm__(movq 0x8*4(%rsi), %r10); __asm__(movq 0x8*5(%rsi), %rax); __asm__(movq 0x8*6(%rsi), %r8); __asm__(movq 0x8*7(%rsi), %r9); __asm__(leaq 64(%rsi), %rsi); __asm__(movq %r10, 0x8*4(%rdi)); __asm__(movq %rax, 0x8*5(%rdi)); __asm__(movq %r8, 0x8*6(%rdi)); __asm__(movq %r9, 0x8*7(%rdi)); __asm__(leaq 64(%rdi), %rdi); __asm__(jnz 1b); __asm__(mov$5, %dl); __asm__(2:); __asm__(decb %dl); __asm__(movq 0x8*0(%rsi), %r10); __asm__(movq 0x8*1(%rsi), %rax); __asm__(movq 0x8*2(%rsi), %r8); __asm__(movq 0x8*3(%rsi), %r9); __asm__(movq %r10, 0x8*0(%rdi)); __asm__(movq %rax, 0x8*1(%rdi));
RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register
If you can't test the CPUs who run this code I think it's safer if you add a new variant for Atom, not change the existing well tested code. Otherwise you risk performance regressions on these older CPUs. I found one older machine, and tested the code on it, the results between them are almost the same as below(attached cpu info). 1 copy_page_org copy_page_new 2 TPT: Len 4096, alignment 0/ 0: 22522218 3 TPT: Len 4096, alignment 0/ 0: 22442193 4 TPT: Len 4096, alignment 0/ 0: 22612227 5 TPT: Len 4096, alignment 0/ 0: 22352244 6 TPT: Len 4096, alignment 0/ 0: 22612184 Thanks Ling xeon-cpu-info Description: xeon-cpu-info
RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register
> > Load and write operation occupy about 35% and 10% respectively for > > most industry benchmarks. Fetched 16-aligned bytes code include about > > 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write. > > Modern CPU support 2 load and 1 write per cycle, so throughput from > > write is bottleneck for memcpy or copy_page, and some slight CPU only > > support one mem operation per cycle. So it is enough to issue one > read > > and write instruction per cycle, and we can save registers. > > So is that also true for AMD CPUs? Although Bulldozer put 32byte instruction into decoupled 16byte entry buffers, it still decode 4 instructions per cycle, so 4 instructions will be fed into execution unit and 2 loads ,1 write will be issued per cycle. Thanks Ling -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register
> > Load and write operation occupy about 35% and 10% respectively for > > most industry benchmarks. Fetched 16-aligned bytes code include about > > 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write. > > Modern CPU support 2 load and 1 write per cycle, so throughput from > > write is bottleneck for memcpy or copy_page, and some slight CPU only > > support one mem operation per cycle. So it is enough to issue one > read > > and write instruction per cycle, and we can save registers. > > I don't think "saving registers" is a useful goal here. Ling: issuing one read and write ops in one cycle is enough for copy_page or memcpy performance, so we could avoid saving and restoring registers operation. > > > > In this patch we also re-arrange instruction sequence to improve > > performance The performance on atom is improved about 11%, 9% on > > hot/cold-cache case respectively. > > That's great, but the question is what happened to the older CPUs that > also this sequence. It may be safer to add a new variant for Atom, > unless you can benchmark those too. Ling: I tested new and original version on core2, the patch improved performance about 9%, Although core2 is out-of-order pipeline and weaken instruction sequence requirement, because of ROB size limitation, new patch issues write operation earlier and get more parallelism possibility for the pair of write and load ops and better result. Attached core2-cpu-info (I have no older machine) Thanks Ling core2-cpu-info Description: core2-cpu-info
RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register
Load and write operation occupy about 35% and 10% respectively for most industry benchmarks. Fetched 16-aligned bytes code include about 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write. Modern CPU support 2 load and 1 write per cycle, so throughput from write is bottleneck for memcpy or copy_page, and some slight CPU only support one mem operation per cycle. So it is enough to issue one read and write instruction per cycle, and we can save registers. I don't think saving registers is a useful goal here. Ling: issuing one read and write ops in one cycle is enough for copy_page or memcpy performance, so we could avoid saving and restoring registers operation. In this patch we also re-arrange instruction sequence to improve performance The performance on atom is improved about 11%, 9% on hot/cold-cache case respectively. That's great, but the question is what happened to the older CPUs that also this sequence. It may be safer to add a new variant for Atom, unless you can benchmark those too. Ling: I tested new and original version on core2, the patch improved performance about 9%, Although core2 is out-of-order pipeline and weaken instruction sequence requirement, because of ROB size limitation, new patch issues write operation earlier and get more parallelism possibility for the pair of write and load ops and better result. Attached core2-cpu-info (I have no older machine) Thanks Ling core2-cpu-info Description: core2-cpu-info
RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register
Load and write operation occupy about 35% and 10% respectively for most industry benchmarks. Fetched 16-aligned bytes code include about 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write. Modern CPU support 2 load and 1 write per cycle, so throughput from write is bottleneck for memcpy or copy_page, and some slight CPU only support one mem operation per cycle. So it is enough to issue one read and write instruction per cycle, and we can save registers. So is that also true for AMD CPUs? Although Bulldozer put 32byte instruction into decoupled 16byte entry buffers, it still decode 4 instructions per cycle, so 4 instructions will be fed into execution unit and 2 loads ,1 write will be issued per cycle. Thanks Ling -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/