Re: [PATCH] reduce inlined x86 memcpy by 2 bytes
On Sunday 20 March 2005 15:17, Adrian Bunk wrote: > Hi Denis, > > what do your benchmarks say about replacing the whole assembler code > with a > > #define __memcpy __builtin_memcpy It generates call to out-of-line memcpy() if count is non-constant. # cat t.c extern char *a, *b; extern int n; void f() { __builtin_memcpy(a,b,n); } void g() { __builtin_memcpy(a,b,24); } # gcc -S -O2 --omit-frame-pointer t.c # cat t.s .file "t.c" .text .p2align 2,,3 .globl f .type f, @function f: subl$16, %esp pushl n pushl b pushl a callmemcpy addl$28, %esp ret .size f, .-f .p2align 2,,3 .globl g .type g, @function g: pushl %edi pushl %esi movla, %edi movlb, %esi cld movl$6, %ecx rep movsl popl%esi popl%edi ret .size g, .-g .section.note.GNU-stack,"",@progbits .ident "GCC: (GNU) 3.4.1" Proving that it is slower than inline is left as an excercise to the reader :) Kernel one will be inlined always. void h) { __memcpy(a,b,n);} is movln, %eax pushl %edi movl%eax, %ecx pushl %esi movla, %edi movlb, %esi shrl$2, %ecx #APP rep ; movsl movl %eax,%ecx andl $3,%ecx jz 1f rep ; movsb 1: #NO_APP popl%esi popl%edi ret -- vda - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] reduce inlined x86 memcpy by 2 bytes
On Sunday 20 March 2005 15:17, Adrian Bunk wrote: Hi Denis, what do your benchmarks say about replacing the whole assembler code with a #define __memcpy __builtin_memcpy It generates call to out-of-line memcpy() if count is non-constant. # cat t.c extern char *a, *b; extern int n; void f() { __builtin_memcpy(a,b,n); } void g() { __builtin_memcpy(a,b,24); } # gcc -S -O2 --omit-frame-pointer t.c # cat t.s .file t.c .text .p2align 2,,3 .globl f .type f, @function f: subl$16, %esp pushl n pushl b pushl a callmemcpy addl$28, %esp ret .size f, .-f .p2align 2,,3 .globl g .type g, @function g: pushl %edi pushl %esi movla, %edi movlb, %esi cld movl$6, %ecx rep movsl popl%esi popl%edi ret .size g, .-g .section.note.GNU-stack,,@progbits .ident GCC: (GNU) 3.4.1 Proving that it is slower than inline is left as an excercise to the reader :) Kernel one will be inlined always. void h) { __memcpy(a,b,n);} is movln, %eax pushl %edi movl%eax, %ecx pushl %esi movla, %edi movlb, %esi shrl$2, %ecx #APP rep ; movsl movl %eax,%ecx andl $3,%ecx jz 1f rep ; movsb 1: #NO_APP popl%esi popl%edi ret -- vda - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] reduce inlined x86 memcpy by 2 bytes
Hi Denis, what do your benchmarks say about replacing the whole assembler code with a #define __memcpy __builtin_memcpy ? cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] reduce inlined x86 memcpy by 2 bytes
Hi Denis, what do your benchmarks say about replacing the whole assembler code with a #define __memcpy __builtin_memcpy ? cu Adrian -- Is there not promise of rain? Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. Only a promise, Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] reduce inlined x86 memcpy by 2 bytes
On Friday 18 March 2005 11:21, Denis Vlasenko wrote: > This memcpy() is 2 bytes shorter than one currently in mainline > and it have one branch less. It is also 3-4% faster in microbenchmarks > on small blocks if block size is multiple of 4. Mainline is slower > because it has to branch twice per memcpy, both mispredicted > (but branch prediction hides that in microbenchmark). > > Last remaining branch can be dropped too, but then we execute second > 'rep movsb' always, even if blocksize%4==0. This is slower than mainline > because 'rep movsb' is microcoded. I wonder, tho, whether 'branchlessness' > wins over this in real world use (not in bench). > > I think blocksize%4==0 happens more than 25% of the time. s/%4/&3 of course. -- vda - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] reduce inlined x86 memcpy by 2 bytes
This memcpy() is 2 bytes shorter than one currently in mainline and it have one branch less. It is also 3-4% faster in microbenchmarks on small blocks if block size is multiple of 4. Mainline is slower because it has to branch twice per memcpy, both mispredicted (but branch prediction hides that in microbenchmark). Last remaining branch can be dropped too, but then we execute second 'rep movsb' always, even if blocksize%4==0. This is slower than mainline because 'rep movsb' is microcoded. I wonder, tho, whether 'branchlessness' wins over this in real world use (not in bench). I think blocksize%4==0 happens more than 25% of the time. This is how many 'allyesconfig' vmlinux gains on branchless memcpy(): # size vmlinux.org vmlinux.memcpy textdata bss dec hex filename 181789506293427 1808916 26281293191054d vmlinux.org 181651606293427 1808916 26267503190cf6f vmlinux.memcpy # echo $(( (18178950-18165160) )) 13790 <= bytes saved on allyesconfig # echo $(( (18178950-18165160)/4 )) 3447 <= memcpy() callsites optimized Attached patch (with one branch) would save 6.5k instead of 13k. Patch is run tested. -- vda --- linux-2.6.11.src/include/asm-i386/string.h.orig Thu Mar 3 09:31:08 2005 +++ linux-2.6.11.src/include/asm-i386/string.h Fri Mar 18 10:55:51 2005 @@ -198,15 +198,13 @@ static inline void * __memcpy(void * to, int d0, d1, d2; __asm__ __volatile__( "rep ; movsl\n\t" - "testb $2,%b4\n\t" - "je 1f\n\t" - "movsw\n" - "1:\ttestb $1,%b4\n\t" - "je 2f\n\t" - "movsb\n" - "2:" + "movl %4,%%ecx\n\t" + "andl $3,%%ecx\n\t" + "jz 1f\n\t" /* pay 2 byte penalty for a chance to skip microcoded rep */ + "rep ; movsb\n\t" + "1:" : "=" (d0), "=" (d1), "=" (d2) - :"0" (n/4), "q" (n),"1" ((long) to),"2" ((long) from) + : "0" (n/4), "g" (n), "1" ((long) to), "2" ((long) from) : "memory"); return (to); }
[PATCH] reduce inlined x86 memcpy by 2 bytes
This memcpy() is 2 bytes shorter than one currently in mainline and it have one branch less. It is also 3-4% faster in microbenchmarks on small blocks if block size is multiple of 4. Mainline is slower because it has to branch twice per memcpy, both mispredicted (but branch prediction hides that in microbenchmark). Last remaining branch can be dropped too, but then we execute second 'rep movsb' always, even if blocksize%4==0. This is slower than mainline because 'rep movsb' is microcoded. I wonder, tho, whether 'branchlessness' wins over this in real world use (not in bench). I think blocksize%4==0 happens more than 25% of the time. This is how many 'allyesconfig' vmlinux gains on branchless memcpy(): # size vmlinux.org vmlinux.memcpy textdata bss dec hex filename 181789506293427 1808916 26281293191054d vmlinux.org 181651606293427 1808916 26267503190cf6f vmlinux.memcpy # echo $(( (18178950-18165160) )) 13790 = bytes saved on allyesconfig # echo $(( (18178950-18165160)/4 )) 3447 = memcpy() callsites optimized Attached patch (with one branch) would save 6.5k instead of 13k. Patch is run tested. -- vda --- linux-2.6.11.src/include/asm-i386/string.h.orig Thu Mar 3 09:31:08 2005 +++ linux-2.6.11.src/include/asm-i386/string.h Fri Mar 18 10:55:51 2005 @@ -198,15 +198,13 @@ static inline void * __memcpy(void * to, int d0, d1, d2; __asm__ __volatile__( rep ; movsl\n\t - testb $2,%b4\n\t - je 1f\n\t - movsw\n - 1:\ttestb $1,%b4\n\t - je 2f\n\t - movsb\n - 2: + movl %4,%%ecx\n\t + andl $3,%%ecx\n\t + jz 1f\n\t /* pay 2 byte penalty for a chance to skip microcoded rep */ + rep ; movsb\n\t + 1: : =c (d0), =D (d1), =S (d2) - :0 (n/4), q (n),1 ((long) to),2 ((long) from) + : 0 (n/4), g (n), 1 ((long) to), 2 ((long) from) : memory); return (to); }
Re: [PATCH] reduce inlined x86 memcpy by 2 bytes
On Friday 18 March 2005 11:21, Denis Vlasenko wrote: This memcpy() is 2 bytes shorter than one currently in mainline and it have one branch less. It is also 3-4% faster in microbenchmarks on small blocks if block size is multiple of 4. Mainline is slower because it has to branch twice per memcpy, both mispredicted (but branch prediction hides that in microbenchmark). Last remaining branch can be dropped too, but then we execute second 'rep movsb' always, even if blocksize%4==0. This is slower than mainline because 'rep movsb' is microcoded. I wonder, tho, whether 'branchlessness' wins over this in real world use (not in bench). I think blocksize%4==0 happens more than 25% of the time. s/%4/3 of course. -- vda - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/