RE: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
We checked also spec2000 and eembc_2_0 on Atom - no visible regressions and gains -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Xinliang David Li Sent: Friday, December 21, 2012 11:26 AM To: Jan Hubicka Cc: GCC Patches; Ahmad Sharif Subject: Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs Ahmad has helped doing some atom performance testing (ChromeOS benchmarks) with this patch. In summary, there is no statistically significant regression seen. There is one improvement of about +1.9% (v8 benchmark) which looks real. David On Wed, Dec 12, 2012 at 9:24 AM, Xinliang David Li davi...@google.com wrote: On Wed, Dec 12, 2012 at 8:37 AM, Jan Hubicka hubi...@ucw.cz wrote: I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a SP adjustment instead of a sequence of pushes/pops. The preference to the MOVs are good for old CPU micro-architectures (before pentium-4, K10), because it breaks the data dependency. In modern micro-architecture, push/pop is implemented using a mechanism called stack engine. The data dependency is removed by the hardware, and push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are smaller. There is no longer the need to avoid using them. This is also what ICC does. The following patch fixed the problem. It passes bootstrap/regression test. OK to install? thanks, David Index: config/i386/i386.c === --- config/i386/i386.c (revision 194324) +++ config/i386/i386.c (working copy) @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, /* X86_TUNE_PROLOGUE_USING_MOVE */ - m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC, + m_PPRO | m_ATHLON_K8, /* X86_TUNE_EPILOGUE_USING_MOVE */ - m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC, + m_PPRO | m_ATHLON_K8, Push/pops wrt moves was always difficult to tune on old CPUs, so I am happy it is gone from generic (in fact I had similar patch pending). Are you sure about Atom having stack engine, too? Good question. The instruction latency table (http://www.agner.org/optimize/instruction_tables.pdf) shows that for Atom: push r has one 1uop, 1 cycle latency. However the instruction is not pairable which will affect ILP. The guide here http://www.agner.org/optimize/microarchitecture.pdf does not mention Atom has stack engine either. I will help collect some performance data on Atom. thanks, David Related thing is accumulate_outgoing_args. Igor is testing it on Core and I will give it a try on K10. Honza I am attaching the changes for core costs I made if someone is interested in testing them. If we can declare P4/PPRo and maybe K8 chips obsolette for generic, there is room for improvement in generic, too. Like using inc/dec again. Honza Index: config/i386/i386.c === --- config/i386/i386.c (revision 194452) +++ config/i386/i386.c (working copy) @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = { COSTS_N_INSNS (8), /* cost of FABS instruction. */ COSTS_N_INSNS (8), /* cost of FCHS instruction. */ COSTS_N_INSNS (40), /* cost of FSQRT instruction. */ - {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}}, - {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true}, + {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}}, + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true}, {-1, libcall, false, {{libcall, {{6, loop_1_byte, true}, {24, loop, true}, {8192, rep_prefix_4_byte, true}, {-1, libcall, false}}}, - {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true}, + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true}, {-1, libcall, false, 1, /* scalar_stmt_cost. */ 1, /* scalar load_cost. */ @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe m_PPRO, /* X86_TUNE_PARTIAL_FLAG_REG_STALL */ - m_CORE2I7 | m_GENERIC, + m_GENERIC | m_CORE2, /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall * on 16-bit immediate moves into memory on Core2 and Corei7. */ @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe m_K6, /* X86_TUNE_USE_CLTD */ - ~(m_PENT | m_ATOM | m_K6), + ~(m_PENT | m_ATOM | m_K6 | m_GENERIC), /* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx. */ m_PENT4, @@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe m_COREI7 | m_BDVER, /* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */ - m_BDVER , + m_BDVER
RE: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
So far we see a regression on one of eembc_1_1 test because of following change: /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion from FP to FP. */ - m_CORE2I7 | m_AMDFAM10 | m_GENERIC, + m_AMDFAM10 | m_GENERIC, Probably we should keep it as is while there is nothing about it in docs indeed... Thanks, Igor -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Jan Hubicka Sent: Wednesday, December 12, 2012 8:37 PM To: Xinliang David Li Cc: GCC Patches Subject: Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a SP adjustment instead of a sequence of pushes/pops. The preference to the MOVs are good for old CPU micro-architectures (before pentium-4, K10), because it breaks the data dependency. In modern micro-architecture, push/pop is implemented using a mechanism called stack engine. The data dependency is removed by the hardware, and push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are smaller. There is no longer the need to avoid using them. This is also what ICC does. The following patch fixed the problem. It passes bootstrap/regression test. OK to install? thanks, David Index: config/i386/i386.c === --- config/i386/i386.c (revision 194324) +++ config/i386/i386.c (working copy) @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, /* X86_TUNE_PROLOGUE_USING_MOVE */ - m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC, + m_PPRO | m_ATHLON_K8, /* X86_TUNE_EPILOGUE_USING_MOVE */ - m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC, + m_PPRO | m_ATHLON_K8, Push/pops wrt moves was always difficult to tune on old CPUs, so I am happy it is gone from generic (in fact I had similar patch pending). Are you sure about Atom having stack engine, too? Related thing is accumulate_outgoing_args. Igor is testing it on Core and I will give it a try on K10. Honza I am attaching the changes for core costs I made if someone is interested in testing them. If we can declare P4/PPRo and maybe K8 chips obsolette for generic, there is room for improvement in generic, too. Like using inc/dec again. Honza Index: config/i386/i386.c === --- config/i386/i386.c (revision 194452) +++ config/i386/i386.c (working copy) @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = { COSTS_N_INSNS (8), /* cost of FABS instruction. */ COSTS_N_INSNS (8), /* cost of FCHS instruction. */ COSTS_N_INSNS (40), /* cost of FSQRT instruction. */ - {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}}, - {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true}, + {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}}, + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true}, {-1, libcall, false, {{libcall, {{6, loop_1_byte, true}, {24, loop, true}, {8192, rep_prefix_4_byte, true}, {-1, libcall, false}}}, - {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true}, + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true}, {-1, libcall, false, 1, /* scalar_stmt_cost. */ 1, /* scalar load_cost. */ @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe m_PPRO, /* X86_TUNE_PARTIAL_FLAG_REG_STALL */ - m_CORE2I7 | m_GENERIC, + m_GENERIC | m_CORE2, /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall * on 16-bit immediate moves into memory on Core2 and Corei7. */ @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe m_K6, /* X86_TUNE_USE_CLTD */ - ~(m_PENT | m_ATOM | m_K6), + ~(m_PENT | m_ATOM | m_K6 | m_GENERIC), /* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx. */ m_PENT4, @@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe m_COREI7 | m_BDVER, /* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */ - m_BDVER , + m_BDVER, /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies are resolved on SSE register parts instead of whole registers, so we may @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe m_ATHLON_K8, /* X86_TUNE_SSE_TYPELESS_STORES */ - m_AMD_MULTIPLE, + m_AMD_MULTIPLE | m_CORE2I7, /**/ /* X86_TUNE_SSE_LOAD0_BY_PXOR */ - m_PPRO | m_P4_NOCONA, + m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/ /* X86_TUNE_MEMORY_MISMATCH_STALL */ m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, @@ -1938,7 +1938,7 @@ static unsigned
RE: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
We checked, no significant gains or losses. -Original Message- From: H.J. Lu [mailto:hjl.to...@gmail.com] Sent: Friday, December 14, 2012 1:03 AM To: Jan Hubicka Cc: Jakub Jelinek; Xinliang David Li; GCC Patches; Teresa Johnson; Melik-adamyan, Areg Subject: Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs On Thu, Dec 13, 2012 at 12:40 PM, Jan Hubicka hubi...@ucw.cz wrote: Here we speak about memcpy/memset only. I never got around to modernize strlen and friends, unfortunately... memcmp and friends are different beats. They realy need some TLC... memcpy and memset in glibc are also extremely fast. The default strategy now is to inline only when the block is known to be small (either constant or via profile feedback, we do not really use the info on upper bound of size of the copied object that would be useful but not readilly available at expansion time). You can try the test_stringop script I attached and send me the results. For Areg, can you give it a try? Thanks. me libc starts to be win only for rather large blocks (i.e. 8KB) Which glibc are you using? -- H.J.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Thu, Dec 20, 2012 at 4:13 AM, Melik-adamyan, Areg areg.melik-adam...@intel.com wrote: We checked, no significant gains or losses. -Original Message- From: H.J. Lu [mailto:hjl.to...@gmail.com] Sent: Friday, December 14, 2012 1:03 AM To: Jan Hubicka Cc: Jakub Jelinek; Xinliang David Li; GCC Patches; Teresa Johnson; Melik-adamyan, Areg Subject: Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs On Thu, Dec 13, 2012 at 12:40 PM, Jan Hubicka hubi...@ucw.cz wrote: Here we speak about memcpy/memset only. I never got around to modernize strlen and friends, unfortunately... memcmp and friends are different beats. They realy need some TLC... memcpy and memset in glibc are also extremely fast. The default strategy now is to inline only when the block is known to be small (either constant or via profile feedback, we do not really use the info on upper bound of size of the copied object that would be useful but not readilly available at expansion time). You can try the test_stringop script I attached and send me the results. For Areg, can you give it a try? Thanks. Hi Areg, Did you mean inlined memcpy/memset are as fast as the ones in libc.so on both ia32 and Intel64? Please keep in mind that memcpy/memset in libc.a may not be optimized. You must not use -static for linking. -- H.J.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
Hi Areg, Did you mean inlined memcpy/memset are as fast as the ones in libc.so on both ia32 and Intel64? I would be interested in output of the stringop script. Please keep in mind that memcpy/memset in libc.a may not be optimized. You must not use -static for linking. In my setup I use dynamic linking... (this is quite anoying property in general - people tend to use --static for performance critical binaries to save expenses of PIC. It would be really cool to have way to call proper stringops based on -march switch) Honza -- H.J.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
Hi Areg, Did you mean inlined memcpy/memset are as fast as the ones in libc.so on both ia32 and Intel64? I would be interested in output of the stringop script. Also as far as I can remember, none of spec2k6 benchmarks is really stringop bound. On Spec2k GCC was quite bound by memset (within alloc_rtx and bitmap oprations) but mostly by collecting page faults there. Inlining that one made quite a lot of difference on K8 hardware, but not on later chips. Honza
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Thu, Dec 20, 2012 at 7:06 AM, Jan Hubicka hubi...@ucw.cz wrote: Hi Areg, Did you mean inlined memcpy/memset are as fast as the ones in libc.so on both ia32 and Intel64? I would be interested in output of the stringop script. Also as far as I can remember, none of spec2k6 benchmarks is really stringop bound. On Spec2k GCC was quite bound by memset (within alloc_rtx and bitmap oprations) but mostly by collecting page faults there. Inlining that one made quite a lot of difference on K8 hardware, but not on later chips. There is a GCC performance regression bug on EEMBC. It turns out that -static was used for linking and optimized memory functions weren't used. Remove -static fixed the performance regression. -- H.J.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
Try the following one. 1) -minline-all-stringops -mstringop-strategy=rep_8byte -O2 vs 1) -mstringop_strategy=libcall -O2. David #include string.h #include stdio.h #include stdlib.h #ifndef LEN #define LEN 16 #endif void copy(char* s1, char* s2,int len) __attribute__((noinline)); void copy(char* s1, char* s2,int len) { memcpy(s2,s1,len); } I guess the catch here is that you force the copy to be noinline and thus you eliminate the benefits of inlined sequence. With inline stringop one saves regalloc and often can get rid of the alignment tests. This is script I use to tune the tables. Honza test() { rm -f a.out cat END | $1 -x c -O3 $3 -DAVG_SIZE=$2 $STRINGOP -DMEMORY_COPIES=$memsize - #define BUFFER_SIZE (16*1024*1024 + AVG_SIZE*2) /*#define MEMORY_COPIES (1024*1024*64*(long long)10)*/ $type t[BUFFER_SIZE]; main() { unsigned int i; for (i=0;i((long long)MEMORY_COPIES + AVG_SIZE * 2 - 1)/AVG_SIZE*2;i++) #ifdef test_memset __builtin_memset (t+(i*1024*1024+i*1)%(BUFFER_SIZE - AVG_SIZE*2), i, (AVG_SIZE + i) % (AVG_SIZE * 2 + 0)); #else __builtin_memcpy (t+(i*1024*1024+i*1)%(BUFFER_SIZE - AVG_SIZE*2), t+((i+1)*1024*1024*4+i*1)%(BUFFER_SIZE - AVG_SIZE *2), (AVG_SIZE + i) % (AVG_SIZE * 2 + 0)); #endif return 0; } END TIME=`/usr/bin/time -f %E ./a.out 21` echo -n $TIME echo $TIME $4 /tmp/accum } testrow() { echo -n /tmp/accum printf block size %7i $3 test $2 $3 -mstringop-strategy=libcall libcall test $2 $3 -mstringop-strategy=rep_byte -malign-stringops rep1 test $2 $3 -mstringop-strategy=rep_byte -mno-align-stringops rep1noalign test $2 $3 -mstringop-strategy=rep_4byte -malign-stringops rep4 test $2 $3 -mstringop-strategy=rep_4byte -mno-align-stringops rep4noalign if [ $mode == 64 ] then test $2 $3 -mstringop-strategy=rep_8byte -malign-stringops rep8 test $2 $3 -mstringop-strategy=rep_8byte -mno-align-stringops rep8noalign fi test $2 $3 -mstringop-strategy=loop -malign-stringops loop test $2 $3 -mstringop-strategy=loop -mno-align-stringops loopnoalign test $2 $3 -mstringop-strategy=unrolled_loop -malign-stringops unrl test $2 $3 -mstringop-strategy=unrolled_loop -mno-align-stringops unrlnoalign test $2 $3 -mstringop-strategy=sse_loop -malign-stringops sse test $2 $3 -mstringop-strategy=sse_loop -mno-align-stringops -msse2 ssenoalign test $2 $3 -mstringop-strategy=byte_loop byte best=`cat /tmp/accum | sort | head -1` test $2 $3 -fprofile-generate /dev/null 21 test $2 $3 -fprofile-use test $2 $3 -minline-stringops-dynamically echo best: $best } test_all_sizes() { if [ $mode == 64 ] then echolibcall rep1 noalgrep4 noalgrep8 noalg loop noalgunrl noalgsse noalgbyte profiled dynamic else echolibcall rep1 noalgrep4 noalgloop noalg unrl noalgssenoalgbyte profiled dynamic fi #for size in 1 2 3 4 6 8 10 12 14 16 24 32 48 64 128 256 512 1024 4096 8192 81920 819200 8192000 #for size in 8192000 819200 81920 8192 4096 2048 1024 512 256 128 64 48 32 24 16 14 12 10 8 6 5 4 3 2 1 for size in 8192000 819200 81920 20480 8192 4096 2048 1024 512 256 128 64 48 32 24 16 14 12 10 8 6 4 1 #for size in 128 256 1024 4096 8192 81920 819200 do testrow $1 $2 $size done } mode=$1 shift export memsize=$1 shift cmdline=$* if [ $mode != 32 ] then if [ $mode != 64 ] then echo Usage: echo test_stringop mode size cmdline echo mode is either 32 or 64 echo size is amount of memory copied in each test. Should be chosed small enough so runtime is less than minute for each test and sorting works echo Example: test_stringop 32 64000 ./xgcc -B ./ -march=pentium3 exit fi fi echo memcpy mode:$mode size:$memsize export STRINGOP= type=char test_all_sizes $mode $cmdline -m$mode echo Aligned type=long test_all_sizes $mode $cmdline -m$mode echo memset type=char export STRINGOP=-Dtest_memset=1 test_all_sizes $mode $cmdline -m$mode echo Aligned type=long test_all_sizes $mode $cmdline -m$mode
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek ja...@redhat.com wrote: On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote: On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka hubi...@ucw.cz wrote: libcall is not faster up to 8KB to rep sequence that is better for regalloc/code cache than fully blowin function call. Be careful with this. My recollection is that REP sequence is good for any size -- for smaller size, the REP initial set up cost is too high (10s of cycles), while for large size copy, it is less efficient compared with library version. Well this is based on the data from the memtest script. Core has good REP implementation - it is a win from rather small blocks (16 bytes if I recall) and it does not need alignment. Library version starts to be interesting with caching hints, but I think till 80KB it is still not a win for my setup (glibc-2.15) A simple test shows that -mstringop-strategy=libcall always beats -mstringop-strategy=rep_8byte (on core2 and corei7) except for size smaller than 8 where the rep_8byte strategy simply bypasses REP movs. Can you share your memtest ? I can't believe that say 16 byte or 32 byte memcpy can be ever faster using a libcall. The PLT call overhead is simply too high. The x86 string/memory functions in the current glibc are extremely fast and tuned for Core 2/Core i7. GCC is having a very hard time to beat them with inlining: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052 -- H.J.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek ja...@redhat.com wrote: On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote: On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka hubi...@ucw.cz wrote: libcall is not faster up to 8KB to rep sequence that is better for regalloc/code cache than fully blowin function call. Be careful with this. My recollection is that REP sequence is good for any size -- for smaller size, the REP initial set up cost is too high (10s of cycles), while for large size copy, it is less efficient compared with library version. Well this is based on the data from the memtest script. Core has good REP implementation - it is a win from rather small blocks (16 bytes if I recall) and it does not need alignment. Library version starts to be interesting with caching hints, but I think till 80KB it is still not a win for my setup (glibc-2.15) A simple test shows that -mstringop-strategy=libcall always beats -mstringop-strategy=rep_8byte (on core2 and corei7) except for size smaller than 8 where the rep_8byte strategy simply bypasses REP movs. Can you share your memtest ? I can't believe that say 16 byte or 32 byte memcpy can be ever faster using a libcall. The PLT call overhead is simply too high. The x86 string/memory functions in the current glibc are extremely fast and tuned for Core 2/Core i7. GCC is having a very hard time to beat them with inlining: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052 Here we speak about memcpy/memset only. I never got around to modernize strlen and friends, unfortunately... memcmp and friends are different beats. They realy need some TLC... Honza
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Thu, Dec 13, 2012 at 12:26 PM, Jan Hubicka hubi...@ucw.cz wrote: On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek ja...@redhat.com wrote: On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote: On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka hubi...@ucw.cz wrote: libcall is not faster up to 8KB to rep sequence that is better for regalloc/code cache than fully blowin function call. Be careful with this. My recollection is that REP sequence is good for any size -- for smaller size, the REP initial set up cost is too high (10s of cycles), while for large size copy, it is less efficient compared with library version. Well this is based on the data from the memtest script. Core has good REP implementation - it is a win from rather small blocks (16 bytes if I recall) and it does not need alignment. Library version starts to be interesting with caching hints, but I think till 80KB it is still not a win for my setup (glibc-2.15) A simple test shows that -mstringop-strategy=libcall always beats -mstringop-strategy=rep_8byte (on core2 and corei7) except for size smaller than 8 where the rep_8byte strategy simply bypasses REP movs. Can you share your memtest ? I can't believe that say 16 byte or 32 byte memcpy can be ever faster using a libcall. The PLT call overhead is simply too high. The x86 string/memory functions in the current glibc are extremely fast and tuned for Core 2/Core i7. GCC is having a very hard time to beat them with inlining: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052 Here we speak about memcpy/memset only. I never got around to modernize strlen and friends, unfortunately... memcmp and friends are different beats. They realy need some TLC... memcpy and memset in glibc are also extremely fast. -- H.J.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
Here we speak about memcpy/memset only. I never got around to modernize strlen and friends, unfortunately... memcmp and friends are different beats. They realy need some TLC... memcpy and memset in glibc are also extremely fast. The default strategy now is to inline only when the block is known to be small (either constant or via profile feedback, we do not really use the info on upper bound of size of the copied object that would be useful but not readilly available at expansion time). You can try the test_stringop script I attached and send me the results. For me libc starts to be win only for rather large blocks (i.e. 8KB) Honza
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Thu, Dec 13, 2012 at 12:40 PM, Jan Hubicka hubi...@ucw.cz wrote: Here we speak about memcpy/memset only. I never got around to modernize strlen and friends, unfortunately... memcmp and friends are different beats. They realy need some TLC... memcpy and memset in glibc are also extremely fast. The default strategy now is to inline only when the block is known to be small (either constant or via profile feedback, we do not really use the info on upper bound of size of the copied object that would be useful but not readilly available at expansion time). You can try the test_stringop script I attached and send me the results. For Areg, can you give it a try? Thanks. me libc starts to be win only for rather large blocks (i.e. 8KB) Which glibc are you using? -- H.J.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
me libc starts to be win only for rather large blocks (i.e. 8KB) Which glibc are you using? 2.15 as it comes with opensuse 12.2 Honza -- H.J.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Tue, Dec 11, 2012 at 11:53 PM, Xinliang David Li davi...@google.com wrote: The following the O2 size data from SPEC2k. Note that with push/pop, it is a always a net win (negative delta) in terms of total binary or total loadable section size. Thanks for the data! Richard. thanks, David .text.eh_frame Total_binary vortex-move 440252 40796 584066 vortex-push 415436 57452 575906 delta-5.6% 40.8% -1.397% twolf-move 169324 10748 223521 twolf-push 168876 11124 223449 delta -0.3% 3.5% -0.032% gzip-move 30668 3652 374399 gzip-push 30524 3740 374343 delta -0.5% 2.4% -0.015% bzip2-move 22748 3196 111616 bzip2-push 22636 3284 111592 delta -0.5% 2.8% -0.022% vpr-move 104684 9380 147378 vpr-push 104236 9788 147338 delta -0.4% 4.3% -0.027% mcf-move 8444 1244 26760 mcf-push 8444 1244 26760 delta0.0% 0.0% 0.000% cc1-move 1093964 90772 1576994 cc1-push 1078988 104068 1575314 delta -1.4% 14.6% -0.107% crafty-move 130556 5508 1256037 crafty-push 130236 5772 1255981 delta-0.2% 4.8% -0.004% eon-move 333660 33220 516491 eon-push 330140 35812 51 delta -1.1% 7.8% -0.181% gap-move 404092 46732 1457735 gap-push 396012 53180 1456103 delta -2.0% 13.8% -0.112% perlbmk-move 456572 45324 618585 perlbmk-push 449516 52340 618545 delta -1.5% 15.5% -0.006% parser-move 81244 15788 334003 parser-push 80684 16332 333987 delta -0.7% 3.4% -0.005% On Tue, Dec 11, 2012 at 9:14 AM, Xinliang David Li davi...@google.com wrote: On Tue, Dec 11, 2012 at 1:49 AM, Richard Biener richard.guent...@gmail.com wrote: On Mon, Dec 10, 2012 at 10:07 PM, Mike Stump mikest...@comcast.net wrote: On Dec 10, 2012, at 12:42 PM, Xinliang David Li davi...@google.com wrote: I have not measured the CFI size impact -- but conceivably it should be larger -- which is unfortunate. Code speed and size are preferable to optimizing dwarf size… :-) I'd let dwarf 5 fix it! Well, different to debug info, CFI data has to be in memory to make unwinding work. These days most Linux distributions enable asyncronous unwind tables so any size savings due to shorter push/pop epilogue/prologue sequences has to be offsetted by the increase in CFI data. I'm not sure there is really a speed difference between both variants (well, maybe due to better icache footprint of the push/pop variant). Yes, for large applications, this can be crucial to performance. That said - I'd prefer to have more data on this before making the switch for the generic model. What was your original motivation? Just theory or was it a real case? 1) some of the very large internal apps I measured benefit from this change (in terms of performance) 2) both ICC and LLVM do the same. I have already committed the patch. I will find some time to collect more size data and post it later. thanks, David Thanks, Richard.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a SP adjustment instead of a sequence of pushes/pops. The preference to the MOVs are good for old CPU micro-architectures (before pentium-4, K10), because it breaks the data dependency. In modern micro-architecture, push/pop is implemented using a mechanism called stack engine. The data dependency is removed by the hardware, and push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are smaller. There is no longer the need to avoid using them. This is also what ICC does. The following patch fixed the problem. It passes bootstrap/regression test. OK to install? thanks, David Index: config/i386/i386.c === --- config/i386/i386.c (revision 194324) +++ config/i386/i386.c (working copy) @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, /* X86_TUNE_PROLOGUE_USING_MOVE */ - m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC, + m_PPRO | m_ATHLON_K8, /* X86_TUNE_EPILOGUE_USING_MOVE */ - m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC, + m_PPRO | m_ATHLON_K8, Push/pops wrt moves was always difficult to tune on old CPUs, so I am happy it is gone from generic (in fact I had similar patch pending). Are you sure about Atom having stack engine, too? Related thing is accumulate_outgoing_args. Igor is testing it on Core and I will give it a try on K10. Honza I am attaching the changes for core costs I made if someone is interested in testing them. If we can declare P4/PPRo and maybe K8 chips obsolette for generic, there is room for improvement in generic, too. Like using inc/dec again. Honza Index: config/i386/i386.c === --- config/i386/i386.c (revision 194452) +++ config/i386/i386.c (working copy) @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = { COSTS_N_INSNS (8), /* cost of FABS instruction. */ COSTS_N_INSNS (8), /* cost of FCHS instruction. */ COSTS_N_INSNS (40), /* cost of FSQRT instruction. */ - {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}}, - {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true}, + {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}}, + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true}, {-1, libcall, false, {{libcall, {{6, loop_1_byte, true}, {24, loop, true}, {8192, rep_prefix_4_byte, true}, {-1, libcall, false}}}, - {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true}, + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true}, {-1, libcall, false, 1, /* scalar_stmt_cost. */ 1, /* scalar load_cost. */ @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe m_PPRO, /* X86_TUNE_PARTIAL_FLAG_REG_STALL */ - m_CORE2I7 | m_GENERIC, + m_GENERIC | m_CORE2, /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall * on 16-bit immediate moves into memory on Core2 and Corei7. */ @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe m_K6, /* X86_TUNE_USE_CLTD */ - ~(m_PENT | m_ATOM | m_K6), + ~(m_PENT | m_ATOM | m_K6 | m_GENERIC), /* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx. */ m_PENT4, @@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe m_COREI7 | m_BDVER, /* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */ - m_BDVER , + m_BDVER, /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies are resolved on SSE register parts instead of whole registers, so we may @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe m_ATHLON_K8, /* X86_TUNE_SSE_TYPELESS_STORES */ - m_AMD_MULTIPLE, + m_AMD_MULTIPLE | m_CORE2I7, /**/ /* X86_TUNE_SSE_LOAD0_BY_PXOR */ - m_PPRO | m_P4_NOCONA, + m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/ /* X86_TUNE_MEMORY_MISMATCH_STALL */ m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more than 4 branch instructions in the 16 byte window. */ - m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, + m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, /* X86_TUNE_SCHEDULE */ m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE | m_GENERIC, @@ -1947,10 +1947,10 @@ static unsigned int initial_ix86_tune_fe m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, /* X86_TUNE_USE_INCDEC */ - ~(m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_GENERIC), + ~(m_P4_NOCONA | m_ATOM |
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Wed, Dec 12, 2012 at 8:37 AM, Jan Hubicka hubi...@ucw.cz wrote: I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a SP adjustment instead of a sequence of pushes/pops. The preference to the MOVs are good for old CPU micro-architectures (before pentium-4, K10), because it breaks the data dependency. In modern micro-architecture, push/pop is implemented using a mechanism called stack engine. The data dependency is removed by the hardware, and push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are smaller. There is no longer the need to avoid using them. This is also what ICC does. The following patch fixed the problem. It passes bootstrap/regression test. OK to install? thanks, David Index: config/i386/i386.c === --- config/i386/i386.c (revision 194324) +++ config/i386/i386.c (working copy) @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, /* X86_TUNE_PROLOGUE_USING_MOVE */ - m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC, + m_PPRO | m_ATHLON_K8, /* X86_TUNE_EPILOGUE_USING_MOVE */ - m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC, + m_PPRO | m_ATHLON_K8, Push/pops wrt moves was always difficult to tune on old CPUs, so I am happy it is gone from generic (in fact I had similar patch pending). Are you sure about Atom having stack engine, too? Good question. The instruction latency table (http://www.agner.org/optimize/instruction_tables.pdf) shows that for Atom: push r has one 1uop, 1 cycle latency. However the instruction is not pairable which will affect ILP. The guide here http://www.agner.org/optimize/microarchitecture.pdf does not mention Atom has stack engine either. I will help collect some performance data on Atom. thanks, David Related thing is accumulate_outgoing_args. Igor is testing it on Core and I will give it a try on K10. Honza I am attaching the changes for core costs I made if someone is interested in testing them. If we can declare P4/PPRo and maybe K8 chips obsolette for generic, there is room for improvement in generic, too. Like using inc/dec again. Honza Index: config/i386/i386.c === --- config/i386/i386.c (revision 194452) +++ config/i386/i386.c (working copy) @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = { COSTS_N_INSNS (8), /* cost of FABS instruction. */ COSTS_N_INSNS (8), /* cost of FCHS instruction. */ COSTS_N_INSNS (40), /* cost of FSQRT instruction. */ - {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}}, - {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true}, + {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}}, + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true}, {-1, libcall, false, {{libcall, {{6, loop_1_byte, true}, {24, loop, true}, {8192, rep_prefix_4_byte, true}, {-1, libcall, false}}}, - {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true}, + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true}, {-1, libcall, false, 1, /* scalar_stmt_cost. */ 1, /* scalar load_cost. */ @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe m_PPRO, /* X86_TUNE_PARTIAL_FLAG_REG_STALL */ - m_CORE2I7 | m_GENERIC, + m_GENERIC | m_CORE2, /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall * on 16-bit immediate moves into memory on Core2 and Corei7. */ @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe m_K6, /* X86_TUNE_USE_CLTD */ - ~(m_PENT | m_ATOM | m_K6), + ~(m_PENT | m_ATOM | m_K6 | m_GENERIC), /* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx. */ m_PENT4, @@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe m_COREI7 | m_BDVER, /* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */ - m_BDVER , + m_BDVER, /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies are resolved on SSE register parts instead of whole registers, so we may @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe m_ATHLON_K8, /* X86_TUNE_SSE_TYPELESS_STORES */ - m_AMD_MULTIPLE, + m_AMD_MULTIPLE | m_CORE2I7, /**/ /* X86_TUNE_SSE_LOAD0_BY_PXOR */ - m_PPRO | m_P4_NOCONA, + m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/ /* X86_TUNE_MEMORY_MISMATCH_STALL */ m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more than 4
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
Honza, can you explain each change and point to the reference? thanks, David On Wed, Dec 12, 2012 at 8:37 AM, Jan Hubicka hubi...@ucw.cz wrote: I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a SP adjustment instead of a sequence of pushes/pops. The preference to the MOVs are good for old CPU micro-architectures (before pentium-4, K10), because it breaks the data dependency. In modern micro-architecture, push/pop is implemented using a mechanism called stack engine. The data dependency is removed by the hardware, and push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are smaller. There is no longer the need to avoid using them. This is also what ICC does. The following patch fixed the problem. It passes bootstrap/regression test. OK to install? thanks, David Index: config/i386/i386.c === --- config/i386/i386.c (revision 194324) +++ config/i386/i386.c (working copy) @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, /* X86_TUNE_PROLOGUE_USING_MOVE */ - m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC, + m_PPRO | m_ATHLON_K8, /* X86_TUNE_EPILOGUE_USING_MOVE */ - m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC, + m_PPRO | m_ATHLON_K8, Push/pops wrt moves was always difficult to tune on old CPUs, so I am happy it is gone from generic (in fact I had similar patch pending). Are you sure about Atom having stack engine, too? Related thing is accumulate_outgoing_args. Igor is testing it on Core and I will give it a try on K10. Honza I am attaching the changes for core costs I made if someone is interested in testing them. If we can declare P4/PPRo and maybe K8 chips obsolette for generic, there is room for improvement in generic, too. Like using inc/dec again. Honza Index: config/i386/i386.c === --- config/i386/i386.c (revision 194452) +++ config/i386/i386.c (working copy) @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = { COSTS_N_INSNS (8), /* cost of FABS instruction. */ COSTS_N_INSNS (8), /* cost of FCHS instruction. */ COSTS_N_INSNS (40), /* cost of FSQRT instruction. */ - {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}}, - {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true}, + {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}}, + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true}, {-1, libcall, false, {{libcall, {{6, loop_1_byte, true}, {24, loop, true}, {8192, rep_prefix_4_byte, true}, {-1, libcall, false}}}, - {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true}, + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true}, {-1, libcall, false, 1, /* scalar_stmt_cost. */ 1, /* scalar load_cost. */ @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe m_PPRO, /* X86_TUNE_PARTIAL_FLAG_REG_STALL */ - m_CORE2I7 | m_GENERIC, + m_GENERIC | m_CORE2, /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall * on 16-bit immediate moves into memory on Core2 and Corei7. */ @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe m_K6, /* X86_TUNE_USE_CLTD */ - ~(m_PENT | m_ATOM | m_K6), + ~(m_PENT | m_ATOM | m_K6 | m_GENERIC), /* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx. */ m_PENT4, @@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe m_COREI7 | m_BDVER, /* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */ - m_BDVER , + m_BDVER, /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies are resolved on SSE register parts instead of whole registers, so we may @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe m_ATHLON_K8, /* X86_TUNE_SSE_TYPELESS_STORES */ - m_AMD_MULTIPLE, + m_AMD_MULTIPLE | m_CORE2I7, /**/ /* X86_TUNE_SSE_LOAD0_BY_PXOR */ - m_PPRO | m_P4_NOCONA, + m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/ /* X86_TUNE_MEMORY_MISMATCH_STALL */ m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more than 4 branch instructions in the 16 byte window. */ - m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, + m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, /* X86_TUNE_SCHEDULE */ m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE | m_GENERIC, @@ -1947,10 +1947,10 @@
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
Concerning 1push per cycle, I think it is same as K7 hardware did, so move prologue should be a win. Index: config/i386/i386.c === --- config/i386/i386.c (revision 194452) +++ config/i386/i386.c (working copy) @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = { COSTS_N_INSNS (8), /* cost of FABS instruction. */ COSTS_N_INSNS (8), /* cost of FCHS instruction. */ COSTS_N_INSNS (40), /* cost of FSQRT instruction. */ - {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}}, - {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true}, + {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}}, + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true}, {-1, libcall, false, {{libcall, {{6, loop_1_byte, true}, {24, loop, true}, {8192, rep_prefix_4_byte, true}, {-1, libcall, false}}}, - {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true}, + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true}, libcall is not faster up to 8KB to rep sequence that is better for regalloc/code cache than fully blowin function call. @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe m_PPRO, /* X86_TUNE_PARTIAL_FLAG_REG_STALL */ - m_CORE2I7 | m_GENERIC, + m_GENERIC | m_CORE2, This disable shifts that store just some flags. Acroding to Agner's manual I7 handle this well. Partial flags stall The Sandy Bridge uses the method of an extra Âľop to join partial registers not only for general purpose registers but also for the flags register, unlike previous processors which used this method only for general purpose registers. This occurs when a write to a part of the flags register is followed by a read from a larger part of the flags register. The partial flags stall of previous processors (See page 75) is therefore replaced by an extra Âľop. The Sandy Bridge also generates an extra Âľop when reading the flags after a rotate instruction. This is cheaper than the 7 cycle delay on Core this flags is trying to avoid. /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall * on 16-bit immediate moves into memory on Core2 and Corei7. */ @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe m_K6, /* X86_TUNE_USE_CLTD */ - ~(m_PENT | m_ATOM | m_K6), + ~(m_PENT | m_ATOM | m_K6 | m_GENERIC), None of CPUs that generic care about are !USE_CLTD now after your change. @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe m_ATHLON_K8, /* X86_TUNE_SSE_TYPELESS_STORES */ - m_AMD_MULTIPLE, + m_AMD_MULTIPLE | m_CORE2I7, /**/ Hmm, I can not seem to find this in manual now, but I believe that stores also do not type, so movaps is preferred over movapd store because it is shorter. If not, this change should produce a lot of slowdowns. /* X86_TUNE_SSE_LOAD0_BY_PXOR */ - m_PPRO | m_P4_NOCONA, + m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/ Agner: A common way of setting a register to zero is XOR EAX,EAX or SUB EBX,EBX. The Core2 and Nehalem processors recognize that certain instructions are independent of the prior value of the register if the source and destination registers are the same. This applies to all of the following instructions: XOR, SUB, PXOR, XORPS, XORPD, and all variants of PSUBxxx and PCMPxxx except PCMPEQQ. /* X86_TUNE_MEMORY_MISMATCH_STALL */ m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more than 4 branch instructions in the 16 byte window. */ - m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, + m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, This is special passs to handle limitations of AMD's K7/K8/K10 branch prediction. Intel never had similar design, so this flag is pointless. We apparently ought to disable it for K10, at least per Agner's manual. /* X86_TUNE_SCHEDULE */ m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE | m_GENERIC, @@ -1947,10 +1947,10 @@ static unsigned int initial_ix86_tune_fe m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, /* X86_TUNE_USE_INCDEC */ - ~(m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_GENERIC), + ~(m_P4_NOCONA | m_ATOM | m_GENERIC), Skipping inc/dec is to avoid partial flag stall happening on P4 only. /* X86_TUNE_PAD_RETURNS */ - m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC, + m_AMD_MULTIPLE | m_GENERIC, Again this deals specifically with AMD K7/K8/K10 branch prediction. I am not even sure this should be enabled for K10. /* X86_TUNE_PAD_SHORT_FUNCTION: Pad short funtion. */
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
Jan Hubicka hubi...@ucw.cz writes: libcall is not faster up to 8KB to rep sequence that is better for regalloc/code cache than fully blowin function call. I noticed btw that some of the generated string instructions are slower than just calling the C library. rep scasb etc. is rarely a win over an optimized library function, it's not very optimized. Perhaps those patterns should just be disabled. The way to optimize that on modern CPUs is to use PCMP*STR*, but that's quite a bit more complicated and has some constraints. /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more than 4 branch instructions in the 16 byte window. */ - m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, + m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, This is special passs to handle limitations of AMD's K7/K8/K10 branch prediction. Intel never had similar design, so this flag is pointless. Actually the Sandy Bridge decoded icache has a limit of 3 jumps per 16 byte window. If you exceed that it falls back to running the full decoder from the normal icache. I don't have solid data, but it may be a win for frontend limited code (otherwise possibly more in power than performance) I would revisit that for Sandy Bridge -Andi -- a...@linux.intel.com -- Speaking for myself only
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
Jan Hubicka hubi...@ucw.cz writes: libcall is not faster up to 8KB to rep sequence that is better for regalloc/code cache than fully blowin function call. I noticed btw that some of the generated string instructions are slower than just calling the C library. rep scasb etc. is rarely a win over an optimized library function, it's not very optimized. Perhaps those patterns should just be disabled. The way to optimize that on modern CPUs is to use PCMP*STR*, but that's quite a bit more complicated and has some constraints. This is only about memset/memcpy expanding. The other sequences are quite lame indeed... /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more than 4 branch instructions in the 16 byte window. */ - m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, + m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, This is special passs to handle limitations of AMD's K7/K8/K10 branch prediction. Intel never had similar design, so this flag is pointless. Actually the Sandy Bridge decoded icache has a limit of 3 jumps per 16 byte window. If you exceed that it falls back to running the full decoder from the normal icache. I don't have solid data, but it may be a win for frontend limited code (otherwise possibly more in power than performance) I would revisit that for Sandy Bridge We are not particularly good on avoiding the branches - basically the code inserts alignment whenever it things the 4 consecutive branches fit in the window. I can make patch to change this to 3 and we can see if it helps at all. -Andi -- a...@linux.intel.com -- Speaking for myself only
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
Andi Kleen a...@firstfloor.org writes: /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more than 4 branch instructions in the 16 byte window. */ - m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, + m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, This is special passs to handle limitations of AMD's K7/K8/K10 branch prediction. Intel never had similar design, so this flag is pointless. Actually the Sandy Bridge decoded icache has a limit of 3 jumps per 16 byte window. Actually it's four per 32bytes, sorry. Here's an old patch I had lying around to optimize for that. diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 1b871be..9b57316 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -2713,6 +2713,7 @@ ix86_target_string (HOST_WIDE_INT isa, int flags, const char *arch, { -mavx256-split-unaligned-load, MASK_AVX256_SPLIT_UNALIGNED_LOAD}, { -mavx256-split-unaligned-store, MASK_AVX256_SPLIT_UNALIGNED_STORE}, { -mprefer-avx128, MASK_PREFER_AVX128}, +{ -mjump-pad-32bytes,MASK_JUMP_PAD_32BYTES}, }; const char *opts[ARRAY_SIZE (isa_opts) + ARRAY_SIZE (flag_opts) + 6][2]; @@ -32182,6 +32183,7 @@ ix86_avoid_jump_mispredicts (void) rtx insn, start = get_insns (); int nbytes = 0, njumps = 0; int isjump = 0; + int jump_pad_window_size = TARGET_JUMP_PAD_32BYTES ? 32 : 16; /* Look for all minimal intervals of instructions containing 4 jumps. The intervals are bounded by START and INSN. NBYTES is the total @@ -32202,8 +32204,8 @@ ix86_avoid_jump_mispredicts (void) int align = label_to_alignment (insn); int max_skip = label_to_max_skip (insn); - if (max_skip 15) - max_skip = 15; + if (max_skip jump_pad_window_size - 1) + max_skip = jump_pad_window_size - 1; /* If align 3, only up to 16 - max_skip - 1 bytes can be already in the current 16 byte page, because otherwise ASM_OUTPUT_MAX_SKIP_ALIGN could skip max_skip or fewer @@ -32216,7 +32218,7 @@ ix86_avoid_jump_mispredicts (void) INSN_UID (insn), max_skip); if (max_skip) { - while (nbytes + max_skip = 16) + while (nbytes + max_skip = jump_pad_window_size) { start = NEXT_INSN (start); if ((JUMP_P (start) @@ -32262,10 +32264,11 @@ ix86_avoid_jump_mispredicts (void) fprintf (dump_file, Interval %i to %i has %i bytes\n, INSN_UID (start), INSN_UID (insn), nbytes); - if (njumps == 3 isjump nbytes 16) + if (njumps == 3 isjump nbytes jump_pad_window_size) { - int padsize = 15 - nbytes + min_insn_size (insn); - + int padsize = jump_pad_window_size - 1 - nbytes + + min_insn_size (insn); + if (dump_file) fprintf (dump_file, Padding insn %i by %i bytes!\n, INSN_UID (insn), padsize); diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt index 6c516e7..b38d163 100644 --- a/gcc/config/i386/i386.opt +++ b/gcc/config/i386/i386.opt @@ -223,6 +223,10 @@ mintel-syntax Target Undocumented Alias(masm=, intel, att) Warn(%-mintel-syntax% and %-mno-intel-syntax% are deprecated; use %-masm=intel% and %-masm=att% instead) ;; Deprecated +mjump-pad-32bytes +Target RejectNegative Mask(JUMP_PAD_32BYTES) Save +Avoid more than 4 jumps in each 32byte code window. + mms-bitfields Target Report Mask(MS_BITFIELD_LAYOUT) Save Use native (MS) bitfield layout -- a...@linux.intel.com -- Speaking for myself only
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Wed, Dec 12, 2012 at 10:30 AM, Jan Hubicka hubi...@ucw.cz wrote: Concerning 1push per cycle, I think it is same as K7 hardware did, so move prologue should be a win. Index: config/i386/i386.c === --- config/i386/i386.c (revision 194452) +++ config/i386/i386.c (working copy) @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = { COSTS_N_INSNS (8), /* cost of FABS instruction. */ COSTS_N_INSNS (8), /* cost of FCHS instruction. */ COSTS_N_INSNS (40), /* cost of FSQRT instruction. */ - {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}}, - {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true}, + {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}}, + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true}, {-1, libcall, false, {{libcall, {{6, loop_1_byte, true}, {24, loop, true}, {8192, rep_prefix_4_byte, true}, {-1, libcall, false}}}, - {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true}, + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true}, libcall is not faster up to 8KB to rep sequence that is better for regalloc/code cache than fully blowin function call. Be careful with this. My recollection is that REP sequence is good for any size -- for smaller size, the REP initial set up cost is too high (10s of cycles), while for large size copy, it is less efficient compared with library version. @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe m_PPRO, /* X86_TUNE_PARTIAL_FLAG_REG_STALL */ - m_CORE2I7 | m_GENERIC, + m_GENERIC | m_CORE2, This disable shifts that store just some flags. Acroding to Agner's manual I7 handle this well. ok. Partial flags stall The Sandy Bridge uses the method of an extra Âľop to join partial registers not only for general purpose registers but also for the flags register, unlike previous processors which used this method only for general purpose registers. This occurs when a write to a part of the flags register is followed by a read from a larger part of the flags register. The partial flags stall of previous processors (See page 75) is therefore replaced by an extra Âľop. The Sandy Bridge also generates an extra Âľop when reading the flags after a rotate instruction. This is cheaper than the 7 cycle delay on Core this flags is trying to avoid. ok. /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall * on 16-bit immediate moves into memory on Core2 and Corei7. */ @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe m_K6, /* X86_TUNE_USE_CLTD */ - ~(m_PENT | m_ATOM | m_K6), + ~(m_PENT | m_ATOM | m_K6 | m_GENERIC), My change was to enable CLTD for generic. Is your change intended to revert that? None of CPUs that generic care about are !USE_CLTD now after your change. @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe m_ATHLON_K8, /* X86_TUNE_SSE_TYPELESS_STORES */ - m_AMD_MULTIPLE, + m_AMD_MULTIPLE | m_CORE2I7, /**/ Hmm, I can not seem to find this in manual now, but I believe that stores also do not type, so movaps is preferred over movapd store because it is shorter. If not, this change should produce a lot of slowdowns. /* X86_TUNE_SSE_LOAD0_BY_PXOR */ - m_PPRO | m_P4_NOCONA, + m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/ Agner: A common way of setting a register to zero is XOR EAX,EAX or SUB EBX,EBX. The Core2 and Nehalem processors recognize that certain instructions are independent of the prior value of the register if the source and destination registers are the same. This applies to all of the following instructions: XOR, SUB, PXOR, XORPS, XORPD, and all variants of PSUBxxx and PCMPxxx except PCMPEQQ. /* X86_TUNE_MEMORY_MISMATCH_STALL */ m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more than 4 branch instructions in the 16 byte window. */ - m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, + m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, This is special passs to handle limitations of AMD's K7/K8/K10 branch prediction. Intel never had similar design, so this flag is pointless. I noticed that too, but Andi has a better answer to it. We apparently ought to disable it for K10, at least per Agner's manual. /* X86_TUNE_SCHEDULE */ m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE | m_GENERIC, @@ -1947,10 +1947,10 @@ static unsigned int initial_ix86_tune_fe m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE |
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Wed, Dec 12, 2012 at 4:16 PM, Xinliang David Li davi...@google.com wrote: On Wed, Dec 12, 2012 at 10:30 AM, Jan Hubicka hubi...@ucw.cz wrote: Concerning 1push per cycle, I think it is same as K7 hardware did, so move prologue should be a win. Index: config/i386/i386.c === --- config/i386/i386.c (revision 194452) +++ config/i386/i386.c (working copy) @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = { COSTS_N_INSNS (8), /* cost of FABS instruction. */ COSTS_N_INSNS (8), /* cost of FCHS instruction. */ COSTS_N_INSNS (40), /* cost of FSQRT instruction. */ - {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}}, - {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true}, + {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}}, + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true}, {-1, libcall, false, {{libcall, {{6, loop_1_byte, true}, {24, loop, true}, {8192, rep_prefix_4_byte, true}, {-1, libcall, false}}}, - {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true}, + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true}, libcall is not faster up to 8KB to rep sequence that is better for regalloc/code cache than fully blowin function call. Be careful with this. My recollection is that REP sequence is good for s/good/not good/ David any size -- for smaller size, the REP initial set up cost is too high (10s of cycles), while for large size copy, it is less efficient compared with library version. @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe m_PPRO, /* X86_TUNE_PARTIAL_FLAG_REG_STALL */ - m_CORE2I7 | m_GENERIC, + m_GENERIC | m_CORE2, This disable shifts that store just some flags. Acroding to Agner's manual I7 handle this well. ok. Partial flags stall The Sandy Bridge uses the method of an extra Âľop to join partial registers not only for general purpose registers but also for the flags register, unlike previous processors which used this method only for general purpose registers. This occurs when a write to a part of the flags register is followed by a read from a larger part of the flags register. The partial flags stall of previous processors (See page 75) is therefore replaced by an extra Âľop. The Sandy Bridge also generates an extra Âľop when reading the flags after a rotate instruction. This is cheaper than the 7 cycle delay on Core this flags is trying to avoid. ok. /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall * on 16-bit immediate moves into memory on Core2 and Corei7. */ @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe m_K6, /* X86_TUNE_USE_CLTD */ - ~(m_PENT | m_ATOM | m_K6), + ~(m_PENT | m_ATOM | m_K6 | m_GENERIC), My change was to enable CLTD for generic. Is your change intended to revert that? None of CPUs that generic care about are !USE_CLTD now after your change. @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe m_ATHLON_K8, /* X86_TUNE_SSE_TYPELESS_STORES */ - m_AMD_MULTIPLE, + m_AMD_MULTIPLE | m_CORE2I7, /**/ Hmm, I can not seem to find this in manual now, but I believe that stores also do not type, so movaps is preferred over movapd store because it is shorter. If not, this change should produce a lot of slowdowns. /* X86_TUNE_SSE_LOAD0_BY_PXOR */ - m_PPRO | m_P4_NOCONA, + m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/ Agner: A common way of setting a register to zero is XOR EAX,EAX or SUB EBX,EBX. The Core2 and Nehalem processors recognize that certain instructions are independent of the prior value of the register if the source and destination registers are the same. This applies to all of the following instructions: XOR, SUB, PXOR, XORPS, XORPD, and all variants of PSUBxxx and PCMPxxx except PCMPEQQ. /* X86_TUNE_MEMORY_MISMATCH_STALL */ m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more than 4 branch instructions in the 16 byte window. */ - m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, + m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, This is special passs to handle limitations of AMD's K7/K8/K10 branch prediction. Intel never had similar design, so this flag is pointless. I noticed that too, but Andi has a better answer to it. We apparently ought to disable it for K10, at least per Agner's manual. /* X86_TUNE_SCHEDULE */ m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE | m_GENERIC,
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
libcall is not faster up to 8KB to rep sequence that is better for regalloc/code cache than fully blowin function call. Be careful with this. My recollection is that REP sequence is good for any size -- for smaller size, the REP initial set up cost is too high (10s of cycles), while for large size copy, it is less efficient compared with library version. Well this is based on the data from the memtest script. Core has good REP implementation - it is a win from rather small blocks (16 bytes if I recall) and it does not need alignment. Library version starts to be interesting with caching hints, but I think till 80KB it is still not a win for my setup (glibc-2.15) /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall * on 16-bit immediate moves into memory on Core2 and Corei7. */ @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe m_K6, /* X86_TUNE_USE_CLTD */ - ~(m_PENT | m_ATOM | m_K6), + ~(m_PENT | m_ATOM | m_K6 | m_GENERIC), My change was to enable CLTD for generic. Is your change intended to revert that? No, it is merge conflict, sorry. I will update it in my tree. Skipping inc/dec is to avoid partial flag stall happening on P4 only. K8 and K10 partitions the flags into groups. References to flags to the same group can still cause the stall -- not sure how that can be handled. I belive the stalls happends only in quite special cases where compare instruction combines flags from multiple instructions. GCC don't generate this type of code, so we should be safe. Honza
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka hubi...@ucw.cz wrote: libcall is not faster up to 8KB to rep sequence that is better for regalloc/code cache than fully blowin function call. Be careful with this. My recollection is that REP sequence is good for any size -- for smaller size, the REP initial set up cost is too high (10s of cycles), while for large size copy, it is less efficient compared with library version. Well this is based on the data from the memtest script. Core has good REP implementation - it is a win from rather small blocks (16 bytes if I recall) and it does not need alignment. Library version starts to be interesting with caching hints, but I think till 80KB it is still not a win for my setup (glibc-2.15) A simple test shows that -mstringop-strategy=libcall always beats -mstringop-strategy=rep_8byte (on core2 and corei7) except for size smaller than 8 where the rep_8byte strategy simply bypasses REP movs. Can you share your memtest ? thanks, David /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall * on 16-bit immediate moves into memory on Core2 and Corei7. */ @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe m_K6, /* X86_TUNE_USE_CLTD */ - ~(m_PENT | m_ATOM | m_K6), + ~(m_PENT | m_ATOM | m_K6 | m_GENERIC), My change was to enable CLTD for generic. Is your change intended to revert that? No, it is merge conflict, sorry. I will update it in my tree. Skipping inc/dec is to avoid partial flag stall happening on P4 only. K8 and K10 partitions the flags into groups. References to flags to the same group can still cause the stall -- not sure how that can be handled. I belive the stalls happends only in quite special cases where compare instruction combines flags from multiple instructions. GCC don't generate this type of code, so we should be safe. Honza
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote: On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka hubi...@ucw.cz wrote: libcall is not faster up to 8KB to rep sequence that is better for regalloc/code cache than fully blowin function call. Be careful with this. My recollection is that REP sequence is good for any size -- for smaller size, the REP initial set up cost is too high (10s of cycles), while for large size copy, it is less efficient compared with library version. Well this is based on the data from the memtest script. Core has good REP implementation - it is a win from rather small blocks (16 bytes if I recall) and it does not need alignment. Library version starts to be interesting with caching hints, but I think till 80KB it is still not a win for my setup (glibc-2.15) A simple test shows that -mstringop-strategy=libcall always beats -mstringop-strategy=rep_8byte (on core2 and corei7) except for size smaller than 8 where the rep_8byte strategy simply bypasses REP movs. Can you share your memtest ? I can't believe that say 16 byte or 32 byte memcpy can be ever faster using a libcall. The PLT call overhead is simply too high. Jakub
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
Try the following one. 1) -minline-all-stringops -mstringop-strategy=rep_8byte -O2 vs 1) -mstringop_strategy=libcall -O2. David #include string.h #include stdio.h #include stdlib.h #ifndef LEN #define LEN 16 #endif void copy(char* s1, char* s2,int len) __attribute__((noinline)); void copy(char* s1, char* s2,int len) { memcpy(s2,s1,len); } int main() { char* s1 = (char*) malloc(LEN +10); char* s2 = (char*) malloc(LEN +10); int i = 0; for (i = 0; i 10; i++) { copy(s1+1,s2+3,LEN); } } On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek ja...@redhat.com wrote: On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote: On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka hubi...@ucw.cz wrote: libcall is not faster up to 8KB to rep sequence that is better for regalloc/code cache than fully blowin function call. Be careful with this. My recollection is that REP sequence is good for any size -- for smaller size, the REP initial set up cost is too high (10s of cycles), while for large size copy, it is less efficient compared with library version. Well this is based on the data from the memtest script. Core has good REP implementation - it is a win from rather small blocks (16 bytes if I recall) and it does not need alignment. Library version starts to be interesting with caching hints, but I think till 80KB it is still not a win for my setup (glibc-2.15) A simple test shows that -mstringop-strategy=libcall always beats -mstringop-strategy=rep_8byte (on core2 and corei7) except for size smaller than 8 where the rep_8byte strategy simply bypasses REP movs. Can you share your memtest ? I can't believe that say 16 byte or 32 byte memcpy can be ever faster using a libcall. The PLT call overhead is simply too high. Jakub
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Mon, Dec 10, 2012 at 10:07 PM, Mike Stump mikest...@comcast.net wrote: On Dec 10, 2012, at 12:42 PM, Xinliang David Li davi...@google.com wrote: I have not measured the CFI size impact -- but conceivably it should be larger -- which is unfortunate. Code speed and size are preferable to optimizing dwarf size… :-) I'd let dwarf 5 fix it! Well, different to debug info, CFI data has to be in memory to make unwinding work. These days most Linux distributions enable asyncronous unwind tables so any size savings due to shorter push/pop epilogue/prologue sequences has to be offsetted by the increase in CFI data. I'm not sure there is really a speed difference between both variants (well, maybe due to better icache footprint of the push/pop variant). That said - I'd prefer to have more data on this before making the switch for the generic model. What was your original motivation? Just theory or was it a real case? Thanks, Richard.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
The following the O2 size data from SPEC2k. Note that with push/pop, it is a always a net win (negative delta) in terms of total binary or total loadable section size. thanks, David .text.eh_frame Total_binary vortex-move 440252 40796 584066 vortex-push 415436 57452 575906 delta-5.6% 40.8% -1.397% twolf-move 169324 10748 223521 twolf-push 168876 11124 223449 delta -0.3% 3.5% -0.032% gzip-move 30668 3652 374399 gzip-push 30524 3740 374343 delta -0.5% 2.4% -0.015% bzip2-move 22748 3196 111616 bzip2-push 22636 3284 111592 delta -0.5% 2.8% -0.022% vpr-move 104684 9380 147378 vpr-push 104236 9788 147338 delta -0.4% 4.3% -0.027% mcf-move 8444 1244 26760 mcf-push 8444 1244 26760 delta0.0% 0.0% 0.000% cc1-move 1093964 90772 1576994 cc1-push 1078988 104068 1575314 delta -1.4% 14.6% -0.107% crafty-move 130556 5508 1256037 crafty-push 130236 5772 1255981 delta-0.2% 4.8% -0.004% eon-move 333660 33220 516491 eon-push 330140 35812 51 delta -1.1% 7.8% -0.181% gap-move 404092 46732 1457735 gap-push 396012 53180 1456103 delta -2.0% 13.8% -0.112% perlbmk-move 456572 45324 618585 perlbmk-push 449516 52340 618545 delta -1.5% 15.5% -0.006% parser-move 81244 15788 334003 parser-push 80684 16332 333987 delta -0.7% 3.4% -0.005% On Tue, Dec 11, 2012 at 9:14 AM, Xinliang David Li davi...@google.com wrote: On Tue, Dec 11, 2012 at 1:49 AM, Richard Biener richard.guent...@gmail.com wrote: On Mon, Dec 10, 2012 at 10:07 PM, Mike Stump mikest...@comcast.net wrote: On Dec 10, 2012, at 12:42 PM, Xinliang David Li davi...@google.com wrote: I have not measured the CFI size impact -- but conceivably it should be larger -- which is unfortunate. Code speed and size are preferable to optimizing dwarf size… :-) I'd let dwarf 5 fix it! Well, different to debug info, CFI data has to be in memory to make unwinding work. These days most Linux distributions enable asyncronous unwind tables so any size savings due to shorter push/pop epilogue/prologue sequences has to be offsetted by the increase in CFI data. I'm not sure there is really a speed difference between both variants (well, maybe due to better icache footprint of the push/pop variant). Yes, for large applications, this can be crucial to performance. That said - I'd prefer to have more data on this before making the switch for the generic model. What was your original motivation? Just theory or was it a real case? 1) some of the very large internal apps I measured benefit from this change (in terms of performance) 2) both ICC and LLVM do the same. I have already committed the patch. I will find some time to collect more size data and post it later. thanks, David Thanks, Richard.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
Some SPEC2k performance number (with 3 runs on core2): Push wins over move on 3 benchmarks. Others are noises. perlbmk : ~+1.9% gap: ~+1.4% vortex:~ +0.7% David On Tue, Dec 11, 2012 at 2:53 PM, Xinliang David Li davi...@google.com wrote: The following the O2 size data from SPEC2k. Note that with push/pop, it is a always a net win (negative delta) in terms of total binary or total loadable section size. thanks, David .text.eh_frame Total_binary vortex-move 440252 40796 584066 vortex-push 415436 57452 575906 delta-5.6% 40.8% -1.397% twolf-move 169324 10748 223521 twolf-push 168876 11124 223449 delta -0.3% 3.5% -0.032% gzip-move 30668 3652 374399 gzip-push 30524 3740 374343 delta -0.5% 2.4% -0.015% bzip2-move 22748 3196 111616 bzip2-push 22636 3284 111592 delta -0.5% 2.8% -0.022% vpr-move 104684 9380 147378 vpr-push 104236 9788 147338 delta -0.4% 4.3% -0.027% mcf-move 8444 1244 26760 mcf-push 8444 1244 26760 delta0.0% 0.0% 0.000% cc1-move 1093964 90772 1576994 cc1-push 1078988 104068 1575314 delta -1.4% 14.6% -0.107% crafty-move 130556 5508 1256037 crafty-push 130236 5772 1255981 delta-0.2% 4.8% -0.004% eon-move 333660 33220 516491 eon-push 330140 35812 51 delta -1.1% 7.8% -0.181% gap-move 404092 46732 1457735 gap-push 396012 53180 1456103 delta -2.0% 13.8% -0.112% perlbmk-move 456572 45324 618585 perlbmk-push 449516 52340 618545 delta -1.5% 15.5% -0.006% parser-move 81244 15788 334003 parser-push 80684 16332 333987 delta -0.7% 3.4% -0.005% On Tue, Dec 11, 2012 at 9:14 AM, Xinliang David Li davi...@google.com wrote: On Tue, Dec 11, 2012 at 1:49 AM, Richard Biener richard.guent...@gmail.com wrote: On Mon, Dec 10, 2012 at 10:07 PM, Mike Stump mikest...@comcast.net wrote: On Dec 10, 2012, at 12:42 PM, Xinliang David Li davi...@google.com wrote: I have not measured the CFI size impact -- but conceivably it should be larger -- which is unfortunate. Code speed and size are preferable to optimizing dwarf size… :-) I'd let dwarf 5 fix it! Well, different to debug info, CFI data has to be in memory to make unwinding work. These days most Linux distributions enable asyncronous unwind tables so any size savings due to shorter push/pop epilogue/prologue sequences has to be offsetted by the increase in CFI data. I'm not sure there is really a speed difference between both variants (well, maybe due to better icache footprint of the push/pop variant). Yes, for large applications, this can be crucial to performance. That said - I'd prefer to have more data on this before making the switch for the generic model. What was your original motivation? Just theory or was it a real case? 1) some of the very large internal apps I measured benefit from this change (in terms of performance) 2) both ICC and LLVM do the same. I have already committed the patch. I will find some time to collect more size data and post it later. thanks, David Thanks, Richard.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Sun, Dec 9, 2012 at 2:50 PM, Uros Bizjak ubiz...@gmail.com wrote: Hello! I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a SP adjustment instead of a sequence of pushes/pops. The preference to the MOVs are good for old CPU micro-architectures (before pentium-4, K10), because it breaks the data dependency. In modern micro-architecture, push/pop is implemented using a mechanism called stack engine. The data dependency is removed by the hardware, and push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are smaller. There is no longer the need to avoid using them. This is also what ICC does. 2012-12-08 Xinliang David Li davi...@google.com * config/i386/i386.c: Eanble push/pop in pro/epilogue for moderen CPUs. s/moderen/modern OK for mainline SVN. It's also more costly for unwind info in the prologue/epilogue. Thus, did you measure the effect on CFI size? Thanks, Richard. Thanks, Uros.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
I have not measured the CFI size impact -- but conceivably it should be larger -- which is unfortunate. David On Mon, Dec 10, 2012 at 1:23 AM, Richard Biener richard.guent...@gmail.com wrote: On Sun, Dec 9, 2012 at 2:50 PM, Uros Bizjak ubiz...@gmail.com wrote: Hello! I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a SP adjustment instead of a sequence of pushes/pops. The preference to the MOVs are good for old CPU micro-architectures (before pentium-4, K10), because it breaks the data dependency. In modern micro-architecture, push/pop is implemented using a mechanism called stack engine. The data dependency is removed by the hardware, and push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are smaller. There is no longer the need to avoid using them. This is also what ICC does. 2012-12-08 Xinliang David Li davi...@google.com * config/i386/i386.c: Eanble push/pop in pro/epilogue for moderen CPUs. s/moderen/modern OK for mainline SVN. It's also more costly for unwind info in the prologue/epilogue. Thus, did you measure the effect on CFI size? Thanks, Richard. Thanks, Uros.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Dec 10, 2012, at 12:42 PM, Xinliang David Li davi...@google.com wrote: I have not measured the CFI size impact -- but conceivably it should be larger -- which is unfortunate. Code speed and size are preferable to optimizing dwarf size… :-) I'd let dwarf 5 fix it!
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
s/Eanble/Enable/ Thanks, Dmitry 2012/12/9 Uros Bizjak ubiz...@gmail.com: Hello! I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a SP adjustment instead of a sequence of pushes/pops. The preference to the MOVs are good for old CPU micro-architectures (before pentium-4, K10), because it breaks the data dependency. In modern micro-architecture, push/pop is implemented using a mechanism called stack engine. The data dependency is removed by the hardware, and push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are smaller. There is no longer the need to avoid using them. This is also what ICC does. 2012-12-08 Xinliang David Li davi...@google.com * config/i386/i386.c: Eanble push/pop in pro/epilogue for moderen CPUs. s/moderen/modern OK for mainline SVN. Thanks, Uros.
[PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a SP adjustment instead of a sequence of pushes/pops. The preference to the MOVs are good for old CPU micro-architectures (before pentium-4, K10), because it breaks the data dependency. In modern micro-architecture, push/pop is implemented using a mechanism called stack engine. The data dependency is removed by the hardware, and push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are smaller. There is no longer the need to avoid using them. This is also what ICC does. The following patch fixed the problem. It passes bootstrap/regression test. OK to install? thanks, David Index: config/i386/i386.c === --- config/i386/i386.c (revision 194324) +++ config/i386/i386.c (working copy) @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, /* X86_TUNE_PROLOGUE_USING_MOVE */ - m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC, + m_PPRO | m_ATHLON_K8, /* X86_TUNE_EPILOGUE_USING_MOVE */ - m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC, + m_PPRO | m_ATHLON_K8, /* X86_TUNE_SHIFT1 */ ~m_486, 2012-12-08 Xinliang David Li davi...@google.com * config/i386/i386.c: Eanble push/pop in pro/epilogue for moderen CPUs.