RE: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
So far we see a regression on one of eembc_1_1 test because of following change: /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion from FP to FP. */ - m_CORE2I7 | m_AMDFAM10 | m_GENERIC, + m_AMDFAM10 | m_GENERIC, Probably we should keep it as is while there is nothing about it in docs indeed... Thanks, Igor -Original Message- From: [email protected] [mailto:[email protected]] On Behalf Of Jan Hubicka Sent: Wednesday, December 12, 2012 8:37 PM To: Xinliang David Li Cc: GCC Patches Subject: Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs > I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a > SP adjustment instead of a sequence of pushes/pops. The preference to > the MOVs are good for old CPU micro-architectures (before pentium-4, > K10), because it breaks the data dependency. In modern > micro-architecture, push/pop is implemented using a mechanism called > stack engine. The data dependency is removed by the hardware, and > push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are > smaller. There is no longer the need to avoid using them. This is > also what ICC does. > > The following patch fixed the problem. It passes bootstrap/regression > test. OK to install? > > thanks, > > David > > Index: config/i386/i386.c > === > --- config/i386/i386.c (revision 194324) > +++ config/i386/i386.c (working copy) > @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe >m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, > >/* X86_TUNE_PROLOGUE_USING_MOVE */ > - m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC, > + m_PPRO | m_ATHLON_K8, > >/* X86_TUNE_EPILOGUE_USING_MOVE */ > - m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC, > + m_PPRO | m_ATHLON_K8, Push/pops wrt moves was always difficult to tune on old CPUs, so I am happy it is gone from generic (in fact I had similar patch pending). Are you sure about Atom having stack engine, too? Related thing is accumulate_outgoing_args. Igor is testing it on Core and I will give it a try on K10. Honza I am attaching the changes for core costs I made if someone is interested in testing them. If we can declare P4/PPRo and maybe K8 chips obsolette for generic, there is room for improvement in generic, too. Like using inc/dec again. Honza Index: config/i386/i386.c === --- config/i386/i386.c (revision 194452) +++ config/i386/i386.c (working copy) @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = { COSTS_N_INSNS (8), /* cost of FABS instruction. */ COSTS_N_INSNS (8), /* cost of FCHS instruction. */ COSTS_N_INSNS (40), /* cost of FSQRT instruction. */ - {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}}, - {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true}, + {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}}, + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true}, {-1, libcall, false, {{libcall, {{6, loop_1_byte, true}, {24, loop, true}, {8192, rep_prefix_4_byte, true}, {-1, libcall, false}}}, - {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true}, + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true}, {-1, libcall, false, 1, /* scalar_stmt_cost. */ 1, /* scalar load_cost. */ @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe m_PPRO, /* X86_TUNE_PARTIAL_FLAG_REG_STALL */ - m_CORE2I7 | m_GENERIC, + m_GENERIC | m_CORE2, /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall * on 16-bit immediate moves into memory on Core2 and Corei7. */ @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe m_K6, /* X86_TUNE_USE_CLTD */ - ~(m_PENT | m_ATOM | m_K6), + ~(m_PENT | m_ATOM | m_K6 | m_GENERIC), /* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx. */ m_PENT4, @@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe m_COREI7 | m_BDVER, /* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */ - m_BDVER , + m_BDVER, /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies are resolved on SSE register parts instead of whole registers, so we may @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe m_ATHLON_K8, /* X86_TUNE_SSE_TYPELESS_STORES */ - m_AMD_MULTIPLE, + m_AMD_MULTIPLE | m_CORE2I7, /**/ /* X86_TUNE_SSE_LOAD0_BY_PXOR */ - m_PPRO | m_P4_NOCONA, + m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/ /* X86_T
RE: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
We checked also spec2000 and eembc_2_0 on Atom - no visible regressions and gains -Original Message- From: [email protected] [mailto:[email protected]] On Behalf Of Xinliang David Li Sent: Friday, December 21, 2012 11:26 AM To: Jan Hubicka Cc: GCC Patches; Ahmad Sharif Subject: Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs Ahmad has helped doing some atom performance testing (ChromeOS benchmarks) with this patch. In summary, there is no statistically significant regression seen. There is one improvement of about +1.9% (v8 benchmark) which looks real. David On Wed, Dec 12, 2012 at 9:24 AM, Xinliang David Li wrote: > On Wed, Dec 12, 2012 at 8:37 AM, Jan Hubicka wrote: >>> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by >>> a SP adjustment instead of a sequence of pushes/pops. The preference >>> to the MOVs are good for old CPU micro-architectures (before >>> pentium-4, K10), because it breaks the data dependency. In modern >>> micro-architecture, push/pop is implemented using a mechanism called >>> stack engine. The data dependency is removed by the hardware, and >>> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are >>> smaller. There is no longer the need to avoid using them. This is >>> also what ICC does. >>> >>> The following patch fixed the problem. It passes >>> bootstrap/regression test. OK to install? >>> >>> thanks, >>> >>> David >>> >>> Index: config/i386/i386.c >>> === >>> --- config/i386/i386.c (revision 194324) >>> +++ config/i386/i386.c (working copy) >>> @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe >>>m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, >>> >>>/* X86_TUNE_PROLOGUE_USING_MOVE */ >>> - m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC, >>> + m_PPRO | m_ATHLON_K8, >>> >>>/* X86_TUNE_EPILOGUE_USING_MOVE */ >>> - m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC, >>> + m_PPRO | m_ATHLON_K8, >> >> Push/pops wrt moves was always difficult to tune on old CPUs, so I am >> happy it is gone from generic (in fact I had similar patch pending). >> Are you sure about Atom having stack engine, too? >> > > Good question. The instruction latency table > (http://www.agner.org/optimize/instruction_tables.pdf) shows that for > Atom: push r has one 1uop, 1 cycle latency. However the instruction is > not pairable which will affect ILP. The guide here > http://www.agner.org/optimize/microarchitecture.pdf does not mention > Atom has stack engine either. > > I will help collect some performance data on Atom. > > > thanks, > > David > > >> Related thing is accumulate_outgoing_args. Igor is testing it on Core >> and I will give it a try on K10. >> >> Honza >> >> I am attaching the changes for core costs I made if someone is >> interested in testing them. If we can declare P4/PPRo and maybe K8 >> chips obsolette for generic, there is room for improvement in >> generic, too. Like using inc/dec again. >> >> Honza >> >> Index: config/i386/i386.c >> === >> --- config/i386/i386.c (revision 194452) >> +++ config/i386/i386.c (working copy) >> @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = { >>COSTS_N_INSNS (8), /* cost of FABS instruction. */ >>COSTS_N_INSNS (8), /* cost of FCHS instruction. */ >>COSTS_N_INSNS (40), /* cost of FSQRT instruction. */ >> - {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}}, >> - {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true}, >> + {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}}, >> + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true}, >>{-1, libcall, false, >>{{libcall, {{6, loop_1_byte, true}, >>{24, loop, true}, >>{8192, rep_prefix_4_byte, true}, >>{-1, libcall, false}}}, >> - {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true}, >> + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true}, >>{-1, libcall, false, >>1, /* scalar_stmt_cost. */ >>1, /* scalar load_cost. */ >> @@ -1806,7 +1806,7 @@ sta
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
Ahmad has helped doing some atom performance testing (ChromeOS
benchmarks) with this patch. In summary, there is no statistically
significant regression seen. There is one improvement of about +1.9%
(v8 benchmark) which looks real.
David
On Wed, Dec 12, 2012 at 9:24 AM, Xinliang David Li wrote:
> On Wed, Dec 12, 2012 at 8:37 AM, Jan Hubicka wrote:
>>> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
>>> SP adjustment instead of a sequence of pushes/pops. The preference to
>>> the MOVs are good for old CPU micro-architectures (before pentium-4,
>>> K10), because it breaks the data dependency. In modern
>>> micro-architecture, push/pop is implemented using a mechanism called
>>> stack engine. The data dependency is removed by the hardware, and
>>> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
>>> smaller. There is no longer the need to avoid using them. This is
>>> also what ICC does.
>>>
>>> The following patch fixed the problem. It passes bootstrap/regression
>>> test. OK to install?
>>>
>>> thanks,
>>>
>>> David
>>>
>>> Index: config/i386/i386.c
>>> ===
>>> --- config/i386/i386.c (revision 194324)
>>> +++ config/i386/i386.c (working copy)
>>> @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe
>>>m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>>
>>>/* X86_TUNE_PROLOGUE_USING_MOVE */
>>> - m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
>>> + m_PPRO | m_ATHLON_K8,
>>>
>>>/* X86_TUNE_EPILOGUE_USING_MOVE */
>>> - m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
>>> + m_PPRO | m_ATHLON_K8,
>>
>> Push/pops wrt moves was always difficult to tune on old CPUs, so I am happy
>> it
>> is gone from generic (in fact I had similar patch pending).
>> Are you sure about Atom having stack engine, too?
>>
>
> Good question. The instruction latency table
> (http://www.agner.org/optimize/instruction_tables.pdf) shows that for
> Atom: push r has one 1uop, 1 cycle latency. However the instruction is
> not pairable which will affect ILP. The guide here
> http://www.agner.org/optimize/microarchitecture.pdf does not mention
> Atom has stack engine either.
>
> I will help collect some performance data on Atom.
>
>
> thanks,
>
> David
>
>
>> Related thing is accumulate_outgoing_args. Igor is testing it on Core and I
>> will
>> give it a try on K10.
>>
>> Honza
>>
>> I am attaching the changes for core costs I made if someone is interested in
>> testing them. If we can declare P4/PPRo and maybe K8 chips obsolette for
>> generic, there is room for improvement in generic, too. Like using inc/dec
>> again.
>>
>> Honza
>>
>> Index: config/i386/i386.c
>> ===
>> --- config/i386/i386.c (revision 194452)
>> +++ config/i386/i386.c (working copy)
>> @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
>>COSTS_N_INSNS (8), /* cost of FABS instruction. */
>>COSTS_N_INSNS (8), /* cost of FCHS instruction. */
>>COSTS_N_INSNS (40), /* cost of FSQRT instruction. */
>> - {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>> - {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
>> + {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>> + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>>{-1, libcall, false,
>>{{libcall, {{6, loop_1_byte, true},
>>{24, loop, true},
>>{8192, rep_prefix_4_byte, true},
>>{-1, libcall, false}}},
>> - {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
>> + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>>{-1, libcall, false,
>>1, /* scalar_stmt_cost. */
>>1, /* scalar load_cost. */
>> @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
>>m_PPRO,
>>
>>/* X86_TUNE_PARTIAL_FLAG_REG_STALL */
>> - m_CORE2I7 | m_GENERIC,
>> + m_GENERIC | m_CORE2,
>>
>>/* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
>> * on 16-bit immediate moves into memory on Core2 and Corei7. */
>> @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
>>m_K6,
>>
>>/* X86_TUNE_USE_CLTD */
>> - ~(m_PENT | m_ATOM | m_K6),
>> + ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
>>
>>/* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx. */
>>m_PENT4,
>> @@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe
>>m_COREI7 | m_BDVER,
>>
>>/* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */
>> - m_BDVER ,
>> + m_BDVER,
>>
>>/* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and
>> dependencies
>> are resolved on SSE register parts instead of whole registers, so we
>> may
>> @@ -1910,10
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Thu, Dec 20, 2012 at 7:06 AM, Jan Hubicka wrote: >> > Hi Areg, >> > >> > Did you mean inlined memcpy/memset are as fast as >> > the ones in libc.so on both ia32 and Intel64? >> >> I would be interested in output of the stringop script. > > Also as far as I can remember, none of spec2k6 benchmarks is really stringop > bound. On Spec2k GCC was quite bound by memset (within alloc_rtx and bitmap > oprations) but mostly by collecting page faults there. Inlining that one made > quite a lot of difference on K8 hardware, but not on later chips. > There is a GCC performance regression bug on EEMBC. It turns out that -static was used for linking and optimized memory functions weren't used. Remove -static fixed the performance regression. -- H.J.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
> > Hi Areg, > > > > Did you mean inlined memcpy/memset are as fast as > > the ones in libc.so on both ia32 and Intel64? > > I would be interested in output of the stringop script. Also as far as I can remember, none of spec2k6 benchmarks is really stringop bound. On Spec2k GCC was quite bound by memset (within alloc_rtx and bitmap oprations) but mostly by collecting page faults there. Inlining that one made quite a lot of difference on K8 hardware, but not on later chips. Honza
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
> Hi Areg, > > Did you mean inlined memcpy/memset are as fast as > the ones in libc.so on both ia32 and Intel64? I would be interested in output of the stringop script. > > Please keep in mind that memcpy/memset in libc.a > may not be optimized. You must not use -static for > linking. In my setup I use dynamic linking... (this is quite anoying property in general - people tend to use --static for performance critical binaries to save expenses of PIC. It would be really cool to have way to call proper stringops based on -march switch) Honza > > -- > H.J.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Thu, Dec 20, 2012 at 4:13 AM, Melik-adamyan, Areg wrote: > We checked, no significant gains or losses. > > -Original Message- > From: H.J. Lu [mailto:[email protected]] > Sent: Friday, December 14, 2012 1:03 AM > To: Jan Hubicka > Cc: Jakub Jelinek; Xinliang David Li; GCC Patches; Teresa Johnson; > Melik-adamyan, Areg > Subject: Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs > > On Thu, Dec 13, 2012 at 12:40 PM, Jan Hubicka wrote: >>> > Here we speak about memcpy/memset only. I never got around to >>> > modernize strlen and friends, unfortunately... >>> > >>> > memcmp and friends are different beats. They realy need some TLC... >>> >>> memcpy and memset in glibc are also extremely fast. >> >> The default strategy now is to inline only when the block is known to >> be small (either constant or via profile feedback, we do not really >> use the info on upper bound of size of the copied object that would be >> useful but not readilly available at expansion time). >> >> You can try the test_stringop script I attached and send me the >> results. For > > Areg, can you give it a try? Thanks. > Hi Areg, Did you mean inlined memcpy/memset are as fast as the ones in libc.so on both ia32 and Intel64? Please keep in mind that memcpy/memset in libc.a may not be optimized. You must not use -static for linking. -- H.J.
RE: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
We checked, no significant gains or losses. -Original Message- From: H.J. Lu [mailto:[email protected]] Sent: Friday, December 14, 2012 1:03 AM To: Jan Hubicka Cc: Jakub Jelinek; Xinliang David Li; GCC Patches; Teresa Johnson; Melik-adamyan, Areg Subject: Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs On Thu, Dec 13, 2012 at 12:40 PM, Jan Hubicka wrote: >> > Here we speak about memcpy/memset only. I never got around to >> > modernize strlen and friends, unfortunately... >> > >> > memcmp and friends are different beats. They realy need some TLC... >> >> memcpy and memset in glibc are also extremely fast. > > The default strategy now is to inline only when the block is known to > be small (either constant or via profile feedback, we do not really > use the info on upper bound of size of the copied object that would be > useful but not readilly available at expansion time). > > You can try the test_stringop script I attached and send me the > results. For Areg, can you give it a try? Thanks. > me libc starts to be win only for rather large blocks (i.e. >8KB) > Which glibc are you using? -- H.J.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
> > me libc starts to be win only for rather large blocks (i.e. >8KB) > > > > Which glibc are you using? 2.15 as it comes with opensuse 12.2 Honza > > -- > H.J.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Thu, Dec 13, 2012 at 12:40 PM, Jan Hubicka wrote: >> > Here we speak about memcpy/memset only. I never got around to modernize >> > strlen and friends, unfortunately... >> > >> > memcmp and friends are different beats. They realy need some TLC... >> >> memcpy and memset in glibc are also extremely fast. > > The default strategy now is to inline only when the block is known to be small > (either constant or via profile feedback, we do not really use the info on > upper bound of size of the copied object that would be useful but not readilly > available at expansion time). > > You can try the test_stringop script I attached and send me the results. For Areg, can you give it a try? Thanks. > me libc starts to be win only for rather large blocks (i.e. >8KB) > Which glibc are you using? -- H.J.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
> > Here we speak about memcpy/memset only. I never got around to modernize > > strlen and friends, unfortunately... > > > > memcmp and friends are different beats. They realy need some TLC... > > memcpy and memset in glibc are also extremely fast. The default strategy now is to inline only when the block is known to be small (either constant or via profile feedback, we do not really use the info on upper bound of size of the copied object that would be useful but not readilly available at expansion time). You can try the test_stringop script I attached and send me the results. For me libc starts to be win only for rather large blocks (i.e. >8KB) Honza
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Thu, Dec 13, 2012 at 12:26 PM, Jan Hubicka wrote: >> On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek wrote: >> > On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote: >> >> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka wrote: >> >> >> > libcall is not faster up to 8KB to rep sequence that is better for >> >> >> > regalloc/code >> >> >> > cache than fully blowin function call. >> >> >> >> >> >> Be careful with this. My recollection is that REP sequence is good for >> >> >> any size -- for smaller size, the REP initial set up cost is too high >> >> >> (10s of cycles), while for large size copy, it is less efficient >> >> >> compared with library version. >> >> > >> >> > Well this is based on the data from the memtest script. >> >> > Core has good REP implementation - it is a win from rather small blocks >> >> > (16 >> >> > bytes if I recall) and it does not need alignment. >> >> > Library version starts to be interesting with caching hints, but I >> >> > think till 80KB >> >> > it is still not a win for my setup (glibc-2.15) >> >> >> >> A simple test shows that -mstringop-strategy=libcall always beats >> >> -mstringop-strategy=rep_8byte (on core2 and corei7) except for size >> >> smaller than 8 where the rep_8byte strategy simply bypasses REP movs. >> >> Can you share your memtest ? >> > >> > I can't believe that say 16 byte or 32 byte memcpy can be ever faster >> > using a >> > libcall. The PLT call overhead is simply too high. >> > >> >> The x86 string/memory functions in the current glibc are >> extremely fast and tuned for Core 2/Core i7. GCC is having >> a very hard time to beat them with inlining: >> >> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052 > > Here we speak about memcpy/memset only. I never got around to modernize > strlen and friends, unfortunately... > > memcmp and friends are different beats. They realy need some TLC... memcpy and memset in glibc are also extremely fast. -- H.J.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
> On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek wrote: > > On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote: > >> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka wrote: > >> >> > libcall is not faster up to 8KB to rep sequence that is better for > >> >> > regalloc/code > >> >> > cache than fully blowin function call. > >> >> > >> >> Be careful with this. My recollection is that REP sequence is good for > >> >> any size -- for smaller size, the REP initial set up cost is too high > >> >> (10s of cycles), while for large size copy, it is less efficient > >> >> compared with library version. > >> > > >> > Well this is based on the data from the memtest script. > >> > Core has good REP implementation - it is a win from rather small blocks > >> > (16 > >> > bytes if I recall) and it does not need alignment. > >> > Library version starts to be interesting with caching hints, but I think > >> > till 80KB > >> > it is still not a win for my setup (glibc-2.15) > >> > >> A simple test shows that -mstringop-strategy=libcall always beats > >> -mstringop-strategy=rep_8byte (on core2 and corei7) except for size > >> smaller than 8 where the rep_8byte strategy simply bypasses REP movs. > >> Can you share your memtest ? > > > > I can't believe that say 16 byte or 32 byte memcpy can be ever faster using > > a > > libcall. The PLT call overhead is simply too high. > > > > The x86 string/memory functions in the current glibc are > extremely fast and tuned for Core 2/Core i7. GCC is having > a very hard time to beat them with inlining: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052 Here we speak about memcpy/memset only. I never got around to modernize strlen and friends, unfortunately... memcmp and friends are different beats. They realy need some TLC... Honza
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek wrote: > On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote: >> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka wrote: >> >> > libcall is not faster up to 8KB to rep sequence that is better for >> >> > regalloc/code >> >> > cache than fully blowin function call. >> >> >> >> Be careful with this. My recollection is that REP sequence is good for >> >> any size -- for smaller size, the REP initial set up cost is too high >> >> (10s of cycles), while for large size copy, it is less efficient >> >> compared with library version. >> > >> > Well this is based on the data from the memtest script. >> > Core has good REP implementation - it is a win from rather small blocks (16 >> > bytes if I recall) and it does not need alignment. >> > Library version starts to be interesting with caching hints, but I think >> > till 80KB >> > it is still not a win for my setup (glibc-2.15) >> >> A simple test shows that -mstringop-strategy=libcall always beats >> -mstringop-strategy=rep_8byte (on core2 and corei7) except for size >> smaller than 8 where the rep_8byte strategy simply bypasses REP movs. >> Can you share your memtest ? > > I can't believe that say 16 byte or 32 byte memcpy can be ever faster using a > libcall. The PLT call overhead is simply too high. > The x86 string/memory functions in the current glibc are extremely fast and tuned for Core 2/Core i7. GCC is having a very hard time to beat them with inlining: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052 -- H.J.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
> Try the following one. 1) -minline-all-stringops
> -mstringop-strategy=rep_8byte -O2 vs 1) -mstringop_strategy=libcall
> -O2.
>
> David
>
>
> #include
> #include
> #include
> #ifndef LEN
> #define LEN 16
> #endif
>
> void copy(char* s1, char* s2,int len) __attribute__((noinline));
> void copy(char* s1, char* s2,int len)
> {
>memcpy(s2,s1,len);
> }
I guess the catch here is that you force the copy to be noinline and thus you
eliminate the benefits of inlined sequence. With inline stringop one saves
regalloc and often can get rid of the alignment tests.
This is script I use to tune the tables.
Honza
test()
{
rm -f a.out
cat <&1`
echo -n " "$TIME
echo $TIME $4 >>/tmp/accum
}
testrow()
{
echo -n "" >/tmp/accum
printf "block size %7i" $3
test "$2" "$3" "-mstringop-strategy=libcall" libcall
test "$2" "$3" "-mstringop-strategy=rep_byte -malign-stringops" rep1
test "$2" "$3" "-mstringop-strategy=rep_byte -mno-align-stringops" rep1noalign
test "$2" "$3" "-mstringop-strategy=rep_4byte -malign-stringops" rep4
test "$2" "$3" "-mstringop-strategy=rep_4byte -mno-align-stringops" rep4noalign
if [ "$mode" == 64 ]
then
test "$2" "$3" "-mstringop-strategy=rep_8byte -malign-stringops" rep8
test "$2" "$3" "-mstringop-strategy=rep_8byte -mno-align-stringops" rep8noalign
fi
test "$2" "$3" "-mstringop-strategy=loop -malign-stringops" loop
test "$2" "$3" "-mstringop-strategy=loop -mno-align-stringops" loopnoalign
test "$2" "$3" "-mstringop-strategy=unrolled_loop -malign-stringops" unrl
test "$2" "$3" "-mstringop-strategy=unrolled_loop -mno-align-stringops"
unrlnoalign
test "$2" "$3" "-mstringop-strategy=sse_loop -malign-stringops" sse
test "$2" "$3" "-mstringop-strategy=sse_loop -mno-align-stringops -msse2"
ssenoalign
test "$2" "$3" "-mstringop-strategy=byte_loop" byte
best=`cat /tmp/accum | sort | head -1`
test "$2" "$3" " -fprofile-generate" >/dev/null 2>&1
test "$2" "$3" " -fprofile-use"
test "$2" "$3" " -minline-stringops-dynamically"
echo " best: $best"
}
test_all_sizes()
{
if [ "$mode" == 64 ]
then
echo " libcall rep1 noalgrep4 noalgrep8 noalg
loop noalgunrl noalgsse noalgbyte profiled dynamic"
else
echo " libcall rep1 noalgrep4 noalgloop noalg
unrl noalgssenoalgbyte profiled dynamic"
fi
#for size in 1 2 3 4 6 8 10 12 14 16 24 32 48 64 128 256 512 1024 4096 8192
81920 819200 8192000
#for size in 8192000 819200 81920 8192 4096 2048 1024 512 256 128 64 48 32 24
16 14 12 10 8 6 5 4 3 2 1
for size in 8192000 819200 81920 20480 8192 4096 2048 1024 512 256 128 64 48 32
24 16 14 12 10 8 6 4 1
#for size in 128 256 1024 4096 8192 81920 819200
do
testrow "$1" "$2" $size
done
}
mode=$1
shift
export memsize=$1
shift
cmdline=$*
if [ "$mode" != 32 ]
then
if [ "$mode" != 64 ]
then
echo "Usage:"
echo "test_stringop mode size cmdline"
echo "mode is either 32 or 64"
echo "size is amount of memory copied in each test. Should be chosed small
enough so runtime is less than minute for each test and sorting works"
echo "Example: test_stringop 32 64000 ./xgcc -B ./ -march=pentium3"
exit
fi
fi
echo "memcpy mode:$mode size:$memsize"
export STRINGOP=""
type=char
test_all_sizes $mode "$cmdline -m$mode"
echo "Aligned"
type=long
test_all_sizes $mode "$cmdline -m$mode"
echo "memset"
type=char
export STRINGOP="-Dtest_memset=1"
test_all_sizes $mode "$cmdline -m$mode"
echo "Aligned"
type=long
test_all_sizes $mode "$cmdline -m$mode"
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Thu, Dec 13, 2012 at 7:21 AM, Jakub Jelinek wrote: > On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote: >> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka wrote: >> >> > libcall is not faster up to 8KB to rep sequence that is better for >> >> > regalloc/code >> >> > cache than fully blowin function call. >> >> >> >> Be careful with this. My recollection is that REP sequence is good for >> >> any size -- for smaller size, the REP initial set up cost is too high >> >> (10s of cycles), while for large size copy, it is less efficient >> >> compared with library version. >> > >> > Well this is based on the data from the memtest script. >> > Core has good REP implementation - it is a win from rather small blocks (16 >> > bytes if I recall) and it does not need alignment. >> > Library version starts to be interesting with caching hints, but I think >> > till 80KB >> > it is still not a win for my setup (glibc-2.15) >> >> A simple test shows that -mstringop-strategy=libcall always beats >> -mstringop-strategy=rep_8byte (on core2 and corei7) except for size >> smaller than 8 where the rep_8byte strategy simply bypasses REP movs. >> Can you share your memtest ? > > I can't believe that say 16 byte or 32 byte memcpy can be ever faster using a > libcall. The PLT call overhead is simply too high. I believe the PLT call overhead may be effectively zero if the benchmarking is just a loop around a memcpy. Thus for measuring the PLT overhead I call the benchmark broken ;) Richard. > Jakub
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
Try the following one. 1) -minline-all-stringops
-mstringop-strategy=rep_8byte -O2 vs 1) -mstringop_strategy=libcall
-O2.
David
#include
#include
#include
#ifndef LEN
#define LEN 16
#endif
void copy(char* s1, char* s2,int len) __attribute__((noinline));
void copy(char* s1, char* s2,int len)
{
memcpy(s2,s1,len);
}
int main() {
char* s1 = (char*) malloc(LEN +10);
char* s2 = (char*) malloc(LEN +10);
int i = 0;
for (i = 0; i < 10; i++)
{
copy(s1+1,s2+3,LEN);
}
}
On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek wrote:
> On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote:
>> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka wrote:
>> >> > libcall is not faster up to 8KB to rep sequence that is better for
>> >> > regalloc/code
>> >> > cache than fully blowin function call.
>> >>
>> >> Be careful with this. My recollection is that REP sequence is good for
>> >> any size -- for smaller size, the REP initial set up cost is too high
>> >> (10s of cycles), while for large size copy, it is less efficient
>> >> compared with library version.
>> >
>> > Well this is based on the data from the memtest script.
>> > Core has good REP implementation - it is a win from rather small blocks (16
>> > bytes if I recall) and it does not need alignment.
>> > Library version starts to be interesting with caching hints, but I think
>> > till 80KB
>> > it is still not a win for my setup (glibc-2.15)
>>
>> A simple test shows that -mstringop-strategy=libcall always beats
>> -mstringop-strategy=rep_8byte (on core2 and corei7) except for size
>> smaller than 8 where the rep_8byte strategy simply bypasses REP movs.
>> Can you share your memtest ?
>
> I can't believe that say 16 byte or 32 byte memcpy can be ever faster using a
> libcall. The PLT call overhead is simply too high.
>
> Jakub
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote: > On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka wrote: > >> > libcall is not faster up to 8KB to rep sequence that is better for > >> > regalloc/code > >> > cache than fully blowin function call. > >> > >> Be careful with this. My recollection is that REP sequence is good for > >> any size -- for smaller size, the REP initial set up cost is too high > >> (10s of cycles), while for large size copy, it is less efficient > >> compared with library version. > > > > Well this is based on the data from the memtest script. > > Core has good REP implementation - it is a win from rather small blocks (16 > > bytes if I recall) and it does not need alignment. > > Library version starts to be interesting with caching hints, but I think > > till 80KB > > it is still not a win for my setup (glibc-2.15) > > A simple test shows that -mstringop-strategy=libcall always beats > -mstringop-strategy=rep_8byte (on core2 and corei7) except for size > smaller than 8 where the rep_8byte strategy simply bypasses REP movs. > Can you share your memtest ? I can't believe that say 16 byte or 32 byte memcpy can be ever faster using a libcall. The PLT call overhead is simply too high. Jakub
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka wrote: >> > libcall is not faster up to 8KB to rep sequence that is better for >> > regalloc/code >> > cache than fully blowin function call. >> >> Be careful with this. My recollection is that REP sequence is good for >> any size -- for smaller size, the REP initial set up cost is too high >> (10s of cycles), while for large size copy, it is less efficient >> compared with library version. > > Well this is based on the data from the memtest script. > Core has good REP implementation - it is a win from rather small blocks (16 > bytes if I recall) and it does not need alignment. > Library version starts to be interesting with caching hints, but I think till > 80KB > it is still not a win for my setup (glibc-2.15) A simple test shows that -mstringop-strategy=libcall always beats -mstringop-strategy=rep_8byte (on core2 and corei7) except for size smaller than 8 where the rep_8byte strategy simply bypasses REP movs. Can you share your memtest ? thanks, David >> >> > >> >> >/* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix >> >> > stall >> >> > * on 16-bit immediate moves into memory on Core2 and Corei7. */ >> >> > @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe >> >> >m_K6, >> >> > >> >> >/* X86_TUNE_USE_CLTD */ >> >> > - ~(m_PENT | m_ATOM | m_K6), >> >> > + ~(m_PENT | m_ATOM | m_K6 | m_GENERIC), >> >> My change was to enable CLTD for generic. Is your change intended to >> revert that? > > No, it is merge conflict, sorry. I will update it in my tree. >> > Skipping inc/dec is to avoid partial flag stall happening on P4 only. >> >> > >> >> >> K8 and K10 partitions the flags into groups. References to flags to >> the same group can still cause the stall -- not sure how that can be >> handled. > > I belive the stalls happends only in quite special cases where compare > instruction > combines flags from multiple instructions. GCC don't generate this type of > code, so > we should be safe. > > Honza
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
> > libcall is not faster up to 8KB to rep sequence that is better for > > regalloc/code > > cache than fully blowin function call. > > Be careful with this. My recollection is that REP sequence is good for > any size -- for smaller size, the REP initial set up cost is too high > (10s of cycles), while for large size copy, it is less efficient > compared with library version. Well this is based on the data from the memtest script. Core has good REP implementation - it is a win from rather small blocks (16 bytes if I recall) and it does not need alignment. Library version starts to be interesting with caching hints, but I think till 80KB it is still not a win for my setup (glibc-2.15) > >> > > >> >/* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall > >> > * on 16-bit immediate moves into memory on Core2 and Corei7. */ > >> > @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe > >> >m_K6, > >> > > >> >/* X86_TUNE_USE_CLTD */ > >> > - ~(m_PENT | m_ATOM | m_K6), > >> > + ~(m_PENT | m_ATOM | m_K6 | m_GENERIC), > > My change was to enable CLTD for generic. Is your change intended to > revert that? No, it is merge conflict, sorry. I will update it in my tree. > > Skipping inc/dec is to avoid partial flag stall happening on P4 only. > >> > > > > K8 and K10 partitions the flags into groups. References to flags to > the same group can still cause the stall -- not sure how that can be > handled. I belive the stalls happends only in quite special cases where compare instruction combines flags from multiple instructions. GCC don't generate this type of code, so we should be safe. Honza
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Wed, Dec 12, 2012 at 4:16 PM, Xinliang David Li wrote:
> On Wed, Dec 12, 2012 at 10:30 AM, Jan Hubicka wrote:
>> Concerning 1push per cycle, I think it is same as K7 hardware did, so move
>> prologue should be a win.
>>> > Index: config/i386/i386.c
>>> > ===
>>> > --- config/i386/i386.c (revision 194452)
>>> > +++ config/i386/i386.c (working copy)
>>> > @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
>>> >COSTS_N_INSNS (8), /* cost of FABS instruction. */
>>> >COSTS_N_INSNS (8), /* cost of FCHS instruction. */
>>> >COSTS_N_INSNS (40), /* cost of FSQRT instruction. */
>>> > - {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>>> > - {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
>>> > + {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>>> > + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>>> >{-1, libcall, false,
>>> >{{libcall, {{6, loop_1_byte, true},
>>> >{24, loop, true},
>>> >{8192, rep_prefix_4_byte, true},
>>> >{-1, libcall, false}}},
>>> > - {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
>>> > + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>>
>> libcall is not faster up to 8KB to rep sequence that is better for
>> regalloc/code
>> cache than fully blowin function call.
>
> Be careful with this. My recollection is that REP sequence is good for
s/good/not good/
David
> any size -- for smaller size, the REP initial set up cost is too high
> (10s of cycles), while for large size copy, it is less efficient
> compared with library version.
>
>
>>> > @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
>>> >m_PPRO,
>>> >
>>> >/* X86_TUNE_PARTIAL_FLAG_REG_STALL */
>>> > - m_CORE2I7 | m_GENERIC,
>>> > + m_GENERIC | m_CORE2,
>>
>> This disable shifts that store just some flags. Acroding to Agner's manual
>> I7 handle
>> this well.
>>
>
> ok.
>
>> Partial flags stall
>> The Sandy Bridge uses the method of an extra Âľop to join partial registers
>> not only for
>> general purpose registers but also for the flags register, unlike previous
>> processors which
>> used this method only for general purpose registers. This occurs when a
>> write to a part of
>> the flags register is followed by a read from a larger part of the flags
>> register. The partial
>> flags stall of previous processors (See page 75) is therefore replaced by an
>> extra Âľop. The
>> Sandy Bridge also generates an extra Âľop when reading the flags after a
>> rotate instruction.
>>
>> This is cheaper than the 7 cycle delay on Core this flags is trying to avoid.
>
> ok.
>
>>> >
>>> >/* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
>>> > * on 16-bit immediate moves into memory on Core2 and Corei7. */
>>> > @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
>>> >m_K6,
>>> >
>>> >/* X86_TUNE_USE_CLTD */
>>> > - ~(m_PENT | m_ATOM | m_K6),
>>> > + ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
>
> My change was to enable CLTD for generic. Is your change intended to
> revert that?
>
>>
>> None of CPUs that generic care about are !USE_CLTD now after your change.
>>> > @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
>>> >m_ATHLON_K8,
>>> >
>>> >/* X86_TUNE_SSE_TYPELESS_STORES */
>>> > - m_AMD_MULTIPLE,
>>> > + m_AMD_MULTIPLE | m_CORE2I7, /**/
>>
>> Hmm, I can not seem to find this in manual now, but I believe that stores
>> also do not type,
>> so movaps is preferred over movapd store because it is shorter. If not,
>> this change should
>> produce a lot of slowdowns.
>>> >
>>> >/* X86_TUNE_SSE_LOAD0_BY_PXOR */
>>> > - m_PPRO | m_P4_NOCONA,
>>> > + m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/
>>
>> Agner:
>> A common way of setting a register to zero is XOR EAX,EAX or SUB EBX,EBX. The
>> Core2 and Nehalem processors recognize that certain instructions are
>> independent of the
>> prior value of the register if the source and destination registers are the
>> same.
>>
>> This applies to all of the following instructions: XOR, SUB, PXOR, XORPS,
>> XORPD, and all
>> variants of PSUBxxx and PCMPxxx except PCMPEQQ.
>>> >
>>> >/* X86_TUNE_MEMORY_MISMATCH_STALL */
>>> >m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>> > @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
>>> >
>>> >/* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict
>>> > more
>>> > than 4 branch instructions in the 16 byte window. */
>>> > - m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>> > + m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>
>> This is special passs to handle limitations of AMD's K7/K8/K10 branch
>> prediction.
>> Intel never h
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Wed, Dec 12, 2012 at 10:30 AM, Jan Hubicka wrote:
> Concerning 1push per cycle, I think it is same as K7 hardware did, so move
> prologue should be a win.
>> > Index: config/i386/i386.c
>> > ===
>> > --- config/i386/i386.c (revision 194452)
>> > +++ config/i386/i386.c (working copy)
>> > @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
>> >COSTS_N_INSNS (8), /* cost of FABS instruction. */
>> >COSTS_N_INSNS (8), /* cost of FCHS instruction. */
>> >COSTS_N_INSNS (40), /* cost of FSQRT instruction. */
>> > - {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>> > - {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
>> > + {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>> > + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>> >{-1, libcall, false,
>> >{{libcall, {{6, loop_1_byte, true},
>> >{24, loop, true},
>> >{8192, rep_prefix_4_byte, true},
>> >{-1, libcall, false}}},
>> > - {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
>> > + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>
> libcall is not faster up to 8KB to rep sequence that is better for
> regalloc/code
> cache than fully blowin function call.
Be careful with this. My recollection is that REP sequence is good for
any size -- for smaller size, the REP initial set up cost is too high
(10s of cycles), while for large size copy, it is less efficient
compared with library version.
>> > @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
>> >m_PPRO,
>> >
>> >/* X86_TUNE_PARTIAL_FLAG_REG_STALL */
>> > - m_CORE2I7 | m_GENERIC,
>> > + m_GENERIC | m_CORE2,
>
> This disable shifts that store just some flags. Acroding to Agner's manual I7
> handle
> this well.
>
ok.
> Partial flags stall
> The Sandy Bridge uses the method of an extra Âľop to join partial registers
> not only for
> general purpose registers but also for the flags register, unlike previous
> processors which
> used this method only for general purpose registers. This occurs when a write
> to a part of
> the flags register is followed by a read from a larger part of the flags
> register. The partial
> flags stall of previous processors (See page 75) is therefore replaced by an
> extra Âľop. The
> Sandy Bridge also generates an extra Âľop when reading the flags after a
> rotate instruction.
>
> This is cheaper than the 7 cycle delay on Core this flags is trying to avoid.
ok.
>> >
>> >/* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
>> > * on 16-bit immediate moves into memory on Core2 and Corei7. */
>> > @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
>> >m_K6,
>> >
>> >/* X86_TUNE_USE_CLTD */
>> > - ~(m_PENT | m_ATOM | m_K6),
>> > + ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
My change was to enable CLTD for generic. Is your change intended to
revert that?
>
> None of CPUs that generic care about are !USE_CLTD now after your change.
>> > @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
>> >m_ATHLON_K8,
>> >
>> >/* X86_TUNE_SSE_TYPELESS_STORES */
>> > - m_AMD_MULTIPLE,
>> > + m_AMD_MULTIPLE | m_CORE2I7, /**/
>
> Hmm, I can not seem to find this in manual now, but I believe that stores
> also do not type,
> so movaps is preferred over movapd store because it is shorter. If not, this
> change should
> produce a lot of slowdowns.
>> >
>> >/* X86_TUNE_SSE_LOAD0_BY_PXOR */
>> > - m_PPRO | m_P4_NOCONA,
>> > + m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/
>
> Agner:
> A common way of setting a register to zero is XOR EAX,EAX or SUB EBX,EBX. The
> Core2 and Nehalem processors recognize that certain instructions are
> independent of the
> prior value of the register if the source and destination registers are the
> same.
>
> This applies to all of the following instructions: XOR, SUB, PXOR, XORPS,
> XORPD, and all
> variants of PSUBxxx and PCMPxxx except PCMPEQQ.
>> >
>> >/* X86_TUNE_MEMORY_MISMATCH_STALL */
>> >m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>> > @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
>> >
>> >/* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
>> > than 4 branch instructions in the 16 byte window. */
>> > - m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>> > + m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>
> This is special passs to handle limitations of AMD's K7/K8/K10 branch
> prediction.
> Intel never had similar design, so this flag is pointless.
I noticed that too, but Andi has a better answer to it.
>
> We apparently ought to disable it for K10, at least per Agner's manual.
>> >
>> >/* X86_TUNE_SCHEDULE */
>> >
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
Andi Kleen writes:
>
>>> >/* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict
>>> > more
>>> > than 4 branch instructions in the 16 byte window. */
>>> > - m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>> > + m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>
>> This is special passs to handle limitations of AMD's K7/K8/K10 branch
>> prediction.
>> Intel never had similar design, so this flag is pointless.
>
> Actually the Sandy Bridge decoded icache has a limit of 3 jumps per
> 16 byte window.
Actually it's four per 32bytes, sorry.
Here's an old patch I had lying around to optimize for that.
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 1b871be..9b57316 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2713,6 +2713,7 @@ ix86_target_string (HOST_WIDE_INT isa, int flags, const
char *arch,
{ "-mavx256-split-unaligned-load", MASK_AVX256_SPLIT_UNALIGNED_LOAD},
{ "-mavx256-split-unaligned-store",
MASK_AVX256_SPLIT_UNALIGNED_STORE},
{ "-mprefer-avx128", MASK_PREFER_AVX128},
+{ "-mjump-pad-32bytes",MASK_JUMP_PAD_32BYTES},
};
const char *opts[ARRAY_SIZE (isa_opts) + ARRAY_SIZE (flag_opts) + 6][2];
@@ -32182,6 +32183,7 @@ ix86_avoid_jump_mispredicts (void)
rtx insn, start = get_insns ();
int nbytes = 0, njumps = 0;
int isjump = 0;
+ int jump_pad_window_size = TARGET_JUMP_PAD_32BYTES ? 32 : 16;
/* Look for all minimal intervals of instructions containing 4 jumps.
The intervals are bounded by START and INSN. NBYTES is the total
@@ -32202,8 +32204,8 @@ ix86_avoid_jump_mispredicts (void)
int align = label_to_alignment (insn);
int max_skip = label_to_max_skip (insn);
- if (max_skip > 15)
- max_skip = 15;
+ if (max_skip > jump_pad_window_size - 1)
+ max_skip = jump_pad_window_size - 1;
/* If align > 3, only up to 16 - max_skip - 1 bytes can be
already in the current 16 byte page, because otherwise
ASM_OUTPUT_MAX_SKIP_ALIGN could skip max_skip or fewer
@@ -32216,7 +32218,7 @@ ix86_avoid_jump_mispredicts (void)
INSN_UID (insn), max_skip);
if (max_skip)
{
- while (nbytes + max_skip >= 16)
+ while (nbytes + max_skip >= jump_pad_window_size)
{
start = NEXT_INSN (start);
if ((JUMP_P (start)
@@ -32262,10 +32264,11 @@ ix86_avoid_jump_mispredicts (void)
fprintf (dump_file, "Interval %i to %i has %i bytes\n",
INSN_UID (start), INSN_UID (insn), nbytes);
- if (njumps == 3 && isjump && nbytes < 16)
+ if (njumps == 3 && isjump && nbytes < jump_pad_window_size)
{
- int padsize = 15 - nbytes + min_insn_size (insn);
-
+ int padsize = jump_pad_window_size - 1 - nbytes +
+ min_insn_size (insn);
+
if (dump_file)
fprintf (dump_file, "Padding insn %i by %i bytes!\n",
INSN_UID (insn), padsize);
diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index 6c516e7..b38d163 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -223,6 +223,10 @@ mintel-syntax
Target Undocumented Alias(masm=, intel, att) Warn(%<-mintel-syntax%> and
%<-mno-intel-syntax%> are deprecated; use %<-masm=intel%> and %<-masm=att%>
instead)
;; Deprecated
+mjump-pad-32bytes
+Target RejectNegative Mask(JUMP_PAD_32BYTES) Save
+Avoid more than 4 jumps in each 32byte code window.
+
mms-bitfields
Target Report Mask(MS_BITFIELD_LAYOUT) Save
Use native (MS) bitfield layout
--
[email protected] -- Speaking for myself only
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
> Jan Hubicka writes: > > > > libcall is not faster up to 8KB to rep sequence that is better for > > regalloc/code > > cache than fully blowin function call. > > I noticed btw that some of the generated string instructions are slower > than just calling the C library. > > rep scasb etc. is rarely a win over an optimized library function, > it's not very optimized. Perhaps those patterns should just be disabled. > The way to optimize that on modern CPUs is to use PCMP*STR*, but that's > quite a bit more complicated and has some constraints. This is only about memset/memcpy expanding. The other sequences are quite lame indeed... > > > >> >/* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict > >> > more > >> > than 4 branch instructions in the 16 byte window. */ > >> > - m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | > >> > m_GENERIC, > >> > + m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, > > > > This is special passs to handle limitations of AMD's K7/K8/K10 branch > > prediction. > > Intel never had similar design, so this flag is pointless. > > Actually the Sandy Bridge decoded icache has a limit of 3 jumps per > 16 byte window. If you exceed that it falls back to running > the full decoder from the normal icache. > > I don't have solid data, but it may be a win for frontend limited > code (otherwise possibly more in power than performance) > > I would revisit that for Sandy Bridge We are not particularly good on avoiding the branches - basically the code inserts alignment whenever it things the 4 consecutive branches fit in the window. I can make patch to change this to 3 and we can see if it helps at all. > > -Andi > -- > [email protected] -- Speaking for myself only
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
Jan Hubicka writes: > > libcall is not faster up to 8KB to rep sequence that is better for > regalloc/code > cache than fully blowin function call. I noticed btw that some of the generated string instructions are slower than just calling the C library. rep scasb etc. is rarely a win over an optimized library function, it's not very optimized. Perhaps those patterns should just be disabled. The way to optimize that on modern CPUs is to use PCMP*STR*, but that's quite a bit more complicated and has some constraints. >> >/* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more >> > than 4 branch instructions in the 16 byte window. */ >> > - m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, >> > + m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, > > This is special passs to handle limitations of AMD's K7/K8/K10 branch > prediction. > Intel never had similar design, so this flag is pointless. Actually the Sandy Bridge decoded icache has a limit of 3 jumps per 16 byte window. If you exceed that it falls back to running the full decoder from the normal icache. I don't have solid data, but it may be a win for frontend limited code (otherwise possibly more in power than performance) I would revisit that for Sandy Bridge -Andi -- [email protected] -- Speaking for myself only
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
Concerning 1push per cycle, I think it is same as K7 hardware did, so move
prologue should be a win.
> > Index: config/i386/i386.c
> > ===
> > --- config/i386/i386.c (revision 194452)
> > +++ config/i386/i386.c (working copy)
> > @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
> >COSTS_N_INSNS (8), /* cost of FABS instruction. */
> >COSTS_N_INSNS (8), /* cost of FCHS instruction. */
> >COSTS_N_INSNS (40), /* cost of FSQRT instruction. */
> > - {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> > - {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
> > + {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> > + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
> >{-1, libcall, false,
> >{{libcall, {{6, loop_1_byte, true},
> >{24, loop, true},
> >{8192, rep_prefix_4_byte, true},
> >{-1, libcall, false}}},
> > - {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
> > + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
libcall is not faster up to 8KB to rep sequence that is better for regalloc/code
cache than fully blowin function call.
> > @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
> >m_PPRO,
> >
> >/* X86_TUNE_PARTIAL_FLAG_REG_STALL */
> > - m_CORE2I7 | m_GENERIC,
> > + m_GENERIC | m_CORE2,
This disable shifts that store just some flags. Acroding to Agner's manual I7
handle
this well.
Partial flags stall
The Sandy Bridge uses the method of an extra Âľop to join partial registers not
only for
general purpose registers but also for the flags register, unlike previous
processors which
used this method only for general purpose registers. This occurs when a write
to a part of
the flags register is followed by a read from a larger part of the flags
register. The partial
flags stall of previous processors (See page 75) is therefore replaced by an
extra Âľop. The
Sandy Bridge also generates an extra Âľop when reading the flags after a rotate
instruction.
This is cheaper than the 7 cycle delay on Core this flags is trying to avoid.
> >
> >/* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
> > * on 16-bit immediate moves into memory on Core2 and Corei7. */
> > @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
> >m_K6,
> >
> >/* X86_TUNE_USE_CLTD */
> > - ~(m_PENT | m_ATOM | m_K6),
> > + ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
None of CPUs that generic care about are !USE_CLTD now after your change.
> > @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
> >m_ATHLON_K8,
> >
> >/* X86_TUNE_SSE_TYPELESS_STORES */
> > - m_AMD_MULTIPLE,
> > + m_AMD_MULTIPLE | m_CORE2I7, /**/
Hmm, I can not seem to find this in manual now, but I believe that stores also
do not type,
so movaps is preferred over movapd store because it is shorter. If not, this
change should
produce a lot of slowdowns.
> >
> >/* X86_TUNE_SSE_LOAD0_BY_PXOR */
> > - m_PPRO | m_P4_NOCONA,
> > + m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/
Agner:
A common way of setting a register to zero is XOR EAX,EAX or SUB EBX,EBX. The
Core2 and Nehalem processors recognize that certain instructions are
independent of the
prior value of the register if the source and destination registers are the
same.
This applies to all of the following instructions: XOR, SUB, PXOR, XORPS,
XORPD, and all
variants of PSUBxxx and PCMPxxx except PCMPEQQ.
> >
> >/* X86_TUNE_MEMORY_MISMATCH_STALL */
> >m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> > @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
> >
> >/* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
> > than 4 branch instructions in the 16 byte window. */
> > - m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> > + m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
This is special passs to handle limitations of AMD's K7/K8/K10 branch
prediction.
Intel never had similar design, so this flag is pointless.
We apparently ought to disable it for K10, at least per Agner's manual.
> >
> >/* X86_TUNE_SCHEDULE */
> >m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE |
> > m_GENERIC,
> > @@ -1947,10 +1947,10 @@ static unsigned int initial_ix86_tune_fe
> >m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> >
> >/* X86_TUNE_USE_INCDEC */
> > - ~(m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_GENERIC),
> > + ~(m_P4_NOCONA | m_ATOM | m_GENERIC),
Skipping inc/dec is to avoid partial flag stall happening on P4 only.
> >
> >/* X86_TUNE_PAD_RETURNS */
> > - m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC,
> > + m_AMD_MULTIPLE | m_GENERIC,
Again this deals specifically with AMD K7/K8/K10 branch
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
Honza, can you explain each change and point to the reference?
thanks,
David
On Wed, Dec 12, 2012 at 8:37 AM, Jan Hubicka wrote:
>> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
>> SP adjustment instead of a sequence of pushes/pops. The preference to
>> the MOVs are good for old CPU micro-architectures (before pentium-4,
>> K10), because it breaks the data dependency. In modern
>> micro-architecture, push/pop is implemented using a mechanism called
>> stack engine. The data dependency is removed by the hardware, and
>> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
>> smaller. There is no longer the need to avoid using them. This is
>> also what ICC does.
>>
>> The following patch fixed the problem. It passes bootstrap/regression
>> test. OK to install?
>>
>> thanks,
>>
>> David
>>
>> Index: config/i386/i386.c
>> ===
>> --- config/i386/i386.c (revision 194324)
>> +++ config/i386/i386.c (working copy)
>> @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe
>>m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>
>>/* X86_TUNE_PROLOGUE_USING_MOVE */
>> - m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
>> + m_PPRO | m_ATHLON_K8,
>>
>>/* X86_TUNE_EPILOGUE_USING_MOVE */
>> - m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
>> + m_PPRO | m_ATHLON_K8,
>
> Push/pops wrt moves was always difficult to tune on old CPUs, so I am happy it
> is gone from generic (in fact I had similar patch pending).
> Are you sure about Atom having stack engine, too?
>
> Related thing is accumulate_outgoing_args. Igor is testing it on Core and I
> will
> give it a try on K10.
>
> Honza
>
> I am attaching the changes for core costs I made if someone is interested in
> testing them. If we can declare P4/PPRo and maybe K8 chips obsolette for
> generic, there is room for improvement in generic, too. Like using inc/dec
> again.
>
> Honza
>
> Index: config/i386/i386.c
> ===
> --- config/i386/i386.c (revision 194452)
> +++ config/i386/i386.c (working copy)
> @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
>COSTS_N_INSNS (8), /* cost of FABS instruction. */
>COSTS_N_INSNS (8), /* cost of FCHS instruction. */
>COSTS_N_INSNS (40), /* cost of FSQRT instruction. */
> - {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> - {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
> + {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>{-1, libcall, false,
>{{libcall, {{6, loop_1_byte, true},
>{24, loop, true},
>{8192, rep_prefix_4_byte, true},
>{-1, libcall, false}}},
> - {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
> + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>{-1, libcall, false,
>1, /* scalar_stmt_cost. */
>1, /* scalar load_cost. */
> @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
>m_PPRO,
>
>/* X86_TUNE_PARTIAL_FLAG_REG_STALL */
> - m_CORE2I7 | m_GENERIC,
> + m_GENERIC | m_CORE2,
>
>/* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
> * on 16-bit immediate moves into memory on Core2 and Corei7. */
> @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
>m_K6,
>
>/* X86_TUNE_USE_CLTD */
> - ~(m_PENT | m_ATOM | m_K6),
> + ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
>
>/* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx. */
>m_PENT4,
> @@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe
>m_COREI7 | m_BDVER,
>
>/* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */
> - m_BDVER ,
> + m_BDVER,
>
>/* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and
> dependencies
> are resolved on SSE register parts instead of whole registers, so we may
> @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
>m_ATHLON_K8,
>
>/* X86_TUNE_SSE_TYPELESS_STORES */
> - m_AMD_MULTIPLE,
> + m_AMD_MULTIPLE | m_CORE2I7, /**/
>
>/* X86_TUNE_SSE_LOAD0_BY_PXOR */
> - m_PPRO | m_P4_NOCONA,
> + m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/
>
>/* X86_TUNE_MEMORY_MISMATCH_STALL */
>m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
>
>/* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
> than 4 branch instructions in the 16 byte window. */
> - m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> + m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>
>/*
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Wed, Dec 12, 2012 at 8:37 AM, Jan Hubicka wrote:
>> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
>> SP adjustment instead of a sequence of pushes/pops. The preference to
>> the MOVs are good for old CPU micro-architectures (before pentium-4,
>> K10), because it breaks the data dependency. In modern
>> micro-architecture, push/pop is implemented using a mechanism called
>> stack engine. The data dependency is removed by the hardware, and
>> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
>> smaller. There is no longer the need to avoid using them. This is
>> also what ICC does.
>>
>> The following patch fixed the problem. It passes bootstrap/regression
>> test. OK to install?
>>
>> thanks,
>>
>> David
>>
>> Index: config/i386/i386.c
>> ===
>> --- config/i386/i386.c (revision 194324)
>> +++ config/i386/i386.c (working copy)
>> @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe
>>m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>
>>/* X86_TUNE_PROLOGUE_USING_MOVE */
>> - m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
>> + m_PPRO | m_ATHLON_K8,
>>
>>/* X86_TUNE_EPILOGUE_USING_MOVE */
>> - m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
>> + m_PPRO | m_ATHLON_K8,
>
> Push/pops wrt moves was always difficult to tune on old CPUs, so I am happy it
> is gone from generic (in fact I had similar patch pending).
> Are you sure about Atom having stack engine, too?
>
Good question. The instruction latency table
(http://www.agner.org/optimize/instruction_tables.pdf) shows that for
Atom: push r has one 1uop, 1 cycle latency. However the instruction is
not pairable which will affect ILP. The guide here
http://www.agner.org/optimize/microarchitecture.pdf does not mention
Atom has stack engine either.
I will help collect some performance data on Atom.
thanks,
David
> Related thing is accumulate_outgoing_args. Igor is testing it on Core and I
> will
> give it a try on K10.
>
> Honza
>
> I am attaching the changes for core costs I made if someone is interested in
> testing them. If we can declare P4/PPRo and maybe K8 chips obsolette for
> generic, there is room for improvement in generic, too. Like using inc/dec
> again.
>
> Honza
>
> Index: config/i386/i386.c
> ===
> --- config/i386/i386.c (revision 194452)
> +++ config/i386/i386.c (working copy)
> @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
>COSTS_N_INSNS (8), /* cost of FABS instruction. */
>COSTS_N_INSNS (8), /* cost of FCHS instruction. */
>COSTS_N_INSNS (40), /* cost of FSQRT instruction. */
> - {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> - {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
> + {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>{-1, libcall, false,
>{{libcall, {{6, loop_1_byte, true},
>{24, loop, true},
>{8192, rep_prefix_4_byte, true},
>{-1, libcall, false}}},
> - {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
> + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>{-1, libcall, false,
>1, /* scalar_stmt_cost. */
>1, /* scalar load_cost. */
> @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
>m_PPRO,
>
>/* X86_TUNE_PARTIAL_FLAG_REG_STALL */
> - m_CORE2I7 | m_GENERIC,
> + m_GENERIC | m_CORE2,
>
>/* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
> * on 16-bit immediate moves into memory on Core2 and Corei7. */
> @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
>m_K6,
>
>/* X86_TUNE_USE_CLTD */
> - ~(m_PENT | m_ATOM | m_K6),
> + ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
>
>/* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx. */
>m_PENT4,
> @@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe
>m_COREI7 | m_BDVER,
>
>/* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */
> - m_BDVER ,
> + m_BDVER,
>
>/* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and
> dependencies
> are resolved on SSE register parts instead of whole registers, so we may
> @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
>m_ATHLON_K8,
>
>/* X86_TUNE_SSE_TYPELESS_STORES */
> - m_AMD_MULTIPLE,
> + m_AMD_MULTIPLE | m_CORE2I7, /**/
>
>/* X86_TUNE_SSE_LOAD0_BY_PXOR */
> - m_PPRO | m_P4_NOCONA,
> + m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/
>
>/* X86_TUNE_MEMORY_MISMATCH_STALL */
>m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> @@ -1938,7 +1938,7 @@ static
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
> SP adjustment instead of a sequence of pushes/pops. The preference to
> the MOVs are good for old CPU micro-architectures (before pentium-4,
> K10), because it breaks the data dependency. In modern
> micro-architecture, push/pop is implemented using a mechanism called
> stack engine. The data dependency is removed by the hardware, and
> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
> smaller. There is no longer the need to avoid using them. This is
> also what ICC does.
>
> The following patch fixed the problem. It passes bootstrap/regression
> test. OK to install?
>
> thanks,
>
> David
>
> Index: config/i386/i386.c
> ===
> --- config/i386/i386.c (revision 194324)
> +++ config/i386/i386.c (working copy)
> @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe
>m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>
>/* X86_TUNE_PROLOGUE_USING_MOVE */
> - m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
> + m_PPRO | m_ATHLON_K8,
>
>/* X86_TUNE_EPILOGUE_USING_MOVE */
> - m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
> + m_PPRO | m_ATHLON_K8,
Push/pops wrt moves was always difficult to tune on old CPUs, so I am happy it
is gone from generic (in fact I had similar patch pending).
Are you sure about Atom having stack engine, too?
Related thing is accumulate_outgoing_args. Igor is testing it on Core and I will
give it a try on K10.
Honza
I am attaching the changes for core costs I made if someone is interested in
testing them. If we can declare P4/PPRo and maybe K8 chips obsolette for
generic, there is room for improvement in generic, too. Like using inc/dec
again.
Honza
Index: config/i386/i386.c
===
--- config/i386/i386.c (revision 194452)
+++ config/i386/i386.c (working copy)
@@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
COSTS_N_INSNS (8), /* cost of FABS instruction. */
COSTS_N_INSNS (8), /* cost of FCHS instruction. */
COSTS_N_INSNS (40), /* cost of FSQRT instruction. */
- {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
- {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
+ {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
+ {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
{-1, libcall, false,
{{libcall, {{6, loop_1_byte, true},
{24, loop, true},
{8192, rep_prefix_4_byte, true},
{-1, libcall, false}}},
- {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
+ {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
{-1, libcall, false,
1, /* scalar_stmt_cost. */
1, /* scalar load_cost. */
@@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
m_PPRO,
/* X86_TUNE_PARTIAL_FLAG_REG_STALL */
- m_CORE2I7 | m_GENERIC,
+ m_GENERIC | m_CORE2,
/* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
* on 16-bit immediate moves into memory on Core2 and Corei7. */
@@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
m_K6,
/* X86_TUNE_USE_CLTD */
- ~(m_PENT | m_ATOM | m_K6),
+ ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
/* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx. */
m_PENT4,
@@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe
m_COREI7 | m_BDVER,
/* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */
- m_BDVER ,
+ m_BDVER,
/* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies
are resolved on SSE register parts instead of whole registers, so we may
@@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
m_ATHLON_K8,
/* X86_TUNE_SSE_TYPELESS_STORES */
- m_AMD_MULTIPLE,
+ m_AMD_MULTIPLE | m_CORE2I7, /**/
/* X86_TUNE_SSE_LOAD0_BY_PXOR */
- m_PPRO | m_P4_NOCONA,
+ m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/
/* X86_TUNE_MEMORY_MISMATCH_STALL */
m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
@@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
/* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
than 4 branch instructions in the 16 byte window. */
- m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
+ m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
/* X86_TUNE_SCHEDULE */
m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE |
m_GENERIC,
@@ -1947,10 +1947,10 @@ static unsigned int initial_ix86_tune_fe
m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
/* X86_TUNE_USE_INCDEC */
- ~(m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_GENERIC),
+ ~
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Tue, Dec 11, 2012 at 11:53 PM, Xinliang David Li wrote: > The following the O2 size data from SPEC2k. Note that with push/pop, > it is a always a net win (negative delta) in terms of total binary or > total loadable section size. Thanks for the data! Richard. > thanks, > > David > >.text.eh_frame Total_binary > vortex-move 440252 40796 584066 > vortex-push 415436 57452 575906 > delta-5.6% 40.8% -1.397% > > twolf-move 169324 10748 223521 > twolf-push 168876 11124 223449 > delta -0.3% 3.5% -0.032% > > gzip-move 30668 3652 374399 > gzip-push 30524 3740 374343 > delta -0.5% 2.4% -0.015% > > bzip2-move 22748 3196 111616 > bzip2-push 22636 3284 111592 > delta -0.5% 2.8% -0.022% > > vpr-move 104684 9380 147378 > vpr-push 104236 9788 147338 > delta -0.4% 4.3% -0.027% > > mcf-move 8444 1244 26760 > mcf-push 8444 1244 26760 > delta0.0% 0.0% 0.000% > > cc1-move 1093964 90772 1576994 > cc1-push 1078988 104068 1575314 > delta -1.4% 14.6% -0.107% > > crafty-move 130556 5508 1256037 > crafty-push 130236 5772 1255981 > delta-0.2% 4.8% -0.004% > > eon-move 333660 33220 516491 > eon-push 330140 35812 51 > delta -1.1% 7.8% -0.181% > > gap-move 404092 46732 1457735 > gap-push 396012 53180 1456103 > delta -2.0% 13.8% -0.112% > > perlbmk-move 456572 45324 618585 > perlbmk-push 449516 52340 618545 > delta -1.5% 15.5% -0.006% > > parser-move 81244 15788 334003 > parser-push 80684 16332 333987 > delta -0.7% 3.4% -0.005% > > > On Tue, Dec 11, 2012 at 9:14 AM, Xinliang David Li wrote: >> On Tue, Dec 11, 2012 at 1:49 AM, Richard Biener >> wrote: >>> On Mon, Dec 10, 2012 at 10:07 PM, Mike Stump wrote: On Dec 10, 2012, at 12:42 PM, Xinliang David Li wrote: > I have not measured the CFI size impact -- but conceivably it should > be larger -- which is unfortunate. Code speed and size are preferable to optimizing dwarf size… :-) I'd let dwarf 5 fix it! >>> >>> Well, different to debug info, CFI data has to be in memory to make >>> unwinding work. >>> These days most Linux distributions enable asyncronous unwind tables so any >>> size savings due to shorter push/pop epilogue/prologue sequences has to be >>> offsetted by the increase in CFI data. I'm not sure there is really a >>> speed difference >>> between both variants (well, maybe due to better icache footprint of >>> the push/pop >>> variant). >> >> Yes, for large applications, this can be crucial to performance. >> >>> >>> That said - I'd prefer to have more data on this before making the switch >>> for >>> the generic model. What was your original motivation? Just "theory" or was >>> it a real case? >> >> 1) some of the very large internal apps I measured benefit from this >> change (in terms of performance) >> 2) both ICC and LLVM do the same. >> >> I have already committed the patch. I will find some time to collect >> more size data and post it later. >> >> thanks, >> >> David >> >> >>> >>> Thanks, >>> Richard.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
Some SPEC2k performance number (with 3 runs on core2): Push wins over move on 3 benchmarks. Others are noises. perlbmk : ~+1.9% gap: ~+1.4% vortex:~ +0.7% David On Tue, Dec 11, 2012 at 2:53 PM, Xinliang David Li wrote: > The following the O2 size data from SPEC2k. Note that with push/pop, > it is a always a net win (negative delta) in terms of total binary or > total loadable section size. > > thanks, > > David > >.text.eh_frame Total_binary > vortex-move 440252 40796 584066 > vortex-push 415436 57452 575906 > delta-5.6% 40.8% -1.397% > > twolf-move 169324 10748 223521 > twolf-push 168876 11124 223449 > delta -0.3% 3.5% -0.032% > > gzip-move 30668 3652 374399 > gzip-push 30524 3740 374343 > delta -0.5% 2.4% -0.015% > > bzip2-move 22748 3196 111616 > bzip2-push 22636 3284 111592 > delta -0.5% 2.8% -0.022% > > vpr-move 104684 9380 147378 > vpr-push 104236 9788 147338 > delta -0.4% 4.3% -0.027% > > mcf-move 8444 1244 26760 > mcf-push 8444 1244 26760 > delta0.0% 0.0% 0.000% > > cc1-move 1093964 90772 1576994 > cc1-push 1078988 104068 1575314 > delta -1.4% 14.6% -0.107% > > crafty-move 130556 5508 1256037 > crafty-push 130236 5772 1255981 > delta-0.2% 4.8% -0.004% > > eon-move 333660 33220 516491 > eon-push 330140 35812 51 > delta -1.1% 7.8% -0.181% > > gap-move 404092 46732 1457735 > gap-push 396012 53180 1456103 > delta -2.0% 13.8% -0.112% > > perlbmk-move 456572 45324 618585 > perlbmk-push 449516 52340 618545 > delta -1.5% 15.5% -0.006% > > parser-move 81244 15788 334003 > parser-push 80684 16332 333987 > delta -0.7% 3.4% -0.005% > > > On Tue, Dec 11, 2012 at 9:14 AM, Xinliang David Li wrote: >> On Tue, Dec 11, 2012 at 1:49 AM, Richard Biener >> wrote: >>> On Mon, Dec 10, 2012 at 10:07 PM, Mike Stump wrote: On Dec 10, 2012, at 12:42 PM, Xinliang David Li wrote: > I have not measured the CFI size impact -- but conceivably it should > be larger -- which is unfortunate. Code speed and size are preferable to optimizing dwarf size… :-) I'd let dwarf 5 fix it! >>> >>> Well, different to debug info, CFI data has to be in memory to make >>> unwinding work. >>> These days most Linux distributions enable asyncronous unwind tables so any >>> size savings due to shorter push/pop epilogue/prologue sequences has to be >>> offsetted by the increase in CFI data. I'm not sure there is really a >>> speed difference >>> between both variants (well, maybe due to better icache footprint of >>> the push/pop >>> variant). >> >> Yes, for large applications, this can be crucial to performance. >> >>> >>> That said - I'd prefer to have more data on this before making the switch >>> for >>> the generic model. What was your original motivation? Just "theory" or was >>> it a real case? >> >> 1) some of the very large internal apps I measured benefit from this >> change (in terms of performance) >> 2) both ICC and LLVM do the same. >> >> I have already committed the patch. I will find some time to collect >> more size data and post it later. >> >> thanks, >> >> David >> >> >>> >>> Thanks, >>> Richard.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
The following the O2 size data from SPEC2k. Note that with push/pop, it is a always a net win (negative delta) in terms of total binary or total loadable section size. thanks, David .text.eh_frame Total_binary vortex-move 440252 40796 584066 vortex-push 415436 57452 575906 delta-5.6% 40.8% -1.397% twolf-move 169324 10748 223521 twolf-push 168876 11124 223449 delta -0.3% 3.5% -0.032% gzip-move 30668 3652 374399 gzip-push 30524 3740 374343 delta -0.5% 2.4% -0.015% bzip2-move 22748 3196 111616 bzip2-push 22636 3284 111592 delta -0.5% 2.8% -0.022% vpr-move 104684 9380 147378 vpr-push 104236 9788 147338 delta -0.4% 4.3% -0.027% mcf-move 8444 1244 26760 mcf-push 8444 1244 26760 delta0.0% 0.0% 0.000% cc1-move 1093964 90772 1576994 cc1-push 1078988 104068 1575314 delta -1.4% 14.6% -0.107% crafty-move 130556 5508 1256037 crafty-push 130236 5772 1255981 delta-0.2% 4.8% -0.004% eon-move 333660 33220 516491 eon-push 330140 35812 51 delta -1.1% 7.8% -0.181% gap-move 404092 46732 1457735 gap-push 396012 53180 1456103 delta -2.0% 13.8% -0.112% perlbmk-move 456572 45324 618585 perlbmk-push 449516 52340 618545 delta -1.5% 15.5% -0.006% parser-move 81244 15788 334003 parser-push 80684 16332 333987 delta -0.7% 3.4% -0.005% On Tue, Dec 11, 2012 at 9:14 AM, Xinliang David Li wrote: > On Tue, Dec 11, 2012 at 1:49 AM, Richard Biener > wrote: >> On Mon, Dec 10, 2012 at 10:07 PM, Mike Stump wrote: >>> On Dec 10, 2012, at 12:42 PM, Xinliang David Li wrote: I have not measured the CFI size impact -- but conceivably it should be larger -- which is unfortunate. >>> >>> Code speed and size are preferable to optimizing dwarf size… :-) I'd let >>> dwarf 5 fix it! >> >> Well, different to debug info, CFI data has to be in memory to make >> unwinding work. >> These days most Linux distributions enable asyncronous unwind tables so any >> size savings due to shorter push/pop epilogue/prologue sequences has to be >> offsetted by the increase in CFI data. I'm not sure there is really a >> speed difference >> between both variants (well, maybe due to better icache footprint of >> the push/pop >> variant). > > Yes, for large applications, this can be crucial to performance. > >> >> That said - I'd prefer to have more data on this before making the switch for >> the generic model. What was your original motivation? Just "theory" or was >> it a real case? > > 1) some of the very large internal apps I measured benefit from this > change (in terms of performance) > 2) both ICC and LLVM do the same. > > I have already committed the patch. I will find some time to collect > more size data and post it later. > > thanks, > > David > > >> >> Thanks, >> Richard.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Tue, Dec 11, 2012 at 1:49 AM, Richard Biener wrote: > On Mon, Dec 10, 2012 at 10:07 PM, Mike Stump wrote: >> On Dec 10, 2012, at 12:42 PM, Xinliang David Li wrote: >>> I have not measured the CFI size impact -- but conceivably it should >>> be larger -- which is unfortunate. >> >> Code speed and size are preferable to optimizing dwarf size… :-) I'd let >> dwarf 5 fix it! > > Well, different to debug info, CFI data has to be in memory to make > unwinding work. > These days most Linux distributions enable asyncronous unwind tables so any > size savings due to shorter push/pop epilogue/prologue sequences has to be > offsetted by the increase in CFI data. I'm not sure there is really a > speed difference > between both variants (well, maybe due to better icache footprint of > the push/pop > variant). Yes, for large applications, this can be crucial to performance. > > That said - I'd prefer to have more data on this before making the switch for > the generic model. What was your original motivation? Just "theory" or was > it a real case? 1) some of the very large internal apps I measured benefit from this change (in terms of performance) 2) both ICC and LLVM do the same. I have already committed the patch. I will find some time to collect more size data and post it later. thanks, David > > Thanks, > Richard.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Mon, Dec 10, 2012 at 10:07 PM, Mike Stump wrote: > On Dec 10, 2012, at 12:42 PM, Xinliang David Li wrote: >> I have not measured the CFI size impact -- but conceivably it should >> be larger -- which is unfortunate. > > Code speed and size are preferable to optimizing dwarf size… :-) I'd let > dwarf 5 fix it! Well, different to debug info, CFI data has to be in memory to make unwinding work. These days most Linux distributions enable asyncronous unwind tables so any size savings due to shorter push/pop epilogue/prologue sequences has to be offsetted by the increase in CFI data. I'm not sure there is really a speed difference between both variants (well, maybe due to better icache footprint of the push/pop variant). That said - I'd prefer to have more data on this before making the switch for the generic model. What was your original motivation? Just "theory" or was it a real case? Thanks, Richard.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Dec 10, 2012, at 12:42 PM, Xinliang David Li wrote: > I have not measured the CFI size impact -- but conceivably it should > be larger -- which is unfortunate. Code speed and size are preferable to optimizing dwarf size… :-) I'd let dwarf 5 fix it!
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
I have not measured the CFI size impact -- but conceivably it should be larger -- which is unfortunate. David On Mon, Dec 10, 2012 at 1:23 AM, Richard Biener wrote: > On Sun, Dec 9, 2012 at 2:50 PM, Uros Bizjak wrote: >> Hello! >> >>> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a >>> SP adjustment instead of a sequence of pushes/pops. The preference to >>> the MOVs are good for old CPU micro-architectures (before pentium-4, >>> K10), because it breaks the data dependency. In modern >>> micro-architecture, push/pop is implemented using a mechanism called >>> stack engine. The data dependency is removed by the hardware, and >>> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are >>> smaller. There is no longer the need to avoid using them. This is >>> also what ICC does. >> >>> 2012-12-08 Xinliang David Li >>>* config/i386/i386.c: Eanble push/pop in pro/epilogue for >>> moderen CPUs. >> >> s/moderen/modern >> >> OK for mainline SVN. > > It's also more costly for unwind info in the prologue/epilogue. Thus, did you > measure the effect on CFI size? > > Thanks, > Richard. > >> Thanks, >> Uros.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
On Sun, Dec 9, 2012 at 2:50 PM, Uros Bizjak wrote: > Hello! > >> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a >> SP adjustment instead of a sequence of pushes/pops. The preference to >> the MOVs are good for old CPU micro-architectures (before pentium-4, >> K10), because it breaks the data dependency. In modern >> micro-architecture, push/pop is implemented using a mechanism called >> stack engine. The data dependency is removed by the hardware, and >> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are >> smaller. There is no longer the need to avoid using them. This is >> also what ICC does. > >> 2012-12-08 Xinliang David Li >>* config/i386/i386.c: Eanble push/pop in pro/epilogue for moderen >> CPUs. > > s/moderen/modern > > OK for mainline SVN. It's also more costly for unwind info in the prologue/epilogue. Thus, did you measure the effect on CFI size? Thanks, Richard. > Thanks, > Uros.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
s/Eanble/Enable/ Thanks, Dmitry 2012/12/9 Uros Bizjak : > Hello! > >> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a >> SP adjustment instead of a sequence of pushes/pops. The preference to >> the MOVs are good for old CPU micro-architectures (before pentium-4, >> K10), because it breaks the data dependency. In modern >> micro-architecture, push/pop is implemented using a mechanism called >> stack engine. The data dependency is removed by the hardware, and >> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are >> smaller. There is no longer the need to avoid using them. This is >> also what ICC does. > >> 2012-12-08 Xinliang David Li >>* config/i386/i386.c: Eanble push/pop in pro/epilogue for moderen >> CPUs. > > s/moderen/modern > > OK for mainline SVN. > > Thanks, > Uros.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
Hello! > I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a > SP adjustment instead of a sequence of pushes/pops. The preference to > the MOVs are good for old CPU micro-architectures (before pentium-4, > K10), because it breaks the data dependency. In modern > micro-architecture, push/pop is implemented using a mechanism called > stack engine. The data dependency is removed by the hardware, and > push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are > smaller. There is no longer the need to avoid using them. This is > also what ICC does. > 2012-12-08 Xinliang David Li >* config/i386/i386.c: Eanble push/pop in pro/epilogue for moderen > CPUs. s/moderen/modern OK for mainline SVN. Thanks, Uros.
