RE: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-21 Thread Zamyatin, Igor
So far we see a regression on one of eembc_1_1 test because of following change:

   /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
  from FP to FP. */
-  m_CORE2I7 | m_AMDFAM10 | m_GENERIC,
+  m_AMDFAM10 | m_GENERIC,

Probably we should keep it as is while there is nothing about it in docs 
indeed...


Thanks,
Igor


-Original Message-
From: [email protected] [mailto:[email protected]] On 
Behalf Of Jan Hubicka
Sent: Wednesday, December 12, 2012 8:37 PM
To: Xinliang David Li
Cc: GCC Patches
Subject: Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a 
> SP adjustment instead of a sequence of pushes/pops. The preference to 
> the MOVs are good for old CPU micro-architectures (before pentium-4, 
> K10), because it breaks the data dependency.  In modern 
> micro-architecture, push/pop is implemented using a mechanism called 
> stack engine. The data dependency is removed by the hardware, and 
> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
> smaller. There is no longer the need to avoid using them.   This is
> also what ICC does.
> 
> The following patch fixed the problem. It passes bootstrap/regression 
> test. OK to install?
> 
> thanks,
> 
> David
> 
> Index: config/i386/i386.c
> ===
> --- config/i386/i386.c (revision 194324)
> +++ config/i386/i386.c (working copy)
> @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe
>m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> 
>/* X86_TUNE_PROLOGUE_USING_MOVE */
> -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
> +  m_PPRO | m_ATHLON_K8,
> 
>/* X86_TUNE_EPILOGUE_USING_MOVE */
> -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
> +  m_PPRO | m_ATHLON_K8,

Push/pops wrt moves was always difficult to tune on old CPUs, so I am happy it 
is gone from generic (in fact I had similar patch pending).
Are you sure about Atom having stack engine, too?

Related thing is accumulate_outgoing_args. Igor is testing it on Core and I 
will give it a try on K10.

Honza

I am attaching the changes for core costs I made if someone is interested in 
testing them.  If we can declare P4/PPRo and maybe K8 chips obsolette for 
generic, there is room for improvement in generic, too. Like using inc/dec 
again.

Honza

Index: config/i386/i386.c
===
--- config/i386/i386.c  (revision 194452)
+++ config/i386/i386.c  (working copy)
@@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
   COSTS_N_INSNS (8),   /* cost of FABS instruction.  */
   COSTS_N_INSNS (8),   /* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),  /* cost of FSQRT instruction.  */
-  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
-   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
+  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
+   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
   {-1, libcall, false,
   {{libcall, {{6, loop_1_byte, true},
   {24, loop, true},
   {8192, rep_prefix_4_byte, true},
   {-1, libcall, false}}},
-   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
+   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
   {-1, libcall, false,
   1,   /* scalar_stmt_cost.  */
   1,   /* scalar load_cost.  */
@@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
   m_PPRO,
 
   /* X86_TUNE_PARTIAL_FLAG_REG_STALL */
-  m_CORE2I7 | m_GENERIC,
+  m_GENERIC | m_CORE2,
 
   /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
* on 16-bit immediate moves into memory on Core2 and Corei7.  */ @@ -1822,7 
+1822,7 @@ static unsigned int initial_ix86_tune_fe
   m_K6,
 
   /* X86_TUNE_USE_CLTD */
-  ~(m_PENT | m_ATOM | m_K6),
+  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
 
   /* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx.  */
   m_PENT4,
@@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe
   m_COREI7 | m_BDVER,
 
   /* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */
-  m_BDVER ,
+  m_BDVER,
 
   /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies
  are resolved on SSE register parts instead of whole registers, so we may 
@@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
   m_ATHLON_K8,
 
   /* X86_TUNE_SSE_TYPELESS_STORES */
-  m_AMD_MULTIPLE,
+  m_AMD_MULTIPLE | m_CORE2I7, /**/
 
   /* X86_TUNE_SSE_LOAD0_BY_PXOR */
-  m_PPRO | m_P4_NOCONA,
+  m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/
 
   /* X86_T

RE: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-21 Thread Zamyatin, Igor
We checked also spec2000 and eembc_2_0 on Atom - no visible regressions and 
gains

-Original Message-
From: [email protected] [mailto:[email protected]] On 
Behalf Of Xinliang David Li
Sent: Friday, December 21, 2012 11:26 AM
To: Jan Hubicka
Cc: GCC Patches; Ahmad Sharif
Subject: Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

Ahmad has helped doing some atom performance testing (ChromeOS
benchmarks) with this patch. In summary, there is no statistically significant 
regression seen. There is one improvement of about +1.9%
(v8 benchmark) which looks real.

David

On Wed, Dec 12, 2012 at 9:24 AM, Xinliang David Li  wrote:
> On Wed, Dec 12, 2012 at 8:37 AM, Jan Hubicka  wrote:
>>> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by 
>>> a SP adjustment instead of a sequence of pushes/pops. The preference 
>>> to the MOVs are good for old CPU micro-architectures (before 
>>> pentium-4, K10), because it breaks the data dependency.  In modern 
>>> micro-architecture, push/pop is implemented using a mechanism called 
>>> stack engine. The data dependency is removed by the hardware, and 
>>> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
>>> smaller. There is no longer the need to avoid using them.   This is
>>> also what ICC does.
>>>
>>> The following patch fixed the problem. It passes 
>>> bootstrap/regression test. OK to install?
>>>
>>> thanks,
>>>
>>> David
>>>
>>> Index: config/i386/i386.c
>>> ===
>>> --- config/i386/i386.c (revision 194324)
>>> +++ config/i386/i386.c (working copy)
>>> @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe
>>>m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>>
>>>/* X86_TUNE_PROLOGUE_USING_MOVE */
>>> -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
>>> +  m_PPRO | m_ATHLON_K8,
>>>
>>>/* X86_TUNE_EPILOGUE_USING_MOVE */
>>> -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
>>> +  m_PPRO | m_ATHLON_K8,
>>
>> Push/pops wrt moves was always difficult to tune on old CPUs, so I am 
>> happy it is gone from generic (in fact I had similar patch pending).
>> Are you sure about Atom having stack engine, too?
>>
>
> Good question. The instruction latency table
> (http://www.agner.org/optimize/instruction_tables.pdf) shows that for
> Atom: push r has one 1uop, 1 cycle latency. However the instruction is 
> not pairable which will affect ILP. The guide here 
> http://www.agner.org/optimize/microarchitecture.pdf does not mention 
> Atom has stack engine either.
>
> I will help collect some performance data on Atom.
>
>
> thanks,
>
> David
>
>
>> Related thing is accumulate_outgoing_args. Igor is testing it on Core 
>> and I will give it a try on K10.
>>
>> Honza
>>
>> I am attaching the changes for core costs I made if someone is 
>> interested in testing them.  If we can declare P4/PPRo and maybe K8 
>> chips obsolette for generic, there is room for improvement in 
>> generic, too. Like using inc/dec again.
>>
>> Honza
>>
>> Index: config/i386/i386.c
>> ===
>> --- config/i386/i386.c  (revision 194452)
>> +++ config/i386/i386.c  (working copy)
>> @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
>>COSTS_N_INSNS (8),   /* cost of FABS instruction.  */
>>COSTS_N_INSNS (8),   /* cost of FCHS instruction.  */
>>COSTS_N_INSNS (40),  /* cost of FSQRT instruction.  */
>> -  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>> -   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
>> +  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>> +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>>{-1, libcall, false,
>>{{libcall, {{6, loop_1_byte, true},
>>{24, loop, true},
>>{8192, rep_prefix_4_byte, true},
>>{-1, libcall, false}}},
>> -   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
>> +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>>{-1, libcall, false,
>>1,   /* scalar_stmt_cost.  */
>>1,   /* scalar load_cost.  */
>> @@ -1806,7 +1806,7 @@ sta

Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-20 Thread Xinliang David Li
Ahmad has helped doing some atom performance testing (ChromeOS
benchmarks) with this patch. In summary, there is no statistically
significant regression seen. There is one improvement of about +1.9%
(v8 benchmark) which looks real.

David

On Wed, Dec 12, 2012 at 9:24 AM, Xinliang David Li  wrote:
> On Wed, Dec 12, 2012 at 8:37 AM, Jan Hubicka  wrote:
>>> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
>>> SP adjustment instead of a sequence of pushes/pops. The preference to
>>> the MOVs are good for old CPU micro-architectures (before pentium-4,
>>> K10), because it breaks the data dependency.  In modern
>>> micro-architecture, push/pop is implemented using a mechanism called
>>> stack engine. The data dependency is removed by the hardware, and
>>> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
>>> smaller. There is no longer the need to avoid using them.   This is
>>> also what ICC does.
>>>
>>> The following patch fixed the problem. It passes bootstrap/regression
>>> test. OK to install?
>>>
>>> thanks,
>>>
>>> David
>>>
>>> Index: config/i386/i386.c
>>> ===
>>> --- config/i386/i386.c (revision 194324)
>>> +++ config/i386/i386.c (working copy)
>>> @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe
>>>m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>>
>>>/* X86_TUNE_PROLOGUE_USING_MOVE */
>>> -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
>>> +  m_PPRO | m_ATHLON_K8,
>>>
>>>/* X86_TUNE_EPILOGUE_USING_MOVE */
>>> -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
>>> +  m_PPRO | m_ATHLON_K8,
>>
>> Push/pops wrt moves was always difficult to tune on old CPUs, so I am happy 
>> it
>> is gone from generic (in fact I had similar patch pending).
>> Are you sure about Atom having stack engine, too?
>>
>
> Good question. The instruction latency table
> (http://www.agner.org/optimize/instruction_tables.pdf) shows that for
> Atom: push r has one 1uop, 1 cycle latency. However the instruction is
> not pairable which will affect ILP. The guide here
> http://www.agner.org/optimize/microarchitecture.pdf does not mention
> Atom has stack engine either.
>
> I will help collect some performance data on Atom.
>
>
> thanks,
>
> David
>
>
>> Related thing is accumulate_outgoing_args. Igor is testing it on Core and I 
>> will
>> give it a try on K10.
>>
>> Honza
>>
>> I am attaching the changes for core costs I made if someone is interested in
>> testing them.  If we can declare P4/PPRo and maybe K8 chips obsolette for
>> generic, there is room for improvement in generic, too. Like using inc/dec
>> again.
>>
>> Honza
>>
>> Index: config/i386/i386.c
>> ===
>> --- config/i386/i386.c  (revision 194452)
>> +++ config/i386/i386.c  (working copy)
>> @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
>>COSTS_N_INSNS (8),   /* cost of FABS instruction.  */
>>COSTS_N_INSNS (8),   /* cost of FCHS instruction.  */
>>COSTS_N_INSNS (40),  /* cost of FSQRT instruction.  */
>> -  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>> -   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
>> +  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>> +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>>{-1, libcall, false,
>>{{libcall, {{6, loop_1_byte, true},
>>{24, loop, true},
>>{8192, rep_prefix_4_byte, true},
>>{-1, libcall, false}}},
>> -   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
>> +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>>{-1, libcall, false,
>>1,   /* scalar_stmt_cost.  */
>>1,   /* scalar load_cost.  */
>> @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
>>m_PPRO,
>>
>>/* X86_TUNE_PARTIAL_FLAG_REG_STALL */
>> -  m_CORE2I7 | m_GENERIC,
>> +  m_GENERIC | m_CORE2,
>>
>>/* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
>> * on 16-bit immediate moves into memory on Core2 and Corei7.  */
>> @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
>>m_K6,
>>
>>/* X86_TUNE_USE_CLTD */
>> -  ~(m_PENT | m_ATOM | m_K6),
>> +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
>>
>>/* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx.  */
>>m_PENT4,
>> @@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe
>>m_COREI7 | m_BDVER,
>>
>>/* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */
>> -  m_BDVER ,
>> +  m_BDVER,
>>
>>/* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and 
>> dependencies
>>   are resolved on SSE register parts instead of whole registers, so we 
>> may
>> @@ -1910,10

Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-20 Thread H.J. Lu
On Thu, Dec 20, 2012 at 7:06 AM, Jan Hubicka  wrote:
>> > Hi Areg,
>> >
>> > Did you mean inlined memcpy/memset are as fast as
>> > the ones in libc.so on both ia32 and Intel64?
>>
>> I would be interested in output of the stringop script.
>
> Also as far as I can remember, none of spec2k6 benchmarks is really stringop
> bound.  On Spec2k GCC was quite bound by memset (within alloc_rtx and bitmap
> oprations) but mostly by collecting page faults there.  Inlining that one made
> quite a lot of difference on K8 hardware, but not on later chips.
>

There is a GCC performance regression bug on EEMBC.  It turns out
that -static was used for linking and optimized memory functions weren't
used.  Remove -static fixed the performance regression.

-- 
H.J.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-20 Thread Jan Hubicka
> > Hi Areg,
> > 
> > Did you mean inlined memcpy/memset are as fast as
> > the ones in libc.so on both ia32 and Intel64?
> 
> I would be interested in output of the stringop script.

Also as far as I can remember, none of spec2k6 benchmarks is really stringop
bound.  On Spec2k GCC was quite bound by memset (within alloc_rtx and bitmap
oprations) but mostly by collecting page faults there.  Inlining that one made
quite a lot of difference on K8 hardware, but not on later chips.

Honza


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-20 Thread Jan Hubicka
> Hi Areg,
> 
> Did you mean inlined memcpy/memset are as fast as
> the ones in libc.so on both ia32 and Intel64?

I would be interested in output of the stringop script.
> 
> Please keep in mind that memcpy/memset in libc.a
> may not be optimized.  You must not use -static for
> linking.

In my setup I use dynamic linking...
(this is quite anoying property in general - people tend to use --static for
performance critical binaries to save expenses of PIC.  It would be really cool
to have way to call proper stringops based on -march switch)

Honza
> 
> -- 
> H.J.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-20 Thread H.J. Lu
On Thu, Dec 20, 2012 at 4:13 AM, Melik-adamyan, Areg
 wrote:
> We checked,  no significant gains or losses.
>
> -Original Message-
> From: H.J. Lu [mailto:[email protected]]
> Sent: Friday, December 14, 2012 1:03 AM
> To: Jan Hubicka
> Cc: Jakub Jelinek; Xinliang David Li; GCC Patches; Teresa Johnson; 
> Melik-adamyan, Areg
> Subject: Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
>
> On Thu, Dec 13, 2012 at 12:40 PM, Jan Hubicka  wrote:
>>> > Here we speak about memcpy/memset only.  I never got around to
>>> > modernize strlen and friends, unfortunately...
>>> >
>>> > memcmp and friends are different beats.  They realy need some TLC...
>>>
>>> memcpy and memset in glibc are also extremely fast.
>>
>> The default strategy now is to inline only when the block is known to
>> be small (either constant or via profile feedback, we do not really
>> use the info on upper bound of size of the copied object that would be
>> useful but not readilly available at expansion time).
>>
>> You can try the test_stringop script I attached and send me the
>> results.  For
>
> Areg, can you give it a try?  Thanks.
>

Hi Areg,

Did you mean inlined memcpy/memset are as fast as
the ones in libc.so on both ia32 and Intel64?

Please keep in mind that memcpy/memset in libc.a
may not be optimized.  You must not use -static for
linking.

-- 
H.J.


RE: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-20 Thread Melik-adamyan, Areg
We checked,  no significant gains or losses.

-Original Message-
From: H.J. Lu [mailto:[email protected]] 
Sent: Friday, December 14, 2012 1:03 AM
To: Jan Hubicka
Cc: Jakub Jelinek; Xinliang David Li; GCC Patches; Teresa Johnson; 
Melik-adamyan, Areg
Subject: Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

On Thu, Dec 13, 2012 at 12:40 PM, Jan Hubicka  wrote:
>> > Here we speak about memcpy/memset only.  I never got around to 
>> > modernize strlen and friends, unfortunately...
>> >
>> > memcmp and friends are different beats.  They realy need some TLC...
>>
>> memcpy and memset in glibc are also extremely fast.
>
> The default strategy now is to inline only when the block is known to 
> be small (either constant or via profile feedback, we do not really 
> use the info on upper bound of size of the copied object that would be 
> useful but not readilly available at expansion time).
>
> You can try the test_stringop script I attached and send me the 
> results.  For

Areg, can you give it a try?  Thanks.

> me libc starts to be win only for rather large blocks (i.e. >8KB)
>

Which glibc are you using?

--
H.J.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-13 Thread Jan Hubicka
> > me libc starts to be win only for rather large blocks (i.e. >8KB)
> >
> 
> Which glibc are you using?

2.15 as it comes with opensuse 12.2

Honza
> 
> -- 
> H.J.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-13 Thread H.J. Lu
On Thu, Dec 13, 2012 at 12:40 PM, Jan Hubicka  wrote:
>> > Here we speak about memcpy/memset only.  I never got around to modernize
>> > strlen and friends, unfortunately...
>> >
>> > memcmp and friends are different beats.  They realy need some TLC...
>>
>> memcpy and memset in glibc are also extremely fast.
>
> The default strategy now is to inline only when the block is known to be small
> (either constant or via profile feedback, we do not really use the info on
> upper bound of size of the copied object that would be useful but not readilly
> available at expansion time).
>
> You can try the test_stringop script I attached and send me the results.  For

Areg, can you give it a try?  Thanks.

> me libc starts to be win only for rather large blocks (i.e. >8KB)
>

Which glibc are you using?

-- 
H.J.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-13 Thread Jan Hubicka
> > Here we speak about memcpy/memset only.  I never got around to modernize
> > strlen and friends, unfortunately...
> >
> > memcmp and friends are different beats.  They realy need some TLC...
> 
> memcpy and memset in glibc are also extremely fast.

The default strategy now is to inline only when the block is known to be small
(either constant or via profile feedback, we do not really use the info on
upper bound of size of the copied object that would be useful but not readilly
available at expansion time).

You can try the test_stringop script I attached and send me the results.  For
me libc starts to be win only for rather large blocks (i.e. >8KB)

Honza


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-13 Thread H.J. Lu
On Thu, Dec 13, 2012 at 12:26 PM, Jan Hubicka  wrote:
>> On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek  wrote:
>> > On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote:
>> >> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka  wrote:
>> >> >> > libcall is not faster up to 8KB to rep sequence that is better for 
>> >> >> > regalloc/code
>> >> >> > cache than fully blowin function call.
>> >> >>
>> >> >> Be careful with this. My recollection is that REP sequence is good for
>> >> >> any size -- for smaller size, the REP initial set up cost is too high
>> >> >> (10s of cycles), while for large size copy, it is less efficient
>> >> >> compared with library version.
>> >> >
>> >> > Well this is based on the data from the memtest script.
>> >> > Core has good REP implementation - it is a win from rather small blocks 
>> >> > (16
>> >> > bytes if I recall) and it does not need alignment.
>> >> > Library version starts to be interesting with caching hints, but I 
>> >> > think till 80KB
>> >> > it is still not a win for my setup (glibc-2.15)
>> >>
>> >> A simple test shows that -mstringop-strategy=libcall always beats
>> >> -mstringop-strategy=rep_8byte (on core2 and corei7) except for size
>> >> smaller than 8 where the rep_8byte strategy simply bypasses REP movs.
>> >> Can you share your memtest ?
>> >
>> > I can't believe that say 16 byte or 32 byte memcpy can be ever faster 
>> > using a
>> > libcall.  The PLT call overhead is simply too high.
>> >
>>
>> The x86 string/memory functions in the current glibc are
>> extremely fast and tuned for Core 2/Core i7.  GCC is having
>> a very hard time to beat them with inlining:
>>
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052
>
> Here we speak about memcpy/memset only.  I never got around to modernize
> strlen and friends, unfortunately...
>
> memcmp and friends are different beats.  They realy need some TLC...

memcpy and memset in glibc are also extremely fast.


-- 
H.J.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-13 Thread Jan Hubicka
> On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek  wrote:
> > On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote:
> >> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka  wrote:
> >> >> > libcall is not faster up to 8KB to rep sequence that is better for 
> >> >> > regalloc/code
> >> >> > cache than fully blowin function call.
> >> >>
> >> >> Be careful with this. My recollection is that REP sequence is good for
> >> >> any size -- for smaller size, the REP initial set up cost is too high
> >> >> (10s of cycles), while for large size copy, it is less efficient
> >> >> compared with library version.
> >> >
> >> > Well this is based on the data from the memtest script.
> >> > Core has good REP implementation - it is a win from rather small blocks 
> >> > (16
> >> > bytes if I recall) and it does not need alignment.
> >> > Library version starts to be interesting with caching hints, but I think 
> >> > till 80KB
> >> > it is still not a win for my setup (glibc-2.15)
> >>
> >> A simple test shows that -mstringop-strategy=libcall always beats
> >> -mstringop-strategy=rep_8byte (on core2 and corei7) except for size
> >> smaller than 8 where the rep_8byte strategy simply bypasses REP movs.
> >> Can you share your memtest ?
> >
> > I can't believe that say 16 byte or 32 byte memcpy can be ever faster using 
> > a
> > libcall.  The PLT call overhead is simply too high.
> >
> 
> The x86 string/memory functions in the current glibc are
> extremely fast and tuned for Core 2/Core i7.  GCC is having
> a very hard time to beat them with inlining:
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052

Here we speak about memcpy/memset only.  I never got around to modernize
strlen and friends, unfortunately...

memcmp and friends are different beats.  They realy need some TLC...

Honza


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-13 Thread H.J. Lu
On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek  wrote:
> On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote:
>> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka  wrote:
>> >> > libcall is not faster up to 8KB to rep sequence that is better for 
>> >> > regalloc/code
>> >> > cache than fully blowin function call.
>> >>
>> >> Be careful with this. My recollection is that REP sequence is good for
>> >> any size -- for smaller size, the REP initial set up cost is too high
>> >> (10s of cycles), while for large size copy, it is less efficient
>> >> compared with library version.
>> >
>> > Well this is based on the data from the memtest script.
>> > Core has good REP implementation - it is a win from rather small blocks (16
>> > bytes if I recall) and it does not need alignment.
>> > Library version starts to be interesting with caching hints, but I think 
>> > till 80KB
>> > it is still not a win for my setup (glibc-2.15)
>>
>> A simple test shows that -mstringop-strategy=libcall always beats
>> -mstringop-strategy=rep_8byte (on core2 and corei7) except for size
>> smaller than 8 where the rep_8byte strategy simply bypasses REP movs.
>> Can you share your memtest ?
>
> I can't believe that say 16 byte or 32 byte memcpy can be ever faster using a
> libcall.  The PLT call overhead is simply too high.
>

The x86 string/memory functions in the current glibc are
extremely fast and tuned for Core 2/Core i7.  GCC is having
a very hard time to beat them with inlining:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052

-- 
H.J.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-13 Thread Jan Hubicka
> Try the following one. 1) -minline-all-stringops
> -mstringop-strategy=rep_8byte -O2 vs 1) -mstringop_strategy=libcall
> -O2.
> 
> David
> 
> 
> #include 
> #include 
> #include 
> #ifndef LEN
> #define LEN 16
> #endif
> 
> void copy(char* s1, char* s2,int len) __attribute__((noinline));
> void copy(char* s1, char* s2,int len)
> {
>memcpy(s2,s1,len);
> }

I guess the catch here is that you force the copy to be noinline and thus you
eliminate the benefits of inlined sequence.  With inline stringop one saves
regalloc and often can get rid of the alignment tests.

This is script I use to tune the tables.

Honza

test()
{
rm -f a.out
cat <&1`
echo -n " "$TIME
echo $TIME $4 >>/tmp/accum
}

testrow()
{
echo -n "" >/tmp/accum
printf "block size %7i" $3
test "$2" "$3" "-mstringop-strategy=libcall" libcall
test "$2" "$3" "-mstringop-strategy=rep_byte -malign-stringops" rep1
test "$2" "$3" "-mstringop-strategy=rep_byte -mno-align-stringops" rep1noalign
test "$2" "$3" "-mstringop-strategy=rep_4byte -malign-stringops" rep4
test "$2" "$3" "-mstringop-strategy=rep_4byte -mno-align-stringops" rep4noalign
if [ "$mode" == 64 ]
then
test "$2" "$3" "-mstringop-strategy=rep_8byte -malign-stringops" rep8
test "$2" "$3" "-mstringop-strategy=rep_8byte -mno-align-stringops" rep8noalign
fi
test "$2" "$3" "-mstringop-strategy=loop -malign-stringops"  loop
test "$2" "$3" "-mstringop-strategy=loop -mno-align-stringops"  loopnoalign
test "$2" "$3" "-mstringop-strategy=unrolled_loop -malign-stringops" unrl
test "$2" "$3" "-mstringop-strategy=unrolled_loop -mno-align-stringops" 
unrlnoalign
test "$2" "$3" "-mstringop-strategy=sse_loop -malign-stringops" sse
test "$2" "$3" "-mstringop-strategy=sse_loop -mno-align-stringops -msse2" 
ssenoalign
test "$2" "$3" "-mstringop-strategy=byte_loop" byte
best=`cat /tmp/accum | sort | head -1`
test "$2" "$3" " -fprofile-generate" >/dev/null 2>&1
test "$2" "$3" " -fprofile-use"
test "$2" "$3" " -minline-stringops-dynamically"
echo " best: $best"
}

test_all_sizes()
{
if [ "$mode" == 64 ]
then
echo "   libcall   rep1   noalgrep4   noalgrep8   noalg 
   loop   noalgunrl   noalgsse   noalgbyte profiled dynamic"
else
echo "   libcall   rep1   noalgrep4   noalgloop   noalg 
   unrl   noalgssenoalgbyte profiled dynamic"
fi
#for size in 1 2 3 4 6 8 10 12 14 16 24 32 48 64 128 256 512 1024 4096 8192 
81920 819200 8192000
#for size in 8192000 819200 81920 8192 4096 2048 1024 512 256 128 64 48 32 24 
16 14 12 10 8 6 5 4 3 2 1
for size in 8192000 819200 81920 20480 8192 4096 2048 1024 512 256 128 64 48 32 
24 16 14 12 10 8 6 4 1
#for size in 128 256 1024 4096 8192 81920 819200
do
testrow "$1" "$2" $size
done
}

mode=$1
shift
export memsize=$1
shift
cmdline=$*
if [ "$mode" != 32 ]
then
  if [ "$mode" != 64 ]
  then
echo "Usage:"
echo "test_stringop mode size cmdline"
echo "mode is either 32 or 64"
echo "size is amount of memory copied in each test.  Should be chosed small 
enough so runtime is less than minute for each test and sorting works"
echo "Example: test_stringop 32 64000 ./xgcc -B ./ -march=pentium3"
exit
  fi
fi
echo "memcpy mode:$mode size:$memsize"
export STRINGOP=""
type=char
test_all_sizes $mode "$cmdline -m$mode"
echo "Aligned"
type=long
test_all_sizes $mode "$cmdline -m$mode"
echo "memset"
type=char
export STRINGOP="-Dtest_memset=1"
test_all_sizes $mode "$cmdline -m$mode"
echo "Aligned"
type=long
test_all_sizes $mode "$cmdline -m$mode"


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-13 Thread Richard Biener
On Thu, Dec 13, 2012 at 7:21 AM, Jakub Jelinek  wrote:
> On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote:
>> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka  wrote:
>> >> > libcall is not faster up to 8KB to rep sequence that is better for 
>> >> > regalloc/code
>> >> > cache than fully blowin function call.
>> >>
>> >> Be careful with this. My recollection is that REP sequence is good for
>> >> any size -- for smaller size, the REP initial set up cost is too high
>> >> (10s of cycles), while for large size copy, it is less efficient
>> >> compared with library version.
>> >
>> > Well this is based on the data from the memtest script.
>> > Core has good REP implementation - it is a win from rather small blocks (16
>> > bytes if I recall) and it does not need alignment.
>> > Library version starts to be interesting with caching hints, but I think 
>> > till 80KB
>> > it is still not a win for my setup (glibc-2.15)
>>
>> A simple test shows that -mstringop-strategy=libcall always beats
>> -mstringop-strategy=rep_8byte (on core2 and corei7) except for size
>> smaller than 8 where the rep_8byte strategy simply bypasses REP movs.
>> Can you share your memtest ?
>
> I can't believe that say 16 byte or 32 byte memcpy can be ever faster using a
> libcall.  The PLT call overhead is simply too high.

I believe the PLT call overhead may be effectively zero if the benchmarking
is just a loop around a memcpy.  Thus for measuring the PLT overhead
I call the benchmark broken ;)

Richard.

> Jakub


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Xinliang David Li
Try the following one. 1) -minline-all-stringops
-mstringop-strategy=rep_8byte -O2 vs 1) -mstringop_strategy=libcall
-O2.

David


#include 
#include 
#include 
#ifndef LEN
#define LEN 16
#endif

void copy(char* s1, char* s2,int len) __attribute__((noinline));
void copy(char* s1, char* s2,int len)
{
   memcpy(s2,s1,len);
}


int main() {

  char* s1 = (char*) malloc(LEN  +10);
  char* s2 = (char*) malloc(LEN  +10);
  int i = 0;
  for (i =  0; i < 10; i++)
  {
copy(s1+1,s2+3,LEN);
  }
}

On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek  wrote:
> On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote:
>> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka  wrote:
>> >> > libcall is not faster up to 8KB to rep sequence that is better for 
>> >> > regalloc/code
>> >> > cache than fully blowin function call.
>> >>
>> >> Be careful with this. My recollection is that REP sequence is good for
>> >> any size -- for smaller size, the REP initial set up cost is too high
>> >> (10s of cycles), while for large size copy, it is less efficient
>> >> compared with library version.
>> >
>> > Well this is based on the data from the memtest script.
>> > Core has good REP implementation - it is a win from rather small blocks (16
>> > bytes if I recall) and it does not need alignment.
>> > Library version starts to be interesting with caching hints, but I think 
>> > till 80KB
>> > it is still not a win for my setup (glibc-2.15)
>>
>> A simple test shows that -mstringop-strategy=libcall always beats
>> -mstringop-strategy=rep_8byte (on core2 and corei7) except for size
>> smaller than 8 where the rep_8byte strategy simply bypasses REP movs.
>> Can you share your memtest ?
>
> I can't believe that say 16 byte or 32 byte memcpy can be ever faster using a
> libcall.  The PLT call overhead is simply too high.
>
> Jakub


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Jakub Jelinek
On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote:
> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka  wrote:
> >> > libcall is not faster up to 8KB to rep sequence that is better for 
> >> > regalloc/code
> >> > cache than fully blowin function call.
> >>
> >> Be careful with this. My recollection is that REP sequence is good for
> >> any size -- for smaller size, the REP initial set up cost is too high
> >> (10s of cycles), while for large size copy, it is less efficient
> >> compared with library version.
> >
> > Well this is based on the data from the memtest script.
> > Core has good REP implementation - it is a win from rather small blocks (16
> > bytes if I recall) and it does not need alignment.
> > Library version starts to be interesting with caching hints, but I think 
> > till 80KB
> > it is still not a win for my setup (glibc-2.15)
> 
> A simple test shows that -mstringop-strategy=libcall always beats
> -mstringop-strategy=rep_8byte (on core2 and corei7) except for size
> smaller than 8 where the rep_8byte strategy simply bypasses REP movs.
> Can you share your memtest ?

I can't believe that say 16 byte or 32 byte memcpy can be ever faster using a
libcall.  The PLT call overhead is simply too high.

Jakub


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Xinliang David Li
On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka  wrote:
>> > libcall is not faster up to 8KB to rep sequence that is better for 
>> > regalloc/code
>> > cache than fully blowin function call.
>>
>> Be careful with this. My recollection is that REP sequence is good for
>> any size -- for smaller size, the REP initial set up cost is too high
>> (10s of cycles), while for large size copy, it is less efficient
>> compared with library version.
>
> Well this is based on the data from the memtest script.
> Core has good REP implementation - it is a win from rather small blocks (16
> bytes if I recall) and it does not need alignment.
> Library version starts to be interesting with caching hints, but I think till 
> 80KB
> it is still not a win for my setup (glibc-2.15)

A simple test shows that -mstringop-strategy=libcall always beats
-mstringop-strategy=rep_8byte (on core2 and corei7) except for size
smaller than 8 where the rep_8byte strategy simply bypasses REP movs.
Can you share your memtest ?

thanks,

David

>> >> >
>> >> >/* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix 
>> >> > stall
>> >> > * on 16-bit immediate moves into memory on Core2 and Corei7.  */
>> >> > @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
>> >> >m_K6,
>> >> >
>> >> >/* X86_TUNE_USE_CLTD */
>> >> > -  ~(m_PENT | m_ATOM | m_K6),
>> >> > +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
>>
>> My change was to enable CLTD for generic. Is your change intended to
>> revert that?
>
> No, it is merge conflict, sorry.  I will update it in my tree.
>> > Skipping inc/dec is to avoid partial flag stall happening on P4 only.
>> >> >
>>
>>
>> K8 and K10 partitions the flags into groups. References to flags to
>> the same group can still cause the stall -- not sure how that can be
>> handled.
>
> I  belive the stalls happends only in quite special cases where compare 
> instruction
> combines flags from multiple instructions.  GCC don't generate this type of 
> code, so
> we should be safe.
>
> Honza


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Jan Hubicka
> > libcall is not faster up to 8KB to rep sequence that is better for 
> > regalloc/code
> > cache than fully blowin function call.
> 
> Be careful with this. My recollection is that REP sequence is good for
> any size -- for smaller size, the REP initial set up cost is too high
> (10s of cycles), while for large size copy, it is less efficient
> compared with library version.

Well this is based on the data from the memtest script.  
Core has good REP implementation - it is a win from rather small blocks (16
bytes if I recall) and it does not need alignment.
Library version starts to be interesting with caching hints, but I think till 
80KB
it is still not a win for my setup (glibc-2.15)
> >> >
> >> >/* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
> >> > * on 16-bit immediate moves into memory on Core2 and Corei7.  */
> >> > @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
> >> >m_K6,
> >> >
> >> >/* X86_TUNE_USE_CLTD */
> >> > -  ~(m_PENT | m_ATOM | m_K6),
> >> > +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
> 
> My change was to enable CLTD for generic. Is your change intended to
> revert that?

No, it is merge conflict, sorry.  I will update it in my tree.
> > Skipping inc/dec is to avoid partial flag stall happening on P4 only.
> >> >
> 
> 
> K8 and K10 partitions the flags into groups. References to flags to
> the same group can still cause the stall -- not sure how that can be
> handled.

I  belive the stalls happends only in quite special cases where compare 
instruction
combines flags from multiple instructions.  GCC don't generate this type of 
code, so
we should be safe.

Honza


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Xinliang David Li
On Wed, Dec 12, 2012 at 4:16 PM, Xinliang David Li  wrote:
> On Wed, Dec 12, 2012 at 10:30 AM, Jan Hubicka  wrote:
>> Concerning 1push per cycle, I think it is same as K7 hardware did, so move
>> prologue should be a win.
>>> > Index: config/i386/i386.c
>>> > ===
>>> > --- config/i386/i386.c  (revision 194452)
>>> > +++ config/i386/i386.c  (working copy)
>>> > @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
>>> >COSTS_N_INSNS (8),   /* cost of FABS instruction.  */
>>> >COSTS_N_INSNS (8),   /* cost of FCHS instruction.  */
>>> >COSTS_N_INSNS (40),  /* cost of FSQRT instruction.  */
>>> > -  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>>> > -   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
>>> > +  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>>> > +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>>> >{-1, libcall, false,
>>> >{{libcall, {{6, loop_1_byte, true},
>>> >{24, loop, true},
>>> >{8192, rep_prefix_4_byte, true},
>>> >{-1, libcall, false}}},
>>> > -   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
>>> > +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>>
>> libcall is not faster up to 8KB to rep sequence that is better for 
>> regalloc/code
>> cache than fully blowin function call.
>
> Be careful with this. My recollection is that REP sequence is good for


s/good/not good/


David

> any size -- for smaller size, the REP initial set up cost is too high
> (10s of cycles), while for large size copy, it is less efficient
> compared with library version.
>
>
>>> > @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
>>> >m_PPRO,
>>> >
>>> >/* X86_TUNE_PARTIAL_FLAG_REG_STALL */
>>> > -  m_CORE2I7 | m_GENERIC,
>>> > +  m_GENERIC | m_CORE2,
>>
>> This disable shifts that store just some flags. Acroding to Agner's manual 
>> I7 handle
>> this well.
>>
>
> ok.
>
>> Partial flags stall
>> The Sandy Bridge uses the method of an extra Âľop to join partial registers 
>> not only for
>> general purpose registers but also for the flags register, unlike previous 
>> processors which
>> used this method only for general purpose registers. This occurs when a 
>> write to a part of
>> the flags register is followed by a read from a larger part of the flags 
>> register. The partial
>> flags stall of previous processors (See page 75) is therefore replaced by an 
>> extra Âľop. The
>> Sandy Bridge also generates an extra Âľop when reading the flags after a 
>> rotate instruction.
>>
>> This is cheaper than the 7 cycle delay on Core this flags is trying to avoid.
>
> ok.
>
>>> >
>>> >/* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
>>> > * on 16-bit immediate moves into memory on Core2 and Corei7.  */
>>> > @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
>>> >m_K6,
>>> >
>>> >/* X86_TUNE_USE_CLTD */
>>> > -  ~(m_PENT | m_ATOM | m_K6),
>>> > +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
>
> My change was to enable CLTD for generic. Is your change intended to
> revert that?
>
>>
>> None of CPUs that generic care about are !USE_CLTD now after your change.
>>> > @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
>>> >m_ATHLON_K8,
>>> >
>>> >/* X86_TUNE_SSE_TYPELESS_STORES */
>>> > -  m_AMD_MULTIPLE,
>>> > +  m_AMD_MULTIPLE | m_CORE2I7, /**/
>>
>> Hmm, I can not seem to find this in manual now, but I believe that stores 
>> also do not type,
>> so movaps is preferred over movapd store because it is shorter.  If not, 
>> this change should
>> produce a lot of slowdowns.
>>> >
>>> >/* X86_TUNE_SSE_LOAD0_BY_PXOR */
>>> > -  m_PPRO | m_P4_NOCONA,
>>> > +  m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/
>>
>> Agner:
>> A common way of setting a register to zero is XOR EAX,EAX or SUB EBX,EBX. The
>> Core2 and Nehalem processors recognize that certain instructions are 
>> independent of the
>> prior value of the register if the source and destination registers are the 
>> same.
>>
>> This applies to all of the following instructions: XOR, SUB, PXOR, XORPS, 
>> XORPD, and all
>> variants of PSUBxxx and PCMPxxx except PCMPEQQ.
>>> >
>>> >/* X86_TUNE_MEMORY_MISMATCH_STALL */
>>> >m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>> > @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
>>> >
>>> >/* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict 
>>> > more
>>> >   than 4 branch instructions in the 16 byte window.  */
>>> > -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>> > +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>
>> This is special passs to handle limitations of AMD's K7/K8/K10 branch 
>> prediction.
>> Intel never h

Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Xinliang David Li
On Wed, Dec 12, 2012 at 10:30 AM, Jan Hubicka  wrote:
> Concerning 1push per cycle, I think it is same as K7 hardware did, so move
> prologue should be a win.
>> > Index: config/i386/i386.c
>> > ===
>> > --- config/i386/i386.c  (revision 194452)
>> > +++ config/i386/i386.c  (working copy)
>> > @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
>> >COSTS_N_INSNS (8),   /* cost of FABS instruction.  */
>> >COSTS_N_INSNS (8),   /* cost of FCHS instruction.  */
>> >COSTS_N_INSNS (40),  /* cost of FSQRT instruction.  */
>> > -  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>> > -   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
>> > +  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>> > +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>> >{-1, libcall, false,
>> >{{libcall, {{6, loop_1_byte, true},
>> >{24, loop, true},
>> >{8192, rep_prefix_4_byte, true},
>> >{-1, libcall, false}}},
>> > -   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
>> > +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>
> libcall is not faster up to 8KB to rep sequence that is better for 
> regalloc/code
> cache than fully blowin function call.

Be careful with this. My recollection is that REP sequence is good for
any size -- for smaller size, the REP initial set up cost is too high
(10s of cycles), while for large size copy, it is less efficient
compared with library version.


>> > @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
>> >m_PPRO,
>> >
>> >/* X86_TUNE_PARTIAL_FLAG_REG_STALL */
>> > -  m_CORE2I7 | m_GENERIC,
>> > +  m_GENERIC | m_CORE2,
>
> This disable shifts that store just some flags. Acroding to Agner's manual I7 
> handle
> this well.
>

ok.

> Partial flags stall
> The Sandy Bridge uses the method of an extra Âľop to join partial registers 
> not only for
> general purpose registers but also for the flags register, unlike previous 
> processors which
> used this method only for general purpose registers. This occurs when a write 
> to a part of
> the flags register is followed by a read from a larger part of the flags 
> register. The partial
> flags stall of previous processors (See page 75) is therefore replaced by an 
> extra Âľop. The
> Sandy Bridge also generates an extra Âľop when reading the flags after a 
> rotate instruction.
>
> This is cheaper than the 7 cycle delay on Core this flags is trying to avoid.

ok.

>> >
>> >/* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
>> > * on 16-bit immediate moves into memory on Core2 and Corei7.  */
>> > @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
>> >m_K6,
>> >
>> >/* X86_TUNE_USE_CLTD */
>> > -  ~(m_PENT | m_ATOM | m_K6),
>> > +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),

My change was to enable CLTD for generic. Is your change intended to
revert that?

>
> None of CPUs that generic care about are !USE_CLTD now after your change.
>> > @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
>> >m_ATHLON_K8,
>> >
>> >/* X86_TUNE_SSE_TYPELESS_STORES */
>> > -  m_AMD_MULTIPLE,
>> > +  m_AMD_MULTIPLE | m_CORE2I7, /**/
>
> Hmm, I can not seem to find this in manual now, but I believe that stores 
> also do not type,
> so movaps is preferred over movapd store because it is shorter.  If not, this 
> change should
> produce a lot of slowdowns.
>> >
>> >/* X86_TUNE_SSE_LOAD0_BY_PXOR */
>> > -  m_PPRO | m_P4_NOCONA,
>> > +  m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/
>
> Agner:
> A common way of setting a register to zero is XOR EAX,EAX or SUB EBX,EBX. The
> Core2 and Nehalem processors recognize that certain instructions are 
> independent of the
> prior value of the register if the source and destination registers are the 
> same.
>
> This applies to all of the following instructions: XOR, SUB, PXOR, XORPS, 
> XORPD, and all
> variants of PSUBxxx and PCMPxxx except PCMPEQQ.
>> >
>> >/* X86_TUNE_MEMORY_MISMATCH_STALL */
>> >m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>> > @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
>> >
>> >/* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
>> >   than 4 branch instructions in the 16 byte window.  */
>> > -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>> > +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>
> This is special passs to handle limitations of AMD's K7/K8/K10 branch 
> prediction.
> Intel never had similar design, so this flag is pointless.

I noticed that too, but Andi has a better answer to it.

>
> We apparently ought to disable it for K10, at least per Agner's manual.
>> >
>> >/* X86_TUNE_SCHEDULE */
>> >  

Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Andi Kleen
Andi Kleen  writes:
>
>>> >/* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict 
>>> > more
>>> >   than 4 branch instructions in the 16 byte window.  */
>>> > -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>> > +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>
>> This is special passs to handle limitations of AMD's K7/K8/K10 branch 
>> prediction.
>> Intel never had similar design, so this flag is pointless.
>
> Actually the Sandy Bridge decoded icache has a limit of 3 jumps per
> 16 byte window.

Actually it's four per 32bytes, sorry.

Here's an old patch I had lying around to optimize for that.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 1b871be..9b57316 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2713,6 +2713,7 @@ ix86_target_string (HOST_WIDE_INT isa, int flags, const 
char *arch,
 { "-mavx256-split-unaligned-load", MASK_AVX256_SPLIT_UNALIGNED_LOAD},
 { "-mavx256-split-unaligned-store",
MASK_AVX256_SPLIT_UNALIGNED_STORE},
 { "-mprefer-avx128",   MASK_PREFER_AVX128},
+{ "-mjump-pad-32bytes",MASK_JUMP_PAD_32BYTES},
   };
 
   const char *opts[ARRAY_SIZE (isa_opts) + ARRAY_SIZE (flag_opts) + 6][2];
@@ -32182,6 +32183,7 @@ ix86_avoid_jump_mispredicts (void)
   rtx insn, start = get_insns ();
   int nbytes = 0, njumps = 0;
   int isjump = 0;
+  int jump_pad_window_size = TARGET_JUMP_PAD_32BYTES ? 32 : 16;
 
   /* Look for all minimal intervals of instructions containing 4 jumps.
  The intervals are bounded by START and INSN.  NBYTES is the total
@@ -32202,8 +32204,8 @@ ix86_avoid_jump_mispredicts (void)
  int align = label_to_alignment (insn);
  int max_skip = label_to_max_skip (insn);
 
- if (max_skip > 15)
-   max_skip = 15;
+ if (max_skip > jump_pad_window_size - 1)
+   max_skip = jump_pad_window_size - 1;
  /* If align > 3, only up to 16 - max_skip - 1 bytes can be
 already in the current 16 byte page, because otherwise
 ASM_OUTPUT_MAX_SKIP_ALIGN could skip max_skip or fewer
@@ -32216,7 +32218,7 @@ ix86_avoid_jump_mispredicts (void)
 INSN_UID (insn), max_skip);
  if (max_skip)
{
- while (nbytes + max_skip >= 16)
+ while (nbytes + max_skip >= jump_pad_window_size)
{
  start = NEXT_INSN (start);
  if ((JUMP_P (start)
@@ -32262,10 +32264,11 @@ ix86_avoid_jump_mispredicts (void)
 fprintf (dump_file, "Interval %i to %i has %i bytes\n",
 INSN_UID (start), INSN_UID (insn), nbytes);
 
-  if (njumps == 3 && isjump && nbytes < 16)
+  if (njumps == 3 && isjump && nbytes < jump_pad_window_size)
{
- int padsize = 15 - nbytes + min_insn_size (insn);
-
+ int padsize = jump_pad_window_size - 1 - nbytes + 
+   min_insn_size (insn);
+ 
  if (dump_file)
fprintf (dump_file, "Padding insn %i by %i bytes!\n",
 INSN_UID (insn), padsize);
diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index 6c516e7..b38d163 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -223,6 +223,10 @@ mintel-syntax
 Target Undocumented Alias(masm=, intel, att) Warn(%<-mintel-syntax%> and 
%<-mno-intel-syntax%> are deprecated; use %<-masm=intel%> and %<-masm=att%> 
instead)
 ;; Deprecated
 
+mjump-pad-32bytes
+Target RejectNegative Mask(JUMP_PAD_32BYTES) Save
+Avoid more than 4 jumps in each 32byte code window.
+
 mms-bitfields
 Target Report Mask(MS_BITFIELD_LAYOUT) Save
 Use native (MS) bitfield layout


-- 
[email protected] -- Speaking for myself only


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Jan Hubicka
> Jan Hubicka  writes:
> >
> > libcall is not faster up to 8KB to rep sequence that is better for 
> > regalloc/code
> > cache than fully blowin function call.
> 
> I noticed btw that some of the generated string instructions are slower 
> than just calling the C library.
> 
> rep scasb etc. is rarely a win over an optimized library function,
> it's not very optimized. Perhaps those patterns should just be disabled.
> The way to optimize that on modern CPUs is to use PCMP*STR*, but that's
> quite a bit more complicated and has some constraints.

This is only about memset/memcpy expanding.  The other sequences are quite lame 
indeed...
> 
> 
> >> >/* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict 
> >> > more
> >> >   than 4 branch instructions in the 16 byte window.  */
> >> > -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | 
> >> > m_GENERIC,
> >> > +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> >
> > This is special passs to handle limitations of AMD's K7/K8/K10 branch 
> > prediction.
> > Intel never had similar design, so this flag is pointless.
> 
> Actually the Sandy Bridge decoded icache has a limit of 3 jumps per
> 16 byte window. If you exceed that it falls back to running 
> the full decoder from the normal icache.
> 
> I don't have solid data, but it may be a win for frontend limited
> code (otherwise possibly more in power than performance)
> 
> I would revisit that for Sandy Bridge

We are not particularly good on avoiding the branches - basically the code 
inserts alignment
whenever it things the 4 consecutive branches fit in the window.
I can make patch to change this to 3 and we can see if it helps at all.
> 
> -Andi
> -- 
> [email protected] -- Speaking for myself only


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Andi Kleen
Jan Hubicka  writes:
>
> libcall is not faster up to 8KB to rep sequence that is better for 
> regalloc/code
> cache than fully blowin function call.

I noticed btw that some of the generated string instructions are slower 
than just calling the C library.

rep scasb etc. is rarely a win over an optimized library function,
it's not very optimized. Perhaps those patterns should just be disabled.
The way to optimize that on modern CPUs is to use PCMP*STR*, but that's
quite a bit more complicated and has some constraints.


>> >/* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
>> >   than 4 branch instructions in the 16 byte window.  */
>> > -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>> > +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>
> This is special passs to handle limitations of AMD's K7/K8/K10 branch 
> prediction.
> Intel never had similar design, so this flag is pointless.

Actually the Sandy Bridge decoded icache has a limit of 3 jumps per
16 byte window. If you exceed that it falls back to running 
the full decoder from the normal icache.

I don't have solid data, but it may be a win for frontend limited
code (otherwise possibly more in power than performance)

I would revisit that for Sandy Bridge

-Andi
-- 
[email protected] -- Speaking for myself only


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Jan Hubicka
Concerning 1push per cycle, I think it is same as K7 hardware did, so move
prologue should be a win.
> > Index: config/i386/i386.c
> > ===
> > --- config/i386/i386.c  (revision 194452)
> > +++ config/i386/i386.c  (working copy)
> > @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
> >COSTS_N_INSNS (8),   /* cost of FABS instruction.  */
> >COSTS_N_INSNS (8),   /* cost of FCHS instruction.  */
> >COSTS_N_INSNS (40),  /* cost of FSQRT instruction.  */
> > -  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> > -   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
> > +  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> > +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
> >{-1, libcall, false,
> >{{libcall, {{6, loop_1_byte, true},
> >{24, loop, true},
> >{8192, rep_prefix_4_byte, true},
> >{-1, libcall, false}}},
> > -   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
> > +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},

libcall is not faster up to 8KB to rep sequence that is better for regalloc/code
cache than fully blowin function call.
> > @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
> >m_PPRO,
> >
> >/* X86_TUNE_PARTIAL_FLAG_REG_STALL */
> > -  m_CORE2I7 | m_GENERIC,
> > +  m_GENERIC | m_CORE2,

This disable shifts that store just some flags. Acroding to Agner's manual I7 
handle
this well.

Partial flags stall
The Sandy Bridge uses the method of an extra Âľop to join partial registers not 
only for
general purpose registers but also for the flags register, unlike previous 
processors which
used this method only for general purpose registers. This occurs when a write 
to a part of
the flags register is followed by a read from a larger part of the flags 
register. The partial
flags stall of previous processors (See page 75) is therefore replaced by an 
extra Âľop. The
Sandy Bridge also generates an extra Âľop when reading the flags after a rotate 
instruction.

This is cheaper than the 7 cycle delay on Core this flags is trying to avoid.
> >
> >/* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
> > * on 16-bit immediate moves into memory on Core2 and Corei7.  */
> > @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
> >m_K6,
> >
> >/* X86_TUNE_USE_CLTD */
> > -  ~(m_PENT | m_ATOM | m_K6),
> > +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),

None of CPUs that generic care about are !USE_CLTD now after your change.
> > @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
> >m_ATHLON_K8,
> >
> >/* X86_TUNE_SSE_TYPELESS_STORES */
> > -  m_AMD_MULTIPLE,
> > +  m_AMD_MULTIPLE | m_CORE2I7, /**/

Hmm, I can not seem to find this in manual now, but I believe that stores also 
do not type,
so movaps is preferred over movapd store because it is shorter.  If not, this 
change should
produce a lot of slowdowns.
> >
> >/* X86_TUNE_SSE_LOAD0_BY_PXOR */
> > -  m_PPRO | m_P4_NOCONA,
> > +  m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/

Agner:
A common way of setting a register to zero is XOR EAX,EAX or SUB EBX,EBX. The
Core2 and Nehalem processors recognize that certain instructions are 
independent of the
prior value of the register if the source and destination registers are the 
same.

This applies to all of the following instructions: XOR, SUB, PXOR, XORPS, 
XORPD, and all
variants of PSUBxxx and PCMPxxx except PCMPEQQ.
> >
> >/* X86_TUNE_MEMORY_MISMATCH_STALL */
> >m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> > @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
> >
> >/* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
> >   than 4 branch instructions in the 16 byte window.  */
> > -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> > +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,

This is special passs to handle limitations of AMD's K7/K8/K10 branch 
prediction.
Intel never had similar design, so this flag is pointless.

We apparently ought to disable it for K10, at least per Agner's manual.
> >
> >/* X86_TUNE_SCHEDULE */
> >m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE | 
> > m_GENERIC,
> > @@ -1947,10 +1947,10 @@ static unsigned int initial_ix86_tune_fe
> >m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> >
> >/* X86_TUNE_USE_INCDEC */
> > -  ~(m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_GENERIC),
> > +  ~(m_P4_NOCONA | m_ATOM | m_GENERIC),

Skipping inc/dec is to avoid partial flag stall happening on P4 only.
> >
> >/* X86_TUNE_PAD_RETURNS */
> > -  m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC,
> > +  m_AMD_MULTIPLE | m_GENERIC,

Again this deals specifically with AMD K7/K8/K10 branch 

Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Xinliang David Li
Honza, can you explain each change and point to the reference?

thanks,

David

On Wed, Dec 12, 2012 at 8:37 AM, Jan Hubicka  wrote:
>> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
>> SP adjustment instead of a sequence of pushes/pops. The preference to
>> the MOVs are good for old CPU micro-architectures (before pentium-4,
>> K10), because it breaks the data dependency.  In modern
>> micro-architecture, push/pop is implemented using a mechanism called
>> stack engine. The data dependency is removed by the hardware, and
>> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
>> smaller. There is no longer the need to avoid using them.   This is
>> also what ICC does.
>>
>> The following patch fixed the problem. It passes bootstrap/regression
>> test. OK to install?
>>
>> thanks,
>>
>> David
>>
>> Index: config/i386/i386.c
>> ===
>> --- config/i386/i386.c (revision 194324)
>> +++ config/i386/i386.c (working copy)
>> @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe
>>m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>
>>/* X86_TUNE_PROLOGUE_USING_MOVE */
>> -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
>> +  m_PPRO | m_ATHLON_K8,
>>
>>/* X86_TUNE_EPILOGUE_USING_MOVE */
>> -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
>> +  m_PPRO | m_ATHLON_K8,
>
> Push/pops wrt moves was always difficult to tune on old CPUs, so I am happy it
> is gone from generic (in fact I had similar patch pending).
> Are you sure about Atom having stack engine, too?
>
> Related thing is accumulate_outgoing_args. Igor is testing it on Core and I 
> will
> give it a try on K10.
>
> Honza
>
> I am attaching the changes for core costs I made if someone is interested in
> testing them.  If we can declare P4/PPRo and maybe K8 chips obsolette for
> generic, there is room for improvement in generic, too. Like using inc/dec
> again.
>
> Honza
>
> Index: config/i386/i386.c
> ===
> --- config/i386/i386.c  (revision 194452)
> +++ config/i386/i386.c  (working copy)
> @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
>COSTS_N_INSNS (8),   /* cost of FABS instruction.  */
>COSTS_N_INSNS (8),   /* cost of FCHS instruction.  */
>COSTS_N_INSNS (40),  /* cost of FSQRT instruction.  */
> -  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> -   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
> +  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>{-1, libcall, false,
>{{libcall, {{6, loop_1_byte, true},
>{24, loop, true},
>{8192, rep_prefix_4_byte, true},
>{-1, libcall, false}}},
> -   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
> +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>{-1, libcall, false,
>1,   /* scalar_stmt_cost.  */
>1,   /* scalar load_cost.  */
> @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
>m_PPRO,
>
>/* X86_TUNE_PARTIAL_FLAG_REG_STALL */
> -  m_CORE2I7 | m_GENERIC,
> +  m_GENERIC | m_CORE2,
>
>/* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
> * on 16-bit immediate moves into memory on Core2 and Corei7.  */
> @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
>m_K6,
>
>/* X86_TUNE_USE_CLTD */
> -  ~(m_PENT | m_ATOM | m_K6),
> +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
>
>/* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx.  */
>m_PENT4,
> @@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe
>m_COREI7 | m_BDVER,
>
>/* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */
> -  m_BDVER ,
> +  m_BDVER,
>
>/* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and 
> dependencies
>   are resolved on SSE register parts instead of whole registers, so we may
> @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
>m_ATHLON_K8,
>
>/* X86_TUNE_SSE_TYPELESS_STORES */
> -  m_AMD_MULTIPLE,
> +  m_AMD_MULTIPLE | m_CORE2I7, /**/
>
>/* X86_TUNE_SSE_LOAD0_BY_PXOR */
> -  m_PPRO | m_P4_NOCONA,
> +  m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/
>
>/* X86_TUNE_MEMORY_MISMATCH_STALL */
>m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
>
>/* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
>   than 4 branch instructions in the 16 byte window.  */
> -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>
>/*

Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Xinliang David Li
On Wed, Dec 12, 2012 at 8:37 AM, Jan Hubicka  wrote:
>> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
>> SP adjustment instead of a sequence of pushes/pops. The preference to
>> the MOVs are good for old CPU micro-architectures (before pentium-4,
>> K10), because it breaks the data dependency.  In modern
>> micro-architecture, push/pop is implemented using a mechanism called
>> stack engine. The data dependency is removed by the hardware, and
>> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
>> smaller. There is no longer the need to avoid using them.   This is
>> also what ICC does.
>>
>> The following patch fixed the problem. It passes bootstrap/regression
>> test. OK to install?
>>
>> thanks,
>>
>> David
>>
>> Index: config/i386/i386.c
>> ===
>> --- config/i386/i386.c (revision 194324)
>> +++ config/i386/i386.c (working copy)
>> @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe
>>m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>
>>/* X86_TUNE_PROLOGUE_USING_MOVE */
>> -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
>> +  m_PPRO | m_ATHLON_K8,
>>
>>/* X86_TUNE_EPILOGUE_USING_MOVE */
>> -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
>> +  m_PPRO | m_ATHLON_K8,
>
> Push/pops wrt moves was always difficult to tune on old CPUs, so I am happy it
> is gone from generic (in fact I had similar patch pending).
> Are you sure about Atom having stack engine, too?
>

Good question. The instruction latency table
(http://www.agner.org/optimize/instruction_tables.pdf) shows that for
Atom: push r has one 1uop, 1 cycle latency. However the instruction is
not pairable which will affect ILP. The guide here
http://www.agner.org/optimize/microarchitecture.pdf does not mention
Atom has stack engine either.

I will help collect some performance data on Atom.


thanks,

David


> Related thing is accumulate_outgoing_args. Igor is testing it on Core and I 
> will
> give it a try on K10.
>
> Honza
>
> I am attaching the changes for core costs I made if someone is interested in
> testing them.  If we can declare P4/PPRo and maybe K8 chips obsolette for
> generic, there is room for improvement in generic, too. Like using inc/dec
> again.
>
> Honza
>
> Index: config/i386/i386.c
> ===
> --- config/i386/i386.c  (revision 194452)
> +++ config/i386/i386.c  (working copy)
> @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
>COSTS_N_INSNS (8),   /* cost of FABS instruction.  */
>COSTS_N_INSNS (8),   /* cost of FCHS instruction.  */
>COSTS_N_INSNS (40),  /* cost of FSQRT instruction.  */
> -  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> -   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
> +  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>{-1, libcall, false,
>{{libcall, {{6, loop_1_byte, true},
>{24, loop, true},
>{8192, rep_prefix_4_byte, true},
>{-1, libcall, false}}},
> -   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
> +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>{-1, libcall, false,
>1,   /* scalar_stmt_cost.  */
>1,   /* scalar load_cost.  */
> @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
>m_PPRO,
>
>/* X86_TUNE_PARTIAL_FLAG_REG_STALL */
> -  m_CORE2I7 | m_GENERIC,
> +  m_GENERIC | m_CORE2,
>
>/* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
> * on 16-bit immediate moves into memory on Core2 and Corei7.  */
> @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
>m_K6,
>
>/* X86_TUNE_USE_CLTD */
> -  ~(m_PENT | m_ATOM | m_K6),
> +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
>
>/* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx.  */
>m_PENT4,
> @@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe
>m_COREI7 | m_BDVER,
>
>/* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */
> -  m_BDVER ,
> +  m_BDVER,
>
>/* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and 
> dependencies
>   are resolved on SSE register parts instead of whole registers, so we may
> @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
>m_ATHLON_K8,
>
>/* X86_TUNE_SSE_TYPELESS_STORES */
> -  m_AMD_MULTIPLE,
> +  m_AMD_MULTIPLE | m_CORE2I7, /**/
>
>/* X86_TUNE_SSE_LOAD0_BY_PXOR */
> -  m_PPRO | m_P4_NOCONA,
> +  m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/
>
>/* X86_TUNE_MEMORY_MISMATCH_STALL */
>m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> @@ -1938,7 +1938,7 @@ static

Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Jan Hubicka
> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
> SP adjustment instead of a sequence of pushes/pops. The preference to
> the MOVs are good for old CPU micro-architectures (before pentium-4,
> K10), because it breaks the data dependency.  In modern
> micro-architecture, push/pop is implemented using a mechanism called
> stack engine. The data dependency is removed by the hardware, and
> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
> smaller. There is no longer the need to avoid using them.   This is
> also what ICC does.
> 
> The following patch fixed the problem. It passes bootstrap/regression
> test. OK to install?
> 
> thanks,
> 
> David
> 
> Index: config/i386/i386.c
> ===
> --- config/i386/i386.c (revision 194324)
> +++ config/i386/i386.c (working copy)
> @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe
>m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> 
>/* X86_TUNE_PROLOGUE_USING_MOVE */
> -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
> +  m_PPRO | m_ATHLON_K8,
> 
>/* X86_TUNE_EPILOGUE_USING_MOVE */
> -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
> +  m_PPRO | m_ATHLON_K8,

Push/pops wrt moves was always difficult to tune on old CPUs, so I am happy it
is gone from generic (in fact I had similar patch pending).
Are you sure about Atom having stack engine, too?

Related thing is accumulate_outgoing_args. Igor is testing it on Core and I will
give it a try on K10.

Honza

I am attaching the changes for core costs I made if someone is interested in
testing them.  If we can declare P4/PPRo and maybe K8 chips obsolette for
generic, there is room for improvement in generic, too. Like using inc/dec
again.

Honza

Index: config/i386/i386.c
===
--- config/i386/i386.c  (revision 194452)
+++ config/i386/i386.c  (working copy)
@@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
   COSTS_N_INSNS (8),   /* cost of FABS instruction.  */
   COSTS_N_INSNS (8),   /* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),  /* cost of FSQRT instruction.  */
-  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
-   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
+  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
+   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
   {-1, libcall, false,
   {{libcall, {{6, loop_1_byte, true},
   {24, loop, true},
   {8192, rep_prefix_4_byte, true},
   {-1, libcall, false}}},
-   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
+   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
   {-1, libcall, false,
   1,   /* scalar_stmt_cost.  */
   1,   /* scalar load_cost.  */
@@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
   m_PPRO,
 
   /* X86_TUNE_PARTIAL_FLAG_REG_STALL */
-  m_CORE2I7 | m_GENERIC,
+  m_GENERIC | m_CORE2,
 
   /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
* on 16-bit immediate moves into memory on Core2 and Corei7.  */
@@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
   m_K6,
 
   /* X86_TUNE_USE_CLTD */
-  ~(m_PENT | m_ATOM | m_K6),
+  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
 
   /* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx.  */
   m_PENT4,
@@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe
   m_COREI7 | m_BDVER,
 
   /* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */
-  m_BDVER ,
+  m_BDVER,
 
   /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies
  are resolved on SSE register parts instead of whole registers, so we may
@@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
   m_ATHLON_K8,
 
   /* X86_TUNE_SSE_TYPELESS_STORES */
-  m_AMD_MULTIPLE,
+  m_AMD_MULTIPLE | m_CORE2I7, /**/
 
   /* X86_TUNE_SSE_LOAD0_BY_PXOR */
-  m_PPRO | m_P4_NOCONA,
+  m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/
 
   /* X86_TUNE_MEMORY_MISMATCH_STALL */
   m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
@@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
 
   /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
  than 4 branch instructions in the 16 byte window.  */
-  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
+  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
 
   /* X86_TUNE_SCHEDULE */
   m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE | 
m_GENERIC,
@@ -1947,10 +1947,10 @@ static unsigned int initial_ix86_tune_fe
   m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
 
   /* X86_TUNE_USE_INCDEC */
-  ~(m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_GENERIC),
+  ~

Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Richard Biener
On Tue, Dec 11, 2012 at 11:53 PM, Xinliang David Li  wrote:
> The following the O2 size data from SPEC2k.  Note that with push/pop,
> it is a always a net win (negative delta) in terms of total binary or
> total loadable section size.

Thanks for the data!

Richard.

> thanks,
>
> David
>
>.text.eh_frame  Total_binary
> vortex-move 440252 40796 584066
> vortex-push 415436 57452 575906
> delta-5.6% 40.8%  -1.397%
>
> twolf-move 169324 10748 223521
> twolf-push 168876 11124 223449
> delta   -0.3% 3.5% -0.032%
>
> gzip-move 30668 3652 374399
> gzip-push 30524 3740 374343
> delta -0.5% 2.4% -0.015%
>
> bzip2-move 22748 3196 111616
> bzip2-push 22636 3284 111592
> delta  -0.5% 2.8% -0.022%
>
> vpr-move 104684 9380 147378
> vpr-push 104236 9788 147338
> delta -0.4% 4.3% -0.027%
>
> mcf-move 8444 1244 26760
> mcf-push 8444 1244 26760
> delta0.0% 0.0% 0.000%
>
> cc1-move 1093964 90772 1576994
> cc1-push 1078988 104068 1575314
> delta  -1.4% 14.6% -0.107%
>
> crafty-move 130556 5508 1256037
> crafty-push 130236 5772 1255981
> delta-0.2% 4.8% -0.004%
>
> eon-move 333660 33220 516491
> eon-push 330140 35812 51
> delta -1.1% 7.8% -0.181%
>
> gap-move 404092 46732 1457735
> gap-push 396012 53180 1456103
> delta -2.0% 13.8% -0.112%
>
> perlbmk-move 456572 45324 618585
> perlbmk-push 449516 52340 618545
> delta -1.5% 15.5% -0.006%
>
> parser-move 81244 15788 334003
> parser-push 80684 16332 333987
> delta   -0.7% 3.4% -0.005%
>
>
> On Tue, Dec 11, 2012 at 9:14 AM, Xinliang David Li  wrote:
>> On Tue, Dec 11, 2012 at 1:49 AM, Richard Biener
>>  wrote:
>>> On Mon, Dec 10, 2012 at 10:07 PM, Mike Stump  wrote:
 On Dec 10, 2012, at 12:42 PM, Xinliang David Li  wrote:
> I have not measured the CFI size impact -- but conceivably it should
> be larger -- which is unfortunate.

 Code speed and size are preferable to optimizing dwarf size…  :-)  I'd let 
 dwarf 5 fix it!
>>>
>>> Well, different to debug info, CFI data has to be in memory to make
>>> unwinding work.
>>> These days most Linux distributions enable asyncronous unwind tables so any
>>> size savings due to shorter push/pop epilogue/prologue sequences has to be
>>> offsetted by the increase in CFI data.  I'm not sure there is really a
>>> speed difference
>>> between both variants (well, maybe due to better icache footprint of
>>> the push/pop
>>> variant).
>>
>> Yes, for large applications, this can be crucial to performance.
>>
>>>
>>> That said - I'd prefer to have more data on this before making the switch 
>>> for
>>> the generic model.  What was your original motivation?  Just "theory" or was
>>> it a real case?
>>
>> 1) some of the very large internal apps I measured benefit from this
>> change (in terms of performance)
>> 2) both ICC and LLVM do the same.
>>
>> I have already committed the patch. I will find some time to collect
>> more size data and post it later.
>>
>> thanks,
>>
>> David
>>
>>
>>>
>>> Thanks,
>>> Richard.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-11 Thread Xinliang David Li
Some SPEC2k performance number (with 3 runs on core2):

Push wins over move on 3 benchmarks. Others are noises.

perlbmk : ~+1.9%
gap:   ~+1.4%
vortex:~ +0.7%

David

On Tue, Dec 11, 2012 at 2:53 PM, Xinliang David Li  wrote:
> The following the O2 size data from SPEC2k.  Note that with push/pop,
> it is a always a net win (negative delta) in terms of total binary or
> total loadable section size.
>
> thanks,
>
> David
>
>.text.eh_frame  Total_binary
> vortex-move 440252 40796 584066
> vortex-push 415436 57452 575906
> delta-5.6% 40.8%  -1.397%
>
> twolf-move 169324 10748 223521
> twolf-push 168876 11124 223449
> delta   -0.3% 3.5% -0.032%
>
> gzip-move 30668 3652 374399
> gzip-push 30524 3740 374343
> delta -0.5% 2.4% -0.015%
>
> bzip2-move 22748 3196 111616
> bzip2-push 22636 3284 111592
> delta  -0.5% 2.8% -0.022%
>
> vpr-move 104684 9380 147378
> vpr-push 104236 9788 147338
> delta -0.4% 4.3% -0.027%
>
> mcf-move 8444 1244 26760
> mcf-push 8444 1244 26760
> delta0.0% 0.0% 0.000%
>
> cc1-move 1093964 90772 1576994
> cc1-push 1078988 104068 1575314
> delta  -1.4% 14.6% -0.107%
>
> crafty-move 130556 5508 1256037
> crafty-push 130236 5772 1255981
> delta-0.2% 4.8% -0.004%
>
> eon-move 333660 33220 516491
> eon-push 330140 35812 51
> delta -1.1% 7.8% -0.181%
>
> gap-move 404092 46732 1457735
> gap-push 396012 53180 1456103
> delta -2.0% 13.8% -0.112%
>
> perlbmk-move 456572 45324 618585
> perlbmk-push 449516 52340 618545
> delta -1.5% 15.5% -0.006%
>
> parser-move 81244 15788 334003
> parser-push 80684 16332 333987
> delta   -0.7% 3.4% -0.005%
>
>
> On Tue, Dec 11, 2012 at 9:14 AM, Xinliang David Li  wrote:
>> On Tue, Dec 11, 2012 at 1:49 AM, Richard Biener
>>  wrote:
>>> On Mon, Dec 10, 2012 at 10:07 PM, Mike Stump  wrote:
 On Dec 10, 2012, at 12:42 PM, Xinliang David Li  wrote:
> I have not measured the CFI size impact -- but conceivably it should
> be larger -- which is unfortunate.

 Code speed and size are preferable to optimizing dwarf size…  :-)  I'd let 
 dwarf 5 fix it!
>>>
>>> Well, different to debug info, CFI data has to be in memory to make
>>> unwinding work.
>>> These days most Linux distributions enable asyncronous unwind tables so any
>>> size savings due to shorter push/pop epilogue/prologue sequences has to be
>>> offsetted by the increase in CFI data.  I'm not sure there is really a
>>> speed difference
>>> between both variants (well, maybe due to better icache footprint of
>>> the push/pop
>>> variant).
>>
>> Yes, for large applications, this can be crucial to performance.
>>
>>>
>>> That said - I'd prefer to have more data on this before making the switch 
>>> for
>>> the generic model.  What was your original motivation?  Just "theory" or was
>>> it a real case?
>>
>> 1) some of the very large internal apps I measured benefit from this
>> change (in terms of performance)
>> 2) both ICC and LLVM do the same.
>>
>> I have already committed the patch. I will find some time to collect
>> more size data and post it later.
>>
>> thanks,
>>
>> David
>>
>>
>>>
>>> Thanks,
>>> Richard.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-11 Thread Xinliang David Li
The following the O2 size data from SPEC2k.  Note that with push/pop,
it is a always a net win (negative delta) in terms of total binary or
total loadable section size.

thanks,

David

   .text.eh_frame  Total_binary
vortex-move 440252 40796 584066
vortex-push 415436 57452 575906
delta-5.6% 40.8%  -1.397%

twolf-move 169324 10748 223521
twolf-push 168876 11124 223449
delta   -0.3% 3.5% -0.032%

gzip-move 30668 3652 374399
gzip-push 30524 3740 374343
delta -0.5% 2.4% -0.015%

bzip2-move 22748 3196 111616
bzip2-push 22636 3284 111592
delta  -0.5% 2.8% -0.022%

vpr-move 104684 9380 147378
vpr-push 104236 9788 147338
delta -0.4% 4.3% -0.027%

mcf-move 8444 1244 26760
mcf-push 8444 1244 26760
delta0.0% 0.0% 0.000%

cc1-move 1093964 90772 1576994
cc1-push 1078988 104068 1575314
delta  -1.4% 14.6% -0.107%

crafty-move 130556 5508 1256037
crafty-push 130236 5772 1255981
delta-0.2% 4.8% -0.004%

eon-move 333660 33220 516491
eon-push 330140 35812 51
delta -1.1% 7.8% -0.181%

gap-move 404092 46732 1457735
gap-push 396012 53180 1456103
delta -2.0% 13.8% -0.112%

perlbmk-move 456572 45324 618585
perlbmk-push 449516 52340 618545
delta -1.5% 15.5% -0.006%

parser-move 81244 15788 334003
parser-push 80684 16332 333987
delta   -0.7% 3.4% -0.005%


On Tue, Dec 11, 2012 at 9:14 AM, Xinliang David Li  wrote:
> On Tue, Dec 11, 2012 at 1:49 AM, Richard Biener
>  wrote:
>> On Mon, Dec 10, 2012 at 10:07 PM, Mike Stump  wrote:
>>> On Dec 10, 2012, at 12:42 PM, Xinliang David Li  wrote:
 I have not measured the CFI size impact -- but conceivably it should
 be larger -- which is unfortunate.
>>>
>>> Code speed and size are preferable to optimizing dwarf size…  :-)  I'd let 
>>> dwarf 5 fix it!
>>
>> Well, different to debug info, CFI data has to be in memory to make
>> unwinding work.
>> These days most Linux distributions enable asyncronous unwind tables so any
>> size savings due to shorter push/pop epilogue/prologue sequences has to be
>> offsetted by the increase in CFI data.  I'm not sure there is really a
>> speed difference
>> between both variants (well, maybe due to better icache footprint of
>> the push/pop
>> variant).
>
> Yes, for large applications, this can be crucial to performance.
>
>>
>> That said - I'd prefer to have more data on this before making the switch for
>> the generic model.  What was your original motivation?  Just "theory" or was
>> it a real case?
>
> 1) some of the very large internal apps I measured benefit from this
> change (in terms of performance)
> 2) both ICC and LLVM do the same.
>
> I have already committed the patch. I will find some time to collect
> more size data and post it later.
>
> thanks,
>
> David
>
>
>>
>> Thanks,
>> Richard.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-11 Thread Xinliang David Li
On Tue, Dec 11, 2012 at 1:49 AM, Richard Biener
 wrote:
> On Mon, Dec 10, 2012 at 10:07 PM, Mike Stump  wrote:
>> On Dec 10, 2012, at 12:42 PM, Xinliang David Li  wrote:
>>> I have not measured the CFI size impact -- but conceivably it should
>>> be larger -- which is unfortunate.
>>
>> Code speed and size are preferable to optimizing dwarf size…  :-)  I'd let 
>> dwarf 5 fix it!
>
> Well, different to debug info, CFI data has to be in memory to make
> unwinding work.
> These days most Linux distributions enable asyncronous unwind tables so any
> size savings due to shorter push/pop epilogue/prologue sequences has to be
> offsetted by the increase in CFI data.  I'm not sure there is really a
> speed difference
> between both variants (well, maybe due to better icache footprint of
> the push/pop
> variant).

Yes, for large applications, this can be crucial to performance.

>
> That said - I'd prefer to have more data on this before making the switch for
> the generic model.  What was your original motivation?  Just "theory" or was
> it a real case?

1) some of the very large internal apps I measured benefit from this
change (in terms of performance)
2) both ICC and LLVM do the same.

I have already committed the patch. I will find some time to collect
more size data and post it later.

thanks,

David


>
> Thanks,
> Richard.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-11 Thread Richard Biener
On Mon, Dec 10, 2012 at 10:07 PM, Mike Stump  wrote:
> On Dec 10, 2012, at 12:42 PM, Xinliang David Li  wrote:
>> I have not measured the CFI size impact -- but conceivably it should
>> be larger -- which is unfortunate.
>
> Code speed and size are preferable to optimizing dwarf size…  :-)  I'd let 
> dwarf 5 fix it!

Well, different to debug info, CFI data has to be in memory to make
unwinding work.
These days most Linux distributions enable asyncronous unwind tables so any
size savings due to shorter push/pop epilogue/prologue sequences has to be
offsetted by the increase in CFI data.  I'm not sure there is really a
speed difference
between both variants (well, maybe due to better icache footprint of
the push/pop
variant).

That said - I'd prefer to have more data on this before making the switch for
the generic model.  What was your original motivation?  Just "theory" or was
it a real case?

Thanks,
Richard.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-10 Thread Mike Stump
On Dec 10, 2012, at 12:42 PM, Xinliang David Li  wrote:
> I have not measured the CFI size impact -- but conceivably it should
> be larger -- which is unfortunate.

Code speed and size are preferable to optimizing dwarf size…  :-)  I'd let 
dwarf 5 fix it!



Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-10 Thread Xinliang David Li
I have not measured the CFI size impact -- but conceivably it should
be larger -- which is unfortunate.

David

On Mon, Dec 10, 2012 at 1:23 AM, Richard Biener
 wrote:
> On Sun, Dec 9, 2012 at 2:50 PM, Uros Bizjak  wrote:
>> Hello!
>>
>>> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
>>> SP adjustment instead of a sequence of pushes/pops. The preference to
>>> the MOVs are good for old CPU micro-architectures (before pentium-4,
>>> K10), because it breaks the data dependency.  In modern
>>> micro-architecture, push/pop is implemented using a mechanism called
>>> stack engine. The data dependency is removed by the hardware, and
>>> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
>>> smaller. There is no longer the need to avoid using them.   This is
>>> also what ICC does.
>>
>>> 2012-12-08  Xinliang David Li  
>>>* config/i386/i386.c: Eanble push/pop in pro/epilogue for 
>>> moderen CPUs.
>>
>> s/moderen/modern
>>
>> OK for mainline SVN.
>
> It's also more costly for unwind info in the prologue/epilogue.  Thus, did you
> measure the effect on CFI size?
>
> Thanks,
> Richard.
>
>> Thanks,
>> Uros.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-10 Thread Richard Biener
On Sun, Dec 9, 2012 at 2:50 PM, Uros Bizjak  wrote:
> Hello!
>
>> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
>> SP adjustment instead of a sequence of pushes/pops. The preference to
>> the MOVs are good for old CPU micro-architectures (before pentium-4,
>> K10), because it breaks the data dependency.  In modern
>> micro-architecture, push/pop is implemented using a mechanism called
>> stack engine. The data dependency is removed by the hardware, and
>> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
>> smaller. There is no longer the need to avoid using them.   This is
>> also what ICC does.
>
>> 2012-12-08  Xinliang David Li  
>>* config/i386/i386.c: Eanble push/pop in pro/epilogue for moderen 
>> CPUs.
>
> s/moderen/modern
>
> OK for mainline SVN.

It's also more costly for unwind info in the prologue/epilogue.  Thus, did you
measure the effect on CFI size?

Thanks,
Richard.

> Thanks,
> Uros.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-09 Thread Дмитрий Дьяченко
s/Eanble/Enable/


Thanks,
Dmitry

2012/12/9 Uros Bizjak :
> Hello!
>
>> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
>> SP adjustment instead of a sequence of pushes/pops. The preference to
>> the MOVs are good for old CPU micro-architectures (before pentium-4,
>> K10), because it breaks the data dependency.  In modern
>> micro-architecture, push/pop is implemented using a mechanism called
>> stack engine. The data dependency is removed by the hardware, and
>> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
>> smaller. There is no longer the need to avoid using them.   This is
>> also what ICC does.
>
>> 2012-12-08  Xinliang David Li  
>>* config/i386/i386.c: Eanble push/pop in pro/epilogue for moderen 
>> CPUs.
>
> s/moderen/modern
>
> OK for mainline SVN.
>
> Thanks,
> Uros.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-09 Thread Uros Bizjak
Hello!

> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
> SP adjustment instead of a sequence of pushes/pops. The preference to
> the MOVs are good for old CPU micro-architectures (before pentium-4,
> K10), because it breaks the data dependency.  In modern
> micro-architecture, push/pop is implemented using a mechanism called
> stack engine. The data dependency is removed by the hardware, and
> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
> smaller. There is no longer the need to avoid using them.   This is
> also what ICC does.

> 2012-12-08  Xinliang David Li  
>* config/i386/i386.c: Eanble push/pop in pro/epilogue for moderen 
> CPUs.

s/moderen/modern

OK for mainline SVN.

Thanks,
Uros.