RE: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-21 Thread Zamyatin, Igor
We checked also spec2000 and eembc_2_0 on Atom - no visible regressions and 
gains

-Original Message-
From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On 
Behalf Of Xinliang David Li
Sent: Friday, December 21, 2012 11:26 AM
To: Jan Hubicka
Cc: GCC Patches; Ahmad Sharif
Subject: Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

Ahmad has helped doing some atom performance testing (ChromeOS
benchmarks) with this patch. In summary, there is no statistically significant 
regression seen. There is one improvement of about +1.9%
(v8 benchmark) which looks real.

David

On Wed, Dec 12, 2012 at 9:24 AM, Xinliang David Li davi...@google.com wrote:
 On Wed, Dec 12, 2012 at 8:37 AM, Jan Hubicka hubi...@ucw.cz wrote:
 I noticed in prologue/epilogue, GCC prefers to use MOVs followed by 
 a SP adjustment instead of a sequence of pushes/pops. The preference 
 to the MOVs are good for old CPU micro-architectures (before 
 pentium-4, K10), because it breaks the data dependency.  In modern 
 micro-architecture, push/pop is implemented using a mechanism called 
 stack engine. The data dependency is removed by the hardware, and 
 push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
 smaller. There is no longer the need to avoid using them.   This is
 also what ICC does.

 The following patch fixed the problem. It passes 
 bootstrap/regression test. OK to install?

 thanks,

 David

 Index: config/i386/i386.c
 ===
 --- config/i386/i386.c (revision 194324)
 +++ config/i386/i386.c (working copy)
 @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe
m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,

/* X86_TUNE_PROLOGUE_USING_MOVE */
 -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
 +  m_PPRO | m_ATHLON_K8,

/* X86_TUNE_EPILOGUE_USING_MOVE */
 -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
 +  m_PPRO | m_ATHLON_K8,

 Push/pops wrt moves was always difficult to tune on old CPUs, so I am 
 happy it is gone from generic (in fact I had similar patch pending).
 Are you sure about Atom having stack engine, too?


 Good question. The instruction latency table
 (http://www.agner.org/optimize/instruction_tables.pdf) shows that for
 Atom: push r has one 1uop, 1 cycle latency. However the instruction is 
 not pairable which will affect ILP. The guide here 
 http://www.agner.org/optimize/microarchitecture.pdf does not mention 
 Atom has stack engine either.

 I will help collect some performance data on Atom.


 thanks,

 David


 Related thing is accumulate_outgoing_args. Igor is testing it on Core 
 and I will give it a try on K10.

 Honza

 I am attaching the changes for core costs I made if someone is 
 interested in testing them.  If we can declare P4/PPRo and maybe K8 
 chips obsolette for generic, there is room for improvement in 
 generic, too. Like using inc/dec again.

 Honza

 Index: config/i386/i386.c
 ===
 --- config/i386/i386.c  (revision 194452)
 +++ config/i386/i386.c  (working copy)
 @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
COSTS_N_INSNS (8),   /* cost of FABS instruction.  */
COSTS_N_INSNS (8),   /* cost of FCHS instruction.  */
COSTS_N_INSNS (40),  /* cost of FSQRT instruction.  */
 -  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
 -   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
 +  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
 +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
{-1, libcall, false,
{{libcall, {{6, loop_1_byte, true},
{24, loop, true},
{8192, rep_prefix_4_byte, true},
{-1, libcall, false}}},
 -   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
 +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
{-1, libcall, false,
1,   /* scalar_stmt_cost.  */
1,   /* scalar load_cost.  */
 @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
m_PPRO,

/* X86_TUNE_PARTIAL_FLAG_REG_STALL */
 -  m_CORE2I7 | m_GENERIC,
 +  m_GENERIC | m_CORE2,

/* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
 * on 16-bit immediate moves into memory on Core2 and Corei7.  */ 
 @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
m_K6,

/* X86_TUNE_USE_CLTD */
 -  ~(m_PENT | m_ATOM | m_K6),
 +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),

/* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx.  */
m_PENT4,
 @@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe
m_COREI7 | m_BDVER,

/* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */
 -  m_BDVER ,
 +  m_BDVER

RE: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-21 Thread Zamyatin, Igor
So far we see a regression on one of eembc_1_1 test because of following change:

   /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
  from FP to FP. */
-  m_CORE2I7 | m_AMDFAM10 | m_GENERIC,
+  m_AMDFAM10 | m_GENERIC,

Probably we should keep it as is while there is nothing about it in docs 
indeed...


Thanks,
Igor


-Original Message-
From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On 
Behalf Of Jan Hubicka
Sent: Wednesday, December 12, 2012 8:37 PM
To: Xinliang David Li
Cc: GCC Patches
Subject: Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

 I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a 
 SP adjustment instead of a sequence of pushes/pops. The preference to 
 the MOVs are good for old CPU micro-architectures (before pentium-4, 
 K10), because it breaks the data dependency.  In modern 
 micro-architecture, push/pop is implemented using a mechanism called 
 stack engine. The data dependency is removed by the hardware, and 
 push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
 smaller. There is no longer the need to avoid using them.   This is
 also what ICC does.
 
 The following patch fixed the problem. It passes bootstrap/regression 
 test. OK to install?
 
 thanks,
 
 David
 
 Index: config/i386/i386.c
 ===
 --- config/i386/i386.c (revision 194324)
 +++ config/i386/i386.c (working copy)
 @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe
m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
 
/* X86_TUNE_PROLOGUE_USING_MOVE */
 -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
 +  m_PPRO | m_ATHLON_K8,
 
/* X86_TUNE_EPILOGUE_USING_MOVE */
 -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
 +  m_PPRO | m_ATHLON_K8,

Push/pops wrt moves was always difficult to tune on old CPUs, so I am happy it 
is gone from generic (in fact I had similar patch pending).
Are you sure about Atom having stack engine, too?

Related thing is accumulate_outgoing_args. Igor is testing it on Core and I 
will give it a try on K10.

Honza

I am attaching the changes for core costs I made if someone is interested in 
testing them.  If we can declare P4/PPRo and maybe K8 chips obsolette for 
generic, there is room for improvement in generic, too. Like using inc/dec 
again.

Honza

Index: config/i386/i386.c
===
--- config/i386/i386.c  (revision 194452)
+++ config/i386/i386.c  (working copy)
@@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
   COSTS_N_INSNS (8),   /* cost of FABS instruction.  */
   COSTS_N_INSNS (8),   /* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),  /* cost of FSQRT instruction.  */
-  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
-   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
+  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
+   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
   {-1, libcall, false,
   {{libcall, {{6, loop_1_byte, true},
   {24, loop, true},
   {8192, rep_prefix_4_byte, true},
   {-1, libcall, false}}},
-   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
+   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
   {-1, libcall, false,
   1,   /* scalar_stmt_cost.  */
   1,   /* scalar load_cost.  */
@@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
   m_PPRO,
 
   /* X86_TUNE_PARTIAL_FLAG_REG_STALL */
-  m_CORE2I7 | m_GENERIC,
+  m_GENERIC | m_CORE2,
 
   /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
* on 16-bit immediate moves into memory on Core2 and Corei7.  */ @@ -1822,7 
+1822,7 @@ static unsigned int initial_ix86_tune_fe
   m_K6,
 
   /* X86_TUNE_USE_CLTD */
-  ~(m_PENT | m_ATOM | m_K6),
+  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
 
   /* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx.  */
   m_PENT4,
@@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe
   m_COREI7 | m_BDVER,
 
   /* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */
-  m_BDVER ,
+  m_BDVER,
 
   /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies
  are resolved on SSE register parts instead of whole registers, so we may 
@@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
   m_ATHLON_K8,
 
   /* X86_TUNE_SSE_TYPELESS_STORES */
-  m_AMD_MULTIPLE,
+  m_AMD_MULTIPLE | m_CORE2I7, /**/
 
   /* X86_TUNE_SSE_LOAD0_BY_PXOR */
-  m_PPRO | m_P4_NOCONA,
+  m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/
 
   /* X86_TUNE_MEMORY_MISMATCH_STALL */
   m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, @@ -1938,7 
+1938,7 @@ static unsigned

RE: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-20 Thread Melik-adamyan, Areg
We checked,  no significant gains or losses.

-Original Message-
From: H.J. Lu [mailto:hjl.to...@gmail.com] 
Sent: Friday, December 14, 2012 1:03 AM
To: Jan Hubicka
Cc: Jakub Jelinek; Xinliang David Li; GCC Patches; Teresa Johnson; 
Melik-adamyan, Areg
Subject: Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

On Thu, Dec 13, 2012 at 12:40 PM, Jan Hubicka hubi...@ucw.cz wrote:
  Here we speak about memcpy/memset only.  I never got around to 
  modernize strlen and friends, unfortunately...
 
  memcmp and friends are different beats.  They realy need some TLC...

 memcpy and memset in glibc are also extremely fast.

 The default strategy now is to inline only when the block is known to 
 be small (either constant or via profile feedback, we do not really 
 use the info on upper bound of size of the copied object that would be 
 useful but not readilly available at expansion time).

 You can try the test_stringop script I attached and send me the 
 results.  For

Areg, can you give it a try?  Thanks.

 me libc starts to be win only for rather large blocks (i.e. 8KB)


Which glibc are you using?

--
H.J.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-20 Thread H.J. Lu
On Thu, Dec 20, 2012 at 4:13 AM, Melik-adamyan, Areg
areg.melik-adam...@intel.com wrote:
 We checked,  no significant gains or losses.

 -Original Message-
 From: H.J. Lu [mailto:hjl.to...@gmail.com]
 Sent: Friday, December 14, 2012 1:03 AM
 To: Jan Hubicka
 Cc: Jakub Jelinek; Xinliang David Li; GCC Patches; Teresa Johnson; 
 Melik-adamyan, Areg
 Subject: Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

 On Thu, Dec 13, 2012 at 12:40 PM, Jan Hubicka hubi...@ucw.cz wrote:
  Here we speak about memcpy/memset only.  I never got around to
  modernize strlen and friends, unfortunately...
 
  memcmp and friends are different beats.  They realy need some TLC...

 memcpy and memset in glibc are also extremely fast.

 The default strategy now is to inline only when the block is known to
 be small (either constant or via profile feedback, we do not really
 use the info on upper bound of size of the copied object that would be
 useful but not readilly available at expansion time).

 You can try the test_stringop script I attached and send me the
 results.  For

 Areg, can you give it a try?  Thanks.


Hi Areg,

Did you mean inlined memcpy/memset are as fast as
the ones in libc.so on both ia32 and Intel64?

Please keep in mind that memcpy/memset in libc.a
may not be optimized.  You must not use -static for
linking.

-- 
H.J.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-20 Thread Jan Hubicka
 Hi Areg,
 
 Did you mean inlined memcpy/memset are as fast as
 the ones in libc.so on both ia32 and Intel64?

I would be interested in output of the stringop script.
 
 Please keep in mind that memcpy/memset in libc.a
 may not be optimized.  You must not use -static for
 linking.

In my setup I use dynamic linking...
(this is quite anoying property in general - people tend to use --static for
performance critical binaries to save expenses of PIC.  It would be really cool
to have way to call proper stringops based on -march switch)

Honza
 
 -- 
 H.J.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-20 Thread Jan Hubicka
  Hi Areg,
  
  Did you mean inlined memcpy/memset are as fast as
  the ones in libc.so on both ia32 and Intel64?
 
 I would be interested in output of the stringop script.

Also as far as I can remember, none of spec2k6 benchmarks is really stringop
bound.  On Spec2k GCC was quite bound by memset (within alloc_rtx and bitmap
oprations) but mostly by collecting page faults there.  Inlining that one made
quite a lot of difference on K8 hardware, but not on later chips.

Honza


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-20 Thread H.J. Lu
On Thu, Dec 20, 2012 at 7:06 AM, Jan Hubicka hubi...@ucw.cz wrote:
  Hi Areg,
 
  Did you mean inlined memcpy/memset are as fast as
  the ones in libc.so on both ia32 and Intel64?

 I would be interested in output of the stringop script.

 Also as far as I can remember, none of spec2k6 benchmarks is really stringop
 bound.  On Spec2k GCC was quite bound by memset (within alloc_rtx and bitmap
 oprations) but mostly by collecting page faults there.  Inlining that one made
 quite a lot of difference on K8 hardware, but not on later chips.


There is a GCC performance regression bug on EEMBC.  It turns out
that -static was used for linking and optimized memory functions weren't
used.  Remove -static fixed the performance regression.

-- 
H.J.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-13 Thread Jan Hubicka
 Try the following one. 1) -minline-all-stringops
 -mstringop-strategy=rep_8byte -O2 vs 1) -mstringop_strategy=libcall
 -O2.
 
 David
 
 
 #include string.h
 #include stdio.h
 #include stdlib.h
 #ifndef LEN
 #define LEN 16
 #endif
 
 void copy(char* s1, char* s2,int len) __attribute__((noinline));
 void copy(char* s1, char* s2,int len)
 {
memcpy(s2,s1,len);
 }

I guess the catch here is that you force the copy to be noinline and thus you
eliminate the benefits of inlined sequence.  With inline stringop one saves
regalloc and often can get rid of the alignment tests.

This is script I use to tune the tables.

Honza

test()
{
rm -f a.out
cat END | $1 -x c -O3 $3 -DAVG_SIZE=$2 $STRINGOP -DMEMORY_COPIES=$memsize -
#define BUFFER_SIZE (16*1024*1024 + AVG_SIZE*2)
/*#define MEMORY_COPIES (1024*1024*64*(long long)10)*/
$type t[BUFFER_SIZE];
main()
{
  unsigned int i;
  for (i=0;i((long long)MEMORY_COPIES + AVG_SIZE * 2 - 1)/AVG_SIZE*2;i++)
#ifdef test_memset
__builtin_memset (t+(i*1024*1024+i*1)%(BUFFER_SIZE - AVG_SIZE*2), i, 
(AVG_SIZE + i) % (AVG_SIZE * 2 + 0));
#else
__builtin_memcpy (t+(i*1024*1024+i*1)%(BUFFER_SIZE - AVG_SIZE*2), 
t+((i+1)*1024*1024*4+i*1)%(BUFFER_SIZE - AVG_SIZE *2), (AVG_SIZE + i) % 
(AVG_SIZE * 2 + 0));
#endif
  return 0;
}
END
TIME=`/usr/bin/time -f %E ./a.out 21`
echo -n  $TIME
echo $TIME $4 /tmp/accum
}

testrow()
{
echo -n  /tmp/accum
printf block size %7i $3
test $2 $3 -mstringop-strategy=libcall libcall
test $2 $3 -mstringop-strategy=rep_byte -malign-stringops rep1
test $2 $3 -mstringop-strategy=rep_byte -mno-align-stringops rep1noalign
test $2 $3 -mstringop-strategy=rep_4byte -malign-stringops rep4
test $2 $3 -mstringop-strategy=rep_4byte -mno-align-stringops rep4noalign
if [ $mode == 64 ]
then
test $2 $3 -mstringop-strategy=rep_8byte -malign-stringops rep8
test $2 $3 -mstringop-strategy=rep_8byte -mno-align-stringops rep8noalign
fi
test $2 $3 -mstringop-strategy=loop -malign-stringops  loop
test $2 $3 -mstringop-strategy=loop -mno-align-stringops  loopnoalign
test $2 $3 -mstringop-strategy=unrolled_loop -malign-stringops unrl
test $2 $3 -mstringop-strategy=unrolled_loop -mno-align-stringops 
unrlnoalign
test $2 $3 -mstringop-strategy=sse_loop -malign-stringops sse
test $2 $3 -mstringop-strategy=sse_loop -mno-align-stringops -msse2 
ssenoalign
test $2 $3 -mstringop-strategy=byte_loop byte
best=`cat /tmp/accum | sort | head -1`
test $2 $3  -fprofile-generate /dev/null 21
test $2 $3  -fprofile-use
test $2 $3  -minline-stringops-dynamically
echo  best: $best
}

test_all_sizes()
{
if [ $mode == 64 ]
then
echolibcall   rep1   noalgrep4   noalgrep8   noalg 
   loop   noalgunrl   noalgsse   noalgbyte profiled dynamic
else
echolibcall   rep1   noalgrep4   noalgloop   noalg 
   unrl   noalgssenoalgbyte profiled dynamic
fi
#for size in 1 2 3 4 6 8 10 12 14 16 24 32 48 64 128 256 512 1024 4096 8192 
81920 819200 8192000
#for size in 8192000 819200 81920 8192 4096 2048 1024 512 256 128 64 48 32 24 
16 14 12 10 8 6 5 4 3 2 1
for size in 8192000 819200 81920 20480 8192 4096 2048 1024 512 256 128 64 48 32 
24 16 14 12 10 8 6 4 1
#for size in 128 256 1024 4096 8192 81920 819200
do
testrow $1 $2 $size
done
}

mode=$1
shift
export memsize=$1
shift
cmdline=$*
if [ $mode != 32 ]
then
  if [ $mode != 64 ]
  then
echo Usage:
echo test_stringop mode size cmdline
echo mode is either 32 or 64
echo size is amount of memory copied in each test.  Should be chosed small 
enough so runtime is less than minute for each test and sorting works
echo Example: test_stringop 32 64000 ./xgcc -B ./ -march=pentium3
exit
  fi
fi
echo memcpy mode:$mode size:$memsize
export STRINGOP=
type=char
test_all_sizes $mode $cmdline -m$mode
echo Aligned
type=long
test_all_sizes $mode $cmdline -m$mode
echo memset
type=char
export STRINGOP=-Dtest_memset=1
test_all_sizes $mode $cmdline -m$mode
echo Aligned
type=long
test_all_sizes $mode $cmdline -m$mode


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-13 Thread H.J. Lu
On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek ja...@redhat.com wrote:
 On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote:
 On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka hubi...@ucw.cz wrote:
   libcall is not faster up to 8KB to rep sequence that is better for 
   regalloc/code
   cache than fully blowin function call.
 
  Be careful with this. My recollection is that REP sequence is good for
  any size -- for smaller size, the REP initial set up cost is too high
  (10s of cycles), while for large size copy, it is less efficient
  compared with library version.
 
  Well this is based on the data from the memtest script.
  Core has good REP implementation - it is a win from rather small blocks (16
  bytes if I recall) and it does not need alignment.
  Library version starts to be interesting with caching hints, but I think 
  till 80KB
  it is still not a win for my setup (glibc-2.15)

 A simple test shows that -mstringop-strategy=libcall always beats
 -mstringop-strategy=rep_8byte (on core2 and corei7) except for size
 smaller than 8 where the rep_8byte strategy simply bypasses REP movs.
 Can you share your memtest ?

 I can't believe that say 16 byte or 32 byte memcpy can be ever faster using a
 libcall.  The PLT call overhead is simply too high.


The x86 string/memory functions in the current glibc are
extremely fast and tuned for Core 2/Core i7.  GCC is having
a very hard time to beat them with inlining:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052

-- 
H.J.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-13 Thread Jan Hubicka
 On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek ja...@redhat.com wrote:
  On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote:
  On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka hubi...@ucw.cz wrote:
libcall is not faster up to 8KB to rep sequence that is better for 
regalloc/code
cache than fully blowin function call.
  
   Be careful with this. My recollection is that REP sequence is good for
   any size -- for smaller size, the REP initial set up cost is too high
   (10s of cycles), while for large size copy, it is less efficient
   compared with library version.
  
   Well this is based on the data from the memtest script.
   Core has good REP implementation - it is a win from rather small blocks 
   (16
   bytes if I recall) and it does not need alignment.
   Library version starts to be interesting with caching hints, but I think 
   till 80KB
   it is still not a win for my setup (glibc-2.15)
 
  A simple test shows that -mstringop-strategy=libcall always beats
  -mstringop-strategy=rep_8byte (on core2 and corei7) except for size
  smaller than 8 where the rep_8byte strategy simply bypasses REP movs.
  Can you share your memtest ?
 
  I can't believe that say 16 byte or 32 byte memcpy can be ever faster using 
  a
  libcall.  The PLT call overhead is simply too high.
 
 
 The x86 string/memory functions in the current glibc are
 extremely fast and tuned for Core 2/Core i7.  GCC is having
 a very hard time to beat them with inlining:
 
 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052

Here we speak about memcpy/memset only.  I never got around to modernize
strlen and friends, unfortunately...

memcmp and friends are different beats.  They realy need some TLC...

Honza


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-13 Thread H.J. Lu
On Thu, Dec 13, 2012 at 12:26 PM, Jan Hubicka hubi...@ucw.cz wrote:
 On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek ja...@redhat.com wrote:
  On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote:
  On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka hubi...@ucw.cz wrote:
libcall is not faster up to 8KB to rep sequence that is better for 
regalloc/code
cache than fully blowin function call.
  
   Be careful with this. My recollection is that REP sequence is good for
   any size -- for smaller size, the REP initial set up cost is too high
   (10s of cycles), while for large size copy, it is less efficient
   compared with library version.
  
   Well this is based on the data from the memtest script.
   Core has good REP implementation - it is a win from rather small blocks 
   (16
   bytes if I recall) and it does not need alignment.
   Library version starts to be interesting with caching hints, but I 
   think till 80KB
   it is still not a win for my setup (glibc-2.15)
 
  A simple test shows that -mstringop-strategy=libcall always beats
  -mstringop-strategy=rep_8byte (on core2 and corei7) except for size
  smaller than 8 where the rep_8byte strategy simply bypasses REP movs.
  Can you share your memtest ?
 
  I can't believe that say 16 byte or 32 byte memcpy can be ever faster 
  using a
  libcall.  The PLT call overhead is simply too high.
 

 The x86 string/memory functions in the current glibc are
 extremely fast and tuned for Core 2/Core i7.  GCC is having
 a very hard time to beat them with inlining:

 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052

 Here we speak about memcpy/memset only.  I never got around to modernize
 strlen and friends, unfortunately...

 memcmp and friends are different beats.  They realy need some TLC...

memcpy and memset in glibc are also extremely fast.


-- 
H.J.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-13 Thread Jan Hubicka
  Here we speak about memcpy/memset only.  I never got around to modernize
  strlen and friends, unfortunately...
 
  memcmp and friends are different beats.  They realy need some TLC...
 
 memcpy and memset in glibc are also extremely fast.

The default strategy now is to inline only when the block is known to be small
(either constant or via profile feedback, we do not really use the info on
upper bound of size of the copied object that would be useful but not readilly
available at expansion time).

You can try the test_stringop script I attached and send me the results.  For
me libc starts to be win only for rather large blocks (i.e. 8KB)

Honza


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-13 Thread H.J. Lu
On Thu, Dec 13, 2012 at 12:40 PM, Jan Hubicka hubi...@ucw.cz wrote:
  Here we speak about memcpy/memset only.  I never got around to modernize
  strlen and friends, unfortunately...
 
  memcmp and friends are different beats.  They realy need some TLC...

 memcpy and memset in glibc are also extremely fast.

 The default strategy now is to inline only when the block is known to be small
 (either constant or via profile feedback, we do not really use the info on
 upper bound of size of the copied object that would be useful but not readilly
 available at expansion time).

 You can try the test_stringop script I attached and send me the results.  For

Areg, can you give it a try?  Thanks.

 me libc starts to be win only for rather large blocks (i.e. 8KB)


Which glibc are you using?

-- 
H.J.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-13 Thread Jan Hubicka
  me libc starts to be win only for rather large blocks (i.e. 8KB)
 
 
 Which glibc are you using?

2.15 as it comes with opensuse 12.2

Honza
 
 -- 
 H.J.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Richard Biener
On Tue, Dec 11, 2012 at 11:53 PM, Xinliang David Li davi...@google.com wrote:
 The following the O2 size data from SPEC2k.  Note that with push/pop,
 it is a always a net win (negative delta) in terms of total binary or
 total loadable section size.

Thanks for the data!

Richard.

 thanks,

 David

.text.eh_frame  Total_binary
 vortex-move 440252 40796 584066
 vortex-push 415436 57452 575906
 delta-5.6% 40.8%  -1.397%

 twolf-move 169324 10748 223521
 twolf-push 168876 11124 223449
 delta   -0.3% 3.5% -0.032%

 gzip-move 30668 3652 374399
 gzip-push 30524 3740 374343
 delta -0.5% 2.4% -0.015%

 bzip2-move 22748 3196 111616
 bzip2-push 22636 3284 111592
 delta  -0.5% 2.8% -0.022%

 vpr-move 104684 9380 147378
 vpr-push 104236 9788 147338
 delta -0.4% 4.3% -0.027%

 mcf-move 8444 1244 26760
 mcf-push 8444 1244 26760
 delta0.0% 0.0% 0.000%

 cc1-move 1093964 90772 1576994
 cc1-push 1078988 104068 1575314
 delta  -1.4% 14.6% -0.107%

 crafty-move 130556 5508 1256037
 crafty-push 130236 5772 1255981
 delta-0.2% 4.8% -0.004%

 eon-move 333660 33220 516491
 eon-push 330140 35812 51
 delta -1.1% 7.8% -0.181%

 gap-move 404092 46732 1457735
 gap-push 396012 53180 1456103
 delta -2.0% 13.8% -0.112%

 perlbmk-move 456572 45324 618585
 perlbmk-push 449516 52340 618545
 delta -1.5% 15.5% -0.006%

 parser-move 81244 15788 334003
 parser-push 80684 16332 333987
 delta   -0.7% 3.4% -0.005%


 On Tue, Dec 11, 2012 at 9:14 AM, Xinliang David Li davi...@google.com wrote:
 On Tue, Dec 11, 2012 at 1:49 AM, Richard Biener
 richard.guent...@gmail.com wrote:
 On Mon, Dec 10, 2012 at 10:07 PM, Mike Stump mikest...@comcast.net wrote:
 On Dec 10, 2012, at 12:42 PM, Xinliang David Li davi...@google.com wrote:
 I have not measured the CFI size impact -- but conceivably it should
 be larger -- which is unfortunate.

 Code speed and size are preferable to optimizing dwarf size…  :-)  I'd let 
 dwarf 5 fix it!

 Well, different to debug info, CFI data has to be in memory to make
 unwinding work.
 These days most Linux distributions enable asyncronous unwind tables so any
 size savings due to shorter push/pop epilogue/prologue sequences has to be
 offsetted by the increase in CFI data.  I'm not sure there is really a
 speed difference
 between both variants (well, maybe due to better icache footprint of
 the push/pop
 variant).

 Yes, for large applications, this can be crucial to performance.


 That said - I'd prefer to have more data on this before making the switch 
 for
 the generic model.  What was your original motivation?  Just theory or was
 it a real case?

 1) some of the very large internal apps I measured benefit from this
 change (in terms of performance)
 2) both ICC and LLVM do the same.

 I have already committed the patch. I will find some time to collect
 more size data and post it later.

 thanks,

 David



 Thanks,
 Richard.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Jan Hubicka
 I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
 SP adjustment instead of a sequence of pushes/pops. The preference to
 the MOVs are good for old CPU micro-architectures (before pentium-4,
 K10), because it breaks the data dependency.  In modern
 micro-architecture, push/pop is implemented using a mechanism called
 stack engine. The data dependency is removed by the hardware, and
 push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
 smaller. There is no longer the need to avoid using them.   This is
 also what ICC does.
 
 The following patch fixed the problem. It passes bootstrap/regression
 test. OK to install?
 
 thanks,
 
 David
 
 Index: config/i386/i386.c
 ===
 --- config/i386/i386.c (revision 194324)
 +++ config/i386/i386.c (working copy)
 @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe
m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
 
/* X86_TUNE_PROLOGUE_USING_MOVE */
 -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
 +  m_PPRO | m_ATHLON_K8,
 
/* X86_TUNE_EPILOGUE_USING_MOVE */
 -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
 +  m_PPRO | m_ATHLON_K8,

Push/pops wrt moves was always difficult to tune on old CPUs, so I am happy it
is gone from generic (in fact I had similar patch pending).
Are you sure about Atom having stack engine, too?

Related thing is accumulate_outgoing_args. Igor is testing it on Core and I will
give it a try on K10.

Honza

I am attaching the changes for core costs I made if someone is interested in
testing them.  If we can declare P4/PPRo and maybe K8 chips obsolette for
generic, there is room for improvement in generic, too. Like using inc/dec
again.

Honza

Index: config/i386/i386.c
===
--- config/i386/i386.c  (revision 194452)
+++ config/i386/i386.c  (working copy)
@@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
   COSTS_N_INSNS (8),   /* cost of FABS instruction.  */
   COSTS_N_INSNS (8),   /* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),  /* cost of FSQRT instruction.  */
-  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
-   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
+  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
+   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
   {-1, libcall, false,
   {{libcall, {{6, loop_1_byte, true},
   {24, loop, true},
   {8192, rep_prefix_4_byte, true},
   {-1, libcall, false}}},
-   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
+   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
   {-1, libcall, false,
   1,   /* scalar_stmt_cost.  */
   1,   /* scalar load_cost.  */
@@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
   m_PPRO,
 
   /* X86_TUNE_PARTIAL_FLAG_REG_STALL */
-  m_CORE2I7 | m_GENERIC,
+  m_GENERIC | m_CORE2,
 
   /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
* on 16-bit immediate moves into memory on Core2 and Corei7.  */
@@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
   m_K6,
 
   /* X86_TUNE_USE_CLTD */
-  ~(m_PENT | m_ATOM | m_K6),
+  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
 
   /* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx.  */
   m_PENT4,
@@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe
   m_COREI7 | m_BDVER,
 
   /* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */
-  m_BDVER ,
+  m_BDVER,
 
   /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies
  are resolved on SSE register parts instead of whole registers, so we may
@@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
   m_ATHLON_K8,
 
   /* X86_TUNE_SSE_TYPELESS_STORES */
-  m_AMD_MULTIPLE,
+  m_AMD_MULTIPLE | m_CORE2I7, /**/
 
   /* X86_TUNE_SSE_LOAD0_BY_PXOR */
-  m_PPRO | m_P4_NOCONA,
+  m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/
 
   /* X86_TUNE_MEMORY_MISMATCH_STALL */
   m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
@@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
 
   /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
  than 4 branch instructions in the 16 byte window.  */
-  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
+  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
 
   /* X86_TUNE_SCHEDULE */
   m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE | 
m_GENERIC,
@@ -1947,10 +1947,10 @@ static unsigned int initial_ix86_tune_fe
   m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
 
   /* X86_TUNE_USE_INCDEC */
-  ~(m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_GENERIC),
+  ~(m_P4_NOCONA | m_ATOM | 

Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Xinliang David Li
On Wed, Dec 12, 2012 at 8:37 AM, Jan Hubicka hubi...@ucw.cz wrote:
 I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
 SP adjustment instead of a sequence of pushes/pops. The preference to
 the MOVs are good for old CPU micro-architectures (before pentium-4,
 K10), because it breaks the data dependency.  In modern
 micro-architecture, push/pop is implemented using a mechanism called
 stack engine. The data dependency is removed by the hardware, and
 push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
 smaller. There is no longer the need to avoid using them.   This is
 also what ICC does.

 The following patch fixed the problem. It passes bootstrap/regression
 test. OK to install?

 thanks,

 David

 Index: config/i386/i386.c
 ===
 --- config/i386/i386.c (revision 194324)
 +++ config/i386/i386.c (working copy)
 @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe
m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,

/* X86_TUNE_PROLOGUE_USING_MOVE */
 -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
 +  m_PPRO | m_ATHLON_K8,

/* X86_TUNE_EPILOGUE_USING_MOVE */
 -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
 +  m_PPRO | m_ATHLON_K8,

 Push/pops wrt moves was always difficult to tune on old CPUs, so I am happy it
 is gone from generic (in fact I had similar patch pending).
 Are you sure about Atom having stack engine, too?


Good question. The instruction latency table
(http://www.agner.org/optimize/instruction_tables.pdf) shows that for
Atom: push r has one 1uop, 1 cycle latency. However the instruction is
not pairable which will affect ILP. The guide here
http://www.agner.org/optimize/microarchitecture.pdf does not mention
Atom has stack engine either.

I will help collect some performance data on Atom.


thanks,

David


 Related thing is accumulate_outgoing_args. Igor is testing it on Core and I 
 will
 give it a try on K10.

 Honza

 I am attaching the changes for core costs I made if someone is interested in
 testing them.  If we can declare P4/PPRo and maybe K8 chips obsolette for
 generic, there is room for improvement in generic, too. Like using inc/dec
 again.

 Honza

 Index: config/i386/i386.c
 ===
 --- config/i386/i386.c  (revision 194452)
 +++ config/i386/i386.c  (working copy)
 @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
COSTS_N_INSNS (8),   /* cost of FABS instruction.  */
COSTS_N_INSNS (8),   /* cost of FCHS instruction.  */
COSTS_N_INSNS (40),  /* cost of FSQRT instruction.  */
 -  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
 -   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
 +  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
 +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
{-1, libcall, false,
{{libcall, {{6, loop_1_byte, true},
{24, loop, true},
{8192, rep_prefix_4_byte, true},
{-1, libcall, false}}},
 -   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
 +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
{-1, libcall, false,
1,   /* scalar_stmt_cost.  */
1,   /* scalar load_cost.  */
 @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
m_PPRO,

/* X86_TUNE_PARTIAL_FLAG_REG_STALL */
 -  m_CORE2I7 | m_GENERIC,
 +  m_GENERIC | m_CORE2,

/* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
 * on 16-bit immediate moves into memory on Core2 and Corei7.  */
 @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
m_K6,

/* X86_TUNE_USE_CLTD */
 -  ~(m_PENT | m_ATOM | m_K6),
 +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),

/* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx.  */
m_PENT4,
 @@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe
m_COREI7 | m_BDVER,

/* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */
 -  m_BDVER ,
 +  m_BDVER,

/* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and 
 dependencies
   are resolved on SSE register parts instead of whole registers, so we may
 @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
m_ATHLON_K8,

/* X86_TUNE_SSE_TYPELESS_STORES */
 -  m_AMD_MULTIPLE,
 +  m_AMD_MULTIPLE | m_CORE2I7, /**/

/* X86_TUNE_SSE_LOAD0_BY_PXOR */
 -  m_PPRO | m_P4_NOCONA,
 +  m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/

/* X86_TUNE_MEMORY_MISMATCH_STALL */
m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
 @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe

/* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
   than 4 

Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Xinliang David Li
Honza, can you explain each change and point to the reference?

thanks,

David

On Wed, Dec 12, 2012 at 8:37 AM, Jan Hubicka hubi...@ucw.cz wrote:
 I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
 SP adjustment instead of a sequence of pushes/pops. The preference to
 the MOVs are good for old CPU micro-architectures (before pentium-4,
 K10), because it breaks the data dependency.  In modern
 micro-architecture, push/pop is implemented using a mechanism called
 stack engine. The data dependency is removed by the hardware, and
 push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
 smaller. There is no longer the need to avoid using them.   This is
 also what ICC does.

 The following patch fixed the problem. It passes bootstrap/regression
 test. OK to install?

 thanks,

 David

 Index: config/i386/i386.c
 ===
 --- config/i386/i386.c (revision 194324)
 +++ config/i386/i386.c (working copy)
 @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe
m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,

/* X86_TUNE_PROLOGUE_USING_MOVE */
 -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
 +  m_PPRO | m_ATHLON_K8,

/* X86_TUNE_EPILOGUE_USING_MOVE */
 -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
 +  m_PPRO | m_ATHLON_K8,

 Push/pops wrt moves was always difficult to tune on old CPUs, so I am happy it
 is gone from generic (in fact I had similar patch pending).
 Are you sure about Atom having stack engine, too?

 Related thing is accumulate_outgoing_args. Igor is testing it on Core and I 
 will
 give it a try on K10.

 Honza

 I am attaching the changes for core costs I made if someone is interested in
 testing them.  If we can declare P4/PPRo and maybe K8 chips obsolette for
 generic, there is room for improvement in generic, too. Like using inc/dec
 again.

 Honza

 Index: config/i386/i386.c
 ===
 --- config/i386/i386.c  (revision 194452)
 +++ config/i386/i386.c  (working copy)
 @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
COSTS_N_INSNS (8),   /* cost of FABS instruction.  */
COSTS_N_INSNS (8),   /* cost of FCHS instruction.  */
COSTS_N_INSNS (40),  /* cost of FSQRT instruction.  */
 -  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
 -   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
 +  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
 +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
{-1, libcall, false,
{{libcall, {{6, loop_1_byte, true},
{24, loop, true},
{8192, rep_prefix_4_byte, true},
{-1, libcall, false}}},
 -   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
 +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
{-1, libcall, false,
1,   /* scalar_stmt_cost.  */
1,   /* scalar load_cost.  */
 @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
m_PPRO,

/* X86_TUNE_PARTIAL_FLAG_REG_STALL */
 -  m_CORE2I7 | m_GENERIC,
 +  m_GENERIC | m_CORE2,

/* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
 * on 16-bit immediate moves into memory on Core2 and Corei7.  */
 @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
m_K6,

/* X86_TUNE_USE_CLTD */
 -  ~(m_PENT | m_ATOM | m_K6),
 +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),

/* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx.  */
m_PENT4,
 @@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe
m_COREI7 | m_BDVER,

/* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */
 -  m_BDVER ,
 +  m_BDVER,

/* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and 
 dependencies
   are resolved on SSE register parts instead of whole registers, so we may
 @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
m_ATHLON_K8,

/* X86_TUNE_SSE_TYPELESS_STORES */
 -  m_AMD_MULTIPLE,
 +  m_AMD_MULTIPLE | m_CORE2I7, /**/

/* X86_TUNE_SSE_LOAD0_BY_PXOR */
 -  m_PPRO | m_P4_NOCONA,
 +  m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/

/* X86_TUNE_MEMORY_MISMATCH_STALL */
m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
 @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe

/* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
   than 4 branch instructions in the 16 byte window.  */
 -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
 +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,

/* X86_TUNE_SCHEDULE */
m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE | 
 m_GENERIC,
 @@ -1947,10 +1947,10 @@ 

Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Jan Hubicka
Concerning 1push per cycle, I think it is same as K7 hardware did, so move
prologue should be a win.
  Index: config/i386/i386.c
  ===
  --- config/i386/i386.c  (revision 194452)
  +++ config/i386/i386.c  (working copy)
  @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
 COSTS_N_INSNS (8),   /* cost of FABS instruction.  */
 COSTS_N_INSNS (8),   /* cost of FCHS instruction.  */
 COSTS_N_INSNS (40),  /* cost of FSQRT instruction.  */
  -  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
  -   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
  +  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
  +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
 {-1, libcall, false,
 {{libcall, {{6, loop_1_byte, true},
 {24, loop, true},
 {8192, rep_prefix_4_byte, true},
 {-1, libcall, false}}},
  -   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
  +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},

libcall is not faster up to 8KB to rep sequence that is better for regalloc/code
cache than fully blowin function call.
  @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
 m_PPRO,
 
 /* X86_TUNE_PARTIAL_FLAG_REG_STALL */
  -  m_CORE2I7 | m_GENERIC,
  +  m_GENERIC | m_CORE2,

This disable shifts that store just some flags. Acroding to Agner's manual I7 
handle
this well.

Partial flags stall
The Sandy Bridge uses the method of an extra Âľop to join partial registers not 
only for
general purpose registers but also for the flags register, unlike previous 
processors which
used this method only for general purpose registers. This occurs when a write 
to a part of
the flags register is followed by a read from a larger part of the flags 
register. The partial
flags stall of previous processors (See page 75) is therefore replaced by an 
extra Âľop. The
Sandy Bridge also generates an extra Âľop when reading the flags after a rotate 
instruction.

This is cheaper than the 7 cycle delay on Core this flags is trying to avoid.
 
 /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
  * on 16-bit immediate moves into memory on Core2 and Corei7.  */
  @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
 m_K6,
 
 /* X86_TUNE_USE_CLTD */
  -  ~(m_PENT | m_ATOM | m_K6),
  +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),

None of CPUs that generic care about are !USE_CLTD now after your change.
  @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
 m_ATHLON_K8,
 
 /* X86_TUNE_SSE_TYPELESS_STORES */
  -  m_AMD_MULTIPLE,
  +  m_AMD_MULTIPLE | m_CORE2I7, /**/

Hmm, I can not seem to find this in manual now, but I believe that stores also 
do not type,
so movaps is preferred over movapd store because it is shorter.  If not, this 
change should
produce a lot of slowdowns.
 
 /* X86_TUNE_SSE_LOAD0_BY_PXOR */
  -  m_PPRO | m_P4_NOCONA,
  +  m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/

Agner:
A common way of setting a register to zero is XOR EAX,EAX or SUB EBX,EBX. The
Core2 and Nehalem processors recognize that certain instructions are 
independent of the
prior value of the register if the source and destination registers are the 
same.

This applies to all of the following instructions: XOR, SUB, PXOR, XORPS, 
XORPD, and all
variants of PSUBxxx and PCMPxxx except PCMPEQQ.
 
 /* X86_TUNE_MEMORY_MISMATCH_STALL */
 m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
  @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
 
 /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
than 4 branch instructions in the 16 byte window.  */
  -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
  +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,

This is special passs to handle limitations of AMD's K7/K8/K10 branch 
prediction.
Intel never had similar design, so this flag is pointless.

We apparently ought to disable it for K10, at least per Agner's manual.
 
 /* X86_TUNE_SCHEDULE */
 m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE | 
  m_GENERIC,
  @@ -1947,10 +1947,10 @@ static unsigned int initial_ix86_tune_fe
 m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
 
 /* X86_TUNE_USE_INCDEC */
  -  ~(m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_GENERIC),
  +  ~(m_P4_NOCONA | m_ATOM | m_GENERIC),

Skipping inc/dec is to avoid partial flag stall happening on P4 only.
 
 /* X86_TUNE_PAD_RETURNS */
  -  m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC,
  +  m_AMD_MULTIPLE | m_GENERIC,

Again this deals specifically with AMD K7/K8/K10 branch prediction.  I am not 
even
sure this should be enabled for K10.
 
 /* X86_TUNE_PAD_SHORT_FUNCTION: Pad short funtion.  */
 

Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Andi Kleen
Jan Hubicka hubi...@ucw.cz writes:

 libcall is not faster up to 8KB to rep sequence that is better for 
 regalloc/code
 cache than fully blowin function call.

I noticed btw that some of the generated string instructions are slower 
than just calling the C library.

rep scasb etc. is rarely a win over an optimized library function,
it's not very optimized. Perhaps those patterns should just be disabled.
The way to optimize that on modern CPUs is to use PCMP*STR*, but that's
quite a bit more complicated and has some constraints.


 /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
than 4 branch instructions in the 16 byte window.  */
  -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
  +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,

 This is special passs to handle limitations of AMD's K7/K8/K10 branch 
 prediction.
 Intel never had similar design, so this flag is pointless.

Actually the Sandy Bridge decoded icache has a limit of 3 jumps per
16 byte window. If you exceed that it falls back to running 
the full decoder from the normal icache.

I don't have solid data, but it may be a win for frontend limited
code (otherwise possibly more in power than performance)

I would revisit that for Sandy Bridge

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Jan Hubicka
 Jan Hubicka hubi...@ucw.cz writes:
 
  libcall is not faster up to 8KB to rep sequence that is better for 
  regalloc/code
  cache than fully blowin function call.
 
 I noticed btw that some of the generated string instructions are slower 
 than just calling the C library.
 
 rep scasb etc. is rarely a win over an optimized library function,
 it's not very optimized. Perhaps those patterns should just be disabled.
 The way to optimize that on modern CPUs is to use PCMP*STR*, but that's
 quite a bit more complicated and has some constraints.

This is only about memset/memcpy expanding.  The other sequences are quite lame 
indeed...
 
 
  /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict 
   more
 than 4 branch instructions in the 16 byte window.  */
   -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | 
   m_GENERIC,
   +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
 
  This is special passs to handle limitations of AMD's K7/K8/K10 branch 
  prediction.
  Intel never had similar design, so this flag is pointless.
 
 Actually the Sandy Bridge decoded icache has a limit of 3 jumps per
 16 byte window. If you exceed that it falls back to running 
 the full decoder from the normal icache.
 
 I don't have solid data, but it may be a win for frontend limited
 code (otherwise possibly more in power than performance)
 
 I would revisit that for Sandy Bridge

We are not particularly good on avoiding the branches - basically the code 
inserts alignment
whenever it things the 4 consecutive branches fit in the window.
I can make patch to change this to 3 and we can see if it helps at all.
 
 -Andi
 -- 
 a...@linux.intel.com -- Speaking for myself only


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Andi Kleen
Andi Kleen a...@firstfloor.org writes:

 /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict 
  more
than 4 branch instructions in the 16 byte window.  */
  -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
  +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,

 This is special passs to handle limitations of AMD's K7/K8/K10 branch 
 prediction.
 Intel never had similar design, so this flag is pointless.

 Actually the Sandy Bridge decoded icache has a limit of 3 jumps per
 16 byte window.

Actually it's four per 32bytes, sorry.

Here's an old patch I had lying around to optimize for that.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 1b871be..9b57316 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2713,6 +2713,7 @@ ix86_target_string (HOST_WIDE_INT isa, int flags, const 
char *arch,
 { -mavx256-split-unaligned-load, MASK_AVX256_SPLIT_UNALIGNED_LOAD},
 { -mavx256-split-unaligned-store,
MASK_AVX256_SPLIT_UNALIGNED_STORE},
 { -mprefer-avx128,   MASK_PREFER_AVX128},
+{ -mjump-pad-32bytes,MASK_JUMP_PAD_32BYTES},
   };
 
   const char *opts[ARRAY_SIZE (isa_opts) + ARRAY_SIZE (flag_opts) + 6][2];
@@ -32182,6 +32183,7 @@ ix86_avoid_jump_mispredicts (void)
   rtx insn, start = get_insns ();
   int nbytes = 0, njumps = 0;
   int isjump = 0;
+  int jump_pad_window_size = TARGET_JUMP_PAD_32BYTES ? 32 : 16;
 
   /* Look for all minimal intervals of instructions containing 4 jumps.
  The intervals are bounded by START and INSN.  NBYTES is the total
@@ -32202,8 +32204,8 @@ ix86_avoid_jump_mispredicts (void)
  int align = label_to_alignment (insn);
  int max_skip = label_to_max_skip (insn);
 
- if (max_skip  15)
-   max_skip = 15;
+ if (max_skip  jump_pad_window_size - 1)
+   max_skip = jump_pad_window_size - 1;
  /* If align  3, only up to 16 - max_skip - 1 bytes can be
 already in the current 16 byte page, because otherwise
 ASM_OUTPUT_MAX_SKIP_ALIGN could skip max_skip or fewer
@@ -32216,7 +32218,7 @@ ix86_avoid_jump_mispredicts (void)
 INSN_UID (insn), max_skip);
  if (max_skip)
{
- while (nbytes + max_skip = 16)
+ while (nbytes + max_skip = jump_pad_window_size)
{
  start = NEXT_INSN (start);
  if ((JUMP_P (start)
@@ -32262,10 +32264,11 @@ ix86_avoid_jump_mispredicts (void)
 fprintf (dump_file, Interval %i to %i has %i bytes\n,
 INSN_UID (start), INSN_UID (insn), nbytes);
 
-  if (njumps == 3  isjump  nbytes  16)
+  if (njumps == 3  isjump  nbytes  jump_pad_window_size)
{
- int padsize = 15 - nbytes + min_insn_size (insn);
-
+ int padsize = jump_pad_window_size - 1 - nbytes + 
+   min_insn_size (insn);
+ 
  if (dump_file)
fprintf (dump_file, Padding insn %i by %i bytes!\n,
 INSN_UID (insn), padsize);
diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index 6c516e7..b38d163 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -223,6 +223,10 @@ mintel-syntax
 Target Undocumented Alias(masm=, intel, att) Warn(%-mintel-syntax% and 
%-mno-intel-syntax% are deprecated; use %-masm=intel% and %-masm=att% 
instead)
 ;; Deprecated
 
+mjump-pad-32bytes
+Target RejectNegative Mask(JUMP_PAD_32BYTES) Save
+Avoid more than 4 jumps in each 32byte code window.
+
 mms-bitfields
 Target Report Mask(MS_BITFIELD_LAYOUT) Save
 Use native (MS) bitfield layout


-- 
a...@linux.intel.com -- Speaking for myself only


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Xinliang David Li
On Wed, Dec 12, 2012 at 10:30 AM, Jan Hubicka hubi...@ucw.cz wrote:
 Concerning 1push per cycle, I think it is same as K7 hardware did, so move
 prologue should be a win.
  Index: config/i386/i386.c
  ===
  --- config/i386/i386.c  (revision 194452)
  +++ config/i386/i386.c  (working copy)
  @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
 COSTS_N_INSNS (8),   /* cost of FABS instruction.  */
 COSTS_N_INSNS (8),   /* cost of FCHS instruction.  */
 COSTS_N_INSNS (40),  /* cost of FSQRT instruction.  */
  -  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
  -   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
  +  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
  +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
 {-1, libcall, false,
 {{libcall, {{6, loop_1_byte, true},
 {24, loop, true},
 {8192, rep_prefix_4_byte, true},
 {-1, libcall, false}}},
  -   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
  +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},

 libcall is not faster up to 8KB to rep sequence that is better for 
 regalloc/code
 cache than fully blowin function call.

Be careful with this. My recollection is that REP sequence is good for
any size -- for smaller size, the REP initial set up cost is too high
(10s of cycles), while for large size copy, it is less efficient
compared with library version.


  @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
 m_PPRO,
 
 /* X86_TUNE_PARTIAL_FLAG_REG_STALL */
  -  m_CORE2I7 | m_GENERIC,
  +  m_GENERIC | m_CORE2,

 This disable shifts that store just some flags. Acroding to Agner's manual I7 
 handle
 this well.


ok.

 Partial flags stall
 The Sandy Bridge uses the method of an extra Âľop to join partial registers 
 not only for
 general purpose registers but also for the flags register, unlike previous 
 processors which
 used this method only for general purpose registers. This occurs when a write 
 to a part of
 the flags register is followed by a read from a larger part of the flags 
 register. The partial
 flags stall of previous processors (See page 75) is therefore replaced by an 
 extra Âľop. The
 Sandy Bridge also generates an extra Âľop when reading the flags after a 
 rotate instruction.

 This is cheaper than the 7 cycle delay on Core this flags is trying to avoid.

ok.

 
 /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
  * on 16-bit immediate moves into memory on Core2 and Corei7.  */
  @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
 m_K6,
 
 /* X86_TUNE_USE_CLTD */
  -  ~(m_PENT | m_ATOM | m_K6),
  +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),

My change was to enable CLTD for generic. Is your change intended to
revert that?


 None of CPUs that generic care about are !USE_CLTD now after your change.
  @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
 m_ATHLON_K8,
 
 /* X86_TUNE_SSE_TYPELESS_STORES */
  -  m_AMD_MULTIPLE,
  +  m_AMD_MULTIPLE | m_CORE2I7, /**/

 Hmm, I can not seem to find this in manual now, but I believe that stores 
 also do not type,
 so movaps is preferred over movapd store because it is shorter.  If not, this 
 change should
 produce a lot of slowdowns.
 
 /* X86_TUNE_SSE_LOAD0_BY_PXOR */
  -  m_PPRO | m_P4_NOCONA,
  +  m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/

 Agner:
 A common way of setting a register to zero is XOR EAX,EAX or SUB EBX,EBX. The
 Core2 and Nehalem processors recognize that certain instructions are 
 independent of the
 prior value of the register if the source and destination registers are the 
 same.

 This applies to all of the following instructions: XOR, SUB, PXOR, XORPS, 
 XORPD, and all
 variants of PSUBxxx and PCMPxxx except PCMPEQQ.
 
 /* X86_TUNE_MEMORY_MISMATCH_STALL */
 m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
  @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
 
 /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
than 4 branch instructions in the 16 byte window.  */
  -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
  +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,

 This is special passs to handle limitations of AMD's K7/K8/K10 branch 
 prediction.
 Intel never had similar design, so this flag is pointless.

I noticed that too, but Andi has a better answer to it.


 We apparently ought to disable it for K10, at least per Agner's manual.
 
 /* X86_TUNE_SCHEDULE */
 m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE | 
  m_GENERIC,
  @@ -1947,10 +1947,10 @@ static unsigned int initial_ix86_tune_fe
 m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | 

Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Xinliang David Li
On Wed, Dec 12, 2012 at 4:16 PM, Xinliang David Li davi...@google.com wrote:
 On Wed, Dec 12, 2012 at 10:30 AM, Jan Hubicka hubi...@ucw.cz wrote:
 Concerning 1push per cycle, I think it is same as K7 hardware did, so move
 prologue should be a win.
  Index: config/i386/i386.c
  ===
  --- config/i386/i386.c  (revision 194452)
  +++ config/i386/i386.c  (working copy)
  @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
 COSTS_N_INSNS (8),   /* cost of FABS instruction.  */
 COSTS_N_INSNS (8),   /* cost of FCHS instruction.  */
 COSTS_N_INSNS (40),  /* cost of FSQRT instruction.  */
  -  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
  -   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
  +  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
  +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
 {-1, libcall, false,
 {{libcall, {{6, loop_1_byte, true},
 {24, loop, true},
 {8192, rep_prefix_4_byte, true},
 {-1, libcall, false}}},
  -   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
  +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},

 libcall is not faster up to 8KB to rep sequence that is better for 
 regalloc/code
 cache than fully blowin function call.

 Be careful with this. My recollection is that REP sequence is good for


s/good/not good/


David

 any size -- for smaller size, the REP initial set up cost is too high
 (10s of cycles), while for large size copy, it is less efficient
 compared with library version.


  @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
 m_PPRO,
 
 /* X86_TUNE_PARTIAL_FLAG_REG_STALL */
  -  m_CORE2I7 | m_GENERIC,
  +  m_GENERIC | m_CORE2,

 This disable shifts that store just some flags. Acroding to Agner's manual 
 I7 handle
 this well.


 ok.

 Partial flags stall
 The Sandy Bridge uses the method of an extra Âľop to join partial registers 
 not only for
 general purpose registers but also for the flags register, unlike previous 
 processors which
 used this method only for general purpose registers. This occurs when a 
 write to a part of
 the flags register is followed by a read from a larger part of the flags 
 register. The partial
 flags stall of previous processors (See page 75) is therefore replaced by an 
 extra Âľop. The
 Sandy Bridge also generates an extra Âľop when reading the flags after a 
 rotate instruction.

 This is cheaper than the 7 cycle delay on Core this flags is trying to avoid.

 ok.

 
 /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
  * on 16-bit immediate moves into memory on Core2 and Corei7.  */
  @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
 m_K6,
 
 /* X86_TUNE_USE_CLTD */
  -  ~(m_PENT | m_ATOM | m_K6),
  +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),

 My change was to enable CLTD for generic. Is your change intended to
 revert that?


 None of CPUs that generic care about are !USE_CLTD now after your change.
  @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
 m_ATHLON_K8,
 
 /* X86_TUNE_SSE_TYPELESS_STORES */
  -  m_AMD_MULTIPLE,
  +  m_AMD_MULTIPLE | m_CORE2I7, /**/

 Hmm, I can not seem to find this in manual now, but I believe that stores 
 also do not type,
 so movaps is preferred over movapd store because it is shorter.  If not, 
 this change should
 produce a lot of slowdowns.
 
 /* X86_TUNE_SSE_LOAD0_BY_PXOR */
  -  m_PPRO | m_P4_NOCONA,
  +  m_PPRO | m_P4_NOCONA | m_CORE2I7, /**/

 Agner:
 A common way of setting a register to zero is XOR EAX,EAX or SUB EBX,EBX. The
 Core2 and Nehalem processors recognize that certain instructions are 
 independent of the
 prior value of the register if the source and destination registers are the 
 same.

 This applies to all of the following instructions: XOR, SUB, PXOR, XORPS, 
 XORPD, and all
 variants of PSUBxxx and PCMPxxx except PCMPEQQ.
 
 /* X86_TUNE_MEMORY_MISMATCH_STALL */
 m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
  @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
 
 /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict 
  more
than 4 branch instructions in the 16 byte window.  */
  -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
  +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,

 This is special passs to handle limitations of AMD's K7/K8/K10 branch 
 prediction.
 Intel never had similar design, so this flag is pointless.

 I noticed that too, but Andi has a better answer to it.


 We apparently ought to disable it for K10, at least per Agner's manual.
 
 /* X86_TUNE_SCHEDULE */
 m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE | 
  m_GENERIC,
  

Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Jan Hubicka
  libcall is not faster up to 8KB to rep sequence that is better for 
  regalloc/code
  cache than fully blowin function call.
 
 Be careful with this. My recollection is that REP sequence is good for
 any size -- for smaller size, the REP initial set up cost is too high
 (10s of cycles), while for large size copy, it is less efficient
 compared with library version.

Well this is based on the data from the memtest script.  
Core has good REP implementation - it is a win from rather small blocks (16
bytes if I recall) and it does not need alignment.
Library version starts to be interesting with caching hints, but I think till 
80KB
it is still not a win for my setup (glibc-2.15)
  
  /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
   * on 16-bit immediate moves into memory on Core2 and Corei7.  */
   @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
  m_K6,
  
  /* X86_TUNE_USE_CLTD */
   -  ~(m_PENT | m_ATOM | m_K6),
   +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
 
 My change was to enable CLTD for generic. Is your change intended to
 revert that?

No, it is merge conflict, sorry.  I will update it in my tree.
  Skipping inc/dec is to avoid partial flag stall happening on P4 only.
  
 
 
 K8 and K10 partitions the flags into groups. References to flags to
 the same group can still cause the stall -- not sure how that can be
 handled.

I  belive the stalls happends only in quite special cases where compare 
instruction
combines flags from multiple instructions.  GCC don't generate this type of 
code, so
we should be safe.

Honza


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Xinliang David Li
On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka hubi...@ucw.cz wrote:
  libcall is not faster up to 8KB to rep sequence that is better for 
  regalloc/code
  cache than fully blowin function call.

 Be careful with this. My recollection is that REP sequence is good for
 any size -- for smaller size, the REP initial set up cost is too high
 (10s of cycles), while for large size copy, it is less efficient
 compared with library version.

 Well this is based on the data from the memtest script.
 Core has good REP implementation - it is a win from rather small blocks (16
 bytes if I recall) and it does not need alignment.
 Library version starts to be interesting with caching hints, but I think till 
 80KB
 it is still not a win for my setup (glibc-2.15)

A simple test shows that -mstringop-strategy=libcall always beats
-mstringop-strategy=rep_8byte (on core2 and corei7) except for size
smaller than 8 where the rep_8byte strategy simply bypasses REP movs.
Can you share your memtest ?

thanks,

David

  
  /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix 
   stall
   * on 16-bit immediate moves into memory on Core2 and Corei7.  */
   @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
  m_K6,
  
  /* X86_TUNE_USE_CLTD */
   -  ~(m_PENT | m_ATOM | m_K6),
   +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),

 My change was to enable CLTD for generic. Is your change intended to
 revert that?

 No, it is merge conflict, sorry.  I will update it in my tree.
  Skipping inc/dec is to avoid partial flag stall happening on P4 only.
  


 K8 and K10 partitions the flags into groups. References to flags to
 the same group can still cause the stall -- not sure how that can be
 handled.

 I  belive the stalls happends only in quite special cases where compare 
 instruction
 combines flags from multiple instructions.  GCC don't generate this type of 
 code, so
 we should be safe.

 Honza


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Jakub Jelinek
On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote:
 On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka hubi...@ucw.cz wrote:
   libcall is not faster up to 8KB to rep sequence that is better for 
   regalloc/code
   cache than fully blowin function call.
 
  Be careful with this. My recollection is that REP sequence is good for
  any size -- for smaller size, the REP initial set up cost is too high
  (10s of cycles), while for large size copy, it is less efficient
  compared with library version.
 
  Well this is based on the data from the memtest script.
  Core has good REP implementation - it is a win from rather small blocks (16
  bytes if I recall) and it does not need alignment.
  Library version starts to be interesting with caching hints, but I think 
  till 80KB
  it is still not a win for my setup (glibc-2.15)
 
 A simple test shows that -mstringop-strategy=libcall always beats
 -mstringop-strategy=rep_8byte (on core2 and corei7) except for size
 smaller than 8 where the rep_8byte strategy simply bypasses REP movs.
 Can you share your memtest ?

I can't believe that say 16 byte or 32 byte memcpy can be ever faster using a
libcall.  The PLT call overhead is simply too high.

Jakub


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-12 Thread Xinliang David Li
Try the following one. 1) -minline-all-stringops
-mstringop-strategy=rep_8byte -O2 vs 1) -mstringop_strategy=libcall
-O2.

David


#include string.h
#include stdio.h
#include stdlib.h
#ifndef LEN
#define LEN 16
#endif

void copy(char* s1, char* s2,int len) __attribute__((noinline));
void copy(char* s1, char* s2,int len)
{
   memcpy(s2,s1,len);
}


int main() {

  char* s1 = (char*) malloc(LEN  +10);
  char* s2 = (char*) malloc(LEN  +10);
  int i = 0;
  for (i =  0; i  10; i++)
  {
copy(s1+1,s2+3,LEN);
  }
}

On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek ja...@redhat.com wrote:
 On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote:
 On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka hubi...@ucw.cz wrote:
   libcall is not faster up to 8KB to rep sequence that is better for 
   regalloc/code
   cache than fully blowin function call.
 
  Be careful with this. My recollection is that REP sequence is good for
  any size -- for smaller size, the REP initial set up cost is too high
  (10s of cycles), while for large size copy, it is less efficient
  compared with library version.
 
  Well this is based on the data from the memtest script.
  Core has good REP implementation - it is a win from rather small blocks (16
  bytes if I recall) and it does not need alignment.
  Library version starts to be interesting with caching hints, but I think 
  till 80KB
  it is still not a win for my setup (glibc-2.15)

 A simple test shows that -mstringop-strategy=libcall always beats
 -mstringop-strategy=rep_8byte (on core2 and corei7) except for size
 smaller than 8 where the rep_8byte strategy simply bypasses REP movs.
 Can you share your memtest ?

 I can't believe that say 16 byte or 32 byte memcpy can be ever faster using a
 libcall.  The PLT call overhead is simply too high.

 Jakub


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-11 Thread Richard Biener
On Mon, Dec 10, 2012 at 10:07 PM, Mike Stump mikest...@comcast.net wrote:
 On Dec 10, 2012, at 12:42 PM, Xinliang David Li davi...@google.com wrote:
 I have not measured the CFI size impact -- but conceivably it should
 be larger -- which is unfortunate.

 Code speed and size are preferable to optimizing dwarf size…  :-)  I'd let 
 dwarf 5 fix it!

Well, different to debug info, CFI data has to be in memory to make
unwinding work.
These days most Linux distributions enable asyncronous unwind tables so any
size savings due to shorter push/pop epilogue/prologue sequences has to be
offsetted by the increase in CFI data.  I'm not sure there is really a
speed difference
between both variants (well, maybe due to better icache footprint of
the push/pop
variant).

That said - I'd prefer to have more data on this before making the switch for
the generic model.  What was your original motivation?  Just theory or was
it a real case?

Thanks,
Richard.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-11 Thread Xinliang David Li
The following the O2 size data from SPEC2k.  Note that with push/pop,
it is a always a net win (negative delta) in terms of total binary or
total loadable section size.

thanks,

David

   .text.eh_frame  Total_binary
vortex-move 440252 40796 584066
vortex-push 415436 57452 575906
delta-5.6% 40.8%  -1.397%

twolf-move 169324 10748 223521
twolf-push 168876 11124 223449
delta   -0.3% 3.5% -0.032%

gzip-move 30668 3652 374399
gzip-push 30524 3740 374343
delta -0.5% 2.4% -0.015%

bzip2-move 22748 3196 111616
bzip2-push 22636 3284 111592
delta  -0.5% 2.8% -0.022%

vpr-move 104684 9380 147378
vpr-push 104236 9788 147338
delta -0.4% 4.3% -0.027%

mcf-move 8444 1244 26760
mcf-push 8444 1244 26760
delta0.0% 0.0% 0.000%

cc1-move 1093964 90772 1576994
cc1-push 1078988 104068 1575314
delta  -1.4% 14.6% -0.107%

crafty-move 130556 5508 1256037
crafty-push 130236 5772 1255981
delta-0.2% 4.8% -0.004%

eon-move 333660 33220 516491
eon-push 330140 35812 51
delta -1.1% 7.8% -0.181%

gap-move 404092 46732 1457735
gap-push 396012 53180 1456103
delta -2.0% 13.8% -0.112%

perlbmk-move 456572 45324 618585
perlbmk-push 449516 52340 618545
delta -1.5% 15.5% -0.006%

parser-move 81244 15788 334003
parser-push 80684 16332 333987
delta   -0.7% 3.4% -0.005%


On Tue, Dec 11, 2012 at 9:14 AM, Xinliang David Li davi...@google.com wrote:
 On Tue, Dec 11, 2012 at 1:49 AM, Richard Biener
 richard.guent...@gmail.com wrote:
 On Mon, Dec 10, 2012 at 10:07 PM, Mike Stump mikest...@comcast.net wrote:
 On Dec 10, 2012, at 12:42 PM, Xinliang David Li davi...@google.com wrote:
 I have not measured the CFI size impact -- but conceivably it should
 be larger -- which is unfortunate.

 Code speed and size are preferable to optimizing dwarf size…  :-)  I'd let 
 dwarf 5 fix it!

 Well, different to debug info, CFI data has to be in memory to make
 unwinding work.
 These days most Linux distributions enable asyncronous unwind tables so any
 size savings due to shorter push/pop epilogue/prologue sequences has to be
 offsetted by the increase in CFI data.  I'm not sure there is really a
 speed difference
 between both variants (well, maybe due to better icache footprint of
 the push/pop
 variant).

 Yes, for large applications, this can be crucial to performance.


 That said - I'd prefer to have more data on this before making the switch for
 the generic model.  What was your original motivation?  Just theory or was
 it a real case?

 1) some of the very large internal apps I measured benefit from this
 change (in terms of performance)
 2) both ICC and LLVM do the same.

 I have already committed the patch. I will find some time to collect
 more size data and post it later.

 thanks,

 David



 Thanks,
 Richard.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-11 Thread Xinliang David Li
Some SPEC2k performance number (with 3 runs on core2):

Push wins over move on 3 benchmarks. Others are noises.

perlbmk : ~+1.9%
gap:   ~+1.4%
vortex:~ +0.7%

David

On Tue, Dec 11, 2012 at 2:53 PM, Xinliang David Li davi...@google.com wrote:
 The following the O2 size data from SPEC2k.  Note that with push/pop,
 it is a always a net win (negative delta) in terms of total binary or
 total loadable section size.

 thanks,

 David

.text.eh_frame  Total_binary
 vortex-move 440252 40796 584066
 vortex-push 415436 57452 575906
 delta-5.6% 40.8%  -1.397%

 twolf-move 169324 10748 223521
 twolf-push 168876 11124 223449
 delta   -0.3% 3.5% -0.032%

 gzip-move 30668 3652 374399
 gzip-push 30524 3740 374343
 delta -0.5% 2.4% -0.015%

 bzip2-move 22748 3196 111616
 bzip2-push 22636 3284 111592
 delta  -0.5% 2.8% -0.022%

 vpr-move 104684 9380 147378
 vpr-push 104236 9788 147338
 delta -0.4% 4.3% -0.027%

 mcf-move 8444 1244 26760
 mcf-push 8444 1244 26760
 delta0.0% 0.0% 0.000%

 cc1-move 1093964 90772 1576994
 cc1-push 1078988 104068 1575314
 delta  -1.4% 14.6% -0.107%

 crafty-move 130556 5508 1256037
 crafty-push 130236 5772 1255981
 delta-0.2% 4.8% -0.004%

 eon-move 333660 33220 516491
 eon-push 330140 35812 51
 delta -1.1% 7.8% -0.181%

 gap-move 404092 46732 1457735
 gap-push 396012 53180 1456103
 delta -2.0% 13.8% -0.112%

 perlbmk-move 456572 45324 618585
 perlbmk-push 449516 52340 618545
 delta -1.5% 15.5% -0.006%

 parser-move 81244 15788 334003
 parser-push 80684 16332 333987
 delta   -0.7% 3.4% -0.005%


 On Tue, Dec 11, 2012 at 9:14 AM, Xinliang David Li davi...@google.com wrote:
 On Tue, Dec 11, 2012 at 1:49 AM, Richard Biener
 richard.guent...@gmail.com wrote:
 On Mon, Dec 10, 2012 at 10:07 PM, Mike Stump mikest...@comcast.net wrote:
 On Dec 10, 2012, at 12:42 PM, Xinliang David Li davi...@google.com wrote:
 I have not measured the CFI size impact -- but conceivably it should
 be larger -- which is unfortunate.

 Code speed and size are preferable to optimizing dwarf size…  :-)  I'd let 
 dwarf 5 fix it!

 Well, different to debug info, CFI data has to be in memory to make
 unwinding work.
 These days most Linux distributions enable asyncronous unwind tables so any
 size savings due to shorter push/pop epilogue/prologue sequences has to be
 offsetted by the increase in CFI data.  I'm not sure there is really a
 speed difference
 between both variants (well, maybe due to better icache footprint of
 the push/pop
 variant).

 Yes, for large applications, this can be crucial to performance.


 That said - I'd prefer to have more data on this before making the switch 
 for
 the generic model.  What was your original motivation?  Just theory or was
 it a real case?

 1) some of the very large internal apps I measured benefit from this
 change (in terms of performance)
 2) both ICC and LLVM do the same.

 I have already committed the patch. I will find some time to collect
 more size data and post it later.

 thanks,

 David



 Thanks,
 Richard.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-10 Thread Richard Biener
On Sun, Dec 9, 2012 at 2:50 PM, Uros Bizjak ubiz...@gmail.com wrote:
 Hello!

 I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
 SP adjustment instead of a sequence of pushes/pops. The preference to
 the MOVs are good for old CPU micro-architectures (before pentium-4,
 K10), because it breaks the data dependency.  In modern
 micro-architecture, push/pop is implemented using a mechanism called
 stack engine. The data dependency is removed by the hardware, and
 push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
 smaller. There is no longer the need to avoid using them.   This is
 also what ICC does.

 2012-12-08  Xinliang David Li  davi...@google.com
* config/i386/i386.c: Eanble push/pop in pro/epilogue for moderen 
 CPUs.

 s/moderen/modern

 OK for mainline SVN.

It's also more costly for unwind info in the prologue/epilogue.  Thus, did you
measure the effect on CFI size?

Thanks,
Richard.

 Thanks,
 Uros.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-10 Thread Xinliang David Li
I have not measured the CFI size impact -- but conceivably it should
be larger -- which is unfortunate.

David

On Mon, Dec 10, 2012 at 1:23 AM, Richard Biener
richard.guent...@gmail.com wrote:
 On Sun, Dec 9, 2012 at 2:50 PM, Uros Bizjak ubiz...@gmail.com wrote:
 Hello!

 I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
 SP adjustment instead of a sequence of pushes/pops. The preference to
 the MOVs are good for old CPU micro-architectures (before pentium-4,
 K10), because it breaks the data dependency.  In modern
 micro-architecture, push/pop is implemented using a mechanism called
 stack engine. The data dependency is removed by the hardware, and
 push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
 smaller. There is no longer the need to avoid using them.   This is
 also what ICC does.

 2012-12-08  Xinliang David Li  davi...@google.com
* config/i386/i386.c: Eanble push/pop in pro/epilogue for 
 moderen CPUs.

 s/moderen/modern

 OK for mainline SVN.

 It's also more costly for unwind info in the prologue/epilogue.  Thus, did you
 measure the effect on CFI size?

 Thanks,
 Richard.

 Thanks,
 Uros.


Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-10 Thread Mike Stump
On Dec 10, 2012, at 12:42 PM, Xinliang David Li davi...@google.com wrote:
 I have not measured the CFI size impact -- but conceivably it should
 be larger -- which is unfortunate.

Code speed and size are preferable to optimizing dwarf size…  :-)  I'd let 
dwarf 5 fix it!



Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-09 Thread Дмитрий Дьяченко
s/Eanble/Enable/


Thanks,
Dmitry

2012/12/9 Uros Bizjak ubiz...@gmail.com:
 Hello!

 I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
 SP adjustment instead of a sequence of pushes/pops. The preference to
 the MOVs are good for old CPU micro-architectures (before pentium-4,
 K10), because it breaks the data dependency.  In modern
 micro-architecture, push/pop is implemented using a mechanism called
 stack engine. The data dependency is removed by the hardware, and
 push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
 smaller. There is no longer the need to avoid using them.   This is
 also what ICC does.

 2012-12-08  Xinliang David Li  davi...@google.com
* config/i386/i386.c: Eanble push/pop in pro/epilogue for moderen 
 CPUs.

 s/moderen/modern

 OK for mainline SVN.

 Thanks,
 Uros.


[PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

2012-12-08 Thread Xinliang David Li
I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
SP adjustment instead of a sequence of pushes/pops. The preference to
the MOVs are good for old CPU micro-architectures (before pentium-4,
K10), because it breaks the data dependency.  In modern
micro-architecture, push/pop is implemented using a mechanism called
stack engine. The data dependency is removed by the hardware, and
push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
smaller. There is no longer the need to avoid using them.   This is
also what ICC does.

The following patch fixed the problem. It passes bootstrap/regression
test. OK to install?

thanks,

David

Index: config/i386/i386.c
===
--- config/i386/i386.c (revision 194324)
+++ config/i386/i386.c (working copy)
@@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe
   m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,

   /* X86_TUNE_PROLOGUE_USING_MOVE */
-  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
+  m_PPRO | m_ATHLON_K8,

   /* X86_TUNE_EPILOGUE_USING_MOVE */
-  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
+  m_PPRO | m_ATHLON_K8,

   /* X86_TUNE_SHIFT1 */
   ~m_486,


2012-12-08  Xinliang David Li  davi...@google.com
   * config/i386/i386.c: Eanble push/pop in pro/epilogue for
moderen CPUs.