[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2012-03-06 Thread oleg at smolsky dot net
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #37 from oleg at smolsky dot net 2012-03-06 16:34:27 UTC ---
Hey Jakub, is this smaller example digestable?
 http://gcc.gnu.org/bugzilla/attachment.cgi?id=26814

The asm output is straightforward, but I obviously have no clue about 
how complex the corresponding compiler's internal state is...


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2012-03-06 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #38 from Jakub Jelinek jakub at gcc dot gnu.org 2012-03-06 
17:26:24 UTC ---
Sorry, can't reproduce any performance degradation between 4.1 and 4.6
on the http://gcc.gnu.org/bugzilla/attachment.cgi?id=26814 testcase (-O3 -m64,
default -mtune=generic):
on i7-2600 4.1 user time is 0m3.833s, 4.6 0m3.411s and 4.7 0m5.102s,
on AMD Barcelona 4.1 user time is 0m8.798s, 4.6 0m5.875s and 4.7 0m5.855s.


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2012-03-06 Thread oleg at smolsky dot net
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #39 from oleg at smolsky dot net 2012-03-06 19:39:03 UTC ---
Hmm... funky. I can reproduce the issue on a newer Intel machine:

$ cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 23
model name  : Intel(R) Xeon(R) CPU   L5410  @ 2.33GHz
stepping: 6
cpu MHz : 2327.445
cache size  : 6144 KB
physical id : 0
siblings: 4
core id : 0
cpu cores   : 4


$ time ./test41
 real0m6.270s
 user0m6.268s
 sys 0m0.000s

$ time ./test44
 real0m5.524s
 user0m5.523s
 sys 0m0.000s

$ time ./test46
 real0m11.721s
 user0m11.718s
 sys 0m0.001s

P.S. the middle one is made using g++ (GCC) 4.4.5 20110214 (Red Hat 
4.4.5-6). The rest are original binaries made a couple of days ago.


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2012-03-02 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #30 from Jakub Jelinek jakub at gcc dot gnu.org 2012-03-02 
08:07:15 UTC ---
Created attachment 26809
  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=26809
pr50182.C

Even the reduced testcase is orders of magnitude longer than what would be
desirable for analysis, I've tried to reduce it just to the templates that are
actually needed (and can be meassured just with time), does this reflect the
slowdowns you are seeing?  The next step at reducing would be to remove all the
template mess, instantiate it by hand, and perhaps also inline by hand.  There
is no reason why we shouldn't be just having one loop with all the statements
in it.  On this reduced testcase on Intel i7-2600 CPU with -O3 the
-DFAST_VER/-DNOINLINE don't seem to make any difference, but 4.6 is measurably
faster than 4.7.

In any case, this is way too late for 4.7.


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2012-03-02 Thread oleg at smolsky dot net
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #31 from oleg at smolsky dot net 2012-03-02 08:21:41 UTC ---
I don't think there is a need to actually check the result in this 
benchmarkable fragment, so that will reduce the code a little. The only 
thing that I was hitting is about fooling/forcing the compiler not to 
discard the intermediate result and actually perform every calculation 
and iteration :)

Let me try do digest this further. I'll also get you a result from our 
production compiler (v4.1 that emits the fastest code)


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2012-03-02 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #32 from Jakub Jelinek jakub at gcc dot gnu.org 2012-03-02 
08:28:34 UTC ---
For me, 4.1 is equally fast to 4.6 on my CPU and on the reduced testcase I've
attached (not clear if it models what the original benchmark did right or not),
and on the trunk regressed with
http://gcc.gnu.org/viewcvs?root=gccview=revrev=176072
Before that the inner loop looked like:
.L12:
addl$10, %edx
addb0(%rbp,%rcx), %dl
addq$1, %rcx
cmpl%ecx, %ebx
jg  .L12
and now it looks like:
.L12:
movzbl  0(%rbp,%rdx), %r8d
addq$1, %rdx
cmpl%edx, %ebx
leal10(%rcx,%r8), %ecx
jg  .L12


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2012-03-02 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #33 from Jakub Jelinek jakub at gcc dot gnu.org 2012-03-02 
09:13:52 UTC ---
After Jason's patch (which needs to be kept, it was a wrong-code bugfix), we
get out of the FE the addition in int type, while previously it was in unsigned
char type.  I.e.

  int D.2177;
  signed char D.2138;
  T D.2178;
  T D.2179;
  T D.2180;
  signed char result;
D.2138 = custom_constant_addsigned char::do_shift (D.2177);
D.2178 = (T) result;
D.2179 = (T) D.2138;
D.2180 = D.2178 + D.2179;
result = (signed char) D.2180;
where T used to be unsigned char before and now is int.
And no GIMPLE optimization pass manages to narrow the addition operation
(together with the previous sign extensions and following demotion) to an
unsigned char operation (signed char would be wrong, because of the possible
overflow).  I bet such narrowing in these cases could even help the vectorizer,
which if it were to vectorize this or similar loops (it doesn't in this case),
would do the promotions/demotions needlessly.


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2012-03-02 Thread oleg at smolsky dot net
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #34 from oleg at smolsky dot net 2012-03-03 02:19:21 UTC ---
OK, here are some benchmark numbers for the test compiled verbatim with 
g++41/g++463 -O2:

$ time ./test41
rv=4243767296

real0m6.063s
user0m6.058s
sys 0m0.001s

$ time ./test46
rv=4243767296

real0m11.425s
user0m11.415s
sys 0m0.003s

$ time ./test46-fast #(ie built it with -DFAST_VER)
rv=4243767296

real0m11.389s
user0m11.383s
sys 0m0.003s

Let me see how the sample can be digested further down...


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2012-03-02 Thread oleg at smolsky dot net
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #35 from oleg at smolsky dot net 2012-03-03 02:45:15 UTC ---
Here is a smaller version. BTW, I've noticed another regression in 
optimization in v4.1 when using a const global...


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2012-03-02 Thread oleg at smolsky dot net
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #36 from oleg at smolsky dot net 2012-03-03 02:59:11 UTC ---
Here is the code emitted by g++ 4.6.3 for smaller_test.cpp (attached to 
the bug)

  unsigned int test_constant proc near
 mov r9d, cs:iterations
 xor r8d, r8d
 xor eax, eax
 testr9d, r9d
 jle short locret_400552
 db  66h, 66h, 66h
 nop
 db  66h, 66h
 nop

loc_400528:
 xor ecx, ecx
 xor edx, edx
 testesi, esi
 jle short loc_40054E

loc_400530:
 add edx, 0Ah
 add dl, [rdi+rcx]
 add rcx, 1
 cmp esi, ecx
 jg  short loc_400530
 movsx   edx, dl

loc_400541:
 add r8d, 1
 add eax, edx
 cmp r8d, r9d
 jnz short loc_400528
 rep retn

loc_40054E:
 xor edx, edx
 jmp short loc_400541

locret_400552:
 rep retn


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2012-03-01 Thread oleg at smolsky dot net
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #29 from oleg at smolsky dot net 2012-03-02 00:54:53 UTC ---
Is it possible to target this to 4.7? These optimization issues result 
in benchmarcably slower code...


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2012-01-11 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

Richard Guenther rguenth at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2012-01-11
 Ever Confirmed|0   |1

--- Comment #27 from Richard Guenther rguenth at gcc dot gnu.org 2012-01-11 
09:41:25 UTC ---
Confirmed.  Can somebody summarize please and point to the relevant short
testcase that shows the regression (is there only one kind of problem?  this
seems to be a benchmark suite).  A short testcase is preprocessed and
at most a few hundred lines.


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2012-01-11 Thread xinliangli at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #28 from davidxl xinliangli at gmail dot com 2012-01-11 17:26:46 
UTC ---
See comment 24 for shorter test case.

Summary:

1) the regression reported by Oleg in gcc4_6 and earlier versions is due to FE
code generation difference which lead to the backend to generate code leading
to partial register stall.
2) the RAT stall problem is fixed in gcc4_7 
3) however in 4_7, there is a different problem -- redundant sign-extension and
move instruction is generated. It could be due to the limitation in RTL forward
propagation and combine pass to deal with multiple downward uses

David


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2012-01-10 Thread oleg at smolsky dot net
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #26 from oleg at smolsky dot net 2012-01-10 18:06:28 UTC ---
Could someone toggle the state assign a milestone please?


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-10-24 Thread oleg at smolsky dot net
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #16 from oleg at smolsky dot net 2011-10-24 18:27:28 UTC ---
$ /work/tools/gcc47/bin/g++ -v
Using built-in specs.
COLLECT_GCC=/work/tools/gcc47/bin/g++
COLLECT_LTO_WRAPPER=/work/tools/gcc47/libexec/gcc/x86_64-unknown-linux-gnu/4.7.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ../gcc-4.7/configure --prefix=/work/tools/gcc47 
--enable-languages=c,c++ --with-system-zlib 
--with-mpfr=/work/tools/mpfr24 --with-gmp=/work/tools/gmp 
--with-mpc=/work/tools/mpc 
LD_LIBRARY_PATH=/work/tools/mpfr/lib24:/work/tools/gmp/lib:/work/tools/mpc/lib
Thread model: posix
gcc version 4.7.0 20111001 (experimental) (GCC)

The test case, test.cpp was compiled with this command:
/work/tools/gcc47/bin/g++  -I. -g -O3 -static-libstdc++ -static-libgcc 
-march=nativetest.cpp   -o test


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-10-24 Thread oleg at smolsky dot net
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #17 from oleg at smolsky dot net 2011-10-24 18:27:31 UTC ---
Created attachment 25595
  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=25595
test.cpp.144t.optimized

--- Comment #18 from oleg at smolsky dot net 2011-10-24 18:27:31 UTC ---
Created attachment 25596
  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=25596
test.s


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-10-24 Thread oleg at smolsky dot net
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #19 from oleg at smolsky dot net 2011-10-24 18:33:23 UTC ---
Also note that Bugzilla has quietly replaced an older attachment, 
test.cpp, with a new one without adding a comment...


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-10-24 Thread xinliangli at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #20 from davidxl xinliangli at gmail dot com 2011-10-24 19:33:18 
UTC ---
The test.cpp attached seems to be the same as the old version.

David


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-10-24 Thread oleg at smolsky dot net
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #21 from oleg at smolsky dot net 2011-10-24 19:48:57 UTC ---
OK, just in case, here is my current test.


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-10-24 Thread xinliangli at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #22 from davidxl xinliangli at gmail dot com 2011-10-24 19:58:23 
UTC ---
(In reply to comment #21)
 OK, just in case, here is my current test.

Preprocessed test case? I saw the main assembly difference that can explain the
performance diff, but want to make sure it is not due to your new source change
(I saw some print statement addeded).

David


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-10-24 Thread oleg at smolsky dot net
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #23 from oleg at smolsky dot net 2011-10-24 21:11:21 UTC ---
Here is the source preprocessed for gcc47. The test exhibits the 
slowdown mentioned in comment 11.


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-10-24 Thread xinliangli at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #24 from davidxl xinliangli at gmail dot com 2011-10-24 23:00:22 
UTC ---
(In reply to comment #23)
 Here is the source preprocessed for gcc47. The test exhibits the 
 slowdown mentioned in comment 11.


The problem can be reproduced with a simplified test case -- basically
depending on how the result value from the inner loop is used in the outer loop
(related to casting), the inner loop code is quite different - in the slow
case, there are two redundant sign extension and a move instructions generated.

# the fast version
gcc -O3 -DFAST_VER bug.cpp

./a.out 
rv=4282167296

test description   absolute   operations   ratio with
number time   per second   test0

 0 int8_t constant add   1.05 sec   1523.81 M 1.00

Total absolute time for int8_t constant folding: 1.05 sec

# the slow version:

gcc -O3 bug.cpp
./a.out 
rv=4282167296

test description   absolute   operations   ratio with
number time   per second   test0

 0 int8_t constant add   1.57 sec   1019.11 M 1.00

Total absolute time for int8_t constant folding: 1.57 sec


# however, when disabling inlining of check_shifted_sum_1 in the slow case, the
runtime is recovered:

gcc -O3 -DNOINLINE bug.cpp

./a.out 
rv=4282167296

test description   absolute   operations   ratio with
number time   per second   test0

 0 int8_t constant add   1.05 sec   1523.81 M 1.00

Total absolute time for int8_t constant folding: 1.05 sec



The inner loop body in faster case:

.L60:
movzbl0(%rbp,%rcx), %r9d
addq$1, %rcx
cmpl%ecx, %ebx
leal10(%r8,%r9), %r8d
# SUCC: 4 [91.0%]  (dfs_back,can_fallthru) 5 [9.0%] 
(fallthru,can_fallthru,loop_exit)
jg.L60

while for the slow case:

.L60:
movzbl(%r12,%rcx), %eax
movsbl%r8b, %r8d
addq$1, %rcx
leal10(%rax), %r9d
movsbl%r9b, %r9d
addl%r8d, %r9d
cmpl%ecx, %ebp
movl%r9d, %r8d
# SUCC: 4 [91.0%]  (dfs_back,can_fallthru) 5 [9.0%] 
(fallthru,can_fallthru,loop_exit)
jg.L60


The relevant source change:

#ifdef NOINLINE
#define INL __attribute__((noinline))
#else
#define INL inline
#endif

template typename T, typename T2, typename Shifter
INL void check_shifted_sum_1(T2 result) {
 T temp = (T)SIZE * Shifter::do_shift((T)init_value);
 if (!tolerance_equalT((T)result,temp))
  printf(test %i failed\n, current_test);
}

#ifdef FAST_VER
#define TYPE u_int32_t
#else
#define TYPE int8_t
#endif


template typename T, typename Shifter
__attribute__((noinline)) u_int32_t test_constant(T* first, int count, const
char *label)
{
int i;
u_int32_t rv = 0;

start_timer();

for (i = 0; i  iterations; ++i) {
T result = 0;
for (int n = 0; n  count; ++n) {
result += Shifter::do_shift( first[n] );
}
rv += result;
check_shifted_sum_1T, TYPE, Shifter(result);
}

record_result( timer(), label );
return rv;
}


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-10-24 Thread xinliangli at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #25 from davidxl xinliangli at gmail dot com 2011-10-24 23:02:14 
UTC ---
Created attachment 25600
  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=25600
test case for 47



Note that with gcc46, the result is even slower -- it has the RAT stall problem
which is fixed in 47.


David


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-10-21 Thread xinliangli at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #15 from davidxl xinliangli at gmail dot com 2011-10-21 23:02:16 
UTC ---
(In reply to comment #14)
 (In reply to comment #13)
  David, it looks like we are seeing different things with v4.7... See my 
  comment 11 - I am still observing the slowdown. Do you have access to 
  v4.1 and v4.6? Could you try reproducing my test please?
 
 Sorry for the delay -- I am pretty swamped these days (till mid October). I
 will try to look at the problem more then.
 
 David


I still can not reproduce the problem with trunk compiler:


rv=4282167296

test description   absolute   operations   ratio with
number time   per second   test0

 0 int8_t constant add   1.09 sec   1467.89 M 1.00

Total absolute time for int8_t constant folding: 1.09 sec


Can you attach the output of -v and the assembly file with -fverbose-asm -dA
and the optimized dump file with option -fdump-tree-optimized-blocks using
trunk compiler?

thanks,

David


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-09-15 Thread oleg at smolsky dot net
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #13 from oleg at smolsky dot net 2011-09-15 16:53:26 UTC ---
David, it looks like we are seeing different things with v4.7... See my 
comment 11 - I am still observing the slowdown. Do you have access to 
v4.1 and v4.6? Could you try reproducing my test please?


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-09-15 Thread xinliangli at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #14 from davidxl xinliangli at gmail dot com 2011-09-15 17:28:10 
UTC ---
(In reply to comment #13)
 David, it looks like we are seeing different things with v4.7... See my 
 comment 11 - I am still observing the slowdown. Do you have access to 
 v4.1 and v4.6? Could you try reproducing my test please?

Sorry for the delay -- I am pretty swamped these days (till mid October). I
will try to look at the problem more then.

David


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-08-30 Thread matt at use dot net
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

Matt Hargett matt at use dot net changed:

   What|Removed |Added

 CC||matt at use dot net

--- Comment #12 from Matt Hargett matt at use dot net 2011-08-30 20:30:15 UTC 
---
Can you determine which release introduced the regression?


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-08-25 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

Jakub Jelinek jakub at gcc dot gnu.org changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #4 from Jakub Jelinek jakub at gcc dot gnu.org 2011-08-25 
08:55:42 UTC ---
The bugreport is incomplete, I don't see anywhere where you'd state what g++
options were meassured, what CPU was it on, is it -m32 or -m64, etc.
For me, on i7-2600 CPU 4.6.0 (both Fedora 4.6.0-10 and 20110727 4.6 branch
snapshot) is actually much faster than current trunk with -O3 -m64:
4.6.* gives roughly
 0 int8_t constant add   0.84 sec   1904.76 M 1.00
while trunk
 0 int8_t constant add   1.26 sec   1269.84 M 1.00
4.4.* gives also
 0 int8_t constant add   1.26 sec   1269.84 M 1.00
4.3.* gives
 0 int8_t constant add   1.26 sec   1269.84 M 1.00
4.2.* gives
 0 int8_t constant add   0.84 sec   1904.76 M 1.00
and 4.1.* doesn't compile, because the source has been preprocessed and STL is
dependent on the compiler version.


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-08-25 Thread oleg.smolsky at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #5 from Oleg Smolsky oleg.smolsky at gmail dot com 2011-08-25 
15:19:57 UTC ---
Created attachment 25103
  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=25103
The same test preprocessed with g++ 4.1


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-08-25 Thread oleg.smolsky at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #6 from Oleg Smolsky oleg.smolsky at gmail dot com 2011-08-25 
15:25:49 UTC ---
Oh, the settings and things were discussed the mail thread... Here is the
digest:

I have compiled and run a set of C++ benchmarks on a CentOS4/64 box using the
following compilers:
 a) g++4.1 that is available for this distro (GCC version 4.1.2 20071124 (Red
Hat 4.1.2-42)
 b) g++4.6 that I built (stock version 4.6.1)

I built the compiler with all the default options (it just has a distinct
installation path):
 ../gcc-%{version}/configure --prefix=/work/tools/gcc46
--enable-languages=c,c++ --with-system-zlib --with-mpfr=/work/tools/mpfr24
--with-gmp=/work/tools/gmp --with-mpc=/work/tools/mpc
LD_LIBRARY_PATH=/work/tools/mpfr/lib24:/work/tools/gmp/lib:/work/tools/mpc/lib

Tests were compiled with -O2 and -O3, I later added -march=native to 4.6
builds.

The processor is Intel quad core something:

processor: 0
vendor_id: GenuineIntel
cpu family: 6
model: 15
model name: Genuine Intel(R) CPU  @ 2.40GHz
stepping: 4
cpu MHz: 2393.943
cache size: 4096 KB
physical id: 0
siblings: 4
core id: 0
cpu cores: 4
fpu: yes
fpu_exception: yes
cpuid level: 10
wp: yes
flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm pni monitor
ds_cpl tm2 cx16 xtpr lahf_lm
bogomips: 4793.09
clflush size: 64
cache_alignment: 64
address sizes: 36 bits physical, 48 bits virtual
power management:


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-08-25 Thread hjl.tools at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #7 from H.J. Lu hjl.tools at gmail dot com 2011-08-25 15:58:08 
UTC ---
(In reply to comment #6)

 The processor is Intel quad core something:
 
 processor: 0
 vendor_id: GenuineIntel
 cpu family: 6
 model: 15
 model name: Genuine Intel(R) CPU  @ 2.40GHz
 stepping: 4

Are you using engineering example? It doesn't look
like a production processor.


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-08-25 Thread xinliangli at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #8 from davidxl xinliangli at gmail dot com 2011-08-25 16:17:10 
UTC ---
gcc46 and gcc47 difference can be reproduced using -O2 -m64.

David


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-08-25 Thread oleg.smolsky at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #9 from Oleg Smolsky oleg.smolsky at gmail dot com 2011-08-25 
16:26:05 UTC ---
AFAIK it's a production processor, a couple of years old. From x86info:

Family: 6 Model: 15 Stepping: 4 Type: 0 Brand: 0
CPU Model: Core 2 Duo E6600 Original OEM
Feature flags:
 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflsh
ds acpi mmx fxsr sse sse2 ss ht tm pbe sse3 monitor ds-cpl vmx tm2 ssse3 cx16
xT
PR
Extended feature flags:
 SYSCALL xd em64t lahf_lm
Cache info
 L1 Instruction cache: 32KB, 8-way associative. 64 byte line size.
 L1 Data cache: 32KB, 8-way associative. 64 byte line size.
 L3 unified cache: 4MB, 16-way associative. 64 byte line size.
TLB info
 Instruction TLB: 4x 4MB page entries, or 8x 2MB pages entries, 4-way assoc..
 Instruction TLB: 4K pages, 4-way associative, 128 entries.
 Data TLB: 4MB pages, 4-way associative, 32 entries
 L0 Data TLB: 4MB pages, 4-way set associative, 16 entries
 L0 Data TLB: 4MB pages, 4-way set associative, 16 entries
 Data TLB: 4K pages, 4-way associative, 256 entries.
 Data TLB: 4MB pages, 4-way associative, 32 entries
 64 byte prefetching.
 L0 Data TLB: 4MB pages, 4-way set associative, 16 entries
 L0 Data TLB: 4MB pages, 4-way set associative, 16 entries
 Data TLB: 4K pages, 4-way associative, 256 entries.
The physical package supports 4 logical processors


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-08-25 Thread oleg.smolsky at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #10 from Oleg Smolsky oleg.smolsky at gmail dot com 2011-08-25 
22:08:49 UTC ---
BTW, the uint16_t test also got slower for the same very reason. Here is the
inner-most loop generated by g++4.6:

text:00400DA0 loc_400DA0:
.text:00400DA0 add eax, 0Ah
.text:00400DA3 add ax, [rdx]
.text:00400DA6 add rdx, 2
.text:00400DAA cmp rdx, 5092E0h
.text:00400DB1 jnz short loc_400DA0


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-08-25 Thread oleg.smolsky at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #11 from Oleg Smolsky oleg.smolsky at gmail dot com 2011-08-26 
00:48:02 UTC ---
Also, I have just built the same suite with GCC version 4.7 that came from
ftp://gcc.gnu.org/pub/gcc/snapshots/4.7-20110820/gcc-4.7-20110820.tar.bz2 and
the performance degradation remains:

gcc41:
0 int8_t constant add   1.35 sec   1185.19 M 1.00

gcc47:
0 int8_t constant add   2.37 sec   675.11 M 1.00

Note, these are original unmodified tests, not my digested derivatives


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-08-24 Thread oleg.smolsky at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #1 from Oleg Smolsky oleg.smolsky at gmail dot com 2011-08-24 
22:13:26 UTC ---
Created attachment 25097
  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=25097
The test case

This is the preprocessed source for the test discussed in the mail thread.


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-08-24 Thread xinliangli at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

davidxl xinliangli at gmail dot com changed:

   What|Removed |Added

 CC||xinliangli at gmail dot com

--- Comment #2 from davidxl xinliangli at gmail dot com 2011-08-24 23:15:44 
UTC ---
The problem is fixed in trunk compiler:

1) with 4.6 compiler:


test description   absolute   operations   ratio with
number time   per second   test0

 0 int8_t constant add   3.29 sec   486.32 M 1.00


RAT_STALLS.registers = 288249 (sampling count 10001)


2) with trunk compiler:


test description   absolute   operations   ratio with
number time   per second   test0

 0 int8_t constant add   1.34 sec   1194.03 M 1.00

No partial register stalls from user functions.


Inner loop from trunk compiler:

.L55:
movzbl0(%rbp,%rcx), %r9d
addq$1, %rcx
cmpl%ecx, %ebx
leal10(%r8,%r9), %r8d
jg.L55


Inner loop from 46 compiler:

.L43:
addl$10, %eax
addb(%rdx), %al
addq$1, %rdx
cmpq$data8+8000, %rdx
jne.L43


RAT stalls (not precise event so the instruction causing stalls is a little
off)
   :  400e27:nopw   0x0(%rax,%rax,1)
   127  0.0440 :  400e30:add$0xa,%eax
  5869  2.0330 :  400e33:add(%rdx),%al
282125 97.7263 :  400e35:add$0x1,%rdx
   :  400e39:cmp$0x404560,%rdx
   :  400e40:jne400e30 main+0xd0


David


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-08-24 Thread xinliangli at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #3 from davidxl xinliangli at gmail dot com 2011-08-25 00:13:00 
UTC ---
Caused by differences in FE generated code:

46:


D.6887 = (int) D.6886;
D.6888 = custom_constant_addsigned char::do_shift (D.6887);
D.6889 = (unsigned char) D.6888;
result.8 = (unsigned char) result;
D.6891 = D.6889 + result.8;
result = (signed char) D.6891;
n = n + 1;


trunk:


D.6938 = (int) D.6937;
D.6874 = custom_constant_addsigned char::do_shift (D.6938);
D.6939 = (int) result;   -- promoted to int
D.6940 = (int) D.6874;   ---promoted to int
D.6941 = D.6939 + D.6940;
result = (signed char) D.6941;
n = n + 1;