[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #37 from oleg at smolsky dot net 2012-03-06 16:34:27 UTC --- Hey Jakub, is this smaller example digestable? http://gcc.gnu.org/bugzilla/attachment.cgi?id=26814 The asm output is straightforward, but I obviously have no clue about how complex the corresponding compiler's internal state is...
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #38 from Jakub Jelinek jakub at gcc dot gnu.org 2012-03-06 17:26:24 UTC --- Sorry, can't reproduce any performance degradation between 4.1 and 4.6 on the http://gcc.gnu.org/bugzilla/attachment.cgi?id=26814 testcase (-O3 -m64, default -mtune=generic): on i7-2600 4.1 user time is 0m3.833s, 4.6 0m3.411s and 4.7 0m5.102s, on AMD Barcelona 4.1 user time is 0m8.798s, 4.6 0m5.875s and 4.7 0m5.855s.
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #39 from oleg at smolsky dot net 2012-03-06 19:39:03 UTC --- Hmm... funky. I can reproduce the issue on a newer Intel machine: $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU L5410 @ 2.33GHz stepping: 6 cpu MHz : 2327.445 cache size : 6144 KB physical id : 0 siblings: 4 core id : 0 cpu cores : 4 $ time ./test41 real0m6.270s user0m6.268s sys 0m0.000s $ time ./test44 real0m5.524s user0m5.523s sys 0m0.000s $ time ./test46 real0m11.721s user0m11.718s sys 0m0.001s P.S. the middle one is made using g++ (GCC) 4.4.5 20110214 (Red Hat 4.4.5-6). The rest are original binaries made a couple of days ago.
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #30 from Jakub Jelinek jakub at gcc dot gnu.org 2012-03-02 08:07:15 UTC --- Created attachment 26809 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=26809 pr50182.C Even the reduced testcase is orders of magnitude longer than what would be desirable for analysis, I've tried to reduce it just to the templates that are actually needed (and can be meassured just with time), does this reflect the slowdowns you are seeing? The next step at reducing would be to remove all the template mess, instantiate it by hand, and perhaps also inline by hand. There is no reason why we shouldn't be just having one loop with all the statements in it. On this reduced testcase on Intel i7-2600 CPU with -O3 the -DFAST_VER/-DNOINLINE don't seem to make any difference, but 4.6 is measurably faster than 4.7. In any case, this is way too late for 4.7.
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #31 from oleg at smolsky dot net 2012-03-02 08:21:41 UTC --- I don't think there is a need to actually check the result in this benchmarkable fragment, so that will reduce the code a little. The only thing that I was hitting is about fooling/forcing the compiler not to discard the intermediate result and actually perform every calculation and iteration :) Let me try do digest this further. I'll also get you a result from our production compiler (v4.1 that emits the fastest code)
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #32 from Jakub Jelinek jakub at gcc dot gnu.org 2012-03-02 08:28:34 UTC --- For me, 4.1 is equally fast to 4.6 on my CPU and on the reduced testcase I've attached (not clear if it models what the original benchmark did right or not), and on the trunk regressed with http://gcc.gnu.org/viewcvs?root=gccview=revrev=176072 Before that the inner loop looked like: .L12: addl$10, %edx addb0(%rbp,%rcx), %dl addq$1, %rcx cmpl%ecx, %ebx jg .L12 and now it looks like: .L12: movzbl 0(%rbp,%rdx), %r8d addq$1, %rdx cmpl%edx, %ebx leal10(%rcx,%r8), %ecx jg .L12
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #33 from Jakub Jelinek jakub at gcc dot gnu.org 2012-03-02 09:13:52 UTC --- After Jason's patch (which needs to be kept, it was a wrong-code bugfix), we get out of the FE the addition in int type, while previously it was in unsigned char type. I.e. int D.2177; signed char D.2138; T D.2178; T D.2179; T D.2180; signed char result; D.2138 = custom_constant_addsigned char::do_shift (D.2177); D.2178 = (T) result; D.2179 = (T) D.2138; D.2180 = D.2178 + D.2179; result = (signed char) D.2180; where T used to be unsigned char before and now is int. And no GIMPLE optimization pass manages to narrow the addition operation (together with the previous sign extensions and following demotion) to an unsigned char operation (signed char would be wrong, because of the possible overflow). I bet such narrowing in these cases could even help the vectorizer, which if it were to vectorize this or similar loops (it doesn't in this case), would do the promotions/demotions needlessly.
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #34 from oleg at smolsky dot net 2012-03-03 02:19:21 UTC --- OK, here are some benchmark numbers for the test compiled verbatim with g++41/g++463 -O2: $ time ./test41 rv=4243767296 real0m6.063s user0m6.058s sys 0m0.001s $ time ./test46 rv=4243767296 real0m11.425s user0m11.415s sys 0m0.003s $ time ./test46-fast #(ie built it with -DFAST_VER) rv=4243767296 real0m11.389s user0m11.383s sys 0m0.003s Let me see how the sample can be digested further down...
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #35 from oleg at smolsky dot net 2012-03-03 02:45:15 UTC --- Here is a smaller version. BTW, I've noticed another regression in optimization in v4.1 when using a const global...
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #36 from oleg at smolsky dot net 2012-03-03 02:59:11 UTC --- Here is the code emitted by g++ 4.6.3 for smaller_test.cpp (attached to the bug) unsigned int test_constant proc near mov r9d, cs:iterations xor r8d, r8d xor eax, eax testr9d, r9d jle short locret_400552 db 66h, 66h, 66h nop db 66h, 66h nop loc_400528: xor ecx, ecx xor edx, edx testesi, esi jle short loc_40054E loc_400530: add edx, 0Ah add dl, [rdi+rcx] add rcx, 1 cmp esi, ecx jg short loc_400530 movsx edx, dl loc_400541: add r8d, 1 add eax, edx cmp r8d, r9d jnz short loc_400528 rep retn loc_40054E: xor edx, edx jmp short loc_400541 locret_400552: rep retn
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #29 from oleg at smolsky dot net 2012-03-02 00:54:53 UTC --- Is it possible to target this to 4.7? These optimization issues result in benchmarcably slower code...
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 Richard Guenther rguenth at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2012-01-11 Ever Confirmed|0 |1 --- Comment #27 from Richard Guenther rguenth at gcc dot gnu.org 2012-01-11 09:41:25 UTC --- Confirmed. Can somebody summarize please and point to the relevant short testcase that shows the regression (is there only one kind of problem? this seems to be a benchmark suite). A short testcase is preprocessed and at most a few hundred lines.
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #28 from davidxl xinliangli at gmail dot com 2012-01-11 17:26:46 UTC --- See comment 24 for shorter test case. Summary: 1) the regression reported by Oleg in gcc4_6 and earlier versions is due to FE code generation difference which lead to the backend to generate code leading to partial register stall. 2) the RAT stall problem is fixed in gcc4_7 3) however in 4_7, there is a different problem -- redundant sign-extension and move instruction is generated. It could be due to the limitation in RTL forward propagation and combine pass to deal with multiple downward uses David
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #26 from oleg at smolsky dot net 2012-01-10 18:06:28 UTC --- Could someone toggle the state assign a milestone please?
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #16 from oleg at smolsky dot net 2011-10-24 18:27:28 UTC --- $ /work/tools/gcc47/bin/g++ -v Using built-in specs. COLLECT_GCC=/work/tools/gcc47/bin/g++ COLLECT_LTO_WRAPPER=/work/tools/gcc47/libexec/gcc/x86_64-unknown-linux-gnu/4.7.0/lto-wrapper Target: x86_64-unknown-linux-gnu Configured with: ../gcc-4.7/configure --prefix=/work/tools/gcc47 --enable-languages=c,c++ --with-system-zlib --with-mpfr=/work/tools/mpfr24 --with-gmp=/work/tools/gmp --with-mpc=/work/tools/mpc LD_LIBRARY_PATH=/work/tools/mpfr/lib24:/work/tools/gmp/lib:/work/tools/mpc/lib Thread model: posix gcc version 4.7.0 20111001 (experimental) (GCC) The test case, test.cpp was compiled with this command: /work/tools/gcc47/bin/g++ -I. -g -O3 -static-libstdc++ -static-libgcc -march=nativetest.cpp -o test
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #17 from oleg at smolsky dot net 2011-10-24 18:27:31 UTC --- Created attachment 25595 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=25595 test.cpp.144t.optimized --- Comment #18 from oleg at smolsky dot net 2011-10-24 18:27:31 UTC --- Created attachment 25596 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=25596 test.s
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #19 from oleg at smolsky dot net 2011-10-24 18:33:23 UTC --- Also note that Bugzilla has quietly replaced an older attachment, test.cpp, with a new one without adding a comment...
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #20 from davidxl xinliangli at gmail dot com 2011-10-24 19:33:18 UTC --- The test.cpp attached seems to be the same as the old version. David
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #21 from oleg at smolsky dot net 2011-10-24 19:48:57 UTC --- OK, just in case, here is my current test.
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #22 from davidxl xinliangli at gmail dot com 2011-10-24 19:58:23 UTC --- (In reply to comment #21) OK, just in case, here is my current test. Preprocessed test case? I saw the main assembly difference that can explain the performance diff, but want to make sure it is not due to your new source change (I saw some print statement addeded). David
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #23 from oleg at smolsky dot net 2011-10-24 21:11:21 UTC --- Here is the source preprocessed for gcc47. The test exhibits the slowdown mentioned in comment 11.
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #24 from davidxl xinliangli at gmail dot com 2011-10-24 23:00:22 UTC --- (In reply to comment #23) Here is the source preprocessed for gcc47. The test exhibits the slowdown mentioned in comment 11. The problem can be reproduced with a simplified test case -- basically depending on how the result value from the inner loop is used in the outer loop (related to casting), the inner loop code is quite different - in the slow case, there are two redundant sign extension and a move instructions generated. # the fast version gcc -O3 -DFAST_VER bug.cpp ./a.out rv=4282167296 test description absolute operations ratio with number time per second test0 0 int8_t constant add 1.05 sec 1523.81 M 1.00 Total absolute time for int8_t constant folding: 1.05 sec # the slow version: gcc -O3 bug.cpp ./a.out rv=4282167296 test description absolute operations ratio with number time per second test0 0 int8_t constant add 1.57 sec 1019.11 M 1.00 Total absolute time for int8_t constant folding: 1.57 sec # however, when disabling inlining of check_shifted_sum_1 in the slow case, the runtime is recovered: gcc -O3 -DNOINLINE bug.cpp ./a.out rv=4282167296 test description absolute operations ratio with number time per second test0 0 int8_t constant add 1.05 sec 1523.81 M 1.00 Total absolute time for int8_t constant folding: 1.05 sec The inner loop body in faster case: .L60: movzbl0(%rbp,%rcx), %r9d addq$1, %rcx cmpl%ecx, %ebx leal10(%r8,%r9), %r8d # SUCC: 4 [91.0%] (dfs_back,can_fallthru) 5 [9.0%] (fallthru,can_fallthru,loop_exit) jg.L60 while for the slow case: .L60: movzbl(%r12,%rcx), %eax movsbl%r8b, %r8d addq$1, %rcx leal10(%rax), %r9d movsbl%r9b, %r9d addl%r8d, %r9d cmpl%ecx, %ebp movl%r9d, %r8d # SUCC: 4 [91.0%] (dfs_back,can_fallthru) 5 [9.0%] (fallthru,can_fallthru,loop_exit) jg.L60 The relevant source change: #ifdef NOINLINE #define INL __attribute__((noinline)) #else #define INL inline #endif template typename T, typename T2, typename Shifter INL void check_shifted_sum_1(T2 result) { T temp = (T)SIZE * Shifter::do_shift((T)init_value); if (!tolerance_equalT((T)result,temp)) printf(test %i failed\n, current_test); } #ifdef FAST_VER #define TYPE u_int32_t #else #define TYPE int8_t #endif template typename T, typename Shifter __attribute__((noinline)) u_int32_t test_constant(T* first, int count, const char *label) { int i; u_int32_t rv = 0; start_timer(); for (i = 0; i iterations; ++i) { T result = 0; for (int n = 0; n count; ++n) { result += Shifter::do_shift( first[n] ); } rv += result; check_shifted_sum_1T, TYPE, Shifter(result); } record_result( timer(), label ); return rv; }
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #25 from davidxl xinliangli at gmail dot com 2011-10-24 23:02:14 UTC --- Created attachment 25600 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=25600 test case for 47 Note that with gcc46, the result is even slower -- it has the RAT stall problem which is fixed in 47. David
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #15 from davidxl xinliangli at gmail dot com 2011-10-21 23:02:16 UTC --- (In reply to comment #14) (In reply to comment #13) David, it looks like we are seeing different things with v4.7... See my comment 11 - I am still observing the slowdown. Do you have access to v4.1 and v4.6? Could you try reproducing my test please? Sorry for the delay -- I am pretty swamped these days (till mid October). I will try to look at the problem more then. David I still can not reproduce the problem with trunk compiler: rv=4282167296 test description absolute operations ratio with number time per second test0 0 int8_t constant add 1.09 sec 1467.89 M 1.00 Total absolute time for int8_t constant folding: 1.09 sec Can you attach the output of -v and the assembly file with -fverbose-asm -dA and the optimized dump file with option -fdump-tree-optimized-blocks using trunk compiler? thanks, David
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #13 from oleg at smolsky dot net 2011-09-15 16:53:26 UTC --- David, it looks like we are seeing different things with v4.7... See my comment 11 - I am still observing the slowdown. Do you have access to v4.1 and v4.6? Could you try reproducing my test please?
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #14 from davidxl xinliangli at gmail dot com 2011-09-15 17:28:10 UTC --- (In reply to comment #13) David, it looks like we are seeing different things with v4.7... See my comment 11 - I am still observing the slowdown. Do you have access to v4.1 and v4.6? Could you try reproducing my test please? Sorry for the delay -- I am pretty swamped these days (till mid October). I will try to look at the problem more then. David
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 Matt Hargett matt at use dot net changed: What|Removed |Added CC||matt at use dot net --- Comment #12 from Matt Hargett matt at use dot net 2011-08-30 20:30:15 UTC --- Can you determine which release introduced the regression?
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 Jakub Jelinek jakub at gcc dot gnu.org changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #4 from Jakub Jelinek jakub at gcc dot gnu.org 2011-08-25 08:55:42 UTC --- The bugreport is incomplete, I don't see anywhere where you'd state what g++ options were meassured, what CPU was it on, is it -m32 or -m64, etc. For me, on i7-2600 CPU 4.6.0 (both Fedora 4.6.0-10 and 20110727 4.6 branch snapshot) is actually much faster than current trunk with -O3 -m64: 4.6.* gives roughly 0 int8_t constant add 0.84 sec 1904.76 M 1.00 while trunk 0 int8_t constant add 1.26 sec 1269.84 M 1.00 4.4.* gives also 0 int8_t constant add 1.26 sec 1269.84 M 1.00 4.3.* gives 0 int8_t constant add 1.26 sec 1269.84 M 1.00 4.2.* gives 0 int8_t constant add 0.84 sec 1904.76 M 1.00 and 4.1.* doesn't compile, because the source has been preprocessed and STL is dependent on the compiler version.
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #5 from Oleg Smolsky oleg.smolsky at gmail dot com 2011-08-25 15:19:57 UTC --- Created attachment 25103 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=25103 The same test preprocessed with g++ 4.1
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #6 from Oleg Smolsky oleg.smolsky at gmail dot com 2011-08-25 15:25:49 UTC --- Oh, the settings and things were discussed the mail thread... Here is the digest: I have compiled and run a set of C++ benchmarks on a CentOS4/64 box using the following compilers: a) g++4.1 that is available for this distro (GCC version 4.1.2 20071124 (Red Hat 4.1.2-42) b) g++4.6 that I built (stock version 4.6.1) I built the compiler with all the default options (it just has a distinct installation path): ../gcc-%{version}/configure --prefix=/work/tools/gcc46 --enable-languages=c,c++ --with-system-zlib --with-mpfr=/work/tools/mpfr24 --with-gmp=/work/tools/gmp --with-mpc=/work/tools/mpc LD_LIBRARY_PATH=/work/tools/mpfr/lib24:/work/tools/gmp/lib:/work/tools/mpc/lib Tests were compiled with -O2 and -O3, I later added -march=native to 4.6 builds. The processor is Intel quad core something: processor: 0 vendor_id: GenuineIntel cpu family: 6 model: 15 model name: Genuine Intel(R) CPU @ 2.40GHz stepping: 4 cpu MHz: 2393.943 cache size: 4096 KB physical id: 0 siblings: 4 core id: 0 cpu cores: 4 fpu: yes fpu_exception: yes cpuid level: 10 wp: yes flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm pni monitor ds_cpl tm2 cx16 xtpr lahf_lm bogomips: 4793.09 clflush size: 64 cache_alignment: 64 address sizes: 36 bits physical, 48 bits virtual power management:
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #7 from H.J. Lu hjl.tools at gmail dot com 2011-08-25 15:58:08 UTC --- (In reply to comment #6) The processor is Intel quad core something: processor: 0 vendor_id: GenuineIntel cpu family: 6 model: 15 model name: Genuine Intel(R) CPU @ 2.40GHz stepping: 4 Are you using engineering example? It doesn't look like a production processor.
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #8 from davidxl xinliangli at gmail dot com 2011-08-25 16:17:10 UTC --- gcc46 and gcc47 difference can be reproduced using -O2 -m64. David
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #9 from Oleg Smolsky oleg.smolsky at gmail dot com 2011-08-25 16:26:05 UTC --- AFAIK it's a production processor, a couple of years old. From x86info: Family: 6 Model: 15 Stepping: 4 Type: 0 Brand: 0 CPU Model: Core 2 Duo E6600 Original OEM Feature flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflsh ds acpi mmx fxsr sse sse2 ss ht tm pbe sse3 monitor ds-cpl vmx tm2 ssse3 cx16 xT PR Extended feature flags: SYSCALL xd em64t lahf_lm Cache info L1 Instruction cache: 32KB, 8-way associative. 64 byte line size. L1 Data cache: 32KB, 8-way associative. 64 byte line size. L3 unified cache: 4MB, 16-way associative. 64 byte line size. TLB info Instruction TLB: 4x 4MB page entries, or 8x 2MB pages entries, 4-way assoc.. Instruction TLB: 4K pages, 4-way associative, 128 entries. Data TLB: 4MB pages, 4-way associative, 32 entries L0 Data TLB: 4MB pages, 4-way set associative, 16 entries L0 Data TLB: 4MB pages, 4-way set associative, 16 entries Data TLB: 4K pages, 4-way associative, 256 entries. Data TLB: 4MB pages, 4-way associative, 32 entries 64 byte prefetching. L0 Data TLB: 4MB pages, 4-way set associative, 16 entries L0 Data TLB: 4MB pages, 4-way set associative, 16 entries Data TLB: 4K pages, 4-way associative, 256 entries. The physical package supports 4 logical processors
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #10 from Oleg Smolsky oleg.smolsky at gmail dot com 2011-08-25 22:08:49 UTC --- BTW, the uint16_t test also got slower for the same very reason. Here is the inner-most loop generated by g++4.6: text:00400DA0 loc_400DA0: .text:00400DA0 add eax, 0Ah .text:00400DA3 add ax, [rdx] .text:00400DA6 add rdx, 2 .text:00400DAA cmp rdx, 5092E0h .text:00400DB1 jnz short loc_400DA0
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #11 from Oleg Smolsky oleg.smolsky at gmail dot com 2011-08-26 00:48:02 UTC --- Also, I have just built the same suite with GCC version 4.7 that came from ftp://gcc.gnu.org/pub/gcc/snapshots/4.7-20110820/gcc-4.7-20110820.tar.bz2 and the performance degradation remains: gcc41: 0 int8_t constant add 1.35 sec 1185.19 M 1.00 gcc47: 0 int8_t constant add 2.37 sec 675.11 M 1.00 Note, these are original unmodified tests, not my digested derivatives
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #1 from Oleg Smolsky oleg.smolsky at gmail dot com 2011-08-24 22:13:26 UTC --- Created attachment 25097 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=25097 The test case This is the preprocessed source for the test discussed in the mail thread.
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 davidxl xinliangli at gmail dot com changed: What|Removed |Added CC||xinliangli at gmail dot com --- Comment #2 from davidxl xinliangli at gmail dot com 2011-08-24 23:15:44 UTC --- The problem is fixed in trunk compiler: 1) with 4.6 compiler: test description absolute operations ratio with number time per second test0 0 int8_t constant add 3.29 sec 486.32 M 1.00 RAT_STALLS.registers = 288249 (sampling count 10001) 2) with trunk compiler: test description absolute operations ratio with number time per second test0 0 int8_t constant add 1.34 sec 1194.03 M 1.00 No partial register stalls from user functions. Inner loop from trunk compiler: .L55: movzbl0(%rbp,%rcx), %r9d addq$1, %rcx cmpl%ecx, %ebx leal10(%r8,%r9), %r8d jg.L55 Inner loop from 46 compiler: .L43: addl$10, %eax addb(%rdx), %al addq$1, %rdx cmpq$data8+8000, %rdx jne.L43 RAT stalls (not precise event so the instruction causing stalls is a little off) : 400e27:nopw 0x0(%rax,%rax,1) 127 0.0440 : 400e30:add$0xa,%eax 5869 2.0330 : 400e33:add(%rdx),%al 282125 97.7263 : 400e35:add$0x1,%rdx : 400e39:cmp$0x404560,%rdx : 400e40:jne400e30 main+0xd0 David
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #3 from davidxl xinliangli at gmail dot com 2011-08-25 00:13:00 UTC --- Caused by differences in FE generated code: 46: D.6887 = (int) D.6886; D.6888 = custom_constant_addsigned char::do_shift (D.6887); D.6889 = (unsigned char) D.6888; result.8 = (unsigned char) result; D.6891 = D.6889 + result.8; result = (signed char) D.6891; n = n + 1; trunk: D.6938 = (int) D.6937; D.6874 = custom_constant_addsigned char::do_shift (D.6938); D.6939 = (int) result; -- promoted to int D.6940 = (int) D.6874; ---promoted to int D.6941 = D.6939 + D.6940; result = (signed char) D.6941; n = n + 1;