[Bug tree-optimization/45241] [4.5/4.6 Regression] CPU2006 465.tonto ICE in the vectorizer with -fno-tree-pre
--- Comment #9 from changpeng dot fang at amd dot com 2010-08-30 16:37 --- Review approval for the trunk: http://gcc.gnu.org/ml/gcc-patches/2010-08/msg00931.html Review Approval for 4.5 branch: http://gcc.gnu.org/ml/gcc-patches/2010-08/msg02112.html -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45241
[Bug tree-optimization/45241] [4.5/4.6 Regression] CPU2006 465.tonto ICE in the vectorizer with -fno-tree-pre
--- Comment #10 from changpeng dot fang at amd dot com 2010-08-30 16:39 --- r163207 - in /trunk/gcc: ChangeLog testsuite/Ch... * From: cfang at gcc dot gnu dot org * To: gcc-cvs at gcc dot gnu dot org * Date: Thu, 12 Aug 2010 22:18:34 - * Subject: r163207 - in /trunk/gcc: ChangeLog testsuite/Ch... Author: cfang Date: Thu Aug 12 22:18:32 2010 New Revision: 163207 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=163207 Log: pr45241 give up dot_prod pattern searching if stmt is outside the loop. * tree-vect-patterns.c (vect_recog_dot_prod_pattern): Give up dot_prod pattern searching if a stmt is outside the loop. * gcc.dg/vect/no-tree-pre-pr45241.c: New. Added: trunk/gcc/testsuite/gcc.dg/vect/no-tree-pre-pr45241.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-patterns.c -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45241
[Bug tree-optimization/45241] [4.5/4.6 Regression] CPU2006 465.tonto ICE in the vectorizer with -fno-tree-pre
--- Comment #11 from changpeng dot fang at amd dot com 2010-08-30 16:40 --- r163286 - in /branches/gcc-4_5-branch/gcc: Chan... * From: cfang at gcc dot gnu dot org * To: gcc-cvs at gcc dot gnu dot org * Date: Mon, 16 Aug 2010 21:02:30 - * Subject: r163286 - in /branches/gcc-4_5-branch/gcc: Chan... Author: cfang Date: Mon Aug 16 21:02:29 2010 New Revision: 163286 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=163286 Log: pr45241 give up dot_prod pattern searching if stmt is outside the loop. * tree-vect-patterns.c (vect_recog_dot_prod_pattern): Give up dot_prod pattern searching if a stmt is outside the loop. * gcc.dg/vect/no-tree-pre-pr45241.c: New. Added: branches/gcc-4_5-branch/gcc/testsuite/gcc.dg/vect/no-tree-pre-pr45241.c Modified: branches/gcc-4_5-branch/gcc/ChangeLog branches/gcc-4_5-branch/gcc/testsuite/ChangeLog branches/gcc-4_5-branch/gcc/tree-vect-patterns.c -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45241
[Bug tree-optimization/45241] [4.5/4.6 Regression] CPU2006 465.tonto ICE in the vectorizer with -fno-tree-pre
--- Comment #12 from changpeng dot fang at amd dot com 2010-08-30 16:41 --- Fixed! -- changpeng dot fang at amd dot com changed: What|Removed |Added Status|NEW |RESOLVED Resolution||FIXED http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45241
[Bug target/45391] CPU2006 482.sphinx3: gcc4.6 5% regression from prefetching of vectorized loop
--- Comment #5 from changpeng dot fang at amd dot com 2010-08-24 22:13 --- For the test case in comment #2, if we don't vectorize the loop, the unroll_factor is incorrectly determined as 1, and insns-to-prefetch ratio (4) will then prevent prefetching, and thus no performance regression. If we vectorize the loop, the prefetch_mod will be smaller than the upper_bound, then the unroll_factor is determined as 4. At this time, insns-to-prefetch ratio is big enough to allow prefetches. Thus (5%) regression for 482.sphinx3. This regression should have occurred for no-tree-vectorize also if the unroll factor is correctly set. The actual problem is the unrolling itself. There is no regression if I just insert the prefetch and do not unroll the loop at all. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45391
[Bug tree-optimization/45260] [4.5/4.6 Regression] g++4.5: -prefetch-loop-arrays internal compiler error: in verify_expr, at tree-cfg.c:2541
--- Comment #6 from changpeng dot fang at amd dot com 2010-08-23 18:59 --- Committed to trunk as Revision: 163475: http://gcc.gnu.org/ml/gcc-cvs/2010-08/msg00688.html Committed to 4.5 branch as Revision: 163483 http://gcc.gnu.org/ml/gcc-cvs/2010-08/msg00696.html -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45260
[Bug c/45389] New: CPU2006 cactusADM: gcc 4.6 15% regression from 4.5
On a AMD amdfam10 system, gcc 4.5 (892s) is 15% faster than gcc 4.6 (1026s) With the following settings: 4.6: gcc version 4.6.0 20100812 (experimental) (GCC) COPTIMIZE = -Ofast -funroll-all-loops -fno-tree-pre --param prefetch-latency=700 -mveclibabi=acml -m64 -march=amdfam10 FOPTIMIZE = -Ofast -funroll-all-loops -fno-tree-pre -mveclibabi=acml -m64 -march=amdfam10 EXTRA_LDFLAGS = -L$(ACML_DIR) -lacml_mv 4.5: gcc version 4.5.2 20100818 (prerelease) (GCC) COPTIMIZE = -O3 -ffast-math -funroll-all-loops -fno-tree-pre -fprefetch-loop-arrays --param prefetch-latency=700 -mveclibabi=acml -m64 -march=amdfam10 FOPTIMIZE = -O3 -ffast-math -funroll-all-loops -fno-tree-pre -mveclibabi=acml -m64 -march=amdfam10 EXTRA_LDFLAGS = -L$(ACML_DIR) -lacml_mv NOTE that for gcc 4.6, -Ofast = -O3 -ffast-math and -fprefetch-loop-arrays is turned on @ -O3. Also acml4.4.0 is used for both tests. -- Summary: CPU2006 cactusADM: gcc 4.6 15% regression from 4.5 Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: changpeng dot fang at amd dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45389
[Bug c/45390] New: CPU2006 434.zeusmp: gcc 4.6 7% regression from gcc 4.6
On an AMD amdfam10 system, gcc 4.5 (713s) is 7% faster than gcc 4.6 (763s) With the following settings: 4.6: gcc version 4.6.0 20100812 (experimental) (GCC) FOPTIMIZE = -Ofast -funroll-all-loops -fno-tree-pre -mveclibabi=acml -m64 -march=amdfam10 EXTRA_LDFLAGS = -L$(ACML_DIR) -lacml_mv 4.5: gcc version 4.5.2 20100818 (prerelease) (GCC) COPTIMIZE = -O3 -ffast-math -funroll-all-loops -fno-tree-pre FOPTIMIZE = -O3 -ffast-math -funroll-all-loops -fno-tree-pre -mveclibabi=acml -m64 -march=amdfam10 EXTRA_LDFLAGS = -L$(ACML_DIR) -lacml_mv NOTE that for gcc 4.6, -Ofast = -O3 -ffast-math and -fprefetch-loop-arrays is turned on @ -O3. Also acml4.4.0 is used for both tests. -- Summary: CPU2006 434.zeusmp: gcc 4.6 7% regression from gcc 4.6 Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: changpeng dot fang at amd dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45390
[Bug target/45391] New: CPU2006 482.sphinx3: gcc4.6 5% regression from prefetching of vectorized loop
With gcc-4.6 -Ofast -funroll-all-loops -fno-tree-pre -mveclibabi=acml -m64 -march=amdfam10 sphnix3 runs 5% slower than with gcc-4.6 -Ofast -funroll-all-loops -fno-prefetch-loop-arrays -fno-tree-pre -mveclibabi=acml -m64 -march=amdfam10 prefetching will not cause any slowdown if the vectorizer is turned off, or with -fno-fast-math. I believe the related loops should be those with reductions that the following commit enabled vectorization. http://gcc.gnu.org/ml/gcc-cvs/2010-05/msg00277.html -- Summary: CPU2006 482.sphinx3: gcc4.6 5% regression from prefetching of vectorized loop Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: changpeng dot fang at amd dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45391
[Bug target/45391] CPU2006 482.sphinx3: gcc4.6 5% regression from prefetching of vectorized loop
--- Comment #2 from changpeng dot fang at amd dot com 2010-08-24 00:03 --- float f (float *x, float *y, float *z, unsigned n) { float ret = 0.0; unsigned i; for (i = 0; i n; i++) { float diff = x[i] - y[i]; ret -= diff * diff * z[i]; } return ret; } NO, this is related tp PR 45022 in certain sense, but the underlying reason is yet unknown. For the above test case, if I compile with -O3 -march=amdfam10 -m64, the loop is not vectorized due to floating point reduction. To my surprise, no prefetch is generated. The cost model filtered out the prefetches (we are trying to prefetch for each of the three memory references): Ahead 15, unroll factor 1, trip count -1 insn count 14, mem ref count 3, prefetch count 3 Not prefetching -- instruction to prefetch ratio (4) too small However, if we compile with -O3 -ffast-math -march=amdfam10 -m64, the loop can be vectorized, and one of the array reference is aligned. As a result and due to PR 45022, we are trying to prefetch only for the aligned reference, and one prefetch is inserted (this time, insns-to-prefetch ratio is big enough). The Fix of PR 45022 will result in NO prefetch generated actually and thus hide the problem. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45391
[Bug target/45391] CPU2006 482.sphinx3: gcc4.6 5% regression from prefetching of vectorized loop
--- Comment #3 from changpeng dot fang at amd dot com 2010-08-24 00:22 --- I checked with open64 and did not find any regression. And for the above testcase, open64 generated 3 non-temporal prefetches. As a result, I am guessing that we are just unlucky that the prefetch kicks out useful data for such streaming accesses (gcc generate one prefetcht0): .Lt_0_6402: #loop Loop body line 8, nesting depth: 1, estimated iterations: 1000 .loc1 7 0 movss 0(%r10),%xmm0 # [0] id:67 subss 0(%r9),%xmm0 # [3] .loc1 8 0 mulss %xmm0,%xmm0 # [9] mulss 0(%rax),%xmm0 # [13] .loc1 7 0 prefetchnta 128(%r10) # [17] L1 prefetchnta 128(%r9)# [17] L1 .loc1 8 0 addq $4,%rax# [17] addq $4,%r10# [18] addq $4,%r9 # [18] cmpq %r11,%rax # [18] prefetchnta 124(%rax) # [19] L1 subss %xmm0,%xmm1 # [19] jle .Lt_0_6402 # [19] -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45391
[Bug target/45391] CPU2006 482.sphinx3: gcc4.6 5% regression from prefetching of vectorized loop
--- Comment #4 from changpeng dot fang at amd dot com 2010-08-24 00:46 --- Ooops, the open64 generated code posted in last comment is for non-vectorized loop, the vectorized one is similar: .LBB23_f: .loc1 7 0 movups 0(%r10),%xmm3# [0] id:65 movups 0(%rax),%xmm1# [1] id:64 subps %xmm3,%xmm1 # [3] .loc1 8 0 mulps %xmm1,%xmm1 # [7] movups 0(%r9),%xmm2 # [9] id:66 mulps %xmm2,%xmm1 # [11] addq $16,%rax # [13] addq $16,%r9# [14] addq $16,%r10 # [14] .loc1 7 0 prefetchnta 112(%rax) # [14] L1 prefetchnta 112(%r10) # [15] L1 .loc1 8 0 prefetchnta 112(%r9)# [15] L1 subps %xmm1,%xmm0 # [15] -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45391
[Bug tree-optimization/45260] [4.5/4.6 Regression] g++4.5: -prefetch-loop-arrays internal compiler error: in verify_expr, at tree-cfg.c:2541
--- Comment #5 from changpeng dot fang at amd dot com 2010-08-20 22:48 --- I have a fix: http://gcc.gnu.org/ml/gcc-patches/2010-08/msg01625.html -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45260
[Bug middle-end/44206] [4.6 Regression] ICE: Inline clone with address taken
--- Comment #3 from changpeng dot fang at amd dot com 2010-08-18 19:43 --- *** Bug 45269 has been marked as a duplicate of this bug. *** -- changpeng dot fang at amd dot com changed: What|Removed |Added CC||changpeng dot fang at amd ||dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44206
[Bug c++/45269] CPU2006 450.soplex: verify_cgraph_node failed with -fprofile-generate
--- Comment #2 from changpeng dot fang at amd dot com 2010-08-18 19:43 --- http://gcc.gnu.org/ml/gcc-cvs/2010-05/msg00406.html Verified. If I back out the above change, the bug goes away. So it is a duplicate of bug 44206 *** This bug has been marked as a duplicate of 44206 *** -- changpeng dot fang at amd dot com changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution||DUPLICATE http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45269
[Bug tree-optimization/45260] [4.5/4.6 Regression] g++4.5: -prefetch-loop-arrays internal compiler error: in verify_expr, at tree-cfg.c:2541
--- Comment #4 from changpeng dot fang at amd dot com 2010-08-16 22:39 --- This bug should be related to VIEW_CONVERT_EXPR. If I use the following statement to filter the prefetch, the bug will go away: if (contains_view_convert_expr_p (ref)) return false; Otherwise, the prefetch pass will generate ref + offset as the prefetching address. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45260
[Bug c/45268] New: CPU2006 458.sjeng: type mismatch in array reference with -fwhole-program -combine
458.sjeng compilation fails with the following config options: ( fails with gcc4.6, passes with gcc4.4, gcc4.5 not tested yet) 458.sjeng=peak=default: ONESTEP = yes COPTIMIZE = -fwhole-program -combine -march=amdfam10 -m64 PORTABILITY = -DSPEC_CPU_LP64 feedback = 0 Here is the message: specmake build 2 make.err | tee make.out /usr/local/bin/gcc -DSPEC_CPU -DNDEBUG -fwhole-program -combine -march=amdfam10 -m64 -DSPEC_CPU_LP64 attacks.c book.c crazy.c draw.c ecache.c epd.c eval.c leval.c moves.c neval.c partner.c proof.c rcfile.c search.c see.c seval.c sjeng.c ttable.c utils.c -o sjeng sjeng.c: In function 'main': sjeng.c:75:5: error: type mismatch in array reference struct move_x struct move_x game_history_x[move_number.324] = path_x[0]; sjeng.c:75:5: error: type mismatch in array reference struct move_x struct move_x game_history_x[move_number.390] = path_x[0]; sjeng.c:75:5: error: type mismatch in array reference struct move_x struct move_x path_x[0] = game_history_x[move_number.428]; sjeng.c:75:5: error: type mismatch in array reference struct move_x struct move_x path_x[0] = game_history_x[move_number.435]; sjeng.c:75:5: error: type mismatch in array reference struct move_x struct move_x path_x[0] = game_history_x[move_number.439]; sjeng.c:75:5: internal compiler error: verify_gimple failed -- Summary: CPU2006 458.sjeng: type mismatch in array reference with -fwhole-program -combine Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: changpeng dot fang at amd dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45268
[Bug c++/45269] New: CPU2006 450.soplex: verify_cgraph_node failed with -fprofile-generate
With gcc 4.6 on X86, 450.soplex ICE with -fprofile-generate in spxmpsread.cc: g++ -c -o spxmpsread.o -DSPEC_CPU -DNDEBUG-fprofile-generate -O2 -m64 -DSPEC_CPU_LP64 spxmpsread.cc spxmpsread.cc:678:1: error: Inline clone with address taken std::basic_ostream_CharT, _Traits std::endl(std::basic_ostream_CharT, _Traits) [with _CharT = char, _Traits = std::char_traitschar]/276(-1) @0x7fafaf623000 (asm: _ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_) (inline copy in virtual bool soplex::SPxLP::readMPS(std::istream, soplex::NameSet*, soplex::NameSet*, soplex::DIdxSet*)/728) availability:local analyzed 71 time, 13 benefit (100 after inlining) 35 size, 4 benefit (75 after inlining) address_taken body local finalized inlinable called by: void soplex::_ZN6soplexL8readRowsERNS_8MPSInputERNS_8LPRowSetERNS_7NameSetE.constprop.9(soplex::MPSInput, soplex::LPRowSet, soplex::NameSet)/268 (0.01 per call) (inlined) (can throw external) calls: built-in/722 (0.01 per call) std::basic_ios_CharT, _Traits::char_type std::basic_ios_CharT, _Traits::widen(char) const [with _CharT = char, _Traits = std::char_traitschar, std::basic_ios_CharT, _Traits::char_type = char]/277 (inlined) (0.01 per call) (can throw external) std::basic_ostream_CharT, _Traits std::basic_ostream_CharT, _Traits::put(std::basic_ostream_CharT, _Traits::char_type) [with _CharT = char, _Traits = std::char_traitschar, std::basic_ostream_CharT, _Traits::char_type = char]/837 (0.01 per call) (can throw external) std::basic_ostream_CharT, _Traits std::basic_ostream_CharT, _Traits::flush() [with _CharT = char, _Traits = std::char_traitschar]/840 (0.01 per call) (can throw external) References: var:long int* __gcov_indirect_call_counters (read) var:void* __gcov_indirect_call_callee (read) var:long int *.LPBX1 [427] (write) var:void* __gcov_indirect_call_callee (write) var:long int *.LPBX1 [427] (read) var:long int *.LPBX1 [427] (write) var:long int *.LPBX1 [427] (read) var:long int *.LPBX1 [427] (write) var:long int *.LPBX1 [427] (read) var:long int *.LPBX1 [427] (write) var:long int *.LPBX1 [427] (read) Refering this function: fn:void soplex::_ZN6soplexL10readBoundsERNS_8MPSInputERNS_8LPColSetERNS_7NameSetEPNS_7DIdxSetE.constprop.13(soplex::MPSInput, soplex::LPColSet, soplex::NameSet, soplex::DIdxSet*)/595 (addr) fn:void soplex::_ZN6soplexL10readRangesERNS_8MPSInputERNS_8LPRowSetERNS_7NameSetE.constprop.12(soplex::MPSInput, soplex::LPRowSet, soplex::NameSet)/481 (addr) fn:void soplex::_ZN6soplexL7readRhsERNS_8MPSInputERNS_8LPRowSetERNS_7NameSetE.constprop.11(soplex::MPSInput, soplex::LPRowSet, soplex::NameSet)/260 (addr) fn:void soplex::_ZN6soplexL8readRowsERNS_8MPSInputERNS_8LPRowSetERNS_7NameSetE.constprop.9(soplex::MPSInput, soplex::LPRowSet, soplex::NameSet)/268 (addr) fn:void soplex::_ZN6soplexL8readNameERNS_8MPSInputE.constprop.7(soplex::MPSInput)/369 (addr) spxmpsread.cc:678:1: internal compiler error: verify_cgraph_node failed Please submit a full bug report, with preprocessed source if appropriate. See http://gcc.gnu.org/bugs.html for instructions -- Summary: CPU2006 450.soplex: verify_cgraph_node failed with - fprofile-generate Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: changpeng dot fang at amd dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45269
[Bug c/45270] New: CPU2006 435.gromacs: Segmentation fault with -fprofile-generate
With gcc 4.6 on x86, 435.gromacs Segmentation fault with -fprofile-generate inconstr.c: gcc -c -DSPEC_CPU -DNDEBUG -I. -DHAVE_CONFIG_H -fprofile-generate -O2 -m64 -DSPEC_CPU_LP64 constr.c constr.c: In function âcount_constraintsâ: constr.c:624:5: internal compiler error: Segmentation fault -- Summary: CPU2006 435.gromacs: Segmentation fault with -fprofile- generate Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: changpeng dot fang at amd dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45270
[Bug tree-optimization/45260] [4.5/4.6 Regression] g++4.5: -prefetch-loop-arrays internal compiler error: in verify_expr, at tree-cfg.c:2541
--- Comment #3 from changpeng dot fang at amd dot com 2010-08-12 00:38 --- (In reply to comment #2) It was caused by revision 153878: http://gcc.gnu.org/ml/gcc-cvs/2009-11/msg00094.html I think the same patch was also committed to 4.4 branch. Maybe some prefetch work(s) in 4.5 triggered the bug. and disappeared with revision 159514: http://gcc.gnu.org/ml/gcc-cvs/2010-05/msg00566.html I am not if it really fixed the bug. This could not be a valid fix, because it just disable some prefetches based on performance concern. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45260
[Bug tree-optimization/45241] [4.5/4.6 Regression] CPU2006 465.tonto ICE in the vectorizer with -fno-tree-pre
--- Comment #7 from changpeng dot fang at amd dot com 2010-08-10 21:44 --- (In reply to comment #5) (In reply to comment #1) This patch should be a valid fix, because the recognition of the dot_prod pattern is known to be fail at this point if the stmt is outside the loop. (I am not sure whether we should not see this case in the vectorizer at this point -- should previous analysis already filter out?): I don't understand this. Where do we check if the stmt (which one?) is outside the loop? Forget about this part of the comment (The vectorization analysis is correct, and it is just that the pattern recognition traces the chain outside the loop). I was looking at PR 45239 and didn't notice that there is another PR and didn't see this comment. So I tested the same fix (successfully on x86_64-suse-linux). You can commit it if you like (just please notice, that the bug exists on 4.5 as well). I am going to add your testcase (in comment #4), and doing bootstraping, and then commit to the trunk and gcc 4.5 branch. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45241
[Bug tree-optimization/45239] New: CPU2006 465.tonto ICE in the vectorizer with -fno-tree-pre
With gcc 4.6: gfortran -c -o diis.fppized.o -O3 -fno-tree-pre -march=amdfam10 -m64 diis.fppized.f90 diis.fppized.f90: In function 'extrapolate': diis.fppized.f90:882:0: internal compiler error: vector VEC(vec_void_p,base) index domain error, in vinfo_for_stmt at tree-vectorizer.h:595 Please submit a full bug report, with preprocessed source if appropriate. See http://gcc.gnu.org/bugs.html for instructions This is invoked in vect_recog_dot_prod_pattern: stmt_vinfo = vinfo_for_stmt (stmt); Where stmt is not inside the loop, and thus stmt_vinfo was not set up. -- Summary: CPU2006 465.tonto ICE in the vectorizer with -fno-tree- pre Product: gcc Version: tree-ssa Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: changpeng dot fang at amd dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45239
[Bug tree-optimization/45241] New: CPU2006 465.tonto ICE in the vectorizer with -fno-tree-pre
With gcc 4.6: gfortran -c -o diis.fppized.o -O3 -fno-tree-pre -march=amdfam10 -m64 diis.fppized.f90 diis.fppized.f90: In function 'extrapolate': diis.fppized.f90:882:0: internal compiler error: vector VEC(vec_void_p,base) index domain error, in vinfo_for_stmt at tree-vectorizer.h:595 Please submit a full bug report, with preprocessed source if appropriate. See http://gcc.gnu.org/bugs.html for instructions This is invoked in vect_recog_dot_prod_pattern: stmt_vinfo = vinfo_for_stmt (stmt); Where stmt is not inside the loop, and thus stmt_vinfo was not set up. -- Summary: CPU2006 465.tonto ICE in the vectorizer with -fno-tree- pre Product: gcc Version: tree-ssa Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: changpeng dot fang at amd dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45241
[Bug tree-optimization/45241] CPU2006 465.tonto ICE in the vectorizer with -fno-tree-pre
--- Comment #1 from changpeng dot fang at amd dot com 2010-08-09 17:52 --- This patch should be a valid fix, because the recognition of the dot_prod pattern is known to be fail at this point if the stmt is outside the loop. (I am not sure whether we should not see this case in the vectorizer at this point -- should previous analysis already filter out?): diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c index 19f0ae6..5f81a73 100644 --- a/gcc/tree-vect-patterns.c +++ b/gcc/tree-vect-patterns.c @@ -259,6 +259,10 @@ vect_recog_dot_prod_pattern (gimple last_stmt, tree *type_in, tree *type_out) inside the loop (in case we are analyzing an outer-loop). */ if (!is_gimple_assign (stmt)) return NULL; + + if (!flow_bb_inside_loop_p (loop, gimple_bb (stmt))) +return NULL; + stmt_vinfo = vinfo_for_stmt (stmt); gcc_assert (stmt_vinfo); if (STMT_VINFO_DEF_TYPE (stmt_vinfo) != vect_internal_def) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45241
[Bug tree-optimization/45022] No prefetch for the vectorized loop
--- Comment #4 from changpeng dot fang at amd dot com 2010-07-29 19:14 --- (In reply to comment #1) The misaligned indirect-refs will vanish soon. I saw your patch that remove ALIGNED_INDIRECT_REF. Do you also plan to remove MISALIGNED_INDIRECT_REF? Thanks. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45022
[Bug tree-optimization/45021] Redundant prefetches for some loops (vectorizer produced ones too)
--- Comment #4 from changpeng dot fang at amd dot com 2010-07-28 18:22 --- Andrew's example is exactly what the prefetch sees for the test case (in the bug description). Unfortunately, the prefetch pass could not recognize that vect_pa.6_24 and vect_pa.20_38 are exactly the same address: bb 2: pretmp.2_18 = (float) beta_4(D); vect_pa.9_22 = (vector(4) float *) a; vect_pa.6_23 = vect_pa.9_22; vect_cst_.12_27 = {pretmp.2_18, pretmp.2_18, pretmp.2_18, pretmp.2_18}; vect_pb.16_29 = (vector(4) float *) b; vect_pb.13_30 = vect_pb.16_29; vect_pa.23_36 = (vector(4) float *) a; vect_pa.20_37 = vect_pa.23_36; bb 3: # vect_pa.6_24 = PHI vect_pa.6_25(4), vect_pa.6_23(2) # vect_pb.13_31 = PHI vect_pb.13_32(4), vect_pb.13_30(2) # vect_pa.20_38 = PHI vect_pa.20_39(4), vect_pa.20_37(2) # ivtmp.24_40 = PHI ivtmp.24_41(4), 0(2) vect_var_.10_26 = *vect_pa.6_24; vect_var_.11_28 = vect_cst_.12_27; vect_var_.17_33 = *vect_pb.13_31; vect_var_.18_34 = vect_var_.11_28 * vect_var_.17_33; vect_var_.19_35 = vect_var_.10_26 + vect_var_.18_34; *vect_pa.20_38 = vect_var_.19_35; vect_pa.6_25 = vect_pa.6_24 + 16; vect_pb.13_32 = vect_pb.13_31 + 16; vect_pa.20_39 = vect_pa.20_38 + 16; ivtmp.24_41 = ivtmp.24_40 + 1; if (ivtmp.24_41 256) goto bb 4; else goto bb 5; bb 4: goto bb 3; -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45021
[Bug tree-optimization/45021] Redundant prefetches for some loops (vectorizer produced ones too)
--- Comment #5 from changpeng dot fang at amd dot com 2010-07-28 18:28 --- Thing is a little complicate if we change the code to: a[i] = a[i+1] + beta * b[i]; The prefetch pass want to group a[i] and a[i+1], i.e. they have the same base address with an offset of 4 bytes. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45021
[Bug tree-optimization/45022] No prefetch for the vectorized loop
--- Comment #2 from changpeng dot fang at amd dot com 2010-07-22 20:52 --- (In reply to comment #1) The misaligned indirect-refs will vanish soon. From the prefetching point of view, is there any reason that we can not prefetch for mis-aligned or indirect refs? I understand that prefetching for indirect refs may be too aggressive -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45022
[Bug tree-optimization/45021] New: Redundant prefetches for the vectorized loop
For the following test case, prefetches will be inserted for both the load and store of a[i] if the loop is vectorized: float a[1024], b[1024]; void foo(int beta) { int i; for(i=0; i1024; i++) a[i] = a[i] + beta * b[i]; } with gcc -O3 -fprefetch-loop-arrays -march=amdfam10 -S, a piece of the assembly is: movaps (%rcx), %xmm0 addl$4, %edi prefetcht0 (%rdx) prefetcht0 240(%rcx) prefetchw (%rdx) leaq64(%rax), %rsi mulps %xmm1, %xmm0 If we don't vectorize the loop, we only generate prefetch for the load a[i]: addl$16, %eax salq$2, %rcx mulss %xmm1, %xmm0 prefetcht0 a+92(%rcx) prefetcht0 b+92(%rcx) movl%esi, %ecx -- Summary: Redundant prefetches for the vectorized loop Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: changpeng dot fang at amd dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45021
[Bug tree-optimization/45022] New: No prefetch for the vectorized loop
For the following test case, if we compile with -O3 -fprefetch-loop-arrays -march=amdfam10, the loop is versioned (for runtime alias checking) to be vectorized. However, we see prefetches in the non-vectorize version, but not in the vectorized version. void foo(int beta, float *a, float *b) { int i; for(i=0; i1024; i++) a[i] = a[i] + beta * b[i]; } For the vectorized loop, in tree-ssa-loop-arrays.c (idx_analyze_ref): if (TREE_CODE (base) == MISALIGNED_INDIRECT_REF || TREE_CODE (base) == ALIGN_INDIRECT_REF) return false; FALSE is returned due to mis-aligned indirect reference: M*vect_p.18_61{misalignment: 0} M*vect_p.23_66{misalignment: 0} M*vect_p.31_74{misalignment: 0} -- Summary: No prefetch for the vectorized loop Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: changpeng dot fang at amd dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45022
[Bug tree-optimization/45021] Redundant prefetches for the vectorized loop
--- Comment #1 from changpeng dot fang at amd dot com 2010-07-21 18:26 --- The direct reason is that prefetching could not differentiate the base addresses of the vectorized load and store (of a[i]): *vect_pa.6_24 *vect_pa.19_37 -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45021
[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86
--- Comment #23 from changpeng dot fang at amd dot com 2010-07-21 21:30 --- Fixed -- changpeng dot fang at amd dot com changed: What|Removed |Added Status|NEW |RESOLVED Resolution||FIXED http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297
[Bug tree-optimization/44955] New: over-prefetched for arrays of complex number
While I am working on prefetching-incurred performance degradation on 168.wupwise, I find that complex arrays are always over-prefetched. Prefetches are generated for both the real part and imagine part. subroutine s311 (i,j,n,m,beta,a,b) c c reductions c sum reduction c integer n, i, j, beta, m complex a(n,n), b(n,n) do 1 j = 1,n do 10 i = 1,m a(i,j) = a(i,j) + beta * b(i,j) 10 continue 1 continue return end For this example, two prefetches are generated for a, and two prefetches for b. -- Summary: over-prefetched for arrays of complex number Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: changpeng dot fang at amd dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44955
[Bug tree-optimization/44955] over-prefetched for arrays of complex number
--- Comment #1 from changpeng dot fang at amd dot com 2010-07-15 17:20 --- This is a piece of code that shows the two prefetches for b. mulss %xmm4, %xmm5 addq$8, %rdx prefetcht0 96(%r11) prefetcht0 100(%r11) subss %xmm2, %xmm1 addss %xmm5, %xmm0 In collecting memory references for the loops, the array of the imagine part is put into the different group from that of the real part (and thus two prefetches are generated). eference 0x2d61e70: group 0x2d63630 (base REALPART_EXPR *b_64(D)... Reference 0x2d615e0: group 0x2d40f40 (base IMAGPART_EXPR *b_64(D)... I think that the base should be reduced to the same, with a offset of 4. So they can be in the same group. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44955
[Bug tree-optimization/44794] pre- and post-loops should not be unrolled.
--- Comment #4 from changpeng dot fang at amd dot com 2010-07-15 01:50 --- Created an attachment (id=21205) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=21205action=view) Do not unroll pre and post loops I did a quick test on polyhedron before and after applying the preliminary patch. Tests are based on -O3 -fprefetch-loop-arrays -funroll-loops. timing (s) | size (B) before after %deduc | before after %deduc cacacita 14.35 10.88 24.18 | 90715 72843 19.7 gas_dyn 34.68 21.58 37.77 | 149608 100936 32.53 nf 33.91 19.32 43.03 | 139150 83054 40.31 protein 51.35 33.23 35.29 | 163672 122808 24.97 rnflow 60.9 43.28 28.93 | 268784 169152 37.07 test_fpu 52.61 30.35 42.31 | 234045 144285 38.35 -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44794
[Bug middle-end/44576] [4.5/4.6 Regression] testsuite/gfortran.dg/zero_sized_1.f90 with huge compile time on prefetching + peeling
--- Comment #20 from changpeng dot fang at amd dot com 2010-07-09 01:59 --- I submitted a patch for review to completely fix the problem. The patch is an extension to Christian's speedup.patch. It splits the cost analysis into three small functions and quits further prefetching analysis as long as we know prefetching is not going to be beneficial to the loop. Here is the gcc-patches@ link: http://gcc.gnu.org/ml/gcc-patches/2010-07/msg00734.html -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44576
[Bug middle-end/44576] [4.5/4.6 Regression] testsuite/gfortran.dg/zero_sized_1.f90 with huge compile time on prefetching + peeling
--- Comment #19 from changpeng dot fang at amd dot com 2010-07-07 19:00 --- (In reply to comment #18) Changpeng, should this PR be closed now? No. I am still looking at the dependence computation cost. I just found the most of the time is spent in memory allocation and freeing of the data dependence relatiuon structure. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44576
[Bug tree-optimization/44794] pre- and post-loops should not be unrolled.
--- Comment #2 from changpeng dot fang at amd dot com 2010-07-06 17:58 --- We also need to handle the post loop of unrolling. Suppose the unroll_factor is 16, then the post-loop should have up to 15 iterations. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44794
[Bug tree-optimization/44794] pre- and post-loops should not be unrolled.
--- Comment #3 from changpeng dot fang at amd dot com 2010-07-06 18:35 --- Here is the impact of loop unrolling on the compilation time and code size on polyhedron test_fpu.f90: -O3 -ftree-vectorize -fno-prefetch-loop-arrays -fno-unroll-loops: timing: 12.62s, size: 67069 bytes -O3 -ftree-vectorize -fprefetch-loop-arrays -funroll-loops: timing: 51.77s, size: 234045 bytes I also did an experiment on prefetching that we don't unroll the pre- and post-loop generated by the vectorizer: -O3 -ftree-vectorize -fprefetch-loop-arrays: timing: 29.32s size: 92541 bytes -O3 -ftree-vectorize -fprefetch-loop-arrays (don't unroll pre- postloops) timing: 18.34s size: 78909 bytes -O3 -ftree-vectorize -fno-prefetch-loop-arrays timing: 12.62s, size: 67069 bytes -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44794
[Bug tree-optimization/44794] New: pre- and post-loops should not be unrolled.
void foo(int *a, int *b, int n) { int i; for(i = 0; i n; i++) a[i] = a[i] + b[i]; } For this simple loop, the vectorizer does its job and peels the last few iterations as post-loop that is not vectorized. But the RTL loop unroller does not know that it just has a few (at most 3 in this case) iterations, and will unroll the post-loop. What is worse, if you compile it with: gcc -O3 -fprefetch-loop-arrays -funroll-loops You may find the prefetch pass will also unroll the post-loop, and generate a new post-loop (post-post-loop) for this post-loop. Again, the RTL loop unroller could not recognize this post-post-loop, and will unroll it. (the RTL loop unroller will generate yet another post loop (post-post-post-loop) for the post-post-loop :-)) This will cause compilation time and code size increase dramastically without any performance benefit. -- Summary: pre- and post-loops should not be unrolled. Product: gcc Version: lno Status: UNCONFIRMED Severity: major Priority: P3 Component: tree-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: changpeng dot fang at amd dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44794
[Bug middle-end/44576] [4.5/4.6 Regression] testsuite/gfortran.dg/zero_sized_1.f90 with huge compile time on prefetching + peeling
--- Comment #17 from changpeng dot fang at amd dot com 2010-07-02 23:58 --- (In reply to comment #15) I have opened PR44794 for the unrolling of pre- and post-loop issue. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44576
[Bug middle-end/44576] [4.5/4.6 Regression] testsuite/gfortran.dg/zero_sized_1.f90 with huge compile time on prefetching + peeling
--- Comment #15 from changpeng dot fang at amd dot com 2010-07-01 00:34 --- Unrolling of the peeled loop is partially the reason for test_fpu.f90 compilation time and code size increase. Vectorization peeled a few iteration of the the loop, the prefetching and unrolling passes does not recognize that a loop is a peeled version and still unroll the loop. MODULE kinds INTEGER, PARAMETER :: RK8 = SELECTED_REAL_KIND(15, 300) END MODULE kinds ! PROGRAM TEST_FPU ! A number-crunching benchmark using matrix inversion. USE kinds ! Implemented by:David Frank dave_fr...@hotmail.com IMPLICIT NONE ! Gauss routine by: Tim Prince n...@aol.com ! Crout routine by: James Van Buskirk tor...@ix.netcom.com ! Lapack routine by: Jos Bergervoet berge...@iaehv.nl REAL(RK8) :: pool(101, 101,1000), a(101, 101) INTEGER :: i DO i = 1,1000 a = pool(:,:,i) ! get next matrix to invert END DO END PROGRAM TEST_FPU In this example, prefetching will unroll tree version of the innermost loop. If we turn off the vectorizer, it unrolls the only loop. In addition, -fprefetch-loop-arrays and -funroll-loops (turned on at the same time) will unroll the same loop. This is over-unrolling and -funroll-loops should recognize that the loop has already been unrolled by prefetching. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44576
[Bug middle-end/44576] [4.5/4.6 Regression] testsuite/gfortran.dg/zero_sized_1.f90 with huge compile time on prefetching + peeling
--- Comment #13 from changpeng dot fang at amd dot com 2010-06-30 00:23 --- Here is the current status of this work: patch1: http://gcc.gnu.org/ml/gcc-patches/2010-06/msg02956.html patch2: http://gcc.gnu.org/ml/gcc-patches/2010-06/msg03049.html On my system with -O3 zero_sized_1.f90 -fprefetch-loop-arrays -fno-unroll-loops --param max-completely-peeled-insns=2000: original timing: 5m30s with patch1: 1m20s with patch1 + patch2: 1m03s without prefetch: 0m30s The timing with prefetch-loop-arrays is still doubled after the two patch compared to no-prefetch-loop-arrays. The extra 33s is mostly spent in dependence computation for loops. For this test case, prefetching is the only optimization that invokes compute_all_dependences. I am not sure whether we should tolerate this timing increase with aggressive peeling and prefetching, or we should work on the cost reduction of dependence computation. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44576
[Bug middle-end/44576] [4.5/4.6 Regression] testsuite/gfortran.dg/zero_sized_1.f90 with huge compile time on prefetching + peeling
--- Comment #14 from changpeng dot fang at amd dot com 2010-06-30 00:36 --- (In reply to comment #7) A good chunk of time seems to be spent in the RTL loop unroller, triggered by array prefetching (testing with -O3 -funroll-loops). Otherwise it might as well be just excessive code growth caused by prefetching. Yes, for test_fpu.f90, more than half of the time is spent in the RTL loop unroller, and if manually set unroll_factor to 1 (don't unroll), the timing increase by array prefetching is negligible. With -O3 -funroll-loops, I don't expect code size or compilation time increase from the RTL loop unroller, triggered by array prefetching. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44576
[Bug middle-end/44576] [4.5/4.6 Regression] testsuite/gfortran.dg/zero_sized_1.f90 with huge compile time on prefetching + peeling
--- Comment #11 from changpeng dot fang at amd dot com 2010-06-29 00:07 --- I have a patch that partially fixes the problem: http://gcc.gnu.org/ml/gcc-patches/2010-06/msg02956.html Note that for this test case, the compile time doubled even though I don't compute the miss rate at all. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44576
[Bug middle-end/44576] [4.5/4.6 Regression] testsuite/gfortran.dg/zero_sized_1.f90 with huge compile time on prefetching + peeling
--- Comment #12 from changpeng dot fang at amd dot com 2010-06-29 00:49 --- Created an attachment (id=21034) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=21034action=view) Early return in miss rate computation The attached patch improves the computation of miss rate. We can stop computing if the total misses has always exceeds the given acceptable threshold. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44576
[Bug middle-end/44576] [4.5/4.6 Regression] testsuite/gfortran.dg/zero_sized_1.f90 with huge compile time on prefetching + peeling
--- Comment #4 from changpeng dot fang at amd dot com 2010-06-25 17:08 --- (In reply to comment #3) Created an attachment (id=21001) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=21001action=view) [edit] Potential fix for compile time regression Here is a potential fix. We just limit prefetching to loops with a low amount of memory references and bail out if the amount of references is too large. This should be a good fix for now. But the complexities of computing group reuse and miss rate are still a concern. I don't think we need to compute the miss rate exactly here. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44576
[Bug tree-optimization/44503] control flow in the middle of basic block with -fprefetch-loop-arrays
--- Comment #3 from changpeng dot fang at amd dot com 2010-06-14 18:28 --- Actually, the prefetching is for the following loop: for (i = 0; i p[2]; i++) q[i] = 0; I do not understand why unrolling of this loop affects other part of the program that has longjmp. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44503
[Bug tree-optimization/44503] control flow in the middle of basic block with -fprefetch-loop-arrays
--- Comment #4 from changpeng dot fang at amd dot com 2010-06-14 22:22 --- There is nothing wrong in the prefetch itself. The problem is __builtin_prefetch call used for prefetch instruction. Whenever, there is a non-local lable in the current function, the __builtin_prefetch inserted will be considered as a control flow statement: is_ctrl_altering_stmt (gimple t) { /* A non-pure/const call alters flow control if the current function has nonlocal labels. */ if (!(flags (ECF_CONST | ECF_PURE)) cfun-has_nonlocal_label) { return true; ... } -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44503
[Bug c/44503] New: control flow in the middle of basic block with -fprefetch-loop-arrays
Attached is a test case from gcc regression test. verify_flow_info failed when I turned on prefetching. gcc -O3 -fprefetch-loop-arrays setjmp-1.c setjmp-1.c: In function main: setjmp-1.c:17:1: error: control flow in the middle of basic block 20 setjmp-1.c:17:1: error: control flow in the middle of basic block 20 setjmp-1.c:17:1: internal compiler error: verify_flow_info failed Please submit a full bug report, Looks like loops with longjmp should not be unrolled. -- Summary: control flow in the middle of basic block with - fprefetch-loop-arrays Product: gcc Version: tree-ssa Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: changpeng dot fang at amd dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44503
[Bug c/44503] control flow in the middle of basic block with -fprefetch-loop-arrays
--- Comment #1 from changpeng dot fang at amd dot com 2010-06-11 16:32 --- Created an attachment (id=20894) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=20894action=view) prefetching for the while loop? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44503
[Bug tree-optimization/44503] control flow in the middle of basic block with -fprefetch-loop-arrays
--- Comment #2 from changpeng dot fang at amd dot com 2010-06-11 18:45 --- Bug 39398 looks similar but that one seems with except handling instead of setjmp. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44503
[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86
--- Comment #21 from changpeng dot fang at amd dot com 2010-06-08 16:23 --- Just for the record, non-constant step prefetching improves 459.GemsFDTD by 5.5% (under -O3 + prefetch) on amd-linux64 systems. And the gains are from the following set of loops: NFT.fppized.f90:1268 NFT.fppized.f90:1227 NFT.fppized.f90:1186 NFT.fppized.f90:1148 NFT.fppized.f90:1109 NFT.fppized.f90:1072 -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297
[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86
--- Comment #14 from changpeng dot fang at amd dot com 2010-06-07 18:27 --- Here is the current status of my investigation: (1) 465.tonto regression (~9%): The regressions mainly comes from loops which have array references with both constant (prefetch_mod = 8) and non-constant (prefetch_mod=1) steps. The loops are unrolled 8 times, and 8 non-constant step prefetches are inserted into the unrolled loops. The ideal way to solve the problem is to compute the prefetch count considering the effect of unrolling, i.e. we should count 8 non-constant step prefetches in stead of 1. (2) 416.gamess regression (~5%): The regression is from non-constant-step prefetching for outer loops. I am proposing not to do non-constant step prefetching for outer loops to solve the problem. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297
[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86
--- Comment #15 from changpeng dot fang at amd dot com 2010-06-07 18:30 --- Created an attachment (id=20860) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=20860action=view) Don't consider effect of unrolling in the computation of insn-to-prefetch ratio -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297
[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86
--- Comment #16 from changpeng dot fang at amd dot com 2010-06-07 18:32 --- Created an attachment (id=20861) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=20861action=view) Limit non-constant step prefetching only to the innermost loops -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297
[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86
--- Comment #17 from changpeng dot fang at amd dot com 2010-06-07 18:37 --- (In reply to comment #15) Created an attachment (id=20860) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=20860action=view) [edit] Don't consider effect of unrolling in the computation of insn-to-prefetch ratio To compute the insn-to-prefetch ratio precisely, we may need to compute this after schedule_prefetches to know exactly how many prefetches are scheduled (we also need to compute the exact number of insns in the unrolled body). For now, I would like to unable my previous commit of using (unroll_factor * insns) for the total insns in the unrolled body. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297
[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86
--- Comment #19 from changpeng dot fang at amd dot com 2010-06-07 22:30 --- Created an attachment (id=20862) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=20862action=view) Account prefetch_mod and unroll_factor for the computation of the prefetch count Ooops. Attached a wrong patch previously. This one is what I have mentioned. -- changpeng dot fang at amd dot com changed: What|Removed |Added Attachment #20860|0 |1 is obsolete|| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297
[Bug tree-optimization/43529] G++ doesn't optimize away empty loop when index is a double
--- Comment #2 from changpeng dot fang at amd dot com 2010-06-04 23:15 --- Interesting! What's the difference between 17 and 18? int main() { double i; for(i=0; i18; i+=1); /* gcc -O3, empty loop not removed */ } int main() { double i; for(i=0; i17; i+=1); /* gcc -O3, empty loop removed */ } -- changpeng dot fang at amd dot com changed: What|Removed |Added CC||changpeng dot fang at amd ||dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43529
[Bug tree-optimization/43529] G++ doesn't optimize away empty loop when index is a double
--- Comment #3 from changpeng dot fang at amd dot com 2010-06-04 23:29 --- (In reply to comment #2) Interesting! What's the difference between 17 and 18? int main() { double i; for(i=0; i18; i+=1); /* gcc -O3, empty loop not removed */ } The funny thing occurs in gcc 4, not gcc 6: .file empty.c .text .p2align 4,,15 .globl main .type main, @function main: .LFB0: .cfi_startproc xorl%eax, %eax .p2align 4,,10 .p2align 3 .L2: addl$1, %eax cmpl$18, %eax jne .L2 rep ret .cfi_endproc .LFE0: .size main, .-main .ident GCC: (Ubuntu 4.4.1-4ubuntu9) 4.4.1 .section.note.GNU-stack,,@progbits -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43529
[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86
--- Comment #11 from changpeng dot fang at amd dot com 2010-06-01 17:40 --- (In reply to comment #10) Created an attachment (id=20783) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=20783action=view) [edit] experimental patch to have separate values for min_insn_to_prefetch_ration Changpeng, thank you for the feedback. Can you confirm that the regression was introduced by a prefetch with an unknown step or is there still a bug in the calculation of the normal prefetches (e.g. by applying the first patch that disables non-constant steps) Anyway, here is a patch that increases min_insn_to_prefetch_ratio for non-constant steps. Does that make a difference for tonto? Do you prefer other intial values? Thanks Christian Hi, Christian: For constant step prefetching only, tonto regressed by ~7%, and for const + invariant step prefetching combined, it regressed by ~16%. I should have mentioned earlier that non-constant step prefetching has improved 459.GemsFDTD by 4~5% on amd-linux64 systems, and tonto regression by non-constant step prefetching should be able to be fixed by re-compute the prefetch count by considering the unroll_factor. However, I have found the non-temporal store problem which can cause 416.gamess degradation by ~50%. I am not sure whether it is caused by non-constant step prefetching or not. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297
[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86
--- Comment #13 from changpeng dot fang at amd dot com 2010-06-01 19:59 --- (In reply to comment #12) Ok. So I will let you continue to look into that and wait for your results? Do you have any feedback on separate.patch and its influence on performance? + for (; groups; groups = groups-next) + for (ref = groups-refs; ref; ref = ref-next) + { + if (cst_and_fits_in_hwi (ref-group-step)) + continue; + if (!ref-issue_prefetch_p) + continue; + insn_to_prefetch_ratio = (unroll_factor * ninsns) / prefetch_count; + if (insn_to_prefetch_ratio MIN_INSN_TO_SPECULATIVE_PREFETCH) + { + ref-issue_prefetch_p = false; + if (dump_file (dump_flags TDF_DETAILS)) + fprintf (dump_file, + Ignoring %p-- insn to prefetch ratio (%d) too small\n, + (void *) ref, insn_to_prefetch_ratio); + } + } + } The patch should fix the tonto regression caused by non-constant step prefetching. It is just that you should move the computation and comparison outside (before) the loop and the debug dump after the loop. I am just thinking that for such loop, we should do nothing: non-non-temporal stores and no constant step prefetching because nothing could be trusted. I am doing some experiemnts and let you know what I could find. Thanks. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297
[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86
--- Comment #6 from changpeng dot fang at amd dot com 2010-05-28 16:46 --- (In reply to comment #4) Created an attachment (id=20767) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=20767action=view) [edit] Patch that makes loop invariant prefetches backend specfic Actually, I am the one who would like the invariant step prefetch to be backend independent. However, the current implementation seems a bit aggressive: The fundamental assumption of the implementation is that the invariant step is big enough so that there is no spatial reuse and we don't need to unroll the loop (preprech_mod == 1). This assumption may be OK for c code (or integer code), and may not be appropriate for fortran programs. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297
[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86
--- Comment #7 from changpeng dot fang at amd dot com 2010-05-28 16:56 --- (In reply to comment #5) An alternative approach might be have different values for prefetch-min-insn-to-mem-ratio and min-insn-to-prefetch-ratio depending on constant/non-constant step size. It may be a good idea for limit non-constant step prefetching to big loops. This is because we are not very confident that the reference will cause cache miss, and we should limit the prefetches generated. min-insn-to-prefetch-ratio may be a good parameter to work on. By the way, I am thinking that min-insn-to-prefetch-ratio should be backend dependent. In certain sense, this parameter implies how many useless prefetches can an architecture tolerate. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297
[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86
--- Comment #8 from changpeng dot fang at amd dot com 2010-05-28 18:30 --- (In reply to comment #4) Created an attachment (id=20767) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=20767action=view) [edit] Patch that makes loop invariant prefetches backend specfic Three observations: 1. the patch had a bug which let to wrong calculation in some cases This commit should be applied to improve some other testcases: http://gcc.gnu.org/viewcvs?view=revisionrevision=159816 Looks like this is a fix to the regressions. That is, the regressions are actually caused by the wrong calculation. This bug could be considered fixed, even though performance tuning may be necessary for non-constant step prefetching. Thanks. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297
[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86
--- Comment #9 from changpeng dot fang at amd dot com 2010-05-28 18:36 --- (In reply to comment #8) Looks like this is a fix to the regressions. That is, the regressions are actually caused by the wrong calculation. This bug could be considered fixed, even though performance tuning may be necessary for non-constant step prefetching. Thanks. Oh, NO! After this patch, 465.tonto has a big regression (-16%), compared to no prefetching. Note that prefetching causes 465.tonto a ~7% degradation originally (before non-constant step prefetching) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297
[Bug middle-end/44297] New: Big spec cpu2006 prefetch regressions on gcc 4.6 on x86
Tests are on amd-linux64 system with -O3 -fprefetch-loop-arrays Compare gcc-4.6-20100522.tar.bz2 to gcc-4.6-20100515.tar.bz2 459.GemsFDTD: -32.6% 434.zeusmp: -13.6% If I replace tree-ssa-loop-prefetch.c in gcc-4.6-20100522.tar.bz2 with the one in gcc-4.6-20100515.tar.bz2, The regression disappears. -- Summary: Big spec cpu2006 prefetch regressions on gcc 4.6 on x86 Product: gcc Version: tree-ssa Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: changpeng dot fang at amd dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297
[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86
--- Comment #1 from changpeng dot fang at amd dot com 2010-05-27 20:49 --- The regressions are most likely from the patch that added non-constant step prefetching: * From: Andreas Krebbel krebbel at linux dot vnet dot ibm dot com * To: Christian Borntraeger borntraeger at de dot ibm dot com * Cc: gcc-patches gcc-patches at gcc dot gnu dot org * Date: Wed, 19 May 2010 12:40:51 +0200 * Subject: Re: [patch 4/4 v4] Allow loop prefetch code to speculatively prefetch non constant steps * tree-ssa-loop-prefetch.c (mem_ref_group): Change step to tree. * tree-ssa-loop-prefetch.c (ar_data): Change step to tree. * tree-ssa-loop-prefetch.c (dump_mem_ref): Adopt debug code to handle a tree as step. This also checks for a constant int vs. non-constant but loop-invariant steps. * tree-ssa-loop-prefetch.c (find_or_create_group): Change the sort algorithm to only consider steps that are constant ints. * tree-ssa-loop-prefetch.c (idx_analyze_ref): Adopt code to handle a tree instead of a HOST_WIDE_INT for step. * tree-ssa-loop-prefetch.c (gather_memory_references_ref): Handle tree instead of int and be prepared to see a NULL_TREE. * tree-ssa-loop-prefetch.c (prune_ref_by_self_reuse): Do not prune prefetches if the step cannot be calculated at compile time. * tree-ssa-loop-prefetch.c (prune_ref_by_group_reuse): Do not prune prefetches if the step cannot be calculated at compile time. * tree-ssa-loop-prefetch.c (issue_prefetch_ref): Issue prefetches for non-constant but loop-invariant steps. Applied to mainline. Thanks! -- changpeng dot fang at amd dot com changed: What|Removed |Added CC||borntraeger at de dot ibm ||dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297
[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86
--- Comment #2 from changpeng dot fang at amd dot com 2010-05-27 20:55 --- To me, non-constant step prefetching seems not fit into the existing prefetching framework. non-constant stride prevent any reuse analysis, and thus prefetching is kind of blindly. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297
[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86
--- Comment #3 from changpeng dot fang at amd dot com 2010-05-27 23:51 --- I did a quick look at 434.zeusmp and found that prefetching for the following simple loop is responsible: linpck.f: 131: c ccode for increment not equal to 1 c ix = 1 smax = abs(sx(1)) ix = ix + incx do 10 i = 2,n if(abs(sx(ix)).le.smax) go to 5 isamax = i smax = abs(sx(ix)) 5ix = ix + incx 10 continue Prefetching for this loop seems too aggressive with unknown incx. It is not precditable which sx(ix) will cause cache miss. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297
[Bug tree-optimization/43423] gcc should vectorize this loop through if-conversion
--- Comment #9 from changpeng dot fang at amd dot com 2010-05-24 22:47 --- (In reply to comment #8) -fgraphite-identity does iteration splitting for this case. Do you know why it could not be vectorized after iteration range splitting? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43423
[Bug middle-end/44185] [4.6 regression] New prefetch test failures
--- Comment #6 from changpeng dot fang at amd dot com 2010-05-21 21:36 --- (In reply to comment #5) The fix introduced: FAIL: gcc.dg/tree-ssa/prefetch-7.c scan-assembler-times movnti 18 FAIL: gcc.dg/tree-ssa/prefetch-7.c scan-tree-dump-times optimized ={nt} 18 on Linux/ia32. It seems the unrolling is quite different for different architecture. The count of movnti in and assembly code depends on the unroll_factor. I would propose to remove the movnti check in the assembly code. The dump in aprefetch shows there are two non-temporal stores generated and this is enough. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44185
[Bug middle-end/44185] [4.6 regression] New prefetch test failures
--- Comment #2 from changpeng dot fang at amd dot com 2010-05-18 19:39 --- I have a patch to fix the test cases: http://gcc.gnu.org/ml/gcc-patches/2010-05/msg01359.html For prefetch-6.c, patch http://gcc.gnu.org/ml/gcc-cvs/2010-05/msg00567.html applies the insn to prefetch ratio heuristic to loops with known trip count, and thus filtered one prefetch out. Add --param min-insn-to-prefetch-ratio=6 (default is 10) fixes the problem. For prefetch-7.c, patch http://gcc.gnu.org/ml/gcc-cvs/2010-05/msg00566.html does not generate prefetch if the loop is far from being sufficiently unrolled required by the prefetching. In this case, prefetching requires the loop to be unrolled 16 times, but the loop is not unrolled due to the parameter constraint. We remove --param max-unrolled-insns=1 to allow unrolling and thus generating prefetches. The movnti count is also adjusted. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44185
[Bug tree-optimization/43425] gcc should vectorize this loop by substitution
--- Comment #3 from changpeng dot fang at amd dot com 2010-05-07 21:33 --- I just found that the test case in the same as (similar to) bug 35229. The subject of this bug is wrong. Scalar expansion is not appropriate for this case. Actually the loop can be transform to: void foo(int n) { int i; a[0] = b[0]; /* + t if t live before this point */ for(i=1; in; i++) { a[i] = b[i] + b[i-1]; } /* t = b[n-1]; is t live after this point */ } The this loop can be vectorized. In open64, this optimization is called forward (backward) substitution, i.e. substitute t with b[i-1]. I am not clear whether bug 35229 addresses the same issue. Maybe we should close one of them. -- changpeng dot fang at amd dot com changed: What|Removed |Added Summary|enhance scalar expansion to |gcc should vectorize this |vectorize this loop |loop by substitution http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43425
[Bug tree-optimization/43423] gcc should vectorize this loop through if-conversion
--- Comment #7 from changpeng dot fang at amd dot com 2010-05-07 21:41 --- (In reply to comment #4) (In reply to comment #3) Subject: Re: gcc should vectorize this loop through iteration range splitting You mean that the problem is the if-conversion of the stores a[i] = ... If we rewrite the code like: int a[100], b[100], c[100]; void foo(int n, int mid) { int i; for(i=0; in; i++) { int t; int ai = a[i], bi = b[i], ci = c[i]; if (i mid) t = ai + bi; else t = ai + ci; a[i] = t; } } --- CUT --- This gets vectorized as we produce an if-cvt first. There are both correctness and performance issues in the re-written code. b[i] or c[i] may not be executed in the original loop. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43423
[Bug tree-optimization/43543] New: Reorder the statements in the loop can vectorize it
int a[100], b[100], c[100], d[100]; void foo () { int i; for(i=1; i 99; i++) { a[i] = b[i-1] + c[i]; b[i] = b[i+1] + d[i]; } } gcc -O3 -ffast-math -ftree-vectorizer-verbose=2 -c foo.c foo.c:6: note: not vectorized, possible dependence between data-refs b[D.2728_3] and b[i_17] foo.c:3: note: vectorized 0 loops in function. However, if we reorder the two statements in the loop, then it can be vectorized. open64 can do this reordering. -- Summary: Reorder the statements in the loop can vectorize it Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: changpeng dot fang at amd dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43543
[Bug tree-optimization/42906] [4.5 Regression] Empty loop not removed
--- Comment #20 from changpeng dot fang at amd dot com 2010-03-18 17:24 --- (In reply to comment #19) Splitting critical edges for CDDCE will probably also solve this problem. Richard. Yes, splitting critical edges is an enhancement to CDDCE and can solve this problem. There are two approaches to do this (1) add pass_split_crit_edges before each pass_cd_dce or (2) encode split_crit_edges into cddce as an initialization. What do you think? Thanks. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42906
[Bug c/43422] New: reversed loop is not vectorized
gcc could not vectorize this simple reversed loop: int a[100], b[100]; void foo(int n) { int i; for(i=n-2; i=0; i--) a[i+1] = a[i] + b[i]; } chf...@pathscale:~/gcc$ gcc -O3 -ftree-vectorizer-verbose=2 -c foo.c foo.c:6: note: not vectorized: complicated access pattern. foo.c:3: note: vectorized 0 loops in function. open64 can vectorize this loop: chf...@pathscale:~/gcc$ opencc -O3 -LNO:simd_verbose=on -c foo.c (foo.c:0) LOOP WAS VECTORIZED. -- Summary: reversed loop is not vectorized Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: changpeng dot fang at amd dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43422
[Bug c/43423] New: gcc should vectorize this loop through iteration range splitting
chf...@pathscale:~/gcc$ cat foo.c int a[100], b[100], c[100]; void foo(int n, int mid) { int i; for(i=0; in; i++) { if (i mid) a[i] = a[i] + b[i]; else a[i] = a[i] + c[i]; } } chf...@pathscale:~/gcc$ gcc -O3 -ftree-vectorizer-verbose=7 -c foo.c foo.c:6: note: not vectorized: control flow in loop. foo.c:3: note: vectorized 0 loops in function. This loop can be vectorized by icc. For this case, I would expect to see two loops with iteration range of [0, mid) and [mid, n). Then both loops can be vectorized. I am not sure which pass in gcc should do this iteration range splitting. -- Summary: gcc should vectorize this loop through iteration range splitting Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: changpeng dot fang at amd dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43423
[Bug c/43425] New: enhance scalar expansion to vectorize this loop
chf...@pathscale:~/gcc$ cat foo.c int a[100], b[100]; void foo(int n, int mid) { int i, t = 0; for(i=0; in; i++) { a[i] = b[i] + t; t = b[i]; } } chf...@pathscale:~/gcc$ gcc -O3 -ftree-vectorizer-verbose=7 -c foo.c foo.c:6: note: not vectorized: unsupported use in stmt. foo.c:3: note: vectorized 0 loops in function. scalar expansion of t into array to carry the values accross iteration. -- Summary: enhance scalar expansion to vectorize this loop Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: changpeng dot fang at amd dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43425
[Bug c/43427] New: The loop is not interchanged and thus could not be vectorized.
chf...@pathscale:~/gcc$ cat foo.c float a[100][100], b[100][100]; void foo(int n) { int i, j; for(j=0; jn; j++) for(i=0; i n; i++) a[i][j] = a[i][j] + b[i][j]; } chf...@pathscale:~/gcc$ gcc -O3 -ftree-vectorizer-verbose=2 -c foo.c foo.c:6: note: not vectorized: can't create epilog loop 2. foo.c:7: note: not vectorized: complicated access pattern. foo.c:3: note: vectorized 0 loops in function. Information from open64: chf...@pathscale:~/gcc$ opencc -O3 -LNO:simd_verbose=on -c foo.c (foo.c:0) LOOP WAS VECTORIZED. (foo.c:0) LOOP WAS VECTORIZED. chf...@pathscale:~/gcc$ opencc -O3 -LNO:simd_verbose=on:interchange=0 -c foo.c (foo.c:0) Non-contiguous array a reference exists. Loop was not vectorized. (foo.c:0) Non-contiguous array a reference exists. Loop was not vectorized. Graphite may be able to do this basic loop interchange. -- Summary: The loop is not interchanged and thus could not be vectorized. Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: changpeng dot fang at amd dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43427
[Bug tree-optimization/43428] New: vectorizer should invoke loop distribution to partially vectorize this loop
chf...@pathscale:~/gcc$ cat foo.c float a[100], b[100], c[100]; void foo(int n) { int i; for(i=1; in; i++) { a[i] = a[i] + c[i]; b[i] = b[i-1] + a[i]; } } chf...@pathscale:~/gcc$ gcc -O3 -ftree-vectorizer-verbose=2 -ftree-loop-distribution -c foo.c foo.c:6: note: not vectorized, possible dependence between data-refs b[D.2730_7] and b[i_17] foo.c:3: note: vectorized 0 loops in function. Loop distribution itself may find not profitable to do such distribution. However, partially vectorize this loop may obtain big profit. ICC can partially vectorize it. -- Summary: vectorizer should invoke loop distribution to partially vectorize this loop Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: changpeng dot fang at amd dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43428
[Bug tree-optimization/32824] Missed reduction vectorizer after store to global is LIM'd
--- Comment #8 from changpeng dot fang at amd dot com 2010-03-17 21:22 --- Created an attachment (id=20133) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=20133action=view) patch with the testcase -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32824
[Bug tree-optimization/42906] [4.5 Regression] Empty loop not removed
--- Comment #17 from changpeng dot fang at amd dot com 2010-03-17 00:18 --- (In reply to comment #8) And int foo (int b, int j) { if (b) { int i; for (i = 0; i1000; ++i) ; j = b; } return j; } With j=b, b is not folded as a phi argument: bb 5: # i_2 = PHI 0(3), i_6(4) if (i_2 = 999) goto bb 4; else goto bb 6; bb 6: j_7 = b_3(D); bb 7: # j_1 = PHI j_4(D)(2), j_7(6) However, if j=0, it is: bb 6: j_7 = 0; bb 7: # j_1 = PHI j_4(D)(2), 0(6) j_8 = j_1; return j_8; Then copy propagation will remove j_7 = 0 (and thus bb 6) because it has no user. So, one possible solution is do not remove trival dead code in copy_propagation pass. Any dce pass will remove such code. Of course, if we follow Steven's suggestion not use constants as phi arguments, j_7=0 will not be removed by constant propagation, and we are all fine. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42906
[Bug tree-optimization/42906] [4.5 Regression] Empty loop not removed
--- Comment #18 from changpeng dot fang at amd dot com 2010-03-17 00:22 --- (In reply to comment #16) In this case, the loop itself is empty and we can replace every use of the phi with n (exit value of the iv). I don't think that is done by remove_empty_loop anyways and it is already done by sccp (Propagation of constants using scev) which is enabled at -O1. But n is not a constant. Of course we can modify the pass to compute the exit value of iv (integer overflow may be an issue). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42906
[Bug middle-end/43238] GCC 4.5 ICE segfault on any -O flag
--- Comment #4 from changpeng dot fang at amd dot com 2010-03-02 21:56 --- I have verified that the patch proposed in bug 43209 did fix this problem. I am going to checkin the change soon. Thanks. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43238
[Bug tree-optimization/43209] [4.5 Regression] ICE in try_improve_iv_set, at tree-ssa-loop-ivopts.c:5238
--- Comment #5 from changpeng dot fang at amd dot com 2010-03-01 18:02 --- I have a fix for this problem. We should not decrease the cost if the cost is infinite. diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c index 74dadf7..9accda9 100644 --- a/gcc/tree-ssa-loop-ivopts.c +++ b/gcc/tree-ssa-loop-ivopts.c @@ -4124,7 +4124,11 @@ determine_use_iv_cost_condition (struct ivopts_data *data, if (integer_zerop (*bound_cst) (operand_equal_p (*control_var, cand-var_after, 0) || operand_equal_p (*control_var, cand-var_before, 0))) -elim_cost.cost -= 1; +{ + /* Should not decrease the cost if it is infinite */ + if (!infinite_cost_p (elim_cost)) +elim_cost.cost -= 1; +} -- changpeng dot fang at amd dot com changed: What|Removed |Added CC||changpeng dot fang at amd ||dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43209
[Bug middle-end/43182] GCC does not pull out a[0] from loop that changes a[i] for i:[1,n]
--- Comment #4 from changpeng dot fang at amd dot com 2010-02-26 18:53 --- Here is another similar case but more general. We know that a(j) and a(i) never access the same memory location. intel ifort can vectorize this triangular loop: do 10 j = 1,n do 20 i = j+1, n a(i) = a(i) - aa(i,j) * a(j) 20 continue 10 continue -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43182
[Bug middle-end/43182] GCC does not pull out a[0] from loop that changes a[i] for i:[1,n]
--- Comment #6 from changpeng dot fang at amd dot com 2010-02-26 19:06 --- Actually it is a totally different case. Please file a new bug with that case; though there might already be a bug about that one. I could not see the difference even though j is not a compile-time constant. (it is an invariant to the innermost loop). I can say: GCC does not pull out a[j] from loop that changes a[i] for i:[j+1,n] -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43182
[Bug middle-end/43182] New: gcc could not vectorize this simple loop (un-handled data-ref)
gcc 4.5 can not vectorize this simple loop: void foo(int a[], int n) { int i; for(i=1; i n; i++) a[i] = a[0]; } gcc -O3 -fdump-tree-vect-all -c foo.c shows: foo.c:3: note: not vectorized: unhandled data-ref foo.c:3: note: bad data references. foo.c:1: note: vectorized 0 loops in function. It seems gcc gets confused at a[0] and gives up vectorization. There is no dependence in this loop, and we should teach gcc to handle a[0] to vectorize it. -- Summary: gcc could not vectorize this simple loop (un-handled data-ref) Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: changpeng dot fang at amd dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43182
[Bug middle-end/43184] New: gcc could not vectorize floating point reduction statements
gcc 4.5 could not vectorize floating point reductions. float sum(float a[], int n) { int i; float total=0.0; for(i=0; i n; i++) total += a[i]; return total; } gcc -O3 -fdump-tree-vect-all shows: foo.c:4: note: Unsupported pattern. foo.c:4: note: not vectorized: unsupported use in stmt. foo.c:4: note: unexpected pattern. foo.c:1: note: vectorized 0 loops in function. I have verified that gcc can vectorize integer reduction, but not float and double. -- Summary: gcc could not vectorize floating point reduction statements Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: changpeng dot fang at amd dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43184
[Bug middle-end/43184] gcc could not vectorize floating point reduction statements
--- Comment #2 from changpeng dot fang at amd dot com 2010-02-26 00:28 --- Subject: RE: gcc could not vectorize floating point reduction statements Thanks for pointing this out. Actually I am working on a fortran program and found the the reduction statement. The fortran code can not be vectorized even with -ffast-math. Do you think this is the problem of fortran frontend? Thanks, -- Changpeng c%3.1 subroutine s311 (ntimes,ld,n,ctime,dtime,a,b,c,d,e,aa,bb,cc) c c reductions c sum reduction c integer ntimes, ld, n, i, nl double precision a(n), b(n), c(n), d(n), e(n), aa(ld,n), + bb(ld,n), cc(ld,n) double precision chksum, sum real t1, t2, second, ctime, dtime call init(ld,n,a,b,c,d,e,aa,bb,cc,'s311 ') t1 = second() do 1 nl = 1,ntimes sum = 0.d0 do 10 i = 1,n sum = sum + a(i) 10 continue call dummy(ld,n,a,b,c,d,e,aa,bb,cc,sum) 1 continue t2 = second() - t1 - ctime - ( dtime * float(ntimes) ) chksum = sum call check (chksum,ntimes*n,n,t2,'s311 ') return end From: pinskia at gcc dot gnu dot org [gcc-bugzi...@gcc.gnu.org] Sent: Thursday, February 25, 2010 5:57 PM To: Fang, Changpeng Subject: [Bug middle-end/43184] gcc could not vectorize floating point reduction statements --- Comment #1 from pinskia at gcc dot gnu dot org 2010-02-25 23:57 --- gcc 4.5 could not vectorize floating point reductions. Yes it can; add -ffast-math. floating point reductions need -ffast-math as it can change the results in some cases (negative zero and I think clamping cases too). -- pinskia at gcc dot gnu dot org changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution||INVALID http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43184 --- You are receiving this mail because: --- You reported the bug, or are watching the reporter. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43184
[Bug tree-optimization/42906] [4.5 Regression] Empty loop not removed
--- Comment #15 from changpeng dot fang at amd dot com 2010-02-16 19:54 --- Hello, I am not sure whether CD-DCE can fully replace remove_empty_loop. However, I would prefer to keep remove_empty_loop pass. There are two reasons for this proposal: (1) remove_empty_loop was at level -O1 and above, but CD-DCE at -O2 and above. (2) remove_empty_loop can be extended to handle other cases which CD-DCE is not able to: for(i=0; in; i++); j = i; In this case, the loop itself is empty and we can replace every use of the phi with n (exit value of the iv). What do you think about this (put back the empty loop removal code)? Thanks, -- changpeng dot fang at amd dot com changed: What|Removed |Added CC||cfang at gcc dot gnu dot org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42906