[Bug tree-optimization/45241] [4.5/4.6 Regression] CPU2006 465.tonto ICE in the vectorizer with -fno-tree-pre

2010-08-30 Thread changpeng dot fang at amd dot com


--- Comment #9 from changpeng dot fang at amd dot com  2010-08-30 16:37 
---
Review approval for the trunk:
http://gcc.gnu.org/ml/gcc-patches/2010-08/msg00931.html

Review Approval for 4.5 branch:
http://gcc.gnu.org/ml/gcc-patches/2010-08/msg02112.html


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45241



[Bug tree-optimization/45241] [4.5/4.6 Regression] CPU2006 465.tonto ICE in the vectorizer with -fno-tree-pre

2010-08-30 Thread changpeng dot fang at amd dot com


--- Comment #10 from changpeng dot fang at amd dot com  2010-08-30 16:39 
---
r163207 - in /trunk/gcc: ChangeLog testsuite/Ch...

* From: cfang at gcc dot gnu dot org
* To: gcc-cvs at gcc dot gnu dot org
* Date: Thu, 12 Aug 2010 22:18:34 -
* Subject: r163207 - in /trunk/gcc: ChangeLog testsuite/Ch...

Author: cfang
Date: Thu Aug 12 22:18:32 2010
New Revision: 163207

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=163207
Log:
pr45241 give up dot_prod pattern searching if stmt is outside the loop.

* tree-vect-patterns.c (vect_recog_dot_prod_pattern): Give
up dot_prod pattern searching if a stmt is outside the loop.

* gcc.dg/vect/no-tree-pre-pr45241.c: New.

Added:
trunk/gcc/testsuite/gcc.dg/vect/no-tree-pre-pr45241.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-vect-patterns.c


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45241



[Bug tree-optimization/45241] [4.5/4.6 Regression] CPU2006 465.tonto ICE in the vectorizer with -fno-tree-pre

2010-08-30 Thread changpeng dot fang at amd dot com


--- Comment #11 from changpeng dot fang at amd dot com  2010-08-30 16:40 
---
r163286 - in /branches/gcc-4_5-branch/gcc: Chan...

* From: cfang at gcc dot gnu dot org
* To: gcc-cvs at gcc dot gnu dot org
* Date: Mon, 16 Aug 2010 21:02:30 -
* Subject: r163286 - in /branches/gcc-4_5-branch/gcc: Chan...

Author: cfang
Date: Mon Aug 16 21:02:29 2010
New Revision: 163286

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=163286
Log:
pr45241 give up dot_prod pattern searching if stmt is outside the loop.

* tree-vect-patterns.c (vect_recog_dot_prod_pattern): Give
up dot_prod pattern searching if a stmt is outside the loop.

* gcc.dg/vect/no-tree-pre-pr45241.c: New.

Added:
branches/gcc-4_5-branch/gcc/testsuite/gcc.dg/vect/no-tree-pre-pr45241.c
Modified:
branches/gcc-4_5-branch/gcc/ChangeLog
branches/gcc-4_5-branch/gcc/testsuite/ChangeLog
branches/gcc-4_5-branch/gcc/tree-vect-patterns.c


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45241



[Bug tree-optimization/45241] [4.5/4.6 Regression] CPU2006 465.tonto ICE in the vectorizer with -fno-tree-pre

2010-08-30 Thread changpeng dot fang at amd dot com


--- Comment #12 from changpeng dot fang at amd dot com  2010-08-30 16:41 
---
Fixed!


-- 

changpeng dot fang at amd dot com changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution||FIXED


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45241



[Bug target/45391] CPU2006 482.sphinx3: gcc4.6 5% regression from prefetching of vectorized loop

2010-08-24 Thread changpeng dot fang at amd dot com


--- Comment #5 from changpeng dot fang at amd dot com  2010-08-24 22:13 
---
For the test case in comment #2, if we don't vectorize the loop, the
unroll_factor is incorrectly determined as 1, and insns-to-prefetch ratio
(4) will then prevent prefetching, and thus no performance regression.

If we vectorize the loop, the prefetch_mod will be smaller than the 
upper_bound, then the unroll_factor is determined as 4. At this time, 
insns-to-prefetch ratio is big enough to allow prefetches. Thus  (5%)
regression for 482.sphinx3.

This regression should have occurred for no-tree-vectorize also if 
the unroll factor is correctly set. The actual problem is 
the unrolling itself. There is no regression if I just insert
the prefetch and do not unroll the loop at all.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45391



[Bug tree-optimization/45260] [4.5/4.6 Regression] g++4.5: -prefetch-loop-arrays internal compiler error: in verify_expr, at tree-cfg.c:2541

2010-08-23 Thread changpeng dot fang at amd dot com


--- Comment #6 from changpeng dot fang at amd dot com  2010-08-23 18:59 
---
Committed to trunk as Revision: 163475:
http://gcc.gnu.org/ml/gcc-cvs/2010-08/msg00688.html

Committed to 4.5 branch as Revision: 163483
http://gcc.gnu.org/ml/gcc-cvs/2010-08/msg00696.html


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45260



[Bug c/45389] New: CPU2006 cactusADM: gcc 4.6 15% regression from 4.5

2010-08-23 Thread changpeng dot fang at amd dot com
On a AMD amdfam10 system, gcc 4.5 (892s) is 15% faster than gcc 4.6 (1026s)
With the following settings:

4.6: gcc version 4.6.0 20100812 (experimental) (GCC) 

COPTIMIZE = -Ofast -funroll-all-loops -fno-tree-pre --param
prefetch-latency=700 -mveclibabi=acml -m64 -march=amdfam10
FOPTIMIZE = -Ofast -funroll-all-loops -fno-tree-pre -mveclibabi=acml
-m64 -march=amdfam10
EXTRA_LDFLAGS = -L$(ACML_DIR) -lacml_mv

4.5: gcc version 4.5.2 20100818 (prerelease) (GCC)

COPTIMIZE = -O3 -ffast-math -funroll-all-loops -fno-tree-pre
-fprefetch-loop-arrays --param prefetch-latency=700 -mveclibabi=acml -m64
-march=amdfam10
FOPTIMIZE = -O3 -ffast-math -funroll-all-loops -fno-tree-pre
-mveclibabi=acml -m64 -march=amdfam10
EXTRA_LDFLAGS = -L$(ACML_DIR) -lacml_mv



NOTE that for gcc 4.6, -Ofast = -O3 -ffast-math and
-fprefetch-loop-arrays
is turned on @ -O3.

Also acml4.4.0 is used for both tests.


-- 
   Summary: CPU2006 cactusADM: gcc 4.6 15% regression from 4.5
   Product: gcc
   Version: 4.6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: changpeng dot fang at amd dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45389



[Bug c/45390] New: CPU2006 434.zeusmp: gcc 4.6 7% regression from gcc 4.6

2010-08-23 Thread changpeng dot fang at amd dot com
On an AMD amdfam10 system, gcc 4.5 (713s) is 7% faster than gcc 4.6 (763s)
With the following settings:

4.6: gcc version 4.6.0 20100812 (experimental) (GCC) 
FOPTIMIZE = -Ofast -funroll-all-loops -fno-tree-pre -mveclibabi=acml
-m64 -march=amdfam10
EXTRA_LDFLAGS = -L$(ACML_DIR) -lacml_mv

4.5: gcc version 4.5.2 20100818 (prerelease) (GCC)

COPTIMIZE = -O3 -ffast-math -funroll-all-loops -fno-tree-pre
FOPTIMIZE = -O3 -ffast-math -funroll-all-loops -fno-tree-pre
-mveclibabi=acml -m64 -march=amdfam10
EXTRA_LDFLAGS = -L$(ACML_DIR) -lacml_mv



NOTE that for gcc 4.6, -Ofast = -O3 -ffast-math and
-fprefetch-loop-arrays is turned on @ -O3.

Also acml4.4.0 is used for both tests.


-- 
   Summary: CPU2006 434.zeusmp: gcc 4.6 7% regression from gcc 4.6
   Product: gcc
   Version: 4.6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: changpeng dot fang at amd dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45390



[Bug target/45391] New: CPU2006 482.sphinx3: gcc4.6 5% regression from prefetching of vectorized loop

2010-08-23 Thread changpeng dot fang at amd dot com
With gcc-4.6 -Ofast -funroll-all-loops -fno-tree-pre -mveclibabi=acml -m64
-march=amdfam10
sphnix3 runs 5% slower than with
gcc-4.6 -Ofast -funroll-all-loops -fno-prefetch-loop-arrays -fno-tree-pre
-mveclibabi=acml -m64 -march=amdfam10

prefetching will not cause any slowdown if the vectorizer is turned off, or
with -fno-fast-math.

I believe the related loops should be those with reductions that the following
commit enabled vectorization.
http://gcc.gnu.org/ml/gcc-cvs/2010-05/msg00277.html


-- 
   Summary: CPU2006 482.sphinx3: gcc4.6 5% regression from
prefetching of vectorized loop
   Product: gcc
   Version: 4.6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: changpeng dot fang at amd dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45391



[Bug target/45391] CPU2006 482.sphinx3: gcc4.6 5% regression from prefetching of vectorized loop

2010-08-23 Thread changpeng dot fang at amd dot com


--- Comment #2 from changpeng dot fang at amd dot com  2010-08-24 00:03 
---
float f (float *x, float *y, float *z, unsigned n)
{
  float ret = 0.0;
  unsigned i;
  for (i = 0; i  n; i++)
{
  float diff = x[i] - y[i];
  ret -= diff * diff * z[i];
}
  return ret;
}

NO, this is related tp PR 45022 in certain sense, but the underlying
reason is yet unknown.

For the above test case, if I compile with -O3 -march=amdfam10 -m64,
the loop is not vectorized due to floating point reduction. To my
surprise, no prefetch is generated. The cost model filtered out the 
prefetches (we are trying to prefetch for each of the three memory
references):
Ahead 15, unroll factor 1, trip count -1
insn count 14, mem ref count 3, prefetch count 3
Not prefetching -- instruction to prefetch ratio (4) too small

However, if we compile with -O3 -ffast-math -march=amdfam10 -m64,
the loop can be vectorized, and one of the array reference is 
aligned. As a result and due to PR 45022, we are trying to prefetch
only for the aligned reference, and one prefetch is inserted (this
time, insns-to-prefetch ratio is big enough).

The Fix of PR 45022 will result in NO prefetch generated actually and thus
hide the problem.




-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45391



[Bug target/45391] CPU2006 482.sphinx3: gcc4.6 5% regression from prefetching of vectorized loop

2010-08-23 Thread changpeng dot fang at amd dot com


--- Comment #3 from changpeng dot fang at amd dot com  2010-08-24 00:22 
---
I checked with open64 and did not find any regression. And for the above
testcase, open64 generated 3 non-temporal prefetches. As a result, I am 
guessing that we are just unlucky that the prefetch kicks out useful data
for such streaming accesses (gcc generate one prefetcht0):

.Lt_0_6402:
 #loop Loop body line 8, nesting depth: 1, estimated iterations: 1000
.loc1   7   0
movss 0(%r10),%xmm0 # [0] id:67
subss 0(%r9),%xmm0  # [3] 
.loc1   8   0
mulss %xmm0,%xmm0   # [9] 
mulss 0(%rax),%xmm0 # [13] 
.loc1   7   0
prefetchnta 128(%r10)   # [17] L1
prefetchnta 128(%r9)# [17] L1
.loc1   8   0
addq $4,%rax# [17] 
addq $4,%r10# [18] 
addq $4,%r9 # [18] 
cmpq %r11,%rax  # [18] 
prefetchnta 124(%rax)   # [19] L1
subss %xmm0,%xmm1   # [19] 
jle .Lt_0_6402  # [19] 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45391



[Bug target/45391] CPU2006 482.sphinx3: gcc4.6 5% regression from prefetching of vectorized loop

2010-08-23 Thread changpeng dot fang at amd dot com


--- Comment #4 from changpeng dot fang at amd dot com  2010-08-24 00:46 
---
Ooops, the open64 generated code posted in last comment is for non-vectorized
loop, the vectorized one is similar:

.LBB23_f:
.loc1   7   0
movups 0(%r10),%xmm3# [0] id:65
movups 0(%rax),%xmm1# [1] id:64
subps %xmm3,%xmm1   # [3]
.loc1   8   0
mulps %xmm1,%xmm1   # [7]
movups 0(%r9),%xmm2 # [9] id:66
mulps %xmm2,%xmm1   # [11]
addq $16,%rax   # [13]
addq $16,%r9# [14]
addq $16,%r10   # [14]
.loc1   7   0
prefetchnta 112(%rax)   # [14] L1
prefetchnta 112(%r10)   # [15] L1
.loc1   8   0
prefetchnta 112(%r9)# [15] L1
subps %xmm1,%xmm0   # [15]


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45391



[Bug tree-optimization/45260] [4.5/4.6 Regression] g++4.5: -prefetch-loop-arrays internal compiler error: in verify_expr, at tree-cfg.c:2541

2010-08-20 Thread changpeng dot fang at amd dot com


--- Comment #5 from changpeng dot fang at amd dot com  2010-08-20 22:48 
---
I have a fix:
http://gcc.gnu.org/ml/gcc-patches/2010-08/msg01625.html


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45260



[Bug middle-end/44206] [4.6 Regression] ICE: Inline clone with address taken

2010-08-18 Thread changpeng dot fang at amd dot com


--- Comment #3 from changpeng dot fang at amd dot com  2010-08-18 19:43 
---
*** Bug 45269 has been marked as a duplicate of this bug. ***


-- 

changpeng dot fang at amd dot com changed:

   What|Removed |Added

 CC||changpeng dot fang at amd
   ||dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44206



[Bug c++/45269] CPU2006 450.soplex: verify_cgraph_node failed with -fprofile-generate

2010-08-18 Thread changpeng dot fang at amd dot com


--- Comment #2 from changpeng dot fang at amd dot com  2010-08-18 19:43 
---
http://gcc.gnu.org/ml/gcc-cvs/2010-05/msg00406.html

Verified. If I back out the above change, the bug goes away.
So it is a duplicate of bug 44206

*** This bug has been marked as a duplicate of 44206 ***


-- 

changpeng dot fang at amd dot com changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution||DUPLICATE


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45269



[Bug tree-optimization/45260] [4.5/4.6 Regression] g++4.5: -prefetch-loop-arrays internal compiler error: in verify_expr, at tree-cfg.c:2541

2010-08-16 Thread changpeng dot fang at amd dot com


--- Comment #4 from changpeng dot fang at amd dot com  2010-08-16 22:39 
---
This bug should be related to VIEW_CONVERT_EXPR.

If I use the following statement to filter the prefetch, the bug will go away:

if (contains_view_convert_expr_p (ref))
return false;


Otherwise, the prefetch pass will generate ref + offset as the prefetching
address.



-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45260



[Bug c/45268] New: CPU2006 458.sjeng: type mismatch in array reference with -fwhole-program -combine

2010-08-12 Thread changpeng dot fang at amd dot com
458.sjeng compilation fails with the following config options:
( fails with gcc4.6, passes with gcc4.4, gcc4.5 not tested yet)

458.sjeng=peak=default:
ONESTEP = yes
COPTIMIZE   = -fwhole-program -combine -march=amdfam10 -m64
PORTABILITY = -DSPEC_CPU_LP64
feedback = 0

Here is the message:

specmake build 2 make.err | tee make.out
/usr/local/bin/gcc -DSPEC_CPU -DNDEBUG -fwhole-program -combine
-march=amdfam10 -m64   -DSPEC_CPU_LP64   attacks.c book.c crazy.c
draw.c ecache.c epd.c eval.c leval.c moves.c neval.c partner.c proof.c rcfile.c
search.c see.c seval.c sjeng.c ttable.c utils.c   -o sjeng
sjeng.c: In function 'main':
sjeng.c:75:5: error: type mismatch in array reference
struct move_x

struct move_x

game_history_x[move_number.324] = path_x[0];

sjeng.c:75:5: error: type mismatch in array reference
struct move_x

struct move_x

game_history_x[move_number.390] = path_x[0];

sjeng.c:75:5: error: type mismatch in array reference
struct move_x

struct move_x

path_x[0] = game_history_x[move_number.428];

sjeng.c:75:5: error: type mismatch in array reference
struct move_x

struct move_x

path_x[0] = game_history_x[move_number.435];

sjeng.c:75:5: error: type mismatch in array reference
struct move_x

struct move_x

path_x[0] = game_history_x[move_number.439];

sjeng.c:75:5: internal compiler error: verify_gimple failed


-- 
   Summary: CPU2006 458.sjeng: type mismatch in array reference with
-fwhole-program -combine
   Product: gcc
   Version: 4.6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: changpeng dot fang at amd dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45268



[Bug c++/45269] New: CPU2006 450.soplex: verify_cgraph_node failed with -fprofile-generate

2010-08-12 Thread changpeng dot fang at amd dot com
With gcc 4.6 on X86, 450.soplex ICE with -fprofile-generate in spxmpsread.cc:

g++ -c -o spxmpsread.o -DSPEC_CPU -DNDEBUG-fprofile-generate   -O2 -m64  
-DSPEC_CPU_LP64  spxmpsread.cc
spxmpsread.cc:678:1: error: Inline clone with address taken
std::basic_ostream_CharT, _Traits std::endl(std::basic_ostream_CharT,
_Traits) [with _CharT = char, _Traits = std::char_traitschar]/276(-1)
@0x7fafaf623000 (asm:
_ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_) (inline copy in
virtual bool soplex::SPxLP::readMPS(std::istream, soplex::NameSet*,
soplex::NameSet*, soplex::DIdxSet*)/728) availability:local analyzed 71 time,
13 benefit (100 after inlining) 35 size, 4 benefit (75 after inlining)
address_taken body local finalized inlinable
  called by: void
soplex::_ZN6soplexL8readRowsERNS_8MPSInputERNS_8LPRowSetERNS_7NameSetE.constprop.9(soplex::MPSInput,
soplex::LPRowSet, soplex::NameSet)/268 (0.01 per call) (inlined) (can throw
external) 
  calls: built-in/722 (0.01 per call) std::basic_ios_CharT,
_Traits::char_type std::basic_ios_CharT, _Traits::widen(char) const [with
_CharT = char, _Traits = std::char_traitschar, std::basic_ios_CharT,
_Traits::char_type = char]/277 (inlined) (0.01 per call) (can throw external)
std::basic_ostream_CharT, _Traits std::basic_ostream_CharT,
_Traits::put(std::basic_ostream_CharT, _Traits::char_type) [with _CharT =
char, _Traits = std::char_traitschar, std::basic_ostream_CharT,
_Traits::char_type = char]/837 (0.01 per call) (can throw external)
std::basic_ostream_CharT, _Traits std::basic_ostream_CharT,
_Traits::flush() [with _CharT = char, _Traits = std::char_traitschar]/840
(0.01 per call) (can throw external) 
  References:  var:long int* __gcov_indirect_call_counters (read) var:void*
__gcov_indirect_call_callee (read) var:long int *.LPBX1 [427] (write) var:void*
__gcov_indirect_call_callee (write) var:long int *.LPBX1 [427] (read) var:long
int *.LPBX1 [427] (write) var:long int *.LPBX1 [427] (read) var:long int
*.LPBX1 [427] (write) var:long int *.LPBX1 [427] (read) var:long int *.LPBX1
[427] (write) var:long int *.LPBX1 [427] (read)
  Refering this function:  fn:void
soplex::_ZN6soplexL10readBoundsERNS_8MPSInputERNS_8LPColSetERNS_7NameSetEPNS_7DIdxSetE.constprop.13(soplex::MPSInput,
soplex::LPColSet, soplex::NameSet, soplex::DIdxSet*)/595 (addr) fn:void
soplex::_ZN6soplexL10readRangesERNS_8MPSInputERNS_8LPRowSetERNS_7NameSetE.constprop.12(soplex::MPSInput,
soplex::LPRowSet, soplex::NameSet)/481 (addr) fn:void
soplex::_ZN6soplexL7readRhsERNS_8MPSInputERNS_8LPRowSetERNS_7NameSetE.constprop.11(soplex::MPSInput,
soplex::LPRowSet, soplex::NameSet)/260 (addr) fn:void
soplex::_ZN6soplexL8readRowsERNS_8MPSInputERNS_8LPRowSetERNS_7NameSetE.constprop.9(soplex::MPSInput,
soplex::LPRowSet, soplex::NameSet)/268 (addr) fn:void
soplex::_ZN6soplexL8readNameERNS_8MPSInputE.constprop.7(soplex::MPSInput)/369
(addr)
spxmpsread.cc:678:1: internal compiler error: verify_cgraph_node failed
Please submit a full bug report,
with preprocessed source if appropriate.
See http://gcc.gnu.org/bugs.html for instructions


-- 
   Summary: CPU2006 450.soplex: verify_cgraph_node failed with -
fprofile-generate
   Product: gcc
   Version: 4.6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: changpeng dot fang at amd dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45269



[Bug c/45270] New: CPU2006 435.gromacs: Segmentation fault with -fprofile-generate

2010-08-12 Thread changpeng dot fang at amd dot com
With gcc 4.6 on x86, 435.gromacs Segmentation fault with -fprofile-generate
inconstr.c:

gcc -c -DSPEC_CPU -DNDEBUG  -I. -DHAVE_CONFIG_H  -fprofile-generate  -O2  -m64 
-DSPEC_CPU_LP64   constr.c
constr.c: In function ‘count_constraints’:
constr.c:624:5: internal compiler error: Segmentation fault


-- 
   Summary: CPU2006 435.gromacs: Segmentation fault with -fprofile-
generate
   Product: gcc
   Version: 4.6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: changpeng dot fang at amd dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45270



[Bug tree-optimization/45260] [4.5/4.6 Regression] g++4.5: -prefetch-loop-arrays internal compiler error: in verify_expr, at tree-cfg.c:2541

2010-08-11 Thread changpeng dot fang at amd dot com


--- Comment #3 from changpeng dot fang at amd dot com  2010-08-12 00:38 
---
(In reply to comment #2)
 It was caused by revision 153878:
 
 http://gcc.gnu.org/ml/gcc-cvs/2009-11/msg00094.html
 

I think the same patch was also committed to 4.4 branch.
Maybe some prefetch work(s) in 4.5 triggered the bug.  

 and disappeared with revision 159514:
 
 http://gcc.gnu.org/ml/gcc-cvs/2010-05/msg00566.html
 
 I am not if it really fixed the bug.
 

This could not be a valid fix, because it just disable some prefetches
based on performance concern.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45260



[Bug tree-optimization/45241] [4.5/4.6 Regression] CPU2006 465.tonto ICE in the vectorizer with -fno-tree-pre

2010-08-10 Thread changpeng dot fang at amd dot com


--- Comment #7 from changpeng dot fang at amd dot com  2010-08-10 21:44 
---
(In reply to comment #5)
 (In reply to comment #1)
  This patch should be a valid fix, because the recognition of the dot_prod
  pattern is known to be fail at this point if the stmt is outside the loop.
  (I am not sure whether we should not see this case in the vectorizer at this
  point -- should previous analysis already filter out?):
  
 
 I don't understand this. Where do we check if the stmt (which one?) is outside
 the loop? 

Forget about this part of the comment (The vectorization analysis is correct,
and it is just that the pattern recognition traces the chain outside the loop). 

 I was looking at PR 45239 and didn't notice that there is another PR and 
 didn't
 see this comment. So I tested the same fix (successfully on 
 x86_64-suse-linux).
 You can commit it if you like (just please notice, that the bug exists on 4.5
 as well). 
 

I am going to add your testcase (in comment #4), and doing bootstraping, and
then commit to the trunk and gcc 4.5 branch.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45241



[Bug tree-optimization/45239] New: CPU2006 465.tonto ICE in the vectorizer with -fno-tree-pre

2010-08-09 Thread changpeng dot fang at amd dot com
With gcc 4.6:
gfortran  -c -o diis.fppized.o -O3 -fno-tree-pre -march=amdfam10 -m64
diis.fppized.f90

diis.fppized.f90: In function 'extrapolate':
diis.fppized.f90:882:0: internal compiler error: vector VEC(vec_void_p,base)
index domain error, in vinfo_for_stmt at tree-vectorizer.h:595
Please submit a full bug report,
with preprocessed source if appropriate.
See http://gcc.gnu.org/bugs.html for instructions

This is invoked in vect_recog_dot_prod_pattern:
stmt_vinfo = vinfo_for_stmt (stmt);

Where stmt is not inside the loop, and thus stmt_vinfo was not set up.


-- 
   Summary: CPU2006 465.tonto ICE in the vectorizer with -fno-tree-
pre
   Product: gcc
   Version: tree-ssa
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: changpeng dot fang at amd dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45239



[Bug tree-optimization/45241] New: CPU2006 465.tonto ICE in the vectorizer with -fno-tree-pre

2010-08-09 Thread changpeng dot fang at amd dot com
With gcc 4.6:
gfortran  -c -o diis.fppized.o -O3 -fno-tree-pre -march=amdfam10 -m64
diis.fppized.f90

diis.fppized.f90: In function 'extrapolate':
diis.fppized.f90:882:0: internal compiler error: vector VEC(vec_void_p,base)
index domain error, in vinfo_for_stmt at tree-vectorizer.h:595
Please submit a full bug report,
with preprocessed source if appropriate.
See http://gcc.gnu.org/bugs.html for instructions

This is invoked in vect_recog_dot_prod_pattern:
stmt_vinfo = vinfo_for_stmt (stmt);

Where stmt is not inside the loop, and thus stmt_vinfo was not set up.


-- 
   Summary: CPU2006 465.tonto ICE in the vectorizer with -fno-tree-
pre
   Product: gcc
   Version: tree-ssa
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: changpeng dot fang at amd dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45241



[Bug tree-optimization/45241] CPU2006 465.tonto ICE in the vectorizer with -fno-tree-pre

2010-08-09 Thread changpeng dot fang at amd dot com


--- Comment #1 from changpeng dot fang at amd dot com  2010-08-09 17:52 
---
This patch should be a valid fix, because the recognition of the dot_prod
pattern is known to be fail at this point if the stmt is outside the loop.
(I am not sure whether we should not see this case in the vectorizer at this
point -- should previous analysis already filter out?):

diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
index 19f0ae6..5f81a73 100644
--- a/gcc/tree-vect-patterns.c
+++ b/gcc/tree-vect-patterns.c
@@ -259,6 +259,10 @@ vect_recog_dot_prod_pattern (gimple last_stmt, tree
*type_in, tree *type_out)
  inside the loop (in case we are analyzing an outer-loop).  */
   if (!is_gimple_assign (stmt))
 return NULL;
+
+  if (!flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
+return NULL;
+
   stmt_vinfo = vinfo_for_stmt (stmt);
   gcc_assert (stmt_vinfo);
   if (STMT_VINFO_DEF_TYPE (stmt_vinfo) != vect_internal_def)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45241



[Bug tree-optimization/45022] No prefetch for the vectorized loop

2010-07-29 Thread changpeng dot fang at amd dot com


--- Comment #4 from changpeng dot fang at amd dot com  2010-07-29 19:14 
---
(In reply to comment #1)
 The misaligned indirect-refs will vanish soon.
 

I saw your patch that remove ALIGNED_INDIRECT_REF. Do you also plan to remove
MISALIGNED_INDIRECT_REF? Thanks.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45022



[Bug tree-optimization/45021] Redundant prefetches for some loops (vectorizer produced ones too)

2010-07-28 Thread changpeng dot fang at amd dot com


--- Comment #4 from changpeng dot fang at amd dot com  2010-07-28 18:22 
---
Andrew's example is exactly what the prefetch sees for the test case (in the
bug description). Unfortunately, the prefetch pass could not recognize that
vect_pa.6_24 and vect_pa.20_38 are exactly the same address:

bb 2:
  pretmp.2_18 = (float) beta_4(D);
  vect_pa.9_22 = (vector(4) float *) a;
  vect_pa.6_23 = vect_pa.9_22;
  vect_cst_.12_27 = {pretmp.2_18, pretmp.2_18, pretmp.2_18, pretmp.2_18};
  vect_pb.16_29 = (vector(4) float *) b;
  vect_pb.13_30 = vect_pb.16_29;
  vect_pa.23_36 = (vector(4) float *) a;
  vect_pa.20_37 = vect_pa.23_36;

bb 3:
  # vect_pa.6_24 = PHI vect_pa.6_25(4), vect_pa.6_23(2)
  # vect_pb.13_31 = PHI vect_pb.13_32(4), vect_pb.13_30(2)
  # vect_pa.20_38 = PHI vect_pa.20_39(4), vect_pa.20_37(2)
  # ivtmp.24_40 = PHI ivtmp.24_41(4), 0(2)
  vect_var_.10_26 = *vect_pa.6_24;
  vect_var_.11_28 = vect_cst_.12_27;
  vect_var_.17_33 = *vect_pb.13_31;
  vect_var_.18_34 = vect_var_.11_28 * vect_var_.17_33;
  vect_var_.19_35 = vect_var_.10_26 + vect_var_.18_34;
  *vect_pa.20_38 = vect_var_.19_35;
  vect_pa.6_25 = vect_pa.6_24 + 16;
  vect_pb.13_32 = vect_pb.13_31 + 16;
  vect_pa.20_39 = vect_pa.20_38 + 16;
  ivtmp.24_41 = ivtmp.24_40 + 1;
  if (ivtmp.24_41  256)
goto bb 4;
  else
goto bb 5;

bb 4:
  goto bb 3;


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45021



[Bug tree-optimization/45021] Redundant prefetches for some loops (vectorizer produced ones too)

2010-07-28 Thread changpeng dot fang at amd dot com


--- Comment #5 from changpeng dot fang at amd dot com  2010-07-28 18:28 
---
Thing is a little complicate if we change the code to:

a[i] = a[i+1] + beta * b[i];

The prefetch pass want to group a[i] and a[i+1], i.e. they have
the same base address with an offset of 4 bytes.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45021



[Bug tree-optimization/45022] No prefetch for the vectorized loop

2010-07-22 Thread changpeng dot fang at amd dot com


--- Comment #2 from changpeng dot fang at amd dot com  2010-07-22 20:52 
---
(In reply to comment #1)
 The misaligned indirect-refs will vanish soon.
 

From the prefetching point of view, is there any reason that we can not
prefetch
for mis-aligned or indirect refs?  I understand that prefetching for indirect
refs may be too aggressive


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45022



[Bug tree-optimization/45021] New: Redundant prefetches for the vectorized loop

2010-07-21 Thread changpeng dot fang at amd dot com
For the following test case, prefetches will be inserted for both the load and
store of a[i] if the loop is vectorized:

float a[1024], b[1024];
void foo(int beta)
{
  int i;
  for(i=0; i1024; i++)
 a[i] = a[i] + beta * b[i];
}

with gcc -O3 -fprefetch-loop-arrays -march=amdfam10 -S, a piece of the assembly
is:
movaps  (%rcx), %xmm0
addl$4, %edi
prefetcht0  (%rdx)
prefetcht0  240(%rcx)
prefetchw   (%rdx)
leaq64(%rax), %rsi
mulps   %xmm1, %xmm0


If we don't vectorize the loop, we only generate prefetch for the load a[i]:
addl$16, %eax
salq$2, %rcx
mulss   %xmm1, %xmm0
prefetcht0  a+92(%rcx)
prefetcht0  b+92(%rcx)
movl%esi, %ecx


-- 
   Summary: Redundant prefetches for the vectorized loop
   Product: gcc
   Version: 4.6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: changpeng dot fang at amd dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45021



[Bug tree-optimization/45022] New: No prefetch for the vectorized loop

2010-07-21 Thread changpeng dot fang at amd dot com
For the following test case, if we compile with -O3 -fprefetch-loop-arrays
-march=amdfam10, the loop is versioned (for runtime alias checking) to be
vectorized. However, we see prefetches in the non-vectorize version, but
not in the vectorized version.

void foo(int beta, float *a, float *b)
{
  int i;
  for(i=0; i1024; i++)
 a[i] = a[i] + beta * b[i];
}

For the vectorized loop, in tree-ssa-loop-arrays.c (idx_analyze_ref):
 if (TREE_CODE (base) == MISALIGNED_INDIRECT_REF
  || TREE_CODE (base) == ALIGN_INDIRECT_REF)
return false;

FALSE is returned due to mis-aligned indirect reference:
M*vect_p.18_61{misalignment: 0}
M*vect_p.23_66{misalignment: 0}
M*vect_p.31_74{misalignment: 0}


-- 
   Summary: No prefetch for the vectorized loop
   Product: gcc
   Version: 4.6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: changpeng dot fang at amd dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45022



[Bug tree-optimization/45021] Redundant prefetches for the vectorized loop

2010-07-21 Thread changpeng dot fang at amd dot com


--- Comment #1 from changpeng dot fang at amd dot com  2010-07-21 18:26 
---
The direct reason is that prefetching could not differentiate the base
addresses
of the vectorized load and store (of a[i]):
*vect_pa.6_24
*vect_pa.19_37


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45021



[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86

2010-07-21 Thread changpeng dot fang at amd dot com


--- Comment #23 from changpeng dot fang at amd dot com  2010-07-21 21:30 
---
Fixed


-- 

changpeng dot fang at amd dot com changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution||FIXED


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297



[Bug tree-optimization/44955] New: over-prefetched for arrays of complex number

2010-07-15 Thread changpeng dot fang at amd dot com
While I am working on prefetching-incurred performance degradation on
168.wupwise, I find that complex arrays are always over-prefetched. 
Prefetches are generated for both the real part and imagine part.

 subroutine s311 (i,j,n,m,beta,a,b)
c
c reductions
c sum reduction
c
  integer n, i, j, beta, m
  complex a(n,n), b(n,n)

  do 1 j = 1,n
  do 10 i = 1,m
 a(i,j) = a(i,j) + beta * b(i,j)
  10  continue
  1   continue

  return
  end


For this example, two prefetches are generated for a, and two prefetches for
b.


-- 
   Summary: over-prefetched for arrays of complex number
   Product: gcc
   Version: 4.6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: changpeng dot fang at amd dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44955



[Bug tree-optimization/44955] over-prefetched for arrays of complex number

2010-07-15 Thread changpeng dot fang at amd dot com


--- Comment #1 from changpeng dot fang at amd dot com  2010-07-15 17:20 
---
This is a piece of code that shows the two prefetches for b.

mulss   %xmm4, %xmm5
addq$8, %rdx
prefetcht0  96(%r11)
prefetcht0  100(%r11)
subss   %xmm2, %xmm1
addss   %xmm5, %xmm0

In collecting memory references for the loops, the array of the imagine part
is put into the different group from that of the real part (and thus two
prefetches are generated).

eference 0x2d61e70:
  group 0x2d63630 (base REALPART_EXPR *b_64(D)...

Reference 0x2d615e0:
  group 0x2d40f40 (base IMAGPART_EXPR *b_64(D)...

I think that the base should be reduced to the same, with a offset of 4.
So they can be in the same group.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44955



[Bug tree-optimization/44794] pre- and post-loops should not be unrolled.

2010-07-14 Thread changpeng dot fang at amd dot com


--- Comment #4 from changpeng dot fang at amd dot com  2010-07-15 01:50 
---
Created an attachment (id=21205)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=21205action=view)
Do not unroll pre and post loops

I did a quick test on polyhedron before and after applying the preliminary
patch. Tests are based on -O3 -fprefetch-loop-arrays -funroll-loops.

   timing (s)  | size (B)
 before after   %deduc | before after   %deduc  
cacacita 14.35  10.88   24.18  | 90715  72843   19.7
gas_dyn  34.68  21.58   37.77  | 149608 100936  32.53
nf   33.91  19.32   43.03  | 139150 83054   40.31
protein  51.35  33.23   35.29  | 163672 122808  24.97
rnflow   60.9   43.28   28.93  | 268784 169152  37.07
test_fpu 52.61  30.35   42.31  | 234045 144285  38.35


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44794



[Bug middle-end/44576] [4.5/4.6 Regression] testsuite/gfortran.dg/zero_sized_1.f90 with huge compile time on prefetching + peeling

2010-07-08 Thread changpeng dot fang at amd dot com


--- Comment #20 from changpeng dot fang at amd dot com  2010-07-09 01:59 
---
I submitted a patch for review to completely fix the problem. The patch is an
extension to Christian's speedup.patch. It splits the cost analysis into
three small functions and quits further prefetching analysis as long as we
know prefetching is not going to be beneficial to the loop.

Here is the gcc-patches@ link:
http://gcc.gnu.org/ml/gcc-patches/2010-07/msg00734.html


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44576



[Bug middle-end/44576] [4.5/4.6 Regression] testsuite/gfortran.dg/zero_sized_1.f90 with huge compile time on prefetching + peeling

2010-07-07 Thread changpeng dot fang at amd dot com


--- Comment #19 from changpeng dot fang at amd dot com  2010-07-07 19:00 
---
(In reply to comment #18)
 Changpeng, should this PR be closed now?
 

No. I am still looking at the dependence computation cost. I just found the
most of the time is spent in memory allocation and freeing of the data
dependence relatiuon structure.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44576



[Bug tree-optimization/44794] pre- and post-loops should not be unrolled.

2010-07-06 Thread changpeng dot fang at amd dot com


--- Comment #2 from changpeng dot fang at amd dot com  2010-07-06 17:58 
---
We also need to handle the post loop of unrolling. Suppose the unroll_factor
is 16, then the post-loop should have up to 15 iterations.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44794



[Bug tree-optimization/44794] pre- and post-loops should not be unrolled.

2010-07-06 Thread changpeng dot fang at amd dot com


--- Comment #3 from changpeng dot fang at amd dot com  2010-07-06 18:35 
---
Here is the impact of loop unrolling on the compilation time and code size
on polyhedron test_fpu.f90:

-O3 -ftree-vectorize -fno-prefetch-loop-arrays -fno-unroll-loops:
timing: 12.62s,  size: 67069  bytes
-O3 -ftree-vectorize -fprefetch-loop-arrays -funroll-loops:
timing: 51.77s,  size: 234045 bytes


I also did an experiment on prefetching that we don't unroll the pre- and
post-loop generated by the vectorizer:
-O3 -ftree-vectorize -fprefetch-loop-arrays:
timing: 29.32s   size: 92541 bytes
-O3 -ftree-vectorize -fprefetch-loop-arrays (don't unroll pre- postloops)
timing: 18.34s   size: 78909 bytes 
-O3 -ftree-vectorize -fno-prefetch-loop-arrays
timing: 12.62s,  size: 67069  bytes


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44794



[Bug tree-optimization/44794] New: pre- and post-loops should not be unrolled.

2010-07-02 Thread changpeng dot fang at amd dot com
void foo(int *a, int *b, int n)
{
  int i;
  for(i = 0; i  n; i++)
 a[i] = a[i] + b[i];
}

For this simple loop, the vectorizer does its job and peels the last few 
iterations as post-loop that is not vectorized. But the RTL loop unroller
does not know that it just has a few (at most 3 in this case) iterations,
and will unroll the post-loop.

What is worse, if you compile it with:
  gcc -O3 -fprefetch-loop-arrays -funroll-loops

You may find the prefetch pass will also unroll the post-loop, and generate
a new post-loop (post-post-loop) for this post-loop. Again, the RTL loop
unroller could not recognize this post-post-loop, and will unroll it.
(the RTL loop unroller will generate yet another post loop
(post-post-post-loop) for the post-post-loop :-))

 This will cause compilation time and code size increase dramastically without
any performance benefit.


-- 
   Summary: pre- and post-loops should not be unrolled.
   Product: gcc
   Version: lno
Status: UNCONFIRMED
  Severity: major
  Priority: P3
 Component: tree-optimization
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: changpeng dot fang at amd dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44794



[Bug middle-end/44576] [4.5/4.6 Regression] testsuite/gfortran.dg/zero_sized_1.f90 with huge compile time on prefetching + peeling

2010-07-02 Thread changpeng dot fang at amd dot com


--- Comment #17 from changpeng dot fang at amd dot com  2010-07-02 23:58 
---
(In reply to comment #15)
I have opened PR44794 for the unrolling of pre- and post-loop issue. 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44576



[Bug middle-end/44576] [4.5/4.6 Regression] testsuite/gfortran.dg/zero_sized_1.f90 with huge compile time on prefetching + peeling

2010-06-30 Thread changpeng dot fang at amd dot com


--- Comment #15 from changpeng dot fang at amd dot com  2010-07-01 00:34 
---
Unrolling of the peeled loop is partially the reason for test_fpu.f90
compilation
time and code size increase. Vectorization peeled a few iteration of the the
loop, the prefetching and unrolling passes does not recognize that a loop is a
peeled version and still unroll the loop.

 MODULE kinds
   INTEGER, PARAMETER :: RK8 = SELECTED_REAL_KIND(15, 300)
END MODULE kinds
! 
PROGRAM TEST_FPU  ! A number-crunching benchmark using matrix inversion.
USE kinds ! Implemented by:David Frank  dave_fr...@hotmail.com
IMPLICIT NONE ! Gauss  routine by: Tim Prince   n...@aol.com
  ! Crout  routine by: James Van Buskirk  tor...@ix.netcom.com
  ! Lapack routine by: Jos Bergervoet berge...@iaehv.nl

REAL(RK8) :: pool(101, 101,1000), a(101, 101)
INTEGER :: i

  DO i = 1,1000
 a = pool(:,:,i) ! get next matrix to invert
  END DO

END PROGRAM TEST_FPU


In this example, prefetching will unroll tree version of the innermost loop.
If we turn off the vectorizer, it unrolls the only loop.

In addition, -fprefetch-loop-arrays and -funroll-loops (turned on at the same
time) will unroll the same loop. This is over-unrolling and  -funroll-loops
should recognize that the loop has already been unrolled by prefetching.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44576



[Bug middle-end/44576] [4.5/4.6 Regression] testsuite/gfortran.dg/zero_sized_1.f90 with huge compile time on prefetching + peeling

2010-06-29 Thread changpeng dot fang at amd dot com


--- Comment #13 from changpeng dot fang at amd dot com  2010-06-30 00:23 
---
Here is the current status of this work:
patch1: http://gcc.gnu.org/ml/gcc-patches/2010-06/msg02956.html
patch2: http://gcc.gnu.org/ml/gcc-patches/2010-06/msg03049.html
On my system with -O3 zero_sized_1.f90 -fprefetch-loop-arrays 
-fno-unroll-loops --param max-completely-peeled-insns=2000:

original timing:  5m30s
with patch1:  1m20s
with patch1 + patch2: 1m03s
without prefetch: 0m30s

The timing with prefetch-loop-arrays is still doubled after the two patch
compared to no-prefetch-loop-arrays. The extra 33s is mostly spent in 
dependence computation for loops. For this test case, prefetching is the
only optimization that invokes compute_all_dependences.

I am not sure whether we should tolerate this timing increase with aggressive
peeling and prefetching, or we should work on the cost reduction of dependence
computation.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44576



[Bug middle-end/44576] [4.5/4.6 Regression] testsuite/gfortran.dg/zero_sized_1.f90 with huge compile time on prefetching + peeling

2010-06-29 Thread changpeng dot fang at amd dot com


--- Comment #14 from changpeng dot fang at amd dot com  2010-06-30 00:36 
---

(In reply to comment #7)
 A good chunk of time seems to be spent in the RTL loop unroller, triggered
 by array prefetching (testing with -O3 -funroll-loops).  Otherwise it might
 as well be just excessive code growth caused by prefetching.

Yes, for test_fpu.f90, more than half of the time is spent in the RTL loop
unroller, and if manually set unroll_factor to 1 (don't unroll), the timing
increase by array prefetching is negligible.

With -O3 -funroll-loops, I don't expect code size or compilation time increase
from the RTL loop unroller, triggered by array prefetching.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44576



[Bug middle-end/44576] [4.5/4.6 Regression] testsuite/gfortran.dg/zero_sized_1.f90 with huge compile time on prefetching + peeling

2010-06-28 Thread changpeng dot fang at amd dot com


--- Comment #11 from changpeng dot fang at amd dot com  2010-06-29 00:07 
---
I have a patch that partially fixes the problem:
http://gcc.gnu.org/ml/gcc-patches/2010-06/msg02956.html

Note that for this test case, the compile time doubled even though
I don't compute the miss rate at all.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44576



[Bug middle-end/44576] [4.5/4.6 Regression] testsuite/gfortran.dg/zero_sized_1.f90 with huge compile time on prefetching + peeling

2010-06-28 Thread changpeng dot fang at amd dot com


--- Comment #12 from changpeng dot fang at amd dot com  2010-06-29 00:49 
---
Created an attachment (id=21034)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=21034action=view)
Early return in miss rate computation

The attached patch improves the computation of miss rate. We can stop computing
if the total misses has always exceeds the given acceptable threshold.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44576



[Bug middle-end/44576] [4.5/4.6 Regression] testsuite/gfortran.dg/zero_sized_1.f90 with huge compile time on prefetching + peeling

2010-06-25 Thread changpeng dot fang at amd dot com


--- Comment #4 from changpeng dot fang at amd dot com  2010-06-25 17:08 
---
(In reply to comment #3)
 Created an attachment (id=21001)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=21001action=view) [edit]
 Potential fix for compile time regression
 
 Here is a potential fix. We just limit prefetching to loops with a low 
 amount
 of memory references and bail out if the amount of references is too large.
 

This should be a good fix for now. But the complexities of computing group
reuse
and miss rate are still a concern. I don't think we need to compute the miss
rate exactly here. 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44576



[Bug tree-optimization/44503] control flow in the middle of basic block with -fprefetch-loop-arrays

2010-06-14 Thread changpeng dot fang at amd dot com


--- Comment #3 from changpeng dot fang at amd dot com  2010-06-14 18:28 
---
Actually, the prefetching is for the following loop:
for (i = 0; i  p[2]; i++)
  q[i] = 0;

I do not understand why unrolling of this loop affects other part of
the program that has longjmp.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44503



[Bug tree-optimization/44503] control flow in the middle of basic block with -fprefetch-loop-arrays

2010-06-14 Thread changpeng dot fang at amd dot com


--- Comment #4 from changpeng dot fang at amd dot com  2010-06-14 22:22 
---
There is nothing wrong in the prefetch itself. The problem is 
__builtin_prefetch call used for prefetch instruction. Whenever,
there is a non-local lable in the current function,  the __builtin_prefetch
inserted will be considered as a control flow statement:

is_ctrl_altering_stmt (gimple t)
{
   

/* A non-pure/const call alters flow control if the current
   function has nonlocal labels.  */
if (!(flags  (ECF_CONST | ECF_PURE))  cfun-has_nonlocal_label) {
  return true;
...
}


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44503



[Bug c/44503] New: control flow in the middle of basic block with -fprefetch-loop-arrays

2010-06-11 Thread changpeng dot fang at amd dot com
Attached is a test case from gcc regression test. verify_flow_info failed
when I turned on prefetching.

gcc -O3 -fprefetch-loop-arrays setjmp-1.c
setjmp-1.c: In function ‘main’:
setjmp-1.c:17:1: error: control flow in the middle of basic block 20
setjmp-1.c:17:1: error: control flow in the middle of basic block 20
setjmp-1.c:17:1: internal compiler error: verify_flow_info failed
Please submit a full bug report,

Looks like loops with longjmp should not be unrolled.


-- 
   Summary: control flow in the middle of basic block with -
fprefetch-loop-arrays
   Product: gcc
   Version: tree-ssa
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: changpeng dot fang at amd dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44503



[Bug c/44503] control flow in the middle of basic block with -fprefetch-loop-arrays

2010-06-11 Thread changpeng dot fang at amd dot com


--- Comment #1 from changpeng dot fang at amd dot com  2010-06-11 16:32 
---
Created an attachment (id=20894)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=20894action=view)
prefetching for the while loop?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44503



[Bug tree-optimization/44503] control flow in the middle of basic block with -fprefetch-loop-arrays

2010-06-11 Thread changpeng dot fang at amd dot com


--- Comment #2 from changpeng dot fang at amd dot com  2010-06-11 18:45 
---
Bug 39398 looks similar but that one seems with except handling instead of
setjmp.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44503



[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86

2010-06-08 Thread changpeng dot fang at amd dot com


--- Comment #21 from changpeng dot fang at amd dot com  2010-06-08 16:23 
---
Just for the record, non-constant step prefetching improves 459.GemsFDTD
by 5.5% (under -O3 + prefetch) on amd-linux64 systems. And the gains are
from the following set of loops:
NFT.fppized.f90:1268
NFT.fppized.f90:1227
NFT.fppized.f90:1186
NFT.fppized.f90:1148
NFT.fppized.f90:1109
NFT.fppized.f90:1072


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297



[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86

2010-06-07 Thread changpeng dot fang at amd dot com


--- Comment #14 from changpeng dot fang at amd dot com  2010-06-07 18:27 
---
Here is the current status of my investigation:

(1) 465.tonto regression (~9%):
The regressions mainly comes from loops which have array references with both
constant (prefetch_mod = 8) and non-constant (prefetch_mod=1) steps. The loops
are unrolled 8 times, and 8 non-constant step prefetches are inserted into the
unrolled loops.

The ideal way to solve the problem is to compute the prefetch count considering
the effect of unrolling, i.e. we should count 8 non-constant step prefetches
in stead of 1.

(2) 416.gamess regression (~5%):
The regression is from non-constant-step prefetching for outer loops. I am
proposing not to do non-constant step prefetching for outer loops to solve the
problem.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297



[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86

2010-06-07 Thread changpeng dot fang at amd dot com


--- Comment #15 from changpeng dot fang at amd dot com  2010-06-07 18:30 
---
Created an attachment (id=20860)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=20860action=view)
Don't consider effect of unrolling in the computation of insn-to-prefetch ratio


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297



[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86

2010-06-07 Thread changpeng dot fang at amd dot com


--- Comment #16 from changpeng dot fang at amd dot com  2010-06-07 18:32 
---
Created an attachment (id=20861)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=20861action=view)
Limit non-constant step prefetching only to the innermost loops


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297



[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86

2010-06-07 Thread changpeng dot fang at amd dot com


--- Comment #17 from changpeng dot fang at amd dot com  2010-06-07 18:37 
---
(In reply to comment #15)
 Created an attachment (id=20860)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=20860action=view) [edit]
 Don't consider effect of unrolling in the computation of insn-to-prefetch 
 ratio
 

To compute the insn-to-prefetch ratio precisely, we may need to compute this
after
schedule_prefetches to know exactly how many prefetches are scheduled (we also
need to compute the exact number of insns in the unrolled body). For now, I
would
like to unable my previous commit of using (unroll_factor * insns) for the
total
insns in the unrolled body.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297



[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86

2010-06-07 Thread changpeng dot fang at amd dot com


--- Comment #19 from changpeng dot fang at amd dot com  2010-06-07 22:30 
---
Created an attachment (id=20862)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=20862action=view)
Account prefetch_mod and unroll_factor for the computation of the prefetch
count

Ooops. Attached a wrong patch previously. This one is what I have mentioned.


-- 

changpeng dot fang at amd dot com changed:

   What|Removed |Added

  Attachment #20860|0   |1
is obsolete||


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297



[Bug tree-optimization/43529] G++ doesn't optimize away empty loop when index is a double

2010-06-04 Thread changpeng dot fang at amd dot com


--- Comment #2 from changpeng dot fang at amd dot com  2010-06-04 23:15 
---
Interesting! What's the difference between 17 and 18?

int main()
{
  double i;
  for(i=0; i18; i+=1); /* gcc -O3, empty loop not removed */
}

int main()
{
  double i;
  for(i=0; i17; i+=1); /* gcc -O3, empty loop removed */
}


-- 

changpeng dot fang at amd dot com changed:

   What|Removed |Added

 CC||changpeng dot fang at amd
   ||dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43529



[Bug tree-optimization/43529] G++ doesn't optimize away empty loop when index is a double

2010-06-04 Thread changpeng dot fang at amd dot com


--- Comment #3 from changpeng dot fang at amd dot com  2010-06-04 23:29 
---
(In reply to comment #2)
 Interesting! What's the difference between 17 and 18?
 
 int main()
 {
   double i;
   for(i=0; i18; i+=1); /* gcc -O3, empty loop not removed */
 }


The funny thing occurs in gcc 4, not gcc 6:

.file   empty.c
.text
.p2align 4,,15
.globl main
.type   main, @function
main:
.LFB0:
.cfi_startproc
xorl%eax, %eax
.p2align 4,,10
.p2align 3
.L2:
addl$1, %eax
cmpl$18, %eax
jne .L2
rep
ret
.cfi_endproc
.LFE0:
.size   main, .-main
.ident  GCC: (Ubuntu 4.4.1-4ubuntu9) 4.4.1
.section.note.GNU-stack,,@progbits


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43529



[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86

2010-06-01 Thread changpeng dot fang at amd dot com


--- Comment #11 from changpeng dot fang at amd dot com  2010-06-01 17:40 
---
 (In reply to comment #10)
 Created an attachment (id=20783)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=20783action=view) [edit]
 experimental patch to have separate values for min_insn_to_prefetch_ration
 
 Changpeng,
 
 thank you for the feedback.
 Can you confirm that the regression was introduced by a prefetch with an
 unknown step or is there still a bug in the calculation of the normal
 prefetches (e.g. by applying the first patch that disables non-constant steps)
 
 Anyway, here is a patch that increases min_insn_to_prefetch_ratio for
 non-constant steps. Does that make a difference for tonto? Do you prefer other
 intial values?
 Thanks
 
 Christian
 
Hi, Christian:

For constant step prefetching only, tonto regressed by ~7%, and for const +
invariant step prefetching combined, it regressed by ~16%.

I should have mentioned earlier that non-constant step prefetching has improved
459.GemsFDTD by 4~5% on amd-linux64 systems, and tonto regression by
non-constant step prefetching should be able to be fixed by re-compute the
prefetch count by considering the unroll_factor. However, I have found the 
non-temporal store problem which can cause 416.gamess degradation by ~50%.
I am not sure whether it is caused by non-constant step prefetching or not.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297



[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86

2010-06-01 Thread changpeng dot fang at amd dot com


--- Comment #13 from changpeng dot fang at amd dot com  2010-06-01 19:59 
---
(In reply to comment #12)
 Ok. So I will let you continue to look into that and wait for your results?
 
 Do you have any feedback on separate.patch and its influence on performance?
 

+   for (; groups; groups = groups-next)
+ for (ref = groups-refs; ref; ref = ref-next)
+   {
+   if (cst_and_fits_in_hwi (ref-group-step))
+ continue;
+ if (!ref-issue_prefetch_p)
+ continue;
+   insn_to_prefetch_ratio = (unroll_factor * ninsns) / prefetch_count;
+   if (insn_to_prefetch_ratio  MIN_INSN_TO_SPECULATIVE_PREFETCH)
+ {
+   ref-issue_prefetch_p = false;
+   if (dump_file  (dump_flags  TDF_DETAILS))
+ fprintf (dump_file,
+  Ignoring %p-- insn to prefetch ratio (%d) too small\n,
+  (void *) ref, insn_to_prefetch_ratio);
+ }
+   }
+ }

The patch should fix the tonto regression caused by non-constant step
prefetching. It is just that you should move the computation and comparison
outside  (before) the loop and the debug dump after the loop.

I am just thinking that for such loop, we should do nothing: non-non-temporal
stores and no constant step prefetching because nothing could be trusted.

I am doing some experiemnts and let you know what I could find. Thanks.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297



[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86

2010-05-28 Thread changpeng dot fang at amd dot com


--- Comment #6 from changpeng dot fang at amd dot com  2010-05-28 16:46 
---
(In reply to comment #4)
 Created an attachment (id=20767)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=20767action=view) [edit]
 Patch that makes loop invariant prefetches backend specfic
 

Actually, I am the one who would like the invariant step prefetch to
be backend independent. However, the current implementation seems a bit
aggressive: The fundamental assumption of the implementation is that
the invariant step is big enough so that there is no spatial reuse
and we don't need to unroll the loop (preprech_mod == 1). This assumption
may be OK for c code (or integer code), and may not be appropriate for
fortran programs.





-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297



[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86

2010-05-28 Thread changpeng dot fang at amd dot com


--- Comment #7 from changpeng dot fang at amd dot com  2010-05-28 16:56 
---
(In reply to comment #5)
 An alternative approach might be have different values for 
 prefetch-min-insn-to-mem-ratio and min-insn-to-prefetch-ratio
 depending on constant/non-constant step size.
 
It may be a good idea for limit non-constant step prefetching to
big loops. This is because we are not very confident that the 
reference will cause cache miss, and we should limit the prefetches
generated. min-insn-to-prefetch-ratio may be a good parameter to
work on.

By the way, I am thinking that min-insn-to-prefetch-ratio should
be backend dependent. In certain sense, this parameter implies
how many useless prefetches can an architecture tolerate. 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297



[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86

2010-05-28 Thread changpeng dot fang at amd dot com


--- Comment #8 from changpeng dot fang at amd dot com  2010-05-28 18:30 
---
(In reply to comment #4)
 Created an attachment (id=20767)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=20767action=view) [edit]
 Patch that makes loop invariant prefetches backend specfic
 
 Three observations:
 
 1. the patch had a bug which let to wrong calculation in some cases
 This commit should be applied to improve some other testcases:
 http://gcc.gnu.org/viewcvs?view=revisionrevision=159816

Looks like this is a fix to the regressions. That is, the regressions are
actually caused by the wrong calculation. This bug could be considered fixed,
even though performance tuning may be necessary for non-constant step
prefetching. Thanks. 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297



[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86

2010-05-28 Thread changpeng dot fang at amd dot com


--- Comment #9 from changpeng dot fang at amd dot com  2010-05-28 18:36 
---
(In reply to comment #8)

 Looks like this is a fix to the regressions. That is, the regressions are
 actually caused by the wrong calculation. This bug could be considered fixed,
 even though performance tuning may be necessary for non-constant step
 prefetching. Thanks. 
 

Oh, NO! After this patch, 465.tonto has a big regression (-16%), compared to
no prefetching. Note that prefetching causes 465.tonto a ~7% degradation
originally (before non-constant step prefetching) 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297



[Bug middle-end/44297] New: Big spec cpu2006 prefetch regressions on gcc 4.6 on x86

2010-05-27 Thread changpeng dot fang at amd dot com
Tests are on amd-linux64 system with -O3 -fprefetch-loop-arrays

Compare gcc-4.6-20100522.tar.bz2 to gcc-4.6-20100515.tar.bz2
459.GemsFDTD: -32.6%
434.zeusmp:   -13.6%

If I replace tree-ssa-loop-prefetch.c in gcc-4.6-20100522.tar.bz2 with the one
in gcc-4.6-20100515.tar.bz2, The regression disappears.


-- 
   Summary: Big spec cpu2006 prefetch regressions on gcc 4.6 on x86
   Product: gcc
   Version: tree-ssa
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: changpeng dot fang at amd dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297



[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86

2010-05-27 Thread changpeng dot fang at amd dot com


--- Comment #1 from changpeng dot fang at amd dot com  2010-05-27 20:49 
---
The regressions are most likely from the patch that added non-constant step
prefetching:

* From: Andreas Krebbel krebbel at linux dot vnet dot ibm dot com
* To: Christian Borntraeger borntraeger at de dot ibm dot com
* Cc: gcc-patches gcc-patches at gcc dot gnu dot org
* Date: Wed, 19 May 2010 12:40:51 +0200
* Subject: Re: [patch 4/4 v4] Allow loop prefetch code to speculatively
prefetch non constant steps



 * tree-ssa-loop-prefetch.c (mem_ref_group): Change step to tree.
 * tree-ssa-loop-prefetch.c (ar_data): Change step to tree.
 * tree-ssa-loop-prefetch.c (dump_mem_ref): Adopt debug code to
 handle a tree as step.  This also checks for a constant int vs.
 non-constant but loop-invariant steps.
 * tree-ssa-loop-prefetch.c (find_or_create_group): Change the sort
 algorithm to only consider steps that are constant ints.
 * tree-ssa-loop-prefetch.c (idx_analyze_ref): Adopt code to handle
 a tree instead of a HOST_WIDE_INT for step.
 * tree-ssa-loop-prefetch.c (gather_memory_references_ref): Handle
 tree instead of int and be prepared to see a NULL_TREE.
 * tree-ssa-loop-prefetch.c (prune_ref_by_self_reuse): Do not prune
 prefetches if the step cannot be calculated at compile time.
 * tree-ssa-loop-prefetch.c (prune_ref_by_group_reuse): Do not
prune
 prefetches if the step cannot be calculated at compile time.
 * tree-ssa-loop-prefetch.c (issue_prefetch_ref): Issue prefetches
 for non-constant but loop-invariant steps.


Applied to mainline. Thanks!


-- 

changpeng dot fang at amd dot com changed:

   What|Removed |Added

 CC||borntraeger at de dot ibm
   ||dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297



[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86

2010-05-27 Thread changpeng dot fang at amd dot com


--- Comment #2 from changpeng dot fang at amd dot com  2010-05-27 20:55 
---
To me, non-constant step prefetching seems not fit into the existing
prefetching
framework. non-constant stride prevent any reuse analysis, and thus prefetching
is kind of blindly.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297



[Bug middle-end/44297] Big spec cpu2006 prefetch regressions on gcc 4.6 on x86

2010-05-27 Thread changpeng dot fang at amd dot com


--- Comment #3 from changpeng dot fang at amd dot com  2010-05-27 23:51 
---
I did a quick look at 434.zeusmp and found that prefetching for the following
simple loop is responsible: 

linpck.f: 131:
c
ccode for increment not equal to 1
c
  ix = 1
  smax = abs(sx(1))
  ix = ix + incx
  do 10 i = 2,n
 if(abs(sx(ix)).le.smax) go to 5
 isamax = i
 smax = abs(sx(ix))
5ix = ix + incx
   10 continue

Prefetching for this loop seems too aggressive with unknown incx. It is not
precditable which sx(ix) will cause cache miss.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44297



[Bug tree-optimization/43423] gcc should vectorize this loop through if-conversion

2010-05-24 Thread changpeng dot fang at amd dot com


--- Comment #9 from changpeng dot fang at amd dot com  2010-05-24 22:47 
---
(In reply to comment #8)
 -fgraphite-identity does iteration splitting for this case.

Do you know why it could not be vectorized after iteration 
range splitting?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43423



[Bug middle-end/44185] [4.6 regression] New prefetch test failures

2010-05-21 Thread changpeng dot fang at amd dot com


--- Comment #6 from changpeng dot fang at amd dot com  2010-05-21 21:36 
---
(In reply to comment #5)
 The fix introduced:
 
 FAIL: gcc.dg/tree-ssa/prefetch-7.c scan-assembler-times movnti 18
 FAIL: gcc.dg/tree-ssa/prefetch-7.c scan-tree-dump-times optimized ={nt} 18
 
 on Linux/ia32.
 

It seems the unrolling is quite different for different architecture. The count
of movnti in and assembly code depends on the unroll_factor.

I would propose to remove the movnti check in the assembly code. The dump
in aprefetch shows there are two non-temporal stores generated and this is
enough.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44185



[Bug middle-end/44185] [4.6 regression] New prefetch test failures

2010-05-18 Thread changpeng dot fang at amd dot com


--- Comment #2 from changpeng dot fang at amd dot com  2010-05-18 19:39 
---
I have a patch to fix the test cases:
http://gcc.gnu.org/ml/gcc-patches/2010-05/msg01359.html

For prefetch-6.c, patch http://gcc.gnu.org/ml/gcc-cvs/2010-05/msg00567.html
applies the insn to prefetch ratio heuristic to loops with known trip count,
and thus filtered one prefetch out.  Add --param min-insn-to-prefetch-ratio=6
(default is 10) fixes the problem.

For prefetch-7.c, patch http://gcc.gnu.org/ml/gcc-cvs/2010-05/msg00566.html
does not generate prefetch if the loop is far from being sufficiently unrolled
required by the prefetching.  In this case, prefetching requires the loop to be
unrolled 16 times, but the loop is not unrolled due to the parameter
constraint.
We remove --param max-unrolled-insns=1 to allow unrolling and thus generating
prefetches.  The movnti count is also adjusted.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44185



[Bug tree-optimization/43425] gcc should vectorize this loop by substitution

2010-05-07 Thread changpeng dot fang at amd dot com


--- Comment #3 from changpeng dot fang at amd dot com  2010-05-07 21:33 
---
I just found that the test case in the same as (similar to) bug 35229.
The subject of this bug is wrong. Scalar expansion is not appropriate
for this case.

Actually the loop can be transform to:

void foo(int n)
{
  int i;
  a[0] = b[0]; /* + t if t live before this point */
  for(i=1; in; i++)
{
  a[i] = b[i] + b[i-1]; 
}
  /* t = b[n-1]; is t live after this point */
}

The this loop can be vectorized.

In open64, this optimization is called forward (backward) substitution, i.e.
substitute t with b[i-1].

I am not clear whether bug 35229 addresses the same issue. Maybe we should
close
one of them.


-- 

changpeng dot fang at amd dot com changed:

   What|Removed |Added

Summary|enhance scalar expansion to |gcc should vectorize this
   |vectorize this loop |loop by substitution


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43425



[Bug tree-optimization/43423] gcc should vectorize this loop through if-conversion

2010-05-07 Thread changpeng dot fang at amd dot com


--- Comment #7 from changpeng dot fang at amd dot com  2010-05-07 21:41 
---
(In reply to comment #4)
 (In reply to comment #3)
  Subject: Re:  gcc should vectorize this loop 
  through iteration range splitting
  You mean that the problem is the if-conversion of the stores
  a[i] = ...
 
 If we rewrite the code like:
 int a[100], b[100], c[100];
 
 void foo(int n, int mid)
 {
   int i;
   for(i=0; in; i++)
 {
   int t;
   int ai = a[i], bi = b[i], ci = c[i];
   if (i  mid)
 t = ai + bi;
   else
 t = ai + ci;
   a[i] = t;
 }
 }
 
 --- CUT ---
 This gets vectorized as we produce an if-cvt first.
 

There are both correctness and performance issues in the re-written code.
b[i] or c[i] may not be executed in the original loop.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43423



[Bug tree-optimization/43543] New: Reorder the statements in the loop can vectorize it

2010-03-26 Thread changpeng dot fang at amd dot com
int a[100], b[100], c[100], d[100];

void foo ()
{
  int i;
  for(i=1; i 99; i++)
  {
a[i] = b[i-1] + c[i];
b[i] = b[i+1] + d[i];
  }
}

gcc -O3 -ffast-math -ftree-vectorizer-verbose=2 -c foo.c

foo.c:6: note: not vectorized, possible dependence between data-refs
b[D.2728_3] and b[i_17]
foo.c:3: note: vectorized 0 loops in function.

However, if we reorder the two statements in the loop, then it can be
vectorized. open64 can do this reordering.


-- 
   Summary: Reorder the statements in the loop can vectorize it
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: changpeng dot fang at amd dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43543



[Bug tree-optimization/42906] [4.5 Regression] Empty loop not removed

2010-03-18 Thread changpeng dot fang at amd dot com


--- Comment #20 from changpeng dot fang at amd dot com  2010-03-18 17:24 
---
(In reply to comment #19)
 Splitting critical edges for CDDCE will probably also solve this problem.
 
 Richard.
 

Yes, splitting critical edges is an enhancement to CDDCE and can solve this
problem. There are two approaches to do this (1) add pass_split_crit_edges
before each pass_cd_dce or (2) encode split_crit_edges into cddce as an
initialization. What do you think? Thanks.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42906



[Bug c/43422] New: reversed loop is not vectorized

2010-03-18 Thread changpeng dot fang at amd dot com
gcc could not vectorize this simple reversed loop:
int a[100], b[100];
void foo(int n)
{
  int i;
  for(i=n-2; i=0; i--)
a[i+1] = a[i] + b[i];
}

chf...@pathscale:~/gcc$ gcc -O3 -ftree-vectorizer-verbose=2 -c foo.c
foo.c:6: note: not vectorized: complicated access pattern.
foo.c:3: note: vectorized 0 loops in function.

open64 can vectorize this loop:
chf...@pathscale:~/gcc$ opencc -O3 -LNO:simd_verbose=on -c foo.c
(foo.c:0) LOOP WAS VECTORIZED.


-- 
   Summary: reversed loop is not vectorized
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: changpeng dot fang at amd dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43422



[Bug c/43423] New: gcc should vectorize this loop through iteration range splitting

2010-03-18 Thread changpeng dot fang at amd dot com
chf...@pathscale:~/gcc$ cat foo.c
int a[100], b[100], c[100];

void foo(int n, int mid)
{
  int i;
  for(i=0; in; i++)
{
  if (i  mid)
a[i] = a[i] + b[i];
  else
a[i] = a[i] + c[i];
}
}


chf...@pathscale:~/gcc$ gcc -O3 -ftree-vectorizer-verbose=7 -c foo.c

foo.c:6: note: not vectorized: control flow in loop.
foo.c:3: note: vectorized 0 loops in function.

This loop can be vectorized by icc.

For this case, I would expect to see two loops with iteration range
of [0, mid) and [mid, n). Then both loops can be vectorized.

I am not sure which pass in gcc should do this iteration range splitting.


-- 
   Summary: gcc should vectorize this loop through iteration range
splitting
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: changpeng dot fang at amd dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43423



[Bug c/43425] New: enhance scalar expansion to vectorize this loop

2010-03-18 Thread changpeng dot fang at amd dot com
chf...@pathscale:~/gcc$ cat foo.c
int a[100], b[100];

void foo(int n, int mid)
{
  int i, t = 0;
  for(i=0; in; i++)
{
  a[i] = b[i] + t; 
  t = b[i];
}
}
chf...@pathscale:~/gcc$ gcc -O3 -ftree-vectorizer-verbose=7 -c foo.c

foo.c:6: note: not vectorized: unsupported use in stmt.
foo.c:3: note: vectorized 0 loops in function.

scalar expansion of t into array to carry the values accross iteration.


-- 
   Summary: enhance scalar expansion to vectorize this loop
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: changpeng dot fang at amd dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43425



[Bug c/43427] New: The loop is not interchanged and thus could not be vectorized.

2010-03-18 Thread changpeng dot fang at amd dot com
chf...@pathscale:~/gcc$ cat foo.c
float a[100][100], b[100][100];

void foo(int n)
{
  int i, j;
  for(j=0; jn; j++)
for(i=0; i n; i++)
  a[i][j] = a[i][j] + b[i][j]; 
}
chf...@pathscale:~/gcc$ gcc -O3 -ftree-vectorizer-verbose=2 -c foo.c

foo.c:6: note: not vectorized: can't create epilog loop 2.
foo.c:7: note: not vectorized: complicated access pattern.
foo.c:3: note: vectorized 0 loops in function.

Information from open64:
chf...@pathscale:~/gcc$ opencc -O3 -LNO:simd_verbose=on -c foo.c
(foo.c:0) LOOP WAS VECTORIZED.
(foo.c:0) LOOP WAS VECTORIZED.
chf...@pathscale:~/gcc$ opencc -O3 -LNO:simd_verbose=on:interchange=0 -c foo.c
(foo.c:0) Non-contiguous array a reference exists. Loop was not vectorized.
(foo.c:0) Non-contiguous array a reference exists. Loop was not vectorized.


Graphite may be able to do this basic loop interchange.


-- 
   Summary: The loop is not interchanged and thus could not be
vectorized.
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: changpeng dot fang at amd dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43427



[Bug tree-optimization/43428] New: vectorizer should invoke loop distribution to partially vectorize this loop

2010-03-18 Thread changpeng dot fang at amd dot com
chf...@pathscale:~/gcc$ cat foo.c
float a[100], b[100], c[100];

void foo(int n)
{
  int i;
  for(i=1; in; i++)
  {
 a[i] = a[i] + c[i];
 b[i] = b[i-1] + a[i]; 
  } 
}
chf...@pathscale:~/gcc$ gcc -O3 -ftree-vectorizer-verbose=2
-ftree-loop-distribution -c foo.c

foo.c:6: note: not vectorized, possible dependence between data-refs
b[D.2730_7] and b[i_17]
foo.c:3: note: vectorized 0 loops in function.

Loop distribution itself may find not profitable to do such distribution.
However, partially vectorize this loop may obtain big profit. ICC can partially
vectorize it.


-- 
   Summary: vectorizer should invoke loop distribution to partially
vectorize this loop
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: changpeng dot fang at amd dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43428



[Bug tree-optimization/32824] Missed reduction vectorizer after store to global is LIM'd

2010-03-17 Thread changpeng dot fang at amd dot com


--- Comment #8 from changpeng dot fang at amd dot com  2010-03-17 21:22 
---
Created an attachment (id=20133)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=20133action=view)
patch with the testcase


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32824



[Bug tree-optimization/42906] [4.5 Regression] Empty loop not removed

2010-03-16 Thread changpeng dot fang at amd dot com


--- Comment #17 from changpeng dot fang at amd dot com  2010-03-17 00:18 
---
(In reply to comment #8)
 And
 
 int foo (int b, int j)
 {
   if (b)
 {
   int i;
   for (i = 0; i1000; ++i)
 ;
   j = b;
 }
   return j;
 }
 

With j=b, b is not folded as a phi argument: 

bb 5:
  # i_2 = PHI 0(3), i_6(4)
  if (i_2 = 999)
goto bb 4;
  else
goto bb 6;

bb 6:
  j_7 = b_3(D);

bb 7:
  # j_1 = PHI j_4(D)(2), j_7(6)

However, if j=0, it is:
bb 6:
  j_7 = 0;

bb 7:
  # j_1 = PHI j_4(D)(2), 0(6)
  j_8 = j_1;
  return j_8;

Then copy propagation will remove j_7 = 0 (and thus bb 6) because it has no
user.

So, one possible solution is do not remove trival dead code in
copy_propagation pass. Any dce pass will remove such code.

Of course, if we follow Steven's suggestion not use constants as phi arguments,
j_7=0 will not be removed by constant propagation, and we are all fine.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42906



[Bug tree-optimization/42906] [4.5 Regression] Empty loop not removed

2010-03-16 Thread changpeng dot fang at amd dot com


--- Comment #18 from changpeng dot fang at amd dot com  2010-03-17 00:22 
---
(In reply to comment #16)
  In this case, the loop itself is empty and we can replace every use of the
  phi with n (exit value of the iv).
 
 I don't think that is done by remove_empty_loop anyways and it is already done
 by sccp (Propagation of constants using scev) which is enabled at -O1.
 

But n is not a constant. Of course we can modify the pass to compute the exit
value of iv (integer overflow may be an issue).



-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42906



[Bug middle-end/43238] GCC 4.5 ICE segfault on any -O flag

2010-03-02 Thread changpeng dot fang at amd dot com


--- Comment #4 from changpeng dot fang at amd dot com  2010-03-02 21:56 
---
I have verified that the patch proposed in bug 43209 did 
fix this problem. I am going to checkin the change soon.
Thanks.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43238



[Bug tree-optimization/43209] [4.5 Regression] ICE in try_improve_iv_set, at tree-ssa-loop-ivopts.c:5238

2010-03-01 Thread changpeng dot fang at amd dot com


--- Comment #5 from changpeng dot fang at amd dot com  2010-03-01 18:02 
---
I have a fix for this problem. We should not decrease the cost if the cost is
infinite.

diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 74dadf7..9accda9 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -4124,7 +4124,11 @@ determine_use_iv_cost_condition (struct ivopts_data
*data,
   if (integer_zerop (*bound_cst)
(operand_equal_p (*control_var, cand-var_after, 0)
  || operand_equal_p (*control_var, cand-var_before, 0)))
-elim_cost.cost -= 1;
+{
+  /* Should not decrease the cost if it is infinite */
+  if (!infinite_cost_p (elim_cost))
+elim_cost.cost -= 1;
+}


-- 

changpeng dot fang at amd dot com changed:

   What|Removed |Added

 CC||changpeng dot fang at amd
   ||dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43209



[Bug middle-end/43182] GCC does not pull out a[0] from loop that changes a[i] for i:[1,n]

2010-02-26 Thread changpeng dot fang at amd dot com


--- Comment #4 from changpeng dot fang at amd dot com  2010-02-26 18:53 
---
Here is another similar case but more general. We know that a(j) and a(i)
never access the same memory location. intel ifort can vectorize this
triangular
loop:

  do 10 j = 1,n
 do 20 i = j+1, n
a(i) = a(i) - aa(i,j) * a(j)
  20 continue
  10  continue


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43182



[Bug middle-end/43182] GCC does not pull out a[0] from loop that changes a[i] for i:[1,n]

2010-02-26 Thread changpeng dot fang at amd dot com


--- Comment #6 from changpeng dot fang at amd dot com  2010-02-26 19:06 
---

 
 Actually it is a totally different case.  Please file a new bug with that 
 case;
 though there might already be a bug about that one.
 

I could not see the difference even though j is not a compile-time constant.
(it
is an invariant to the innermost loop). I can say:

GCC does not pull out a[j] from loop that changes a[i] for i:[j+1,n]


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43182



[Bug middle-end/43182] New: gcc could not vectorize this simple loop (un-handled data-ref)

2010-02-25 Thread changpeng dot fang at amd dot com
gcc 4.5 can not vectorize this simple loop:

void foo(int a[], int n) {
 int i;
 for(i=1; i n; i++)
  a[i] = a[0];
}

gcc -O3 -fdump-tree-vect-all -c foo.c shows:
foo.c:3: note: not vectorized: unhandled data-ref 
foo.c:3: note: bad data references.
foo.c:1: note: vectorized 0 loops in function.

It seems gcc gets confused at a[0] and gives up vectorization. There
is no dependence in this loop, and we should teach gcc to handle a[0]
to vectorize it.


-- 
   Summary: gcc could not vectorize this simple loop (un-handled
data-ref)
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: changpeng dot fang at amd dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43182



[Bug middle-end/43184] New: gcc could not vectorize floating point reduction statements

2010-02-25 Thread changpeng dot fang at amd dot com
gcc 4.5 could not vectorize floating point reductions.
float sum(float a[], int n) {
 int i;
 float total=0.0;
 for(i=0; i n; i++)
  total += a[i];
 return total;
}

gcc -O3 -fdump-tree-vect-all shows:
foo.c:4: note: Unsupported pattern.
foo.c:4: note: not vectorized: unsupported use in stmt.
foo.c:4: note: unexpected pattern.
foo.c:1: note: vectorized 0 loops in function.

I have verified that gcc can vectorize integer reduction, but not float and
double.


-- 
   Summary: gcc could not vectorize floating point reduction
statements
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: changpeng dot fang at amd dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43184



[Bug middle-end/43184] gcc could not vectorize floating point reduction statements

2010-02-25 Thread changpeng dot fang at amd dot com


--- Comment #2 from changpeng dot fang at amd dot com  2010-02-26 00:28 
---
Subject: RE:  gcc could not vectorize floating point
 reduction statements

Thanks for pointing this out. Actually I am working on a fortran program and
found the
the reduction statement.  The fortran code can not be vectorized even with
-ffast-math.
Do you think this is the problem of fortran frontend? Thanks, -- Changpeng

c%3.1
  subroutine s311 (ntimes,ld,n,ctime,dtime,a,b,c,d,e,aa,bb,cc)
c
c reductions
c sum reduction
c
  integer ntimes, ld, n, i, nl
  double precision a(n), b(n), c(n), d(n), e(n), aa(ld,n),
 + bb(ld,n), cc(ld,n)
  double precision chksum, sum
  real t1, t2, second, ctime, dtime

  call init(ld,n,a,b,c,d,e,aa,bb,cc,'s311 ')
  t1 = second()
  do 1 nl = 1,ntimes
  sum = 0.d0
  do 10 i = 1,n
 sum = sum + a(i)
  10  continue
  call dummy(ld,n,a,b,c,d,e,aa,bb,cc,sum)
  1   continue
  t2 = second() - t1 - ctime - ( dtime * float(ntimes) )
  chksum = sum
  call check (chksum,ntimes*n,n,t2,'s311 ')
  return
  end




From: pinskia at gcc dot gnu dot org [gcc-bugzi...@gcc.gnu.org]
Sent: Thursday, February 25, 2010 5:57 PM
To: Fang, Changpeng
Subject: [Bug middle-end/43184] gcc could not vectorize floating point
reduction statements

--- Comment #1 from pinskia at gcc dot gnu dot org  2010-02-25 23:57
---
gcc 4.5 could not vectorize floating point reductions.

Yes it can; add -ffast-math.  floating point reductions need -ffast-math as it
can change the results in some cases (negative zero and I think clamping cases
too).


--

pinskia at gcc dot gnu dot org changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution||INVALID


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43184

--- You are receiving this mail because: ---
You reported the bug, or are watching the reporter.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43184



[Bug tree-optimization/42906] [4.5 Regression] Empty loop not removed

2010-02-16 Thread changpeng dot fang at amd dot com


--- Comment #15 from changpeng dot fang at amd dot com  2010-02-16 19:54 
---
Hello,
I am not sure whether CD-DCE can fully replace remove_empty_loop. However,
I would prefer to keep remove_empty_loop pass. There are two reasons for
this proposal:
(1) remove_empty_loop was at level -O1 and above, but CD-DCE at -O2 and above.
(2) remove_empty_loop can be extended to handle other cases which CD-DCE is not
able to:

for(i=0; in; i++);
j = i;
 In this case, the loop itself is empty and we can replace every use of the
phi with n (exit value of the iv).

What do you think about this (put back the empty loop removal code)? Thanks,


-- 

changpeng dot fang at amd dot com changed:

   What|Removed |Added

 CC||cfang at gcc dot gnu dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42906