[Bug tree-optimization/63185] Improve DSE with branches

2018-05-18 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63185

--- Comment #12 from Marc Glisse  ---
(In reply to Richard Biener from comment #11)
> I guess you meant (notice the bogus memset size above):

True. And while it shouldn't make a difference in checking if the stores to c
are dead, it could (but doesn't, symbolic bounds are not so easy to handle)
make a difference for the reads since we could notice that the reads from a and
b are necessarily 0 (thus we write 0 to c[i], thus this loop is a memset, etc).

[Bug tree-optimization/63185] Improve DSE with branches

2018-05-18 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63185

--- Comment #11 from Richard Biener  ---
(In reply to Marc Glisse from comment #5)
> Another example:
> 
> #include 
> void f(){
>   const int n=1<<14;
>   double a[n];
>   double b[n];
>   double c[n];
>   __builtin_memset(a,0,n);
>   __builtin_memset(b,0,n);
>   __builtin_memset(c,0,n);
>   for(int i=0;i c[i]=a[i]+b[i];
>   time(0);
> }

I guess you meant (notice the bogus memset size above):

#include 
void f(){
const int n=1<<14;
double a[n];
double b[n];
double c[n];
__builtin_memset(a,0,n*sizeof(double));
__builtin_memset(b,0,n*sizeof(double));
__builtin_memset(c,0,n*sizeof(double));
for(int i=0;i If I make n a constant (using #define), DCE knows that c is not
> ref_may_be_aliased (that test could probably be weakened) and removes almost
> everything. We don't replace alloca_with_align with an array because the
> size is larger than 256 (with a more limited scope the limit would even be
> 25...). DSE is likely confused by the loop. And PRE and others don't know
> that a[i] is always 0. (llvm removes everything but the final call)

[Bug tree-optimization/63185] Improve DSE with branches

2018-05-18 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63185

--- Comment #10 from Richard Biener  ---
Created attachment 44146
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44146=edit
testresults

Related I checked

Index: gcc/passes.def
===
--- gcc/passes.def  (revision 260317)
+++ gcc/passes.def  (working copy)
@@ -209,8 +209,6 @@ along with GCC; see the file COPYING3.
   NEXT_PASS (pass_return_slot);
   NEXT_PASS (pass_fre);
   NEXT_PASS (pass_merge_phi);
-  NEXT_PASS (pass_thread_jumps);
-  NEXT_PASS (pass_vrp, true /* warn_array_bounds_p */);
   NEXT_PASS (pass_chkp_opt);
   NEXT_PASS (pass_dce);
   NEXT_PASS (pass_stdarg);
@@ -224,6 +222,7 @@ along with GCC; see the file COPYING3.
   NEXT_PASS (pass_ch);
   NEXT_PASS (pass_lower_complex);
   NEXT_PASS (pass_sra);
+  NEXT_PASS (pass_vrp, true /* warn_array_bounds_p */);
   /* The dom pass will also resolve all __builtin_constant_p calls
  that are still there to 0.  This has to be done after some
 propagations have already run, but before some more dead code

which as quite some regressions (attached).  Ignore any auto-fdo fails.
Will check w/o the removal of pass_thread_jumps.

[Bug tree-optimization/63185] Improve DSE with branches

2018-05-17 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63185

--- Comment #9 from Richard Biener  ---
So the following incomplete patch solves the value-numbering issue in FRE2.

diff --git a/gcc/tree-dfa.c b/gcc/tree-dfa.c
index a121b880bb0..745d60cbda5 100644
--- a/gcc/tree-dfa.c
+++ b/gcc/tree-dfa.c
@@ -529,6 +529,31 @@ get_ref_base_and_extent (tree exp, poly_int64_pod
*poffset,
/* Remember that we have seen an array ref with a variable
   index.  */
seen_variable_array_ref = true;
+
+   wide_int min, max;
+   if (TREE_CODE (index) == SSA_NAME
+   && integer_zerop (array_ref_low_bound (exp))
+   && (unit_size = array_ref_element_size (exp),
+   TREE_CODE (unit_size) == INTEGER_CST)
+   && get_range_info (index, , ) == VR_RANGE
+   && wi::gts_p (max, 0))
+ {
+   poly_offset_int rmaxsize;
+   rmaxsize = offset_int::from (max, UNSIGNED) * wi::to_offset
(unit_size);
+   if (known_size_p (maxsize))
+ {
+   if (known_lt (rmaxsize, maxsize))
+ {
+   maxsize = rmaxsize;
+   seen_variable_array_ref = false;
+   }
+ }
+   else
+ {
+   maxsize = rmaxsize;
+   seen_variable_array_ref = false;
+ }
+ }
  }
  }
  break;
diff --git a/gcc/tree-ssa-sccvn.c b/gcc/tree-ssa-sccvn.c
index 1463c1d4116..dad4e4fecdf 100644
--- a/gcc/tree-ssa-sccvn.c
+++ b/gcc/tree-ssa-sccvn.c
@@ -1960,19 +1960,46 @@ vn_reference_lookup_3 (ao_ref *ref, tree vuse, void
*vr_,
   && gimple_call_builtin_p (def_stmt, BUILT_IN_MEMSET)
   && integer_zerop (gimple_call_arg (def_stmt, 1))
   && poly_int_tree_p (gimple_call_arg (def_stmt, 2))
-  && TREE_CODE (gimple_call_arg (def_stmt, 0)) == ADDR_EXPR)
+  && (TREE_CODE (gimple_call_arg (def_stmt, 0)) == ADDR_EXPR
+ || TREE_CODE (gimple_call_arg (def_stmt, 0)) == SSA_NAME))
 {
-  tree ref2 = TREE_OPERAND (gimple_call_arg (def_stmt, 0), 0);
   tree base2;
   poly_int64 offset2, size2, maxsize2;
   bool reverse;
-  base2 = get_ref_base_and_extent (ref2, , , ,
-  );
+  tree ref2 = gimple_call_arg (def_stmt, 0);
+  if (TREE_CODE (ref2) == SSA_NAME)
+   {
+ ref2 = SSA_VAL (ref2);
+ if (TREE_CODE (ref2) == SSA_NAME
+ && (TREE_CODE (base) != MEM_REF
+ || TREE_OPERAND (base, 0) != ref2))
+   {
+ gimple *def_stmt = SSA_NAME_DEF_STMT (ref2);
+ if (gimple_assign_single_p (def_stmt)
+ && gimple_assign_rhs_code (def_stmt) == ADDR_EXPR)
+   ref2 = gimple_assign_rhs1 (def_stmt);
+   }
+   }
+  if (TREE_CODE (ref2) == ADDR_EXPR)
+   {
+ ref2 = TREE_OPERAND (ref2, 0);
+ base2 = get_ref_base_and_extent (ref2, , , ,
+  );
+ if (!known_size_p (maxsize2)
+ || !operand_equal_p (base, base2, 0))
+   return (void *)-1;
+   }
+  else
+   {
+ if (TREE_CODE (base) != MEM_REF
+ || TREE_OPERAND (base, 0) != ref2
+ || !integer_zerop (TREE_OPERAND (base, 1)))
+   return (void *)-1;
+ offset2 = 0;
+   }
   tree len = gimple_call_arg (def_stmt, 2);
-  if (known_size_p (maxsize2)
- && operand_equal_p (base, base2, 0)
- && known_subrange_p (offset, maxsize, offset2,
-  wi::to_poly_offset (len) << LOG2_BITS_PER_UNIT))
+  if (known_subrange_p (offset, maxsize, offset2,
+   wi::to_poly_offset (len) << LOG2_BITS_PER_UNIT))
{
  tree val = build_zero_cst (vr->type);
  return vn_reference_lookup_or_insert_for_pieces

[Bug tree-optimization/63185] Improve DSE with branches

2018-05-17 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63185

--- Comment #8 from Richard Biener  ---
comment#5 runs into the fact that we cannot reliably compute kills for
loop-variant stores and thus we give up completely:

  if (gimple_code (temp) == GIMPLE_PHI)
{
  /* If we visit this PHI by following a backedge then we have to
 make sure ref->ref only refers to SSA names that are invariant
 with respect to the loop represented by this PHI node.  */
  if (dominated_by_p (CDI_DOMINATORS, gimple_bb (stmt),
  gimple_bb (temp))
  && !for_each_index (ref->ref ? >ref : >base,
  check_name, gimple_bb (temp)))
return DSE_STORE_LIVE;

but what we could do is somehow massage the ref to not get false
positive stmt_kills_ref_p or false negative ref_maybe_used_by_stmt_p
results.  That could for example involve stripping all components
with variable names in them (and then also disabling byte-tracking).
For varying bases we still need to give up completely.

A backward data-flow algorithm might have an easier time dealing with this
case.


As for the value-numbering issue we currently do not use number of iteration
analysis / range-info to constrain the ao_ref max_size/offset.  That's probably
a useful thing to add and might already help.  In this case it doesn't
because the range is one too large (head controlled loop).  And VRP doesn't
run after CH when FRE/PRE still runs afterwards.  And value-numbering visiting
memset doesn't handle pointer destinations (that's easy to fix).

[Bug tree-optimization/63185] Improve DSE with branches

2018-05-17 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63185

--- Comment #7 from Marc Glisse  ---
This PR is messy. To sum up, comment #0 was recently fixed, comment #5 is not
(not noticing that the writes in the loop are dead), and comment #6 asks for
increasing the alignment of VLAs the same way we sometimes increase the
alignment of arrays to help vectorization.

[Bug tree-optimization/63185] Improve DSE with branches

2015-12-15 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63185

--- Comment #6 from Marc Glisse  ---
In addition to the issues already described, it seems that we generate better
code if I replace the VLAs with calls to alloca. Indeed, we assume that alloca
returns 16-aligned memory, while with __builtin_alloca_with_align(..., 64), we
don't seem to have code to turn it into __builtin_alloca_with_align(..., 128)
so we could avoid all the loop adjustment code.

[Bug tree-optimization/63185] Improve DSE with branches

2014-10-24 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63185

--- Comment #5 from Marc Glisse glisse at gcc dot gnu.org ---
Another example:

#include time.h
void f(){
  const int n=114;
  double a[n];
  double b[n];
  double c[n];
  __builtin_memset(a,0,n);
  __builtin_memset(b,0,n);
  __builtin_memset(c,0,n);
  for(int i=0;in;++i)
c[i]=a[i]+b[i];
  time(0);
}

If I make n a constant (using #define), DCE knows that c is not
ref_may_be_aliased (that test could probably be weakened) and removes almost
everything. We don't replace alloca_with_align with an array because the size
is larger than 256 (with a more limited scope the limit would even be 25...).
DSE is likely confused by the loop. And PRE and others don't know that a[i] is
always 0. (llvm removes everything but the final call)


[Bug tree-optimization/63185] Improve DSE with branches

2014-10-16 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63185

--- Comment #3 from Richard Biener rguenth at gcc dot gnu.org ---
Btw, what was the original testcase?

It's all a little bit complicated and not really fit for the current simple
handling of PHIs.  Because with a conditional store we have to continue
looking for uses on the other path (for regular stores we'd have expected
code sinking to sink the store).

That is, without making the walk very much more expensive there isn't anything
we can do about this.


[Bug tree-optimization/63185] Improve DSE with branches

2014-10-16 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63185

--- Comment #4 from Marc Glisse glisse at gcc dot gnu.org ---
(In reply to Richard Biener from comment #3)
 Btw, what was the original testcase?

Sorry, I don't remember. Probably something involving std::vector or
std::string. Since llvm optimizes it just fine, I hadn't realized it would be
so hard.

 That is, without making the walk very much more expensive there isn't
 anything we can do about this.

Ok, thanks for looking at it. Down to P5?


[Bug tree-optimization/63185] Improve DSE with branches

2014-09-05 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63185

Richard Biener rguenth at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2014-09-05
 Ever confirmed|0   |1

--- Comment #1 from Richard Biener rguenth at gcc dot gnu.org ---
Hm?  The PHI does post-dominate the memset.  I think you run into another
check,
namely we visited the call already (temp != NULL).

We don't special-case a non-diamond if (where there are never two possible
uses on different CFG paths).  That is, the walking over all uses is
somewhat stupid.


[Bug tree-optimization/63185] Improve DSE with branches

2014-09-05 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63185

--- Comment #2 from Marc Glisse glisse at gcc dot gnu.org ---
(In reply to Richard Biener from comment #1)
 Hm?  The PHI does post-dominate the memset.  I think you run into another
 check, namely we visited the call already (temp != NULL).

Oups... The original testcase was failing because a PHI didn't post-dominate,
and I wasn't very careful when reducing. Bah, the issue seems close enough...