Re: [WIP PATCH] Re: Inefficient end-of-loop value computation - missed optimization somewhere?

2012-03-12 Thread Richard Guenther
On Thu, Mar 8, 2012 at 3:29 PM, Ulrich Weigand uweig...@de.ibm.com wrote:
 Richard Guenther wrote:
 On Tue, Feb 28, 2012 at 4:10 PM, Ulrich Weigand uweig...@de.ibm.com wrote:
  I'll still need to do proper testing and benchmarking, but I thought
  I'd post the patch anyway just as a heads-up ...
 
  Does this look reasonable to you?

 Yes, that looks reasonable.  Though instead of unsigned_type_for
 please use build_nonstandard_integer_type instead (if the type
 is not already unsigned).  unsigned_type_for does not necessarily
 preserve type-precision and may even return NULL.

 OK, I've changed the patch to use build_nonstandard_integer_type, and it
 still fixes the original problem.  However, on further examination it
 turns out that this fix is not because the expression is actually
 simplified at tree level.  Rather, it is just that because of some
 minor reassociation (the +1 is moved to the end), the *RTL* optimizers
 now see a (just as complex but) slightly different RTL sequence, and
 therefore combine now just happens to be able to fold everything
 together (using the RTL-level simplify_plus_minus machinery).

 Even worse, the patch causes a number of regressions:

 FAIL: gcc.dg/tree-ssa/loop-15.c scan-tree-dump-times optimized \\+ 0
 FAIL: gcc.dg/tree-ssa/loop-15.c scan-tree-dump-times optimized n_. \\* n_. 
 1
 FAIL: gcc.dg/vect/no-scevccp-noreassoc-outer-4.c scan-tree-dump-times vect 
 OUTER LOOP VECTORIZED. 1


 The loop-15.c problem is probably just a missed optimization
 elsewhere that can be fixed.  The problem is that a loop-end
 value used to be computed as (n - 1) * n + n which got
 simplified via fold_plusminus_mult_expr to n * n.  When
 computing in unsigned, however, the expression becomes something
 along the lines of

  [...] * (unsigned) n + (unsigned) n

 and the top-level of fold_binary_loc removes the *outermost* NOP_EXPRs
 resulting in

  [...] * (unsigned) n + n

 which fold_plusminus_mult_expr no longer recognizes as something
 it can handle (since (unsigned) n is not operand_equal_p to n).

 I think this can be fixed by having fold_plusminus_mult_expr call
 STRIP_NOPS on sub-operands, as is already done elsewhere in fold_binary_loc.

 This basically fixes the test case (due to some remaining casts, the
 scan-tree-dump patterns still don't match as-is, but the expected code
 is again generated).


 The no-scevccp-noreassoc-outer-4.c problem is more difficult, and would
 seem to point to a fundamental problem with performing the computation
 in unsigned mode.  In this case, the end value of an inner loop forms
 part of a scalar evolution of the outer loop.  When computing the end
 value in unsigned type, that scalar evolution is no longer detected.
 In part, this is because the scalar evolution machinery might need to
 be improved in handling casts.  In particular, in some places it only
 does STRIP_USELESS_TYPE_CONVERSION; changing those to STRIP_NOPS makes
 it detect the evolution again.  But I'm not fully convinced this is
 a safe change ...   In any case, this still doesn't make the outer
 loop vectorizable, since it is no longer provably finite.  Without
 the patch, this could be proven *because* the computation of the
 inner loop's end value used signed arithmetic which cannot overflow.
 By performing the computation in unsigned, this information is in
 fact lost.


 This would seem to indicate that unconditionally using unsigned
 arithmetic even if not strictly necessary might not always be
 the best solution either.  I've gone back and looked that at the
 original problem that

  (start + 1) + (int)((unsigned int) ~start + (unsigned int) end)

 is not simplified by fold-const.

 Interestingly enough, it *is* valid to simplify this expression
 to just plain end, even with signed arithmetic, since the
 transformation does not introduce any new potential overflows.

 This is actually recognized in fold_binary_loc, but only one special
 case.  See the code around the following comment:
                  /* The only case we can still associate with two variables
                     is if they are the same, modulo negation.  */

 Unfortunately, this handles only A and -A; it doesn't recognize that
 A+1 is in fact the negation of ~A ...

 There is also code in tree-ssa-forwprop.c:associate_plusminus that
 handles quite a number of special cases where re-association *is*
 allowed even for signed arithmetic:

        (A +- B) - A       -  +- B
        (A +- B) -+ B      -  A
        (CST +- A) +- CST  -  CST +- A
        (A + CST) +- CST   -  A + CST
        ~A + A             -  -1
        ~A + 1             -  -A
        A - (A +- B)       -  -+ B
        A +- (B +- A)      -  +- B
        CST +- (CST +- A)  -  CST +- A
        CST +- (A +- CST)  -  CST +- A
        A + ~A             -  -1

 This handles some cases involving ~, but still not quite the case
 we need for this expression.   In addition, the forward propagtion logic
 doesn't seem to handle casts very well (as 

Re: [WIP PATCH] Re: Inefficient end-of-loop value computation - missed optimization somewhere?

2012-03-08 Thread Ulrich Weigand
Richard Guenther wrote:
 On Tue, Feb 28, 2012 at 4:10 PM, Ulrich Weigand uweig...@de.ibm.com wrote:
  I'll still need to do proper testing and benchmarking, but I thought
  I'd post the patch anyway just as a heads-up ...
 
  Does this look reasonable to you?
 
 Yes, that looks reasonable.  Though instead of unsigned_type_for
 please use build_nonstandard_integer_type instead (if the type
 is not already unsigned).  unsigned_type_for does not necessarily
 preserve type-precision and may even return NULL.

OK, I've changed the patch to use build_nonstandard_integer_type, and it
still fixes the original problem.  However, on further examination it
turns out that this fix is not because the expression is actually
simplified at tree level.  Rather, it is just that because of some
minor reassociation (the +1 is moved to the end), the *RTL* optimizers
now see a (just as complex but) slightly different RTL sequence, and
therefore combine now just happens to be able to fold everything
together (using the RTL-level simplify_plus_minus machinery).

Even worse, the patch causes a number of regressions:

 FAIL: gcc.dg/tree-ssa/loop-15.c scan-tree-dump-times optimized \\+ 0
 FAIL: gcc.dg/tree-ssa/loop-15.c scan-tree-dump-times optimized n_. \\* n_. 1
 FAIL: gcc.dg/vect/no-scevccp-noreassoc-outer-4.c scan-tree-dump-times vect 
 OUTER LOOP VECTORIZED. 1


The loop-15.c problem is probably just a missed optimization
elsewhere that can be fixed.  The problem is that a loop-end
value used to be computed as (n - 1) * n + n which got
simplified via fold_plusminus_mult_expr to n * n.  When
computing in unsigned, however, the expression becomes something
along the lines of

  [...] * (unsigned) n + (unsigned) n

and the top-level of fold_binary_loc removes the *outermost* NOP_EXPRs
resulting in

  [...] * (unsigned) n + n

which fold_plusminus_mult_expr no longer recognizes as something
it can handle (since (unsigned) n is not operand_equal_p to n).

I think this can be fixed by having fold_plusminus_mult_expr call
STRIP_NOPS on sub-operands, as is already done elsewhere in fold_binary_loc.

This basically fixes the test case (due to some remaining casts, the
scan-tree-dump patterns still don't match as-is, but the expected code
is again generated).


The no-scevccp-noreassoc-outer-4.c problem is more difficult, and would
seem to point to a fundamental problem with performing the computation
in unsigned mode.  In this case, the end value of an inner loop forms
part of a scalar evolution of the outer loop.  When computing the end
value in unsigned type, that scalar evolution is no longer detected.
In part, this is because the scalar evolution machinery might need to
be improved in handling casts.  In particular, in some places it only
does STRIP_USELESS_TYPE_CONVERSION; changing those to STRIP_NOPS makes
it detect the evolution again.  But I'm not fully convinced this is
a safe change ...   In any case, this still doesn't make the outer
loop vectorizable, since it is no longer provably finite.  Without
the patch, this could be proven *because* the computation of the
inner loop's end value used signed arithmetic which cannot overflow.
By performing the computation in unsigned, this information is in
fact lost.


This would seem to indicate that unconditionally using unsigned
arithmetic even if not strictly necessary might not always be
the best solution either.  I've gone back and looked that at the
original problem that

  (start + 1) + (int)((unsigned int) ~start + (unsigned int) end)

is not simplified by fold-const.

Interestingly enough, it *is* valid to simplify this expression
to just plain end, even with signed arithmetic, since the
transformation does not introduce any new potential overflows.

This is actually recognized in fold_binary_loc, but only one special
case.  See the code around the following comment:
  /* The only case we can still associate with two variables
 is if they are the same, modulo negation.  */

Unfortunately, this handles only A and -A; it doesn't recognize that
A+1 is in fact the negation of ~A ...

There is also code in tree-ssa-forwprop.c:associate_plusminus that
handles quite a number of special cases where re-association *is*
allowed even for signed arithmetic:

(A +- B) - A   -  +- B
(A +- B) -+ B  -  A
(CST +- A) +- CST  -  CST +- A
(A + CST) +- CST   -  A + CST
~A + A -  -1
~A + 1 -  -A 
A - (A +- B)   -  -+ B
A +- (B +- A)  -  +- B
CST +- (CST +- A)  -  CST +- A
CST +- (A +- CST)  -  CST +- A
A + ~A -  -1

This handles some cases involving ~, but still not quite the case
we need for this expression.   In addition, the forward propagtion logic
doesn't seem to handle casts very well (as opposed to fold-const, which
makes liberal use of STRIP_NOPS).


I'm wondering where to go from there:

- Why are those special 

Re: [WIP PATCH] Re: Inefficient end-of-loop value computation - missed optimization somewhere?

2012-02-28 Thread Richard Guenther
On Tue, Feb 28, 2012 at 4:10 PM, Ulrich Weigand uweig...@de.ibm.com wrote:
 Richard Guenther wrote:
 On Mon, Feb 20, 2012 at 11:19 PM, Ulrich Weigand uweig...@de.ibm.com wrote:
  we've noticed that the loop optimizer sometimes inserts weirdly
  inefficient code to compute the value of an induction variable
  at the end of the loop.
 [snip]

 The issue is that (start + 1) + 1 * (int) ... is computed using signed
 integer arithmetic which may invoke undefined behavior when it wraps.
 Thus we cannot re-associate it.  I suppose if you'd arrange the code
 to do the arithmetic in unsigned and only cast the result back to the
 desired type we might fold it ...

 I'm not completely sure I've got the correct place for the conversions,
 but using the patch below does indeed fix the inefficient code.

 I'll still need to do proper testing and benchmarking, but I thought
 I'd post the patch anyway just as a heads-up ...

 Does this look reasonable to you?

Yes, that looks reasonable.  Though instead of unsigned_type_for
please use build_nonstandard_integer_type instead (if the type
is not already unsigned).  unsigned_type_for does not necessarily
preserve type-precision and may even return NULL.

Richard.

 Thanks,
 Ulrich


 ChangeLog:

        * tree-scalar-evolution.c (compute_overall_effect_of_inner_loop):
        Perform chrec_apply computation in an unsigned type.

 Index: gcc/tree-scalar-evolution.c
 ===
 --- gcc/tree-scalar-evolution.c (revision 184439)
 +++ gcc/tree-scalar-evolution.c (working copy)
 @@ -474,11 +474,18 @@
            return chrec_dont_know;
          else
            {
 +             tree type = TREE_TYPE (evolution_fn);
 +             tree utype = unsigned_type_for (type);
              tree res;

              /* evolution_fn is the evolution function in LOOP.  Get
 -                its value in the nb_iter-th iteration.  */
 -             res = chrec_apply (inner_loop-num, evolution_fn, nb_iter);
 +                its value in the nb_iter-th iteration.  Since the
 +                number of iterations is always unsigned, we perform
 +                the computation in an unsigned type to allow for
 +                improved simplification of subexpressions.  */
 +             res = chrec_convert (utype, evolution_fn, NULL);
 +             res = chrec_apply (inner_loop-num, res, nb_iter);
 +             res = chrec_convert (type, res, NULL);

              if (chrec_contains_symbols_defined_in_loop (res, loop-num))
                res = instantiate_parameters (loop, res);

 --
  Dr. Ulrich Weigand
  GNU Toolchain for Linux on System z and Cell BE
  ulrich.weig...@de.ibm.com