[Bug tree-optimization/114769] [14 Regression] Suspicious code in vect_recog_sad_pattern() since r14-1832

2024-04-19 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114769

--- Comment #3 from Feng Xue  ---
When half_type is null, and call chain would be:

vect_supportable_direct_optab_p (vinfo, sum_type, SAD_EXPR, NULL, ...)
  -> get_vectype_for_scalar_type (vinfo, NULL)
 -> get_related_vectype_for_scalar_type (vinfo, /*scalar_type= */NULL, ...)

   if ((!INTEGRAL_TYPE_P (scalar_type)
   && !POINTER_TYPE_P (scalar_type)
   && !SCALAR_FLOAT_TYPE_P (scalar_type))

Here will be a segfault.

[Bug tree-optimization/114769] New: Suspicious code in vect_recog_sad_pattern()

2024-04-18 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114769

Bug ID: 114769
   Summary: Suspicious code in vect_recog_sad_pattern()
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

This function may contain buggy code, which was introduced due to recent
support to the new ABD pattern. It calls "vect_recog_absolute_difference" to
check ABS/ABSU statement, if succeed, it expects that "vect_unpromoted_value
unprom[2]" is properly set for the two operands. Though, at the final code
snippet, this is missed, so "unprom" could have unwanted values.

  /* Failed to find a widen operation so we check for a regular MINUS_EXPR.  */
  gassign *diff = dyn_cast  (STMT_VINFO_STMT (diff_stmt_vinfo));
  if (diff_stmt && diff
  && gimple_assign_rhs_code (diff) == MINUS_EXPR
  && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (abs_oprnd)))
{
  *diff_stmt = diff;
  *half_type = NULL_TREE;
  // unprom is not set accordingly
  return true;
}

This execution path would be triggered for an architecture that merely support
dot-product instruction, but w/o "IFN_ABD" and "IFN_VEC_WIDEN_ABD"
instructions.

[Bug tree-optimization/114440] New: Fail to recognize a chain of lane-reduced operations for loop reduction vect

2024-03-23 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114440

Bug ID: 114440
   Summary: Fail to recognize a chain of lane-reduced operations
for loop reduction vect
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

In a loop reduction path containing a lane-reduced operation
(DOT_PROD/SAD/WIDEN_SUM), current vectorizer could not handle the pattern if
there are other operations, which might be a normal or another lane-reduced
one. A pseudo example is represented as:

   char *d0, *d1;
   char *s0, *s1;
   char *w;
   int *n;

   ...
   int sum = 0;

   for (i) {
 ...
 sum += d0[i] * d1[i];   /* DOT_PROD */
 ...
 sum += abs(s0[i] - s1[i]);  /* SAD */
 ...
 sum += w[i];/* WIDEN_SUM */
 ...
 sum += n[i];/* Normal */
 ...
   }

   ... = sum;

For the case, reduction vectype would vary with operations, and this causes
mismatch on count of vectorized defs and uses, a possible means might be fixing
that by generating extra trivial pass-through copies. Given a concrete example
as:

   sum = 0; 
   for (i) {
 sum += d0[i] * d1[i];   /* 16*char -> 4*int */
 sum += n[i];/*   4*int -> 4*int */
   }

Final vetorized statements could be:

   sum_v0 = { 0, 0, 0, 0 };
   sum_v1 = { 0, 0, 0, 0 };
   sum_v2 = { 0, 0, 0, 0 };
   sum_v3 = { 0, 0, 0, 0 };

   for (i / 16) {
 sum_v0 += DOT_PROD (v_d0[i: 0 .. 15], v_d1[i: 0 .. 15]);
 sum_v1 += 0;  // copy
 sum_v2 += 0;  // copy
 sum_v3 += 0;  // copy

 sum_v0 += v_n[i:  0 .. 3];
 sum_v1 += v_n[i:  4 .. 7];
 sum_v2 += v_n[i:  8 .. 11];
 sum_v3 += v_n[i: 12 .. 15]; 
   }

   sum = REDUC_PLUS(sum_v0 + sum_v1 + sum_v2 + sum_v3);

In the above sequence, one summation statement simply forms one pattern.
Though, we could easily compose a somewhat more complicated variant that gets
into the similar situation. That is, a chain of lane-reduced operations comes
from the non-reduction addend in one summation statement, like:

   sum += d0[i] * d1[i] + abs(s0[i] - s1[i]) + n[i];

Probably, this requires some extension in the vector pattern formation stage to
split the patterns.

[Bug tree-optimization/113091] Over-estimate SLP vector-to-scalar cost for non-live pattern statement

2024-01-30 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091

Feng Xue  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #10 from Feng Xue  ---
Fixed

[Bug target/113326] Optimize vector shift with constant delta on shifting-count operand

2024-01-11 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326

--- Comment #7 from Feng Xue  ---
(In reply to Richard Biener from comment #6)
> (In reply to Andrew Pinski from comment #5)
> > One more thing:
> > ```
> >  vect_shift_0 = vect_value >> { 0, 1, 2, 3 };
> >  vect_shift_1 = vect_value >> { 4, 5, 6, 7 };
> >  vect_shift_2 = vect_value >> { 8, 9, 10, 11 };
> >  vect_shift_3 = vect_value >> { 12, 13, 14, 15 };
> > ```
> > vs
> > ```
> >  vect_shift_0 = vect_value >> { 0, 1, 2, 3 };
> >  vect_shift_1 = vect_shift_0 >> { 4, 4, 4, 4 };
> >  vect_shift_2 = vect_shift_0 >> { 8, 8, 8, 8 };
> >  vect_shift_3 = vect_shift_0 >> { 12, 12, 12, 12 };
> > ```
> > 
> > the first has fully independent operations while in the second case, there
> > is one dependent and the are independent operations.
> > 
> > On cores which has many vector units the first one might be faster than the
> > second one.  So this needs a cost model too.
> 
> Note the vectorizer has the shift values dependent as well (across
> iterations),
> we just constant propagate after unrolling here.
> 
> Note this is basically asking for "strength-reduction" of expensive
> constants which could be more generally useful and not only for this
> specific shift case.  Consider the same example but with an add instead
> of a shift for example, the same exact set of constants will appear.

It is. But only find that vector shift has special treatment to constant
operands based on its numerical pattern. No sure any other operator would be.

BTW, here is a scalar-version strength-reduction for shift, like:

  int a = value >> n;
  int b = value >> (n + 6);

  ==>

  int a = value >> n;
  int b = a >> 6;  // (n + 6) is not needed

But this is not covered by current scalar strength-reduction pass.

[Bug target/113326] Optimize vector shift with constant delta on shifting-count operand

2024-01-10 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326

--- Comment #3 from Feng Xue  ---

(In reply to Andrew Pinski from comment #1)
> Note on aarch64 with SVE, you should be able to generate those constants
> without a load, using the index instruction.
Ok. Thanks for the note. This still requires an extra instruction, while the
constant delta could be nested in shift instruction as IMM operand.

> Basically this requires an "un-shift" pass and most likely should be done at
> the RTL level though that might be too late.
> Maybe isel?
I'm thinking of adding the processing in pass_lower_vector_ssa, which also
contains other peephole vector ssa optimizations, not just lowering.

[Bug tree-optimization/113326] New: Optimize vector shift with constant delta on shifting-count operand

2024-01-10 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326

Bug ID: 113326
   Summary: Optimize vector shift with constant delta on
shifting-count operand
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

For shift by induction variable, loop vectorization could generate a series of
vector shifts, whose shifting-count operands are advanced by constant delta. We
could use result of the first vector shift as base, and transform others to be
vector shift on the result by scalar delta, which would eliminate non-uniform
shift vector count constant, and thereby save a load on most architectures.

 int test(int array[16]);

 int foo(int value)
 {
   int array[16];

   for (int i = 0; i < 16; i++)
  array[i] = value >> i;

   return test(array);
 }

 // Current
 vect_value = { value, value, value, value };
 vect_shift_0 = vect_value >> { 0, 1, 2, 3 };
 vect_shift_1 = vect_value >> { 4, 5, 6, 7 };
 vect_shift_2 = vect_value >> { 8, 9, 10, 11 };
 vect_shift_3 = vect_value >> { 12, 13, 14, 15 };

 // To be optimized to
 vect_value = { value, value, value, value };
 vect_shift_0 = vect_value >> { 0, 1, 2, 3 };
 vect_shift_1 = vect_shift_0 >> { 4, 4, 4, 4 };
 vect_shift_2 = vect_shift_0 >> { 8, 8, 8, 8 };
 vect_shift_3 = vect_shift_0 >> { 12, 12, 12, 12 };

[Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations

2024-01-09 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104

Feng Xue  changed:

   What|Removed |Added

 Resolution|FIXED   |---
 Status|RESOLVED|REOPENED

--- Comment #7 from Feng Xue  ---
Partial permutation (especially extract-low/high) would incur inefficient
splicing in some situation even the vector mode with lowest cost is used.

  int test(unsigned array[4][4]);

  int foo(unsigned short *a, unsigned long n)
  {
unsigned array[4][4];

for (unsigned i = 0; i < 4; i++, a += n)
  {
array[i][0] = (a[0] << 3) - (a[4] << 6);
array[i][1] = (a[1] << 3) - (a[5] << 6);
array[i][2] = (a[2] << 3) - (a[6] << 6);
array[i][3] = (a[3] << 3) - (a[7] << 6);
  }

return test(array);
  }

// After vect-compare-cost fix
mov x2, x0
stp x29, x30, [sp, -80]!
add x3, x2, x1, lsl 1
lsl x1, x1, 1
mov x29, sp
add x4, x3, x1
add x0, sp, 16
ldr q5, [x2]
ldr q31, [x4, x1]
ldr q0, [x3, x1]
ldr q1, [x2, x1]
moviv28.4s, 0//
zip1v29.2d, v0.2d, v31.2d//
zip1v2.2d, v5.2d, v1.2d  //
zip2v31.2d, v0.2d, v31.2d//
zip2v1.2d, v5.2d, v1.2d  // 
zip1v30.8h, v29.8h, v28.8h   //
zip1v4.8h, v2.8h, v28.8h //  superfluous
zip1v27.8h, v31.8h, v28.8h   //
zip1v3.8h, v1.8h, v28.8h //
zip2v29.8h, v29.8h, v28.8h   //
zip2v31.8h, v31.8h, v28.8h   //
zip2v2.8h, v2.8h, v28.8h //
zip2v1.8h, v1.8h, v28.8h //
shl v30.4s, v30.4s, 3
shl v29.4s, v29.4s, 3
shl v4.4s, v4.4s, 3
shl v2.4s, v2.4s, 3
shl v27.4s, v27.4s, 6
shl v31.4s, v31.4s, 6
shl v3.4s, v3.4s, 6
shl v1.4s, v1.4s, 6
sub v27.4s, v30.4s, v27.4s
sub v31.4s, v29.4s, v31.4s
sub v3.4s, v4.4s, v3.4s
sub v1.4s, v2.4s, v1.4s
stp q27, q31, [sp, 48]
stp q3, q1, [sp, 16]
bl  test
ldp x29, x30, [sp], 80
ret


// Expect it to be optimized as:
lsl x3, x1, 1
mov x2, x0
stp x29, x30, [sp, -80]!
add x1, x2, x1, lsl 1
add x4, x1, x3
mov x29, sp
add x0, sp, 16
ldr q30, [x2, x3]
ldr q0, [x2]
ushll   v31.4s, v30.4h, 3
ushll2  v30.4s, v30.8h, 6
ushll   v29.4s, v0.4h, 3
ushll2  v0.4s, v0.8h, 6
sub v30.4s, v31.4s, v30.4s
sub v0.4s, v29.4s, v0.4s
str q0, [sp, 16]
ldr q0, [x1, x3]
str q30, [sp, 32]
ldr q30, [x4, x3]
ushll   v29.4s, v0.4h, 3
ushll2  v0.4s, v0.8h, 6
ushll   v31.4s, v30.4h, 3
ushll2  v30.4s, v30.8h, 6
sub v0.4s, v29.4s, v0.4s
sub v30.4s, v31.4s, v30.4s
stp q0, q30, [sp, 48]
bl  test
ldp x29, x30, [sp], 80
ret

Based on cost arising from splicing, we still need a way to select the most
profitable vector mode per slp node, over the global vector mode specified in
vinfo.

[Bug tree-optimization/113091] Over-estimate SLP vector-to-scalar cost for non-live pattern statement

2023-12-29 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091

--- Comment #8 from Feng Xue  ---
https://gcc.gnu.org/pipermail/gcc-patches/2023-December/641547.html

[Bug tree-optimization/113091] Over-estimate SLP vector-to-scalar cost for non-live pattern statement

2023-12-26 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091

--- Comment #7 from Feng Xue  ---
> The issue here is that because the "outer" pattern consumes
> patt_64 = (int) patt_63 it should have adjusted _2 = (int) _1
> stmt-to-vectorize
> as being the outer pattern root stmt for all this logic to work correctly.

We could not simply make this adjustment since pattern recognition does not
require replaced SSA to be of single-use. If we change the above case to attach
another scalar use to "_2" as:

  int foo(char *a, char *b)
  {
unsigned array[8];
int a0 = a[0];  // _2 = (int) _1;

array[0] = (a0 - b[0]);
array[1] = (a[1] - b[1]);
array[2] = (a[2] - b[2]);
array[3] = (a[3] - b[3]);
array[4] = (a[4] - b[4]);
array[5] = (a[5] - b[5]);
array[6] = (a[6] - b[6]);
array[7] = (a[7] - b[7]);

return test(array) + a0;
  }

The pattern statement "patt_64 = (int) patt_63" for "_2 = (int) _1" should be
kept. So we also need the check of "identifying whether a scalar stmt takes
part
in vectorization or not" to ensure the adjustment is doable.

[Bug tree-optimization/113091] Over-estimate SLP vector-to-scalar cost for non-live pattern statement

2023-12-21 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091

--- Comment #6 from Feng Xue  ---
(In reply to Richard Sandiford from comment #5)
> > The issue here is that because the "outer" pattern consumes
> > patt_64 = (int) patt_63 it should have adjusted _2 = (int) _1 
> > stmt-to-vectorize
> > as being the outer pattern root stmt for all this logic to work correctly.
> 
> I don't think it can though, at least not in general.  The final pattern
> stmt has to compute the same value as the original scalar stmt.

Could current pattern replacement support N:1 mapping (N stmts -> 1 pattern)?
If not, probably this handing would break related code somewhere.

[Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations

2023-12-21 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104

--- Comment #2 from Feng Xue  ---
(In reply to Richard Biener from comment #1)
> See my proposal on the mailing list to lift the restriction of sticking to a
> single vector size, I think this is another example showing this.  If you
> use BB level vectorization by disabling loop vectorization but not SLP
> vectorization the code should improve?

Yes, the loop is fully unrolled, and BB SLP would.

I could not find the proposal, would you share me a link? Thanks

[Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations

2023-12-21 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104

Bug ID: 113104
   Summary: Suboptimal loop-based slp node splicing across
iterations
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

Given a partial vector-sized slp node in loop, code generation would utilize
inter-iteration parallelism to archive full vectorization by splicing defs of
the node in multiple iterations into one vector. This strategy is not always
good, and could be refined in some situation. To be specific, we'd better not
splice node if it participates in a full-vector-sized operation, otherwise a
permute and vextract that are really unneeded would be introduced.

Suppose target vector size is 128-bit, and a slp node is mapped to VEC_OP in an
iteration. Depending on whether backend supports LO/HI version of the
operation, there are two kinds code sequence for splicing.

  // Isolated 2 iterations
  res_v128_I0 = VEC_OP(opnd_v64_I0, ...)   // iteration #0
  res_v128_I1 = VEC_OP(opnd_v64_I1, ...)   // iteration #1 


  // Spliced (1)
  opnd_v128_I0_I1 = { opnd_v64_I0, opnd_v64_I1 }  // extra permute
  opnd_v64_lo = [vec_unpack_lo_expr] opnd_v128_I0_I1; // extra vextract
  opnd_v64_hi = [vec_unpack_hi_expr] opnd_v128_I0_I1; // extra vextract
  res_v128_I0 = VEC_OP(opnd_v64_lo, ...)
  res_v128_I1 = VEC_OP(opnd_v64_hi, ...)

  // Spliced (2)
  opnd_v128_I0_I1 = { opnd_v64_I0, opnd_v64_I1 }  // extra permute
  res_v128_I0 = VEC_OP_LO(opnd_v128_i0_i1, ...)   // similar or same as VEC_OP
  res_v128_I1 = VEC_OP_HI(opnd_v128_i0_i1, ...)   // similar or same as VEC_OP

Sometime, such permute and vextract might be optimized away by backend passes.
But sometime, it can not. Here is a case on aarch64.

  int test(unsigned array[4][4]);

  int foo(unsigned short *a, unsigned long n)
  {
unsigned array[4][4];

for (unsigned i = 0; i < 4; i++, a += n)
  {
array[i][0] = a[0] << 6;
array[i][1] = a[1] << 6;
array[i][2] = a[2] << 6;
array[i][3] = a[3] << 6;
  }

return test(array);
  }


// Current code generation
mov x2, x0
stp x29, x30, [sp, -80]!
add x3, x2, x1, lsl 1
lsl x1, x1, 1
mov x29, sp
add x4, x3, x1
ldr d0, [x2]
moviv30.4s, 0
add x0, sp, 16
ldr d31, [x2, x1]
ldr d29, [x3, x1]
ldr d28, [x4, x1]
ins v0.d[1], v31.d[0]//
ins v29.d[1], v28.d[0]   // 
zip1v1.8h, v0.8h, v30.8h // superfluous  
zip2v0.8h, v0.8h, v30.8h //
zip1v31.8h, v29.8h, v30.8h   //
zip2v29.8h, v29.8h, v30.8h   //
shl v1.4s, v1.4s, 6
shl v0.4s, v0.4s, 6
shl v31.4s, v31.4s, 6
shl v29.4s, v29.4s, 6
stp q1, q0, [sp, 16]
stp q31, q29, [sp, 48]
bl  test
ldp x29, x30, [sp], 80
ret


// May be optimized to:
stp x29, x30, [sp, -80]!
mov x29, sp
mov x2, x0
add x0, sp, 16
lsl x3, x1, 1
add x1, x2, x1, lsl 1
add x4, x1, x3
ldr d31, [x2, x3]
ushll   v31.4s, v31.4h, 6
ldr d30, [x2]
ushll   v30.4s, v30.4h, 6
str q30, [sp, 16]
ldr d30, [x1, x3]
ushll   v30.4s, v30.4h, 6
str q31, [sp, 32]
ldr d31, [x4, x3]
ushll   v31.4s, v31.4h, 6
stp q30, q31, [sp, 48]
bl  test
ldp x29, x30, [sp], 80
ret

[Bug tree-optimization/113091] Over-estimate SLP vector-to-scalar cost for non-live pattern statement

2023-12-20 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091

--- Comment #3 from Feng Xue  ---
The function vect_bb_vectorization_profitable_p resorts to a recursive way to
identify scalar use, for this case, setting STMT_VINFO_LIVE_P or not would not
change scalar cost computation.

[Bug tree-optimization/113091] Over-estimate SLP vector-to-scalar cost for non-live pattern statement

2023-12-20 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091

--- Comment #2 from Feng Xue  ---
(In reply to Richard Biener from comment #1)
> It's the logic
> 
>   FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), i, stmt_info)
> {
>   if (svisited.contains (stmt_info))
> continue;
>   stmt_vec_info orig_stmt_info = vect_orig_stmt (stmt_info);
>   if (STMT_VINFO_IN_PATTERN_P (orig_stmt_info)
>   && STMT_VINFO_RELATED_STMT (orig_stmt_info) != stmt_info)
> /* Only the pattern root stmt computes the original scalar value.  */
> continue;
>   bool mark_visited = true;
>   gimple *orig_stmt = orig_stmt_info->stmt;
>   ssa_op_iter op_iter;
>   def_operand_p def_p;
>   FOR_EACH_PHI_OR_STMT_DEF (def_p, orig_stmt, op_iter, SSA_OP_DEF)
> {
>   imm_use_iterator use_iter;
>   gimple *use_stmt;
>   stmt_vec_info use_stmt_info;
>   FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, DEF_FROM_PTR (def_p))
> if (!is_gimple_debug (use_stmt))
>   {
> use_stmt_info = bb_vinfo->lookup_stmt (use_stmt);
> if (!use_stmt_info
> || !PURE_SLP_STMT (vect_stmt_to_vectorize
> (use_stmt_info)))
>   {
> STMT_VINFO_LIVE_P (stmt_info) = true;
> 
> specifically the last check.  That's supposed to pick up the "main" pattern
> that's now covering the scalar stmt.
> 
> But somehow the "main" pattern,
> 
> patt_67 = .VEC_WIDEN_MINUS (_1, _3);  //  _5 = _2 - _4;
>  patt_68 = (signed short) patt_67; //  _5 = _2 - _4;
>  patt_69 = (int) patt_68;  //  _5 = _2 - _4;
> 
> doesn't get picked up here.  I wonder what's the orig_stmt and the def
> picked and what original scalar use we end up in where the
> vect_stmt_to_vectorize isn't the "last" pattern.  Maybe we really want


This problem happens at slp node:

 note: node 0x425bc38 (max_nunits=8, refcnt=1) vector(8) char
 note: op template: _1 = *a_50(D);
 note:  stmt 0 _1 = *a_50(D);
 note:  stmt 1 _7 = MEM[(char *)a_50(D) + 1B];
 note:  stmt 2 _13 = MEM[(char *)a_50(D) + 2B];
 note:  stmt 3 _19 = MEM[(char *)a_50(D) + 3B];
 note:  stmt 4 _25 = MEM[(char *)a_50(D) + 4B];
 note:  stmt 5 _31 = MEM[(char *)a_50(D) + 5B];
 note:  stmt 6 _37 = MEM[(char *)a_50(D) + 6B];
 note:  stmt 7 _43 = MEM[(char *)a_50(D) + 7B];

The orig_stmt is "_1 = *a_50(D)"

The use stmt is "_2 = (int) _1", whose pattern statement is "patt_64 = (int)
patt_63", which is not referenced by any original or other pattern statements.
Or in other word, the orig_stmt could be absorbed into a vector operation,
without any outlier scalar use.

The fore-mentioned "last check" in vect_bb_slp_mark_live_stmts would make the
orig_stmt to be STMT_VINFO_LIVE_P, which actually implies it has scalar use
(though it should not have), the difference is re-generating the def somewhere,
rather than retaining the original scalar statement. And the following
"vectorizable_live_operation" would account the new operations into
vectorization cost of the SLP instance.

The function vect_bb_vectorization_profitable_p resorts to a recursive way to
identify scalar use, for this case, setting STMT_VINFO_LIVE_P or not would
change scalar cost computation. If we can avoid such fake-liveness adjustment
on the statements we are interested in, vectorization cost could beat scalar
cost, and make the former succeed.

   Unvectorized: 
mov x2, x0
stp x29, x30, [sp, -48]!
mov x29, sp
ldrbw3, [x1]
ldrbw4, [x1, 1]
add x0, sp, 16
ldrbw9, [x2]
ldrbw8, [x2, 1]
sub w9, w9, w3
ldrbw7, [x2, 2]
ldrbw3, [x1, 2]
sub w8, w8, w4
ldrbw6, [x2, 3]
ldrbw4, [x1, 3]
sub w7, w7, w3
ldrbw10, [x1, 5]
ldrbw3, [x1, 4]
sub w6, w6, w4
ldrbw5, [x2, 4]
ldrbw4, [x2, 5]
sub w5, w5, w3
ldrbw3, [x2, 6]
sub w4, w4, w10
ldrbw2, [x2, 7]
ldrbw10, [x1, 6]
ldrbw1, [x1, 7]
sub w3, w3, w10
stp w9, w8, [sp, 16]
sub w1, w2, w1
stp w7, w6, [sp, 24]
stp w5, w4, [sp, 32]
stp w3, w1, [sp, 40]
bl  test
ldp x29, x30, [sp], 48
ret


Vectorized:
mov x2, x0
stp x29, x30, [sp, -48]!
mov x29, sp
ldr d1, [x1]
add x0, sp, 16
ldr d0, [x2]
usubl   v0.8h, v0.8b, v1.8b
sxtlv1.4s, v0.4h
sxtl2   v0.4s, v0.8h
stp q1, q0, [sp, 16]
bl  test
ldp x29, x30, [sp], 48
ret


> these "overlapping" patterns, but IMHO having "two entries" into
> a chain of scalar stmts is bad and we should link up the whole matched
> sequence to the final "root" instead?
> 
> That said, the 

[Bug tree-optimization/113091] New: Over-estimate SLP vector-to-scalar cost for non-live pattern statement

2023-12-20 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091

Bug ID: 113091
   Summary: Over-estimate SLP vector-to-scalar cost for non-live
pattern statement
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

Gcc fails to vectorize the below testcase on aarch64.

  int test(unsigned array[8]);

  int foo(char *a, char *b)
  {
unsigned array[8];

array[0] = (a[0] - b[0]);
array[1] = (a[1] - b[1]);
array[2] = (a[2] - b[2]);
array[3] = (a[3] - b[3]);
array[4] = (a[4] - b[4]);
array[5] = (a[5] - b[5]);
array[6] = (a[6] - b[6]);
array[7] = (a[7] - b[7]);

return test(array);
  }

The dump shows that loads to a[i] and b[i] are considered to be live as scalar
references, which results in over-estimated vector-to-scalar cost.

*a_50(D) 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)a_50(D) + 1B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)a_50(D) + 2B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)a_50(D) + 3B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)a_50(D) + 4B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)a_50(D) + 5B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)a_50(D) + 6B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)a_50(D) + 7B] 1 times vec_to_scalar costs 2 in epilogue
*b_51(D) 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)b_51(D) + 1B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)b_51(D) + 2B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)b_51(D) + 3B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)b_51(D) + 4B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)b_51(D) + 5B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)b_51(D) + 6B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)b_51(D) + 7B] 1 times vec_to_scalar costs 2 in epilogue

Subtraction on char type is recognized as widen-sub, and involves two kinds of
pattern replacement.

 * Original
 _1 = *a_50(D);
 _2 = (int) _1;
 _3 = *b_51(D);
 _4 = (int) _3;
 _5 = _2 - _4;


 * After pattern replacement
 patt_63 = (unsigned short) _1;  //  _2 = (int) _1;
 patt_64 = (int) patt_63;//  _2 = (int) _1;

 patt_65 = (unsigned short) _3;  //  _4 = (int) _3;
 patt_66 = (int) patt_65;//  _4 = (int) _3;

 patt_67 = .VEC_WIDEN_MINUS (_1, _3);  //  _5 = _2 - _4;
 patt_68 = (signed short) patt_67; //  _5 = _2 - _4;
 patt_69 = (int) patt_68;  //  _5 = _2 - _4;

For the statement "_2 = (int) _1", its vectorization representative "patt_64 =
(int) patt_63" is not marked as PURE_SLP, so it is conservatively considered to
having scalar use and being live outside of SLP bb (in the function
vect_bb_slp_mark_live_stmts). However, the pattern definition is actually dead,
should not contribute to vector-to-scalar cost. 

Those defs from pattern statements are not part of function body, we could not
track def/use chain as ordinary SSAs. Probably, we may have a quick fix for one
situation, if the original SSA "_2" has single use, its existence should be
only covered by vectorized operation, no matter what/how it would be w/o
pattern replacement.

[Bug target/106671] aarch64: BTI instruction are not inserted for cross-section direct calls

2023-08-02 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106671

Feng Xue  changed:

   What|Removed |Added

 CC||fxue at os dot 
amperecomputing.com

--- Comment #9 from Feng Xue  ---
On some occasions, we may not use the new ld, the kernel-building relies on its
own runtime linker which is used for kernel modules. So I created a patch
(https://gcc.gnu.org/pipermail/gcc-patches/2023-August/626084.html), and this
provides user another option that could be done at the compiler side.

[Bug tree-optimization/109427] New: Wrong param description in param.opt

2023-04-05 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109427

Bug ID: 109427
   Summary: Wrong param description in param.opt
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

-param=vect-induction-float=
Common Joined UInteger Var(param_vect_induction_float) Init(1) IntegerRage(0,
1) Param Optimization
   ^^^

[Bug ipa/108445] Address expression on global variable is not normalized

2023-01-18 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108445

--- Comment #1 from Feng Xue  ---
Created attachment 54297
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54297=edit
testcase

[Bug ipa/108445] New: Address expression on global variable is not normalized

2023-01-18 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108445

Bug ID: 108445
   Summary: Address expression on global variable is not
normalized
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: ipa
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

Created attachment 54296
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54296=edit
testcase

Compile two files with option "-flto -g -O2",

  addr_f1.c

  int gArray[16];


  addr_f2.c

  extern int gArray[];

  int foo(int *a)
  {
  int *p = a;

  return *p;
  }

  int main(int argc, char *argv[])
  {
  if (argc & 1)
  gArray[argc - 1] = 1;

  if (argc > 1)
  return foo(gArray);

  return 0;
  }

We will get an ICE as:

  addr_f2.c:10:5: error: invalid address operand in ‘mem_ref’
 10 | int main(int argc, char *argv[])
| ^
  MEM[(int *)];

  # VUSE <.MEM_9>
  _8 = MEM[(int *)];
  during GIMPLE pass: fixup_cfg
  addr_f2.c:10:5: internal compiler error: verify_gimple failed
  0xe87e8b verify_gimple_in_cfg(function*, bool, bool)
  ../../gcc/tree-cfg.cc:5647
  0xd55eb4 execute_function_todo
  ../../gcc/passes.cc:2091
  0xd567e3 do_per_function
  ../../gcc/passes.cc:1694
  0xd567e3 execute_todo
  ../../gcc/passes.cc:2145

Detailed representation of above  is _REF[], not a normalized
address expression, which is rejected by gimple verifier. For an address
expression on global variable, LTO gimple serializer would do a trick
transformation to change it to such kind of redundant form, then rely on gimple
de-serializer to revert the change. However, in some situation, de-serializer
may fail to do that(as the testcase), this does not cause ICE because it almost
happens on gimple-debug statement.

With the commit "r13-4743-gda85bfc75024a92b97e60e4436863dd5789786ec", an
constant address expression might be shared by gimple-debug and other
statements , and thus makes the issue exposed.

  1.  # DEBUG BEGIN_STMT
  2.  # DEBUG a => 
  3.  # DEBUG INLINE_ENTRY foo
  4.  # DEBUG BEGIN_STMT
  5.  # DEBUG p => 
  6.  # DEBUG BEGIN_STMT
  7.  _10 = MEM[(int *)];

After early-inlining, the "main" function will end up with above gimples, in
which stmt "5" and "7" refer to shared 

[Bug rtl-optimization/108117] Wrong instruction scheduling on value coming from abnormal SSA

2022-12-15 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117

--- Comment #8 from Feng Xue  ---
(In reply to Andrew Pinski from comment #7)
> (In reply to Feng Xue from comment #6)
> > (In reply to Andrew Pinski from comment #2)
> > > https://en.cppreference.com/w/c/program/setjmp
> > 
> > I think that most programmers are not aware of this, neither I for sure.
> > Usage of volatile here is not that intuitive as for purpose of
> > multi-execution synchronization
> 
> You should read up on
> https://en.cppreference.com/w/cpp/utility/program/signal too.

>From viewpoint of programmer, setjmp/longjmp is somewhat similar to C++
exception handling, which happens in one logical execution context, while
signal implies two  unrelated contexts.

In another angle, because gcc already model control flow and SSA web for
setjmp/longjmp, explicit volatile specification is not really needed. But
signal mechanism is kind of asynchronous exception, for which it is impractical
for compiler to do that, so volatile is must.

[Bug rtl-optimization/108117] Wrong instruction scheduling on value coming from abnormal SSA

2022-12-15 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117

--- Comment #6 from Feng Xue  ---
(In reply to Andrew Pinski from comment #2)
> https://en.cppreference.com/w/c/program/setjmp

I think that most programmers are not aware of this, neither I for sure. Usage
of volatile here is not that intuitive as for purpose of multi-execution
synchronization, it just tells compiler to keep variable in memory, instead of
register. With current SSA representation in GCC, probably, this could be
automatically identified without explicit source level specifier. That is, any
SSA name occurring in abnormal phi should go to memory during expansion to rtl.

[Bug rtl-optimization/108117] New: Wrong instruction scheduling on value coming from abnormal SSA

2022-12-14 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117

Bug ID: 108117
   Summary: Wrong instruction scheduling on value coming from
abnormal SSA
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

Compile the following code with "-O2" on AArch64.

#include 
#include 

jmp_buf ex_buf;

__attribute__((noinline)) int fn_throw(int x)
{
   if (x == 1)
  longjmp(ex_buf, 1);
   return 1;
}

int main(int argc, char** argv)
{
int va = 0;
int vb = 0;

if (!setjmp(ex_buf)) {
va = fn_throw(1); /* throw via longjmp */
vb = 1;
} else
printf("Got exception, va = %d\n", va);

if (vb)
printf("Failed, vb should not = %d!\n", vb);

return 0;
}

Since "fn_throw" involves abnormal control flow transferring, any statement
after "va = fn_throw(1)" should not be hoisted prior to the call. In this case,
"vb = 1" is moved before it by RTL inst sched, and leads to incorrect result.

Similar to C++ exception handling, setjmp/longjmp would generate SSA names
occurring in abnormal phi, these should be specially treated. Though, it looks
like that RTL passes do not respect this characteristics in any kind of code
motions. Now the issue is only exposed on AArch64, for while inst sched1 pass
is enabled, but it is also a potential one on other backends.

[Bug tree-optimization/107828] tree-inlining would generate SSA with incorrect def stmt

2022-12-13 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107828

Feng Xue  changed:

   What|Removed |Added

 Resolution|INVALID |---
 Status|RESOLVED|WAITING

--- Comment #4 from Feng Xue  ---
(In reply to Martin Jambor from comment #2)
> I don't see any correctness issue here.  It is true that when IPA-SRA
> needs to create a temporary SSA_NAME for an assignment with
> VIEW_CONVERT_EXPR (that is the only statement that extra_stmts can
> contain at the moment), the SSA_NAME then get's re-mapped to another
> while woe could have just re-use the existing one, it is already
> created within the new function.
> 
> We could use the knowledge that extra_stmts is really only this one
> statement or nothing and avoid the remapping but that would be at the
> expense of universality and extensibility of the interface between
> tree-inline.cc and ipa-param-manipulations.cc.
> 
> Please let me know if I am missing something (and preferably with a
> test-case that demonstrates the problem).

Sorry for late response.

It is hard to compose a simple test case, since this could only be exposed with
our own private code. To reproduce it is not that straightforward, but we could
forcefully do that with a minor change to
"ipa_param_body_adjustments::modify_assignment" in ipa-param-manipulation.cc.

diff --git a/gcc/ipa-param-manipulation.cc b/gcc/ipa-param-manipulation.cc
index cee0e23f946..202c0fda32b 100644
--- a/gcc/ipa-param-manipulation.cc
+++ b/gcc/ipa-param-manipulation.cc
@@ -1787,7 +1787,7 @@ ipa_param_body_adjustments::modify_assignment (gimple
*stmt,
   any = modify_expression (lhs_p, false);
   any |= modify_expression (rhs_p, false);
   if (any
-  && !useless_type_conversion_p (TREE_TYPE (*lhs_p), TREE_TYPE (*rhs_p)))
+  && true /* !useless_type_conversion_p (TREE_TYPE (*lhs_p), TREE_TYPE
(*rhs_p)) */)
 {
   if (TREE_CODE (*rhs_p) == CONSTRUCTOR)
{
@@ -1805,7 +1805,14 @@ ipa_param_body_adjustments::modify_assignment (gimple
*stmt,
  *rhs_p);
  tree tmp = force_gimple_operand (new_rhs, extra_stmts, true,
   NULL_TREE);
- gimple_assign_set_rhs1 (stmt, tmp);
+
+ tree tmp1 = make_ssa_name (tmp);
+ gimple *copy = gimple_build_assign (tmp1, tmp);
+
+ auto gsi = gsi_last (*extra_stmts);
+ gsi_insert_after_without_update (, copy, GSI_NEW_STMT);
+
+ gimple_assign_set_rhs1 (stmt, tmp1);
}
   return true;
 }

The purpose of above code piece is to add trivial ssa copy statement to make
the below "extra_stmts" not empty:

remap_gimple_stmt()

{
  ...
  if (id->param_body_adjs)
{
  ...
  if (!gimple_seq_empty_p (extra_stmts))
{
  memset (, 0, sizeof (wi));
  wi.info = id;
  for (gimple_stmt_iterator egsi = gsi_start (extra_stmts);
   !gsi_end_p (egsi);
   gsi_next ())
walk_gimple_op (gsi_stmt (egsi), remap_gimple_op_r, );
 ^

...
}
}
   ...
}

Now, we use the case gcc.dg/ipa/ipa-sra-1.c, and option is "-O3 -flto
-fipa-sra". Trace into "remap_gimple_op_r" as above, check remapping on copy:

<&0x76c282d0> ISRA.0_1 = ISRA.0;

A new SSA name "ISRA.0_2" is created, after setting definition statement for it
via:


  if (TREE_CODE (*tp) == SSA_NAME)
{
  *tp = remap_ssa_name (*tp, id);
  *walk_subtrees = 0;
  if (is_lhs)
 SSA_NAME_DEF_STMT (*tp) = wi_p->stmt;
 ^
  return NULL;
}

Here, both "ISRA.0_1" and "ISRA.O_2" belong to same function, and point to same
def stmt.

[Bug lto/107829] New: Trivial compile time tracking code

2022-11-22 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107829

Bug ID: 107829
   Summary: Trivial compile time tracking code
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: lto
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

In the function "materialize_cgraph" of lto.cc:


  /* Start the appropriate timer depending on the mode that we are
 operating in.  */
  lto_timer = (flag_wpa) ? TV_WHOPR_WPA
  : (flag_ltrans) ? TV_WHOPR_LTRANS
  : TV_LTO;
  timevar_push (lto_timer);

  current_function_decl = NULL;
  set_cfun (NULL);

  if (!quiet_flag)
fprintf (stderr, "\n");

  timevar_pop (lto_timer);


Execution of code enclosed by time-variable "lto_timer" is expected to be of
zero time. Adding time tracking here seems to be superfluous.

[Bug tree-optimization/107828] New: tree-inlining would generate SSA with incorrect def stmt

2022-11-22 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107828

Bug ID: 107828
   Summary: tree-inlining would generate SSA with incorrect def
stmt
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

In the function "remap_gimple_op_r" of tree-inlining.cc:

  ...

  if (TREE_CODE (*tp) == SSA_NAME)
{
  *tp = remap_ssa_name (*tp, id);
  *walk_subtrees = 0;
  if (is_lhs)
 SSA_NAME_DEF_STMT (*tp) = wi_p->stmt;
  return NULL;
}

The definition statement of original "*tp" may be same as wi_p->stmt. So, we
will end up with a statement that it is pointed by both old and new SSA name,
while the old one should have been reclaimed.

And this happens when some parameter needs to be adjusted during inline
versioning as:

remap_gimple_stmt()

{
  ...
  if (id->param_body_adjs)
{
  ...
  if (!gimple_seq_empty_p (extra_stmts))
{
  memset (, 0, sizeof (wi));
  wi.info = id;
  for (gimple_stmt_iterator egsi = gsi_start (extra_stmts);
   !gsi_end_p (egsi);
   gsi_next ())
walk_gimple_op (gsi_stmt (egsi), remap_gimple_op_r, );
 ^

...
}
}
   ...
}

[Bug tree-optimization/107818] New: Overflow of linemap breaks its chronological order

2022-11-22 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107818

Bug ID: 107818
   Summary: Overflow of linemap breaks its chronological order
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

Large-size source codes might exceed representation space of linemap. When this
happens, UNKNOWN_LOCATION(0) would inserted to the end of linemap. And the
action breaks the order constraint on the map, which requires all logical
locations contained by it should remain chronologically-ordered, so that binary
search could be used.(Comments in linemap_ordinary_map_lookup).


In the function "linemap_add":

  ...

  if (start_location >= LINE_MAP_MAX_LOCATION)
/* We ran out of line map space.   */
start_location = 0;

  line_map_ordinary *map
= linemap_check_ordinary (new_linemap (set, start_location));
  ^^^
  UNKNOWN_LOCATION(0) is also added to linemap
  map->reason = reason;
  ...


Afterwards, logical location to source line would be mis-translated.

pr86872 only partially fixed see-able ICE due to linemap overflow, but made a
this hidden issue.

[Bug ipa/107670] New: Suspicious redundant code in ipa-cp

2022-11-13 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107670

Bug ID: 107670
   Summary: Suspicious redundant code in ipa-cp
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: ipa
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

Function "ipa_prop_read_jump_functions()" in ipa-prop.c:

  ipa_check_create_node_params ();
  ipa_check_create_edge_args ();

These two are also called inside the immediately following statement:

  ipa_register_cgraph_hooks();



Function "ipa_fn_summary_read()" in ipa-fnsummary.cc:

  ipa-cp's hooks will be registered twice in this function. One is indirectly
done by ipa_prop_read_jump_functions(), and the other is by a nearby call to
ipa_register_cgraph_hooks.

[Bug tree-optimization/107066] Field initialized before ctor is mis-optimized away by DSE

2022-09-28 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107066

--- Comment #3 from Feng Xue  ---
Got it. Thanks for that.

[Bug tree-optimization/107066] New: Field initialized before ctor is mis-optimized away by DSE

2022-09-28 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107066

Bug ID: 107066
   Summary: Field initialized before ctor is mis-optimized away by
DSE
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

By means of user-defined new operator, it is possible that a field is
initialized before constructor.

#include 

class A {
public:
  int f1;
  int f2;

  A() : f2(2) { }

  void *operator new(size_t size)
  {
  void *mem = ::operator new(size);
  A *obj = static_cast(mem);

  obj->f1 = 1;
  return obj;
  }

};

A* foo ()
{
  return new A();
}


The original gimple code of foo() is:

struct A * foo ()
{
  void * D.2444;
  void * _9;

   :
  _9 = operator new (8); 
  MEM[(struct A *)_9].f1 = 1;
  MEM[(struct A *)_9] ={v} {CLOBBER};
  MEM[(struct A *)_9].f2 = 2;
  return _9;

}

In gimple, there exists a pseudo clobber statement marking beginning of
constructor code. Although the statement is of no side effect, it is regarded
as normal store by DSE when determining store redundancy. Consequently, DSE
thought that "MEM[(struct A *)_9].f1 = 1" was killed by "MEM[(struct A *)_9]
={v} {CLOBBER}", and removed it. After DSE pass,the foo becomes:

struct A * foo ()
{
  void * D.2444;
  void * _9;

   :
  _9 = operator new (8);
  MEM[(struct A *)_9] ={v} {CLOBBER};
  MEM[(struct A *)_9].f2 = 2;
  return _9;

}

[Bug ipa/102513] [10/11/12 Regression] Many false positive warnings with recursive function

2022-02-14 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102513

--- Comment #10 from Feng Xue  ---
(In reply to Martin Jambor from comment #8)
> I am about to thest the following patch.  In longer-run, it would be better
> to never generate lattice values outside of the value_range but there is an
> ordering problem, we need the complete VR info before we can use it.  I plan
> to rearrange IPA-CP into making multiple passes over the lattice dependency
> graph and this should quite naturally be solved by doing this kind of
> resursive-value-generation only in second and later passes. 
> 
> 
> diff --git a/gcc/ipa-cp.cc b/gcc/ipa-cp.cc
> index 453e9c93cc3..cbbb8bbc80a 100644
> --- a/gcc/ipa-cp.cc
> +++ b/gcc/ipa-cp.cc
> @@ -6154,8 +6154,16 @@ decide_whether_version_node (struct cgraph_node *node)
> {
>   ipcp_value *val;
>   for (val = lat->values; val; val = val->next)
> -   ret |= decide_about_value (node, i, -1, val, ,
> -  _gen_clones);
> +   {
> + if (!plats->m_value_range.bottom_p ()
> + && !plats->m_value_range.m_vr.contains_p (val->value))
> +   {
> + gcc_checking_assert (val->self_recursion_generated_p ());
> + continue;
> +   }
> + ret |= decide_about_value (node, i, -1, val, ,
> +_gen_clones);
> +   }
> }
>  
>if (!plats->aggs_bottom)

Here is a complication that value range for recursion index might not be not
easily computed, and could not prevent IPA-CP generating useless copy.
Constraint of recursion index comes from "block2[level][x]", not its value
range deduced from condition predicate (level > 0). Change the case to cover up
value range of "level", and we will get same warning. So in the circumstances,
one way for us is to disable warning for these copies?

extern int block2[7][256];

extern unsigned G_START;
extern unsigned G_SCALE;

static int encode_block(int block2[7][256], unsigned level)
{
int best_score = 0;

for (unsigned x = G_START; x < G_SCALE * level; x++) {
int v = block2[1][x];
block2[level][x] = 0;
best_score  += v * v;
}

if (G_SCALE * level > G_START && best_score > 64) {
int score = 0;

score += encode_block(block2, level - 1);
score += encode_block(block2, level - 1);

if (score < best_score) {
best_score = score;
}
}

return best_score;
}

[Bug ipa/104377] Unreachable code in create_specialized_node of ipa-prop.c?

2022-02-14 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104377

--- Comment #4 from Feng Xue  ---
(In reply to Martin Jambor from comment #2)
> (In reply to Feng Xue from comment #1)
> > 
> > OK. I does missed something. Here we could not hold assumption that
> > ipcp_decision_stage() only sees raw cgraph node, since sometime in the
> > future some new ipa pass may be added prior to ipa-cp, and this pass
> > introduces clone node.
> 
> Right, initially IPA-SRA was developed as a pass before IPA-CP and it
> may well be that we decide to swap the order again.
> 
> > 
> > However, there is a questionable point about the code snippet
> > 
> >   if (!node->can_change_signature
> >   || old_adj->op != IPA_PARAM_OP_COPY
> >   || (!known_csts[old_adj->base_index]
> >   && ipa_is_param_used (info, old_adj->base_index)))
> > 
> > In ipa-cp, known_csts is for the node, has no relation to the node's origin
> > node, but here it is accessed via index of the latter (old_adj->base_index),
> > will this cause out-of-bound error?
> 
> I think the code is correct. Assume IPA-SRA running before IPA-CP, and
> we're compiling a function with two argument, with indices 0 and 1.
> 
> Analysis phases of both passes run before the IPA (WPA) phases of
> either.  This is important to keep in mind.
> 
> IPA SRA removes the first one with index zero as useless, IPA-CP wants
> to remove the second one with index 1, possibly because it is constant
> everywhere.  In oder to that it has to combine the pre-existing
> adjustments with its own changes.
> 
> Before create_specialized_node, the pass checks whether previous
> passes did not kill some parameters and stops caring about them, but
> it does not re-index anything, all lattices, jump functions,
> everything, still keep their positions (and thus indices) they got in
> the analysis phase.
> 
> Then create_specialized_node hits this loop.  For i=0 encounters an
> old_adj element that actually describes the parameter which originally
> had index 1.  The pass looks up in base_index what the original
> (pre-IPA) index of the parameter was (1) and uses those to look up in
> its own structures whether it wants to remove it or not.
> 
> Bounds should be always OK, base_index should never be greater than
> the original pre-IPA number of parameters (-1) and known_csts should
> always have just as many parameters.
> 
> Does that make sense?

Yes. Thanks for your explanation.

[Bug tree-optimization/102513] [10/11/12 Regression] Many false positive warnings with recursive function

2022-02-05 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102513

Feng Xue  changed:

   What|Removed |Added

 CC||fxue at os dot 
amperecomputing.com

--- Comment #7 from Feng Xue  ---
(In reply to Jakub Jelinek from comment #6)
> It is in
>   /* Recursively generate lattice values with a limited count.  */
>   FOR_EACH_VEC_ELT (val_seeds, i, src_val)
> {
>   for (int j = 1; j < max_recursive_depth; j++)
> {
>   tree cstval = get_val_across_arith_op (opcode, opnd1_type,
> opnd2,
>  src_val, res_type);
>   if (!cstval
>   || !ipacp_value_safe_for_type (res_type, cstval))
> break;
> 
>   ret |= dest_lat->add_value (cstval, cs, src_val, src_idx,
>   src_offset, _val, j);
>   gcc_checking_assert (src_val);
> }
> }
> (but there is another spot doing the similar thing) where it would be nice
> to also break if cstval is non-NULL and safe for type, but is outside of the
> value range.  I have no idea how to get from this spot at that value range
> though.

By default, ipcp is told to clone a recursive function 8 times, that exceeds
value space of index in this case. We could rely on ipa fnsummary on condition
predicate of a call to avoid generating never-executed copy. I will take it.

[Bug ipa/104377] Unreachable code in create_specialized_node of ipa-prop.c?

2022-02-04 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104377

--- Comment #1 from Feng Xue  ---
(In reply to Feng Xue from comment #0)
> For function create_specialized_node(), the "node" to operated on seems
> always to be an original cgraph node, never a clone node. From call graph
> related to the function, we know that ipcp_decision_stage () only passes raw
> cgraph node downwards to its callees. Then, "node" reaching
> create_specialized_node() would not be a clone, so the code enclosed by "if
> (old_adjustments)" might be of no use. But I am not sure sure if there is
> some thing that I missed.
> 
> ipcp_driver
>  |
>  '--> ipcp_decision_stage
>|
>'--> decide_whether_version_node
>  |
>  |--> decide_about_value
>  | |
>  '-'--> create_specialized_node


OK. I does missed something. Here we could not hold assumption that
ipcp_decision_stage() only sees raw cgraph node, since sometime in the future
some new ipa pass may be added prior to ipa-cp, and this pass introduces clone
node.

However, there is a questionable point about the code snippet

  if (!node->can_change_signature
  || old_adj->op != IPA_PARAM_OP_COPY
  || (!known_csts[old_adj->base_index]
  && ipa_is_param_used (info, old_adj->base_index)))

In ipa-cp, known_csts is for the node, has no relation to the node's origin
node, but here it is accessed via index of the latter (old_adj->base_index),
will this cause out-of-bound error?

[Bug ipa/104377] New: Unreachable code in create_specialized_node of ipa-prop.c?

2022-02-03 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104377

Bug ID: 104377
   Summary: Unreachable code in create_specialized_node of
ipa-prop.c?
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: ipa
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

For function create_specialized_node(), the "node" to operated on seems always
to be an original cgraph node, never a clone node. From call graph related to
the function, we know that ipcp_decision_stage () only passes raw cgraph node
downwards to its callees. Then, "node" reaching create_specialized_node() would
not be a clone, so the code enclosed by "if (old_adjustments)" might be of no
use. But I am not sure sure if there is some thing that I missed.

ipcp_driver
 |
 '--> ipcp_decision_stage
   |
   '--> decide_whether_version_node
 |
 |--> decide_about_value
 | |
 '-'--> create_specialized_node

[Bug tree-optimization/103786] New: Suspicious code in verify_type

2021-12-21 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103786

Bug ID: 103786
   Summary: Suspicious code in verify_type
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

tree ct = TYPE_CANONICAL (t);
  if (!ct)
;
  else if (TYPE_CANONICAL (t) != ct)
   ^ should be ct?

[Bug tree-optimization/100802] VRP fails to fold comparison using known value orders

2021-11-15 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100802

Feng Xue  changed:

   What|Removed |Added

 Status|RESOLVED|VERIFIED

--- Comment #2 from Feng Xue  ---
Verified.

[Bug bootstrap/102681] [12 Regression] AArch64 bootstrap failure

2021-10-14 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102681

--- Comment #4 from Feng Xue  ---
(In reply to Martin Sebor from comment #3)
> Simply initializing the variable as in the patch below avoids the warning. 
> The control flow in the code is sufficiently opaque to make it worthwhile
> from a readability point irrespective of whether or not the variable can, in
> fact, be used uninitialized.
> 
> index e50d3fc3b62..c7f0a405ff6 100644
> --- a/gcc/calls.c
> +++ b/gcc/calls.c
> @@ -199,7 +199,7 @@ stack_region_maybe_used_p (poly_uint64 lower_bound,
> poly_uint64 upper_bound,
>  static void
>  mark_stack_region_used (poly_uint64 lower_bound, poly_uint64 upper_bound)
>  {
> -  unsigned HOST_WIDE_INT const_lower, const_upper;
> +  unsigned HOST_WIDE_INT const_lower, const_upper = 0;
>const_lower = constant_lower_bound (lower_bound);
>if (upper_bound.is_constant (_upper))
>  for (unsigned HOST_WIDE_INT i = const_lower; i < const_upper; ++i)

This code looks good, the warning seems to be an over-kill.
Will this change be checked in as a fix?

[Bug tree-optimization/102681] New: AArch64 bootstrap failure

2021-10-10 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102681

Bug ID: 102681
   Summary: AArch64 bootstrap failure
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

This occurs after commit "Loosen loop crossing restriction in threader"
(ec0124e0acb556cdf5dba0e8d0ca6b69d9537fcc).

In function ‘void mark_stack_region_used(poly_uint64, poly_uint64)’,
inlined from ‘rtx_def* emit_library_call_value_1(int, rtx, rtx,
libcall_type, machine_mode, int, rtx_mode_t*)’ at ../../gcc/calls.c:4536:29:
../../gcc/calls.c:206:26: error: ‘const_upper’ may be used uninitialized in
this function [-Werror=maybe-uninitialized]
  206 |   stack_usage_map[i] = 1;
  |   ~~~^~~
../../gcc/calls.c: In function ‘rtx_def* emit_library_call_value_1(int, rtx,
rtx, libcall_type, machine_mode, int, rtx_mode_t*)’:
../../gcc/calls.c:202:39: note: ‘const_upper’ was declared here
  202 |   unsigned HOST_WIDE_INT const_lower, const_upper;
  |   ^~~

[Bug tree-optimization/102451] Suspicious null-pointer dereference in delete_dead_or_redundant_call

2021-09-23 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102451

Feng Xue  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from Feng Xue  ---
Fixed

[Bug tree-optimization/102400] Field might miss initialization in vn_reference_insert_pieces()

2021-09-23 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102400

Feng Xue  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #4 from Feng Xue  ---
Fixed

[Bug tree-optimization/102451] Suspicious null-pointer dereference in delete_dead_or_redundant_call

2021-09-22 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102451

--- Comment #2 from Feng Xue  ---
(In reply to Richard Biener from comment #1)
> Confirmed.  Mind fixing it by recording the basic-block index before
> removing/replacing?

OK.

[Bug tree-optimization/102451] New: Suspicious null-pointer dereference in delete_dead_or_redundant_call

2021-09-22 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102451

Bug ID: 102451
   Summary: Suspicious null-pointer dereference in
delete_dead_or_redundant_call
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

In the code snippet:

  if (lhs)
{
   
  if (gsi_replace (gsi, new_stmt, true))
bitmap_set_bit (need_eh_cleanup, gimple_bb (stmt)->index);
}
  else
{
  
  if (gsi_remove (gsi, true))
bitmap_set_bit (need_eh_cleanup, gimple_bb (stmt)->index);
}

It is known that "gsi" points to "stmt", and gsi_replace/gsi_remove will clear
basic block of "stmt". Then bitmap_set_bit will dereference "index" from null
bb.

[Bug tree-optimization/102400] New: Field might miss initialization in vn_reference_insert_pieces()

2021-09-18 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102400

Bug ID: 102400
   Summary: Field might miss initialization in
vn_reference_insert_pieces()
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

There is no initialization for vr1->result_vdef, and it may contain non-zero
garbage value.

[Bug c++/102214] New: ICE when compiling local class with -fno-weak

2021-09-06 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102214

Bug ID: 102214
   Summary: ICE when compiling local class with -fno-weak
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

Created attachment 51413
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51413=edit
testcase

Compiling attached testcase with option "-flto -O3 -fno-weak" will get:

during IPA pass: *free_lang_data
ic.cpp:5:11: internal compiler error: in mangle_decl, at cp/mangle.c:4088   
5 | class InnerParent { 
  |   ^~~   
0xbe3ef2 mangle_decl(tree_node*)
../../gcc/cp/mangle.c:4088  
0x17b0ac4 decl_assembler_name(tree_node*)
../../gcc/tree.c:710
0x17bd358 assign_assembler_name_if_needed(tree_node*)
../../gcc/tree.c:6318
0x17bd471 free_lang_data_in_cgraph
../../gcc/tree.c:6366
0x17bd606 free_lang_data
../../gcc/tree.c:6411
0x17bd7ac execute
../../gcc/tree.c:6483

[Bug ipa/101502] Inconsistent behavior in maybe_record_node()

2021-07-23 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101502

--- Comment #2 from Feng Xue  ---
(In reply to Martin Liška from comment #1)
> @Honza: Can you please take a look?
> @Feng: Do you have a test-case for it, please?

No. I just got this from code logic, but not 100% sure.

[Bug ipa/101502] New: Inconsistent behavior in maybe_record_node()

2021-07-19 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101502

Bug ID: 101502
   Summary: Inconsistent behavior in maybe_record_node()
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: ipa
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

For code snippet in maybe_record_node(), behavior might be inconsistent when
SANITIZE_UNREACHABLE is turned on. Suppose we are adding two functions
(method_fn/__cxa_pure_virtual), content of nodes is order-sensitive.

  o. method_fn, __cxa_pure_virtual ->  nodes: [ method_fn ]
  o. __cxa_pure_virtual, method_fn ->  nodes: [ __cxa_pure_virtual, method_fn ]

BTW: assertion on target_node->real_symbol_p() is redundant, since the
condition is contained.

  else if (target_node != NULL
   && (TREE_PUBLIC (target)
   || DECL_EXTERNAL (target)
   || target_node->definition)
   && target_node->real_symbol_p ())
{
  gcc_assert (!target_node->inlined_to);
  gcc_assert (target_node->real_symbol_p ());
  /* When sanitizing, do not assume that __cxa_pure_virtual is not called
 by valid program.  */
  if (flag_sanitize & SANITIZE_UNREACHABLE)
;
  /* Only add pure virtual if it is the only possible target.  This way
 we will preserve the diagnostics about pure virtual called in many
 cases without disabling optimization in other.  */
  else if (pure_virtual)
{
  if (nodes.length ())
return;
}
  /* If we found a real target, take away cxa_pure_virtual.  */
  else if (!pure_virtual && nodes.length () == 1
   && is_cxa_pure_virtual_p (nodes[0]->decl))
nodes.pop ();
  if (pure_virtual && nodes.length ())
return;
  if (!inserted->add (target))
{
  cached_polymorphic_call_targets->add (target_node);
  nodes.safe_push (target_node);
}
}

[Bug tree-optimization/100802] New: VRP fails to fold comparison with known value orders

2021-05-27 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100802

Bug ID: 100802
   Summary: VRP fails to fold comparison with known value orders
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

In the case below, it is obvious that we have value orders as i < a <= b inside
loop. So "if (i >= b)" could be optimized away, but VRP fails to do that.


int f1();
int f2();

int foo (unsigned a, unsigned b)
{
   if (a <= b)
{
   for (unsigned i = 0; i < a; i++)
 {
f1 ();

if (i >= b)
  f2 ();
 }

}

  return 0;
}

[Bug c++/100580] New: ICE with -fdump-passes

2021-05-12 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100580

Bug ID: 100580
   Summary: ICE with -fdump-passes
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

Created attachment 50805
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50805=edit
Source file

Reproduce ICE using the command:

g++ -fpreprocessed TimeControl.ii -std=c++98 -fdump-passes

[Bug tree-optimization/100222] New: Redundant mark_irreducible_loops () in predicate.c

2021-04-22 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100222

Bug ID: 100222
   Summary: Redundant mark_irreducible_loops () in predicate.c
   Product: gcc
   Version: tree-ssa
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

Found two places that redundant mark_irreducible_loops() is called after
loop_optimizer_init (LOOPS_NORMAL), since "LOOPS_NORMAL" already includes
"LOOPS_HAVE_MARKED_IRREDUCIBLE_REGIONS". The codes are in pass_profile::execute
() and report_predictor_hitrates().

[Bug ipa/99951] Dead return value after modify_call() is not released

2021-04-07 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99951

--- Comment #2 from Feng Xue  ---
Can we report error in verify_ssa() when a non-default SSA's defining statement
has NULL bb, which is always a case that the statement is removed somewhere?

[Bug ipa/99951] New: Dead return value after modify_call() is not released

2021-04-07 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99951

Bug ID: 99951
   Summary: Dead return value after modify_call() is not released
   Product: gcc
   Version: tree-ssa
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: ipa
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

By ipa_param_adjustments::modify_call(), a call statement with return value
might be replaced to a new void-return call. But the original return value SSA
is not released, which would give us garbage SSA when traversing
FOR_EACH_SSA_NAME.

In the following code, all uses of stmt's lhs are redirected to a new one, and
it becomes unused, but is never released from SSA name table.

  if (tree lhs = gimple_call_lhs (stmt))
{
  if (!m_skip_return)
gimple_call_set_lhs (new_stmt, lhs);
  else if (TREE_CODE (lhs) == SSA_NAME)
{
  /* LHS should now by a default-def SSA.  Unfortunately default-def
 SSA_NAMEs need a backing variable (or at least some code examining
 SSAs assumes it is non-NULL).  So we either have to re-use the
 decl we have at hand or introdice a new one.  */
  tree repl = create_tmp_var (TREE_TYPE (lhs), "removed_return");
  repl = get_or_create_ssa_default_def (cfun, repl);
  SSA_NAME_IS_DEFAULT_DEF (repl) = true;
  imm_use_iterator ui;
  use_operand_p use_p;
  gimple *using_stmt;
  FOR_EACH_IMM_USE_STMT (using_stmt, ui, lhs)
{
  FOR_EACH_IMM_USE_ON_STMT (use_p, ui)
{
  SET_USE (use_p, repl);
}
  update_stmt (using_stmt);
}
}
}

[Bug ipa/98815] Redundant free_dominance_info in cgraph_node::analyze()

2021-01-25 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98815

--- Comment #2 from Feng Xue  ---
If we step into free_dominance_info(dir), it is a wrapper of
free_dominance_info (cfun, dir), which means it assumes a non-NULL "cfun".
Additionally, please go through calling stack of free_dominance_info():

   free_dominance_info ()
 -> if (!dom_info_available_p ())
   -> dom_info_state (fn, dir) != DOM_NONE
  ->  if (!fn->cfg)
return DOM_NONE;

This shows when "fn->cfg" is NULL, free_dominance_info() does nothing.
Above all, if conditional check "cfun && fn->cfg" is true, two calls to
free_dominance_info() in cgraph_node::analyze() are redundant, and if the check
is false, these calls are trival. So we could remove those calls in
cgraph_node::analyze(), not execute_pass_list().

[Bug tree-optimization/98815] New: Redundant free_dominance_info in cgraph_node::analyze()

2021-01-25 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98815

Bug ID: 98815
   Summary: Redundant free_dominance_info in
cgraph_node::analyze()
   Product: gcc
   Version: tree-ssa
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

Since "execute_pass_list ()" has an action of clearing dominance information,
two free_dominance_info() after the line "execute_pass_list (cfun,
g->get_passes ()->all_lowering_passes)" seems to be redundant, and could be
removed.

[Bug rtl-optimization/98782] IRA artificially creating spills due to BB frequencies

2021-01-22 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

Feng Xue  changed:

   What|Removed |Added

 CC||fxue at os dot 
amperecomputing.com

--- Comment #1 from Feng Xue  ---
The value "foo + 1024" is spilled for both cases, but in different way. For bad
case, spill outside loop, and only reload inside. While for good case,
spill/reload pair occurs around the call to "bar", which might also consider
extra cost of using caller-saved registers. It seems that IRA has two different
logics to handle spilling.

[Bad case]
foo:
stp x29, x30, [sp, -80]!
mov w5, 753
mov x29, sp
stp x19, x20, [sp, 16]
mov x19, x1
mul w1, w0, w5
stp x21, x22, [sp, 32]
mov w22, 5271
add w2, w1, 7
mov w21, w5
mul w3, w0, w22
mov w20, 760
mov w22, 0
str w0, [sp, 76]
add x0, x19, 1024
str x0, [sp, 64]  // Spill (foo + 1024)
.p2align 3,,7
.L5:
ldrbw0, [x19]
cbz w0, .L2
ldr w0, [sp, 76]
stp w1, w2, [sp, 56]
str w3, [sp, 72]
bl  bar
ldrbw0, [x19, 1]!
ldp w1, w2, [sp, 56]
add w21, w21, w0
ldr w3, [sp, 72]
mul w20, w20, w0
ldr x0, [sp, 64] // Reload (foo + 1024)
add w22, w22, w20
cmp x19, x0
bne .L5
b   .L4
.p2align 2,,3
.L2:
ldrbw0, [x19, 1]!
add w21, w21, w0
mul w20, w20, w0
ldr x0, [sp, 64] // Reload (foo + 1024)
add w22, w22, w20
cmp x0, x19
bne .L5
.L4:
add w0, w20, w21
add w0, w0, w22
ldp x19, x20, [sp, 16]
ldp x21, x22, [sp, 32]
ldp x29, x30, [sp], 80
ret

[Good case:]
foo:
stp x29, x30, [sp, -80]!
mov w5, 753
add x7, x1, 1024
mul w2, w0, w5
mov x29, sp
stp x21, x22, [sp, 32]
mov w21, 5271
mov w22, w5
stp x19, x20, [sp, 16]
mov x19, x1
mul w3, w0, w21
stp w2, w0, [sp, 72]  // Spill x(%w0)
add w2, w2, 7 // t2(%w2)
mov w21, 0
mov w20, 760
.p2align 3,,7
.L5:
ldrbw0, [x19]
cbz w0, .L2
ldp w1, w0, [sp, 72]  // Reload x 
stp w2, w3, [sp, 56]  // Spill t2
str x7, [sp, 64]  // Spill (foo + 1024)
bl  bar
ldrbw0, [x19, 1]!
ldr x7, [sp, 64]  // Reload (foo + 1024)
add w22, w22, w0
ldp w2, w3, [sp, 56]  // Reload t2
mul w20, w20, w0
add w21, w21, w20
cmp x19, x7
bne .L5
b   .L4
.p2align 2,,3
.L2:
ldrbw0, [x19, 1]!
add w22, w22, w0
mul w20, w20, w0
add w21, w21, w20
cmp x7, x19
bne .L5
.L4:
add w0, w20, w22
add w0, w0, w21
ldp x19, x20, [sp, 16]
ldp x21, x22, [sp, 32]
ldp x29, x30, [sp], 80
ret

Even for good case, we could expect better spill/reload generation. Refer to
comments above, "x" and "t2" are similar, both loop invariant, but handled
differently. Spilling "t2" inside loop is worst than spilling it outside, as
what IRA does for "x". 

Both issues could be correlated to same thing.

[Bug tree-optimization/97066] [11 Regression] regression caused by r11-3207

2020-09-16 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97066

--- Comment #1 from Feng Xue  ---
Both cases will be simplified with new pattern, and this is correct. Will
change test code to make that not happen.

[Bug tree-optimization/94234] missed ccp folding for (addr + 8 * n) - (addr + 8 * (n - 1))

2020-09-15 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94234

Feng Xue  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #6 from Feng Xue  ---
Fixed.

[Bug tree-optimization/92712] [8/9 Regression] Performance regression with assumed values

2020-09-01 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92712

Feng Xue  changed:

   What|Removed |Added

 CC||fxue at os dot 
amperecomputing.com

--- Comment #24 from Feng Xue  ---
If(In reply to Jakub Jelinek from comment #15)
> For
> A*B+A*C -> (B+C)*A
> the problematic cases are
> A==-1 and B > 0 and C==(max-B)+1 (i.e. when B+C overflows to min)
> or A==0 and B < 0 and C or A==0 and B > 0 and C>max-B
> (last two cases cover when B+C overflows)
> For
> A*B-A*C -> (B-C)*A
> the problematic cases are
> A==-1 and B > 0 and C==min+B (i.e. when B-C is min)
> or A==0 and B < -1 and C>B-min
> or A==0 and B >= 0 and C (last two cases cover when B-C overflows)
> Again, we perform the operation right now if A is not 0 and not -1 for
> certain.
> I guess we could handle those cases by using something like
> check_for_binary_op_overflow, except that for the case where A might be -1
> and plusminus equal to MINUS_EXPR we also need to make sure the result of
> B-C is known not to be min.

Another point: if B+-C can be folded to an existing gimple value, we might
deduce B+-C does not overflow?

[Bug tree-optimization/94352] New: Suspicious code in tree-ssa-propagate.c

2020-03-27 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94352

Bug ID: 94352
   Summary: Suspicious code in tree-ssa-propagate.c
   Product: gcc
   Version: tree-ssa
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

In function ssa_propagation_engine::ssa_propagate(), a call chain below might
lead to a reference to uninitialized data.

 ssa_propagate() -> ssa_prop_init() -> add_control_edge() -> "curr_order"

 But ssa_prop_init() is called before "curr_order" is initialized to 0.

[Bug tree-optimization/94234] New: missed ccp folding for (addr + 8 * n) - (addr + 8 * (n - 1))

2020-03-20 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94234

Bug ID: 94234
   Summary: missed ccp folding for (addr + 8 * n) - (addr + 8 * (n
- 1))
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

CCP could not fold the following expression to a constant "8":

size_t foo (char *a, size_t n)
{
  char *b1 = a + 8 * n;
  char *b2 = a + 8 * (n - 1);

  return b1 - b2;
}

But if we change b1 and b2 to integer type, folding happens.

size_t foo (char *a, size_t n)
{
  size_t b1 = (size_t)(a + 8 * n);
  size_t b2 = (size_t)(a + 8 * (n - 1));

  return b1 - b2;
}

[Bug tree-optimization/93023] New: give preference to address iv without offset in ivopts

2019-12-20 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93023

Bug ID: 93023
   Summary: give preference to address iv without offset in ivopts
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

>From address IVs with same base and index, ivopts always pick up one with
non-zero offset. This does not incur extra cost on architecture like X86, which
has LEA instruction to combine offset into address computation. But on ARM, one
more add-with-offset instruction is required.

   X86:   lea addr_reg, base[index + offset]

   ARM:   add addr_reg, base, index
  add addr_reg, addr_reg, offset

So choosing IV w/o offset can save one instruction in most situations.
Here is an example, compile it on aarch64.

  int data[100];
  int fn1 ();

  void fn2 (int b, int n)
  {
int i;

for (i = 0; i < n; i++, b++)
  {
data[b + 10] = 1;
fn1 ();
data[b + 3] = 2;
  }
  }

Analysis into ivopts shows that those address IVs have same in-loop cost, and
IV w/o offset does have smaller pre-loop setup cost. But since the setup cost
will be averaged to each iteration, the minor cost difference will go due to
round-off by integer division. To fix this round-off error, cost can be
represented in a more accurate way, such as adding a fraction part to make it a
fixpoint number.

[Bug tree-optimization/93004] New: Suspicious code in tree-ssa-loop-ivopts.c

2019-12-18 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93004

Bug ID: 93004
   Summary: Suspicious code in tree-ssa-loop-ivopts.c
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

In function determine_group_iv_cost_address(),

  /* Uses in a group can share setup code, so only add setup cost once.  */
  cost -= cost.scratch;
  /* Compute and add costs for rest uses of this group.  */
  for (i = 1; i < group->vuses.length () && !sum_cost.infinite_cost_p (); i++)
{
  struct iv_use *next = group->vuses[i];

  /* TODO: We could skip computing cost for sub iv_use when it has the
 same cost as the first iv_use, but the cost really depends on the
 offset and where the iv_use is.  */
cost = get_computation_cost (data, next, cand, true,
 NULL, _autoinc, _expr);
if (inv_expr)
  {
if (!inv_exprs)
  inv_exprs = BITMAP_ALLOC (NULL);

bitmap_set_bit (inv_exprs, inv_expr->id);
  }
  sum_cost += cost;
}

Change to "cost" ahead of loop is meaningless, since it will be refreshed
inside the loop. According to comment, I guess "cost -= cost.scratch" might be
placed after "cost = get_computation_cost()".

[Bug tree-optimization/92862] New: Suspicious codes in tree-ssa-loop-niter.c and predict.c

2019-12-08 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92862

Bug ID: 92862
   Summary: Suspicious codes in tree-ssa-loop-niter.c and
predict.c
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

Function loop_only_exit_p() in tree-ssa-loop-niter.c:

  for (i = 0; i < loop->num_nodes; i++)
{
  for (bsi = gsi_start_bb (body[i]); !gsi_end_p (bsi); gsi_next ())
if (stmt_can_terminate_bb_p (gsi_stmt (bsi)))
  {
return true;
  }
}

we should return false, not true?


Function predict_paths_leading_to_edge() in predict.c:

 FOR_EACH_EDGE (e2, ei, bb->succs)
if (e2->dest != e->src && e2->dest != e->dest
&& !unlikely_executed_edge_p (e)
&& !dominated_by_p (CDI_POST_DOMINATORS, e->src, e2->dest))
  {
has_nonloop_edge = true;
break;
  }

"e" is loop invariant, I guess that unlikely_executed_edge_p (e) might be
unlikely_executed_edge_p (e2), which is more reasonable. And we can find
similar code in predict_paths_for_bb () that uses the latter.

[Bug tree-optimization/92839] New: Normalize memory address to same base in non-loop code

2019-12-06 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92839

Bug ID: 92839
   Summary: Normalize memory address to same base in non-loop code
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

If possible, IVOPTs can transform memory accesses to make them use same base,
which can decrease register pressure in loop. But it does not handle memory
address that is not an iv, or in a pure non-loop code.

Consider the following example, compile with -O3, we can find that data[b + ?]
will use same base register. But if we remove the line "for (i = 0; i < n; i++,
b++)" to make it be non-loop code, data[b + ?] will use different bases, with
result that we have to consume 10 register as bases, which is a much higher
register pressure. And since live ranges of these base registers cross function
call, most of them will be spilled to memory.

int data[100];

int fn();

void foo(int b, int n)
{
  int i;

  for (i = 0; i < n; i++, b++)  // remove the statement to be non-loop 
{
  data[b + 1] += 1;
  data[b + 3] += 3;
  data[b + 5] += 5;
  data[b + 7] += 7;
  data[b + 9] += 9;
  data[b + 11] += 11;
  data[b + 13] += 13;
  data[b + 15] += 15;
  data[b + 17] += 17;
  data[b + 19] += 19;

  fn ();

  data[b + 1] -= 1;
  data[b + 3] -= 3;
  data[b + 5] -= 5;
  data[b + 7] -= 7;
  data[b + 9] -= 9;
  data[b + 11] -= 11;
  data[b + 15] -= 13;
  data[b + 17] -= 15;
  data[b + 19] -= 19;
}
}

[Bug ipa/92133] Support multi versioning on self recursive function

2019-12-02 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92133

--- Comment #11 from Feng Xue  ---
--param ipa-cp-eval-threshold=1 -param ipcp-unit-growth=80 is enough.

[Bug ipa/92133] Support multi versioning on self recursive function

2019-12-02 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92133

--- Comment #9 from Feng Xue  ---
Ok. For any followups on this, I'll create new tracker.

[Bug tree-optimization/92689] Improve stmt_may_clobber_ref_p_1 on constant memory reference

2019-11-27 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92689

--- Comment #6 from Feng Xue  ---
Good case. I did missed something, a const pointer does not imply it is
restrict and for a real const data, we can even create a non-const pointer
alias to it by using explicit type cast.

[Bug tree-optimization/92689] Improve stmt_may_clobber_ref_p_1 on constant memory reference

2019-11-27 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92689

--- Comment #2 from Feng Xue  ---
int fn();

int goo(const int *p)
{
  return fn();
}

int data;

int foo (const int *p)
{
  int i;
  int t;

  data = *p; 

  for (i = 0; i < 100; i++)
{
  int t = *p + 1;

  goo ();
}

  return *p;
}

Compile the above with -O3 -fno-early-inlining -fno-inline
-fno-partial-inlining, and add a breakpoint at stmt_may_clobber_ref_p_1(), you
will find the call stmt_may_clobber_ref_p_1 (stmt/* data = *p; */, *p) returns
true.

[Bug tree-optimization/92689] New: Improve stmt_may_clobber_ref_p_1 on constant memory reference

2019-11-27 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92689

Bug ID: 92689
   Summary: Improve stmt_may_clobber_ref_p_1 on constant memory
reference
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

If ao_ref represents a constant data, stmt_may_clobber_ref_p_1() and
call_may_clobber_ref_p() can return false. But not sure there is any subtlety
behind it, which I missed.

[Bug ipa/91682] IPA-cp can not propagate value for by-ref argument in form of *arg = param op constant

2019-11-13 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91682

Feng Xue  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #2 from Feng Xue  ---
Fixed and close it.

[Bug ipa/91088] IPA-cp cost evaluation is too conservative for "if (f(param) cmp const_val)" condition

2019-11-13 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91088

Feng Xue  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from Feng Xue  ---
Fixed and close it.

[Bug ipa/91089] IPA-cp does not setup proper cost model for switch default case in function versioning

2019-11-13 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91089

Feng Xue  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Feng Xue  ---
Fixed and close it.

[Bug ipa/92133] Support multi versioning on self recursive function

2019-11-03 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92133

--- Comment #5 from Feng Xue  ---
(In reply to Thomas Koenig from comment #4)
> Author: tkoenig
> Date: Sun Nov  3 22:33:53 2019
> New Revision: 277760
> 
> URL: https://gcc.gnu.org/viewcvs?rev=277760=gcc=rev
> Log:
> 2019-11-03  Thomas Koenig  
> 
> PR fortran/92133
> * trans-decl.c (gfc_get_symbol_decl): If __def_init actually
> contains a value, put it into  the read-only section.
> 
> 
> Modified:
> trunk/gcc/fortran/ChangeLog
> trunk/gcc/fortran/trans-decl.c

Your patch's tracker number should be 92113, not 92133.

[Bug ipa/92133] Support multi versioning on self recursive function

2019-10-22 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92133

--- Comment #2 from Feng Xue  ---
(In reply to Martin Liška from comment #1)
> Let me take a look.

I've created a patch
(https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01260.html), could you take a
time to review it?

[Bug ipa/92133] New: Support multi versioning on self recursive function

2019-10-17 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92133

Bug ID: 92133
   Summary: Support multi versioning on self recursive function
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: ipa
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

For recursive function, IPA does not allow constant propagation on parameter
that is used to control recursion. It is common that the parameter is inc/dec
(arithmetic op) a constant before entering next recursion. The following
example gives a general source code pattern.

int recur_fn (int i)
{
  if (i == 6)
{
  do_work ();
  return 0;
}

  do_prepare ();

  recur_fn (i + 1);

  do_post ();

  return 0;
}

int foo()
{
  ...
  recur_fn (1);
  ...
}

A straight forward optimization is to duplicate recur_fn () 6 times, which
constitute a calling chain on the specialized copies as:

  recur_fn() -> recur_fn() -> ... -> recur_fn()

[Bug ipa/91682] New: IPA-cp can not propagate value for by-ref argument in form of *arg = param op constant

2019-09-06 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91682

Bug ID: 91682
   Summary: IPA-cp can not propagate value for by-ref argument in
form of *arg = param op constant
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: ipa
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

Current IPA only supports CP on by-ref argument assigned with constant. But if
value of the argument is computed via an unary/binary operation on caller's
parameter, it fails to do that. For example.

   struct S
 {
   int a, b;
 };

   void callee2 (struct S *s)
 {
// use s->a;
// use s->b; 
 }

   void callee1 (int a, int *p)
 {
   struct S s;

   s.a = a;
   s.b = *p + 1;
   callee2 ();
 }

   void caller ()
 {
int a = 1;
int b = 2;

callee1 (a, );
 } 

In callee1(), we could know that value of s.a equals 1 and s.b is 3, it is
possible to propagate these value to callee2().

[Bug tree-optimization/91571] New: TBAA does not work for ao_ref created by ao_ref_init_from_ptr_and_size()

2019-08-27 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91571

Bug ID: 91571
   Summary: TBAA does not work for ao_ref created by
ao_ref_init_from_ptr_and_size()
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

Given a SSA_NAME pointer, its type information is discarded by
ao_ref_init_from_ptr_and_size() when building an ao_ref from the pointer, thus
TBAA for this kind of ao_ref is actually disabled. Related code snippet:

  else
{
  gcc_assert (POINTER_TYPE_P (TREE_TYPE (ptr)));
  ref->base = build2 (MEM_REF, char_type_node,
  ptr, null_pointer_node);
  ..
}
  .
  ref->ref_alias_set = 0;
  ref->base_alias_set = 0;

If enabled, some optimizations relying on this could produce better result.
Here is an example for IPA-CP, but not just it (others include dse,
strlen-opt).

struct A
{
   int a;
};

struct B
{
   int b;
};

void f2 (struct A *pa, struct B *pb)
{
   ...
}

void f1(int v, struct A *pa, struct B *pb)
{
  pa->a = 1;
  pb->b = v;

  f2 (pa, pb);
}

The ao_ref for f2's argument "pa" is setup by ao_ref_init_from_ptr_and_size(),
AA can not disambiguate pa->a and pb->b due to loss of original type
information, and IPA-CP gets a conservative conclusion that constant "1"
assigned to pa->a can not reach f2(pa, pb), then misses a chance of constant
propagation.

I check uses of ao_ref_init_from_ptr_and_size(), it is mainly used in alias
analysis on function call arguments. That's better to enable TBAA in this
scenario, but not sure is there any special consideration about it?

[Bug ipa/91508] New: Segfault due to referencing removed cgraph_node

2019-08-21 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91508

Bug ID: 91508
   Summary: Segfault due to referencing removed cgraph_node
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: ipa
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

Caught a segfault when compiling pr63766.C with debug version gcc. 

pr63766.C:48:1: internal compiler error: Segmentation fault
   48 | }
  | ^
0x14ace78 crash_signal
../../gcc/toplev.c:326
0x8fd7c5 tree_check(tree_node*, char const*, int, char const*, tree_code)
../../gcc/tree.h:3256
0x1342850 do_per_function_toporder(void (*)(function*, void*), void*)
../../gcc/passes.c:1703
0x1345471 execute_ipa_pass_list(opt_pass*)
../../gcc/passes.c:2919
0xe10dba ipa_passes
../../gcc/cgraphunit.c:2480
0xe1120c symbol_table::compile()
../../gcc/cgraphunit.c:2618
0xe117d2 symbol_table::finalize_compilation_unit()
../../gcc/cgraphunit.c:2868

Analysis points to do_per_function_order()/passes.c, in which handling an 
cgraph_node in a worklist might remove later unprocessed node. So, it registers
a remove hook to track removed node by hashset, and uses node's uid as hashset
key. But this trick has a problem in that node's uid becomes garbage if the
node has been removed. Alternative way is to use node address as key.

 /* Function could be inlined and removed as unreachable.  */
 if (node == NULL || removed_nodes.contains (node->get_uid ()))
  ^^

[Bug ipa/91468] New: Suspicious codes in ipa-prop.c and ipa-cp.c

2019-08-15 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91468

Bug ID: 91468
   Summary: Suspicious codes in ipa-prop.c and ipa-cp.c
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: ipa
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

Some might be a bug, and some might be redundant.

ipa-prop.c:

In function ipcp_modif_dom_walker::before_dom_children(),

  vce = false;
  t = rhs;
  while (handled_component_p (t))
{
  /* V_C_E can do things like convert an array of integers to one
 bigger integer and similar things we do not handle below.  */
  if (TREE_CODE (rhs) == VIEW_CONVERT_EXPR)
{
  vce = true;
  break;
}
  t = TREE_OPERAND (t, 0);
}
  if (vce)
continue;

Should "rhs" in "if (TREE_CODE (rhs) == VIEW_CONVERT_EXPR)" be "t"?


In function update_jump_functions_after_inlining(),

  if (dst->type == IPA_JF_ANCESTOR)
{
  ..

  if (src->type == IPA_JF_PASS_THROUGH
  && src->value.pass_through.operation == NOP_EXPR)
{
   ..
}
  else if (src->type == IPA_JF_PASS_THROUGH
   && TREE_CODE_CLASS (src->value.pass_through.operation) ==
tcc_unary)
{
  dst->value.ancestor.formal_id =
src->value.pass_through.formal_id;
  dst->value.ancestor.agg_preserved = false;
}
  ..   
}

If we suppose pass_through operation is "negate_expr" (while it is not a
reasonable operation on pointer type), the code might be incorrect. It's better
to specify expected unary operations here.


In function compute_complex_assign_jump_func(),

case GIMPLE_UNARY_RHS:
  if (is_gimple_assign (stmt)
  && gimple_assign_rhs_class (stmt) == GIMPLE_UNARY_RHS
  && ! CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (stmt)))
ipa_set_jf_unary_pass_through (jfunc, index,
   gimple_assign_rhs_code (stmt));

The condition "is_gimple_assign (stmt)
  && gimple_assign_rhs_class (stmt) == GIMPLE_UNARY_RHS" seems to
be redundant, might be omit.


ipa-cp.c:

In function merge_agg_lats_step(), 

  if (**aglat && (**aglat)->offset == offset)
{
  if ((**aglat)->size != val_size
  || ((**aglat)->next 
  && (**aglat)->next->offset < offset + val_size))
{
  set_agg_lats_to_bottom (dest_plats);
  return false;
}
  gcc_checking_assert (!(**aglat)->next
   || (**aglat)->next->offset >= offset + val_size);
  return true;
}

The condition "|| ((**aglat)->next && (**aglat)->next->offset < offset +
val_size))" seems to be always false, because the next item should not be
overlapped with its prev, this is what merge_agg_lats_step() tries to ensure.

[Bug ipa/91089] IPA-cp does not setup proper cost model for switch default case in function versioning

2019-08-15 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91089

--- Comment #2 from Feng Xue  ---
I've already created a patch under review. Please give some comments. Here it
is: https://gcc.gnu.org/ml/gcc-patches/2019-07/msg00937.html

[Bug ipa/91088] IPA-cp cost evaluation is too conservative for "if (f(param) cmp const_val)" condition

2019-08-15 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91088

--- Comment #3 from Feng Xue  ---
I've already created a patch under review. Please give some comments. Here it
is: https://gcc.gnu.org/ml/gcc-patches/2019-07/msg00959.html

[Bug ipa/91194] A suspicious condition in recursive_inlining

2019-07-21 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91194

Feng Xue  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Feng Xue  ---
Fixed

[Bug ipa/91194] New: A suspicious condition in recursive_inlining

2019-07-18 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91194

Bug ID: 91194
   Summary: A suspicious condition in recursive_inlining
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: ipa
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

A piece of code n recursive_inlining()/ipa-inline.c:

  while (!heap.empty ())
{
  struct cgraph_edge *curr = heap.extract_min ();
  struct cgraph_node *cnode, *dest = curr->callee;

  if (!can_inline_edge_p (curr, true)
  || can_inline_edge_by_limits_p (curr, true))
continue;

The second condition does not seem be logical, should it be
"!can_inline_edge_by_limits_p (curr, true)"?

[Bug tree-optimization/91090] New: A suspicious code in tree-ssa-dom.c

2019-07-04 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91090

Bug ID: 91090
   Summary: A suspicious code in tree-ssa-dom.c
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

Find code snippet in simplify_stmt_for_jump_threading ():

  if (vr->kind () == VR_RANGE)
{
  size_t i, j;

  find_case_label_range (switch_stmt, vr->min (), vr->max (), , );

  if (i == j)
{
  tree label = gimple_switch_label (switch_stmt, i);
  tree singleton;

  if (CASE_HIGH (label) != NULL_TREE
  ? (tree_int_cst_compare (CASE_LOW (label), vr->min ()) <= 0
 && tree_int_cst_compare (CASE_HIGH (label), vr->max ()) >=
0)
  : (vr->singleton_p ()
 && tree_int_cst_equal (CASE_LOW (label), singleton)))
return label;

  if (i > j)
return gimple_switch_label (switch_stmt, 0);
}
}

The conditional "if (i > j)" should be outside of "if (i == j)"?

[Bug ipa/91089] New: IPA-cp does setup proper cost model for switch default case in function versioning

2019-07-04 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91089

Bug ID: 91089
   Summary: IPA-cp does setup proper cost model for switch default
case in function versioning
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: ipa
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

IPA-cp always adds execution cost of switch default case into cost evaluation
for other switch case, which might miss opportunities of function versioning
when only certain case is proved to be executed after constant propagation.

The following is an example, compiled with "-O3 -fno-inline".

It is reasonable to make a function versioning for callee when "v=1", but
actually it does not happen. And if we replace "default" with "case = 4",
it clones the function as we expect.

In some situations, we can deduce a relative small execution predicate
for default case. If all switch cases are in a continuous value range, we
can use compare to range low and high bound to mark default switch case.
Additionally, range analysis can also give us more information to simplify
the predicate.

int foo();

int callee(int v)
{
  int i = 0;

  switch (v)
   {
 case 1:
   i = 3;
   break;
 case 2:
   i = 9;
   break;

 default: 
   foo();
   foo();
   foo();
   foo();
   foo();
   foo();
   foo();
   foo();
   foo();
  }
  return i;
}

int caller()
{
  return callee (1);
}

[Bug ipa/91088] New: IPA-cp cost evaluation is too conservative for "if (f(param) cmp const_val)" condition

2019-07-04 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91088

Bug ID: 91088
   Summary: IPA-cp cost evaluation is too conservative for "if
(f(param) cmp const_val)" condition
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: ipa
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

Current IPA-cp only detects conditional statement like "if (param cmp
const_val)", and gives a reasonable cost evaluation used in function
versioning. But beyond this form, any generalized form as f(param), a function
on the parameter, which involves extra operations and constant values, will be
conservatively treated, have not use in cost computation.

The following is an example, compiled with "-O3 -fno-inline".

The value "5" is propagated into callee(), for "if (v < 3)", ipa-cp can
deduce that "true branch" will not be executed, and benefit for function
versioning is large. And for "if (v * 2 < 6)", ipa-cp can not handle that, and
conservatively takes cost of "true branch" into account, so get a "no need to
clone function" conclusion.

int foo();

int callee(int v)
{
  // if (v < 3)  // will make a constprop copy with v = 5
  // if (v * 2 < 6)  // will no make a constprop copy
{
  foo();
  foo();
  foo();
  foo();
  foo();
  foo();
  foo();
  foo();
  foo();
}
  else
return 1;
}

int caller()
{
  return callee (5);
}

[Bug ipa/90401] Missed propagation of by-ref constant argument to callee function

2019-06-13 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90401

Feng Xue  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Feng Xue  ---
Fixed

[Bug tree-optimization/89713] Optimize away an empty loop whose finiteness can not be analytically determined

2019-06-13 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89713

Feng Xue  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from Feng Xue  ---
Fixed

[Bug ipa/90401] Missed propagation of by-ref constant argument to callee function

2019-05-09 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90401

--- Comment #2 from Feng Xue  ---
(In reply to Richard Biener from comment #1)
> Huh.  IPA-CP dump difference:
> 
> @@ -26,6 +26,8 @@
>   Unknown VR
>  callsite  int caller(int, int&)/2 -> int callee(int&)/1 : 
> param 0: UNKNOWN
> + Aggregate passed by reference:
> +   offset: 0, cst: 1
>   value: 0x0, mask: 0xfffc
>   VR  ~[0, 0]
>Jump functions of caller  int callee(int&)/1:

This jump function is for the 2nd call. For the 1st call, IPA-CP can not deduce
that. So there is no enough benefit to convince IPA-CP to clone the "callee()".

> 
> I guess somehow IPA-CP walks stmts for defs instead of virtual operands?

Yes, it is. IPA-CP only walks stmts in callsite basic block.

[Bug ipa/90401] New: Missed propagation of by-ref constant argument to callee function

2019-05-08 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90401

Bug ID: 90401
   Summary: Missed propagation of by-ref constant argument to
callee function
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: ipa
  Assignee: unassigned at gcc dot gnu.org
  Reporter: fxue at os dot amperecomputing.com
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

Created attachment 46318
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46318=edit
The case that ipa_cp is failed on by_ref constant argument

IPA-cp can not identify a constant by-ref argument to a function, if value
definition of the argument is not in same basic block where the callsite lies
in. In Fortran, this is very common that a constant function argument is placed
in a local variable instead of literal constant.

The following is a C++ case, compiling option is "-O3 -fno-inline", here using
"-fno-inline" to give a change for ipa-cp on small function.

If we move the statement "int a = 1" from inside conditional block to begin of
function, ipa-cp fails to do function versioning on "callee()".

int data;

int callee(int )
{
if (v < 2) {
return 0;
} else {
data = v;
return 1;
}
}

int caller(int c, int )
{
// int a = 1;   <.   /* ipa-cp on callee(), failed */  
 |
if (c) { |
//  int a = 1;  -'   /* ipa-cp on callee(), ok */ 

return callee(a);
} else {
r = 1;
return callee(r);
}
}

[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order

2019-05-05 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174

--- Comment #9 from Feng Xue  ---
(In reply to Feng Xue from comment #5)
> >   I would say that top-down algorithm behaves better than bottom-up one.  I
> > implemented Callahan-Koblentz (bottom-up algorithm) in GCC about 10 years
> > ago and it behaved worse than the current one.  I think adding an additional
> Intuitively, it's better to give preference to regions with higher frequency
> to let them pick up registers earlier and with more freedom, which matches
> with bottom-up coloring strategy.
> 
> > pass over regions probably would solve the problem but it will complicate
> > already complicated RA (which is about 60K lines now). In any case I'll
> > think about this problem.
> > 
> >   The problem you are mentioning in this code is quite insignificant (one
> > memory access in the top most region).
> But it will be significant inside multi-level loop. Actually, the problem is
> exposed by a 8-level loop in a real Fortran application. 
> 
> >   
> >   I also checked the attachments.  What I see is GCC generates a better big
> > loop body (a loop with high register pressure).  GCC generated loop body has
> > 50 x86-64 insns with 22 memory accesses vs 56 with 26 memory access for 
> > LLVM.
> As far as how to spilling live range at loop boundary is concerned, gcc is
> not very efficient. To make loop body not be the focus, when we remove two
> live-range variables in original case, we can get same amount of memory
> accesses for gcc and llvm, both of which generates 9 register spills in loop
> body. I attached modified case and assembly files.

For the modified case, allocation of llvm and gcc remain nearly the same for
loop body, but as to 'lv0', llvm gets better spill/split result over gcc.

[Bug rtl-optimization/90168] context-sensitive local register allocation

2019-05-05 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90168

--- Comment #4 from Feng Xue  ---
(In reply to Andrew Pinski from comment #3)
> >or to use float type to hold frequency?
> 
> This won't work correctly as floating point is different between hosts. 
> There has been some usage of floating point inside of GCC which was removed
> because of that issue.  I thought that was documented somewhere too.

How about adjusting REG_FREQ_MAX to be same as BB_FREQ_MAX? Now
REG_FREQ_MAX/BB_FREQ_MAX is 1/10.

[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order

2019-05-05 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174

--- Comment #8 from Feng Xue  ---
Created attachment 46294
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46294=edit
asm file 2 generated by llvm

[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order

2019-05-05 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174

--- Comment #7 from Feng Xue  ---
Created attachment 46293
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46293=edit
asm file 2 generated by gcc

[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order

2019-05-05 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174

--- Comment #6 from Feng Xue  ---
Created attachment 46292
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46292=edit
test case with less live ranges

[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order

2019-05-05 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174

--- Comment #5 from Feng Xue  ---
>   I would say that top-down algorithm behaves better than bottom-up one.  I
> implemented Callahan-Koblentz (bottom-up algorithm) in GCC about 10 years
> ago and it behaved worse than the current one.  I think adding an additional
Intuitively, it's better to give preference to regions with higher frequency to
let them pick up registers earlier and with more freedom, which matches with
bottom-up coloring strategy.

> pass over regions probably would solve the problem but it will complicate
> already complicated RA (which is about 60K lines now). In any case I'll
> think about this problem.
> 
>   The problem you are mentioning in this code is quite insignificant (one
> memory access in the top most region).
But it will be significant inside multi-level loop. Actually, the problem is
exposed by a 8-level loop in a real Fortran application. 

>   
>   I also checked the attachments.  What I see is GCC generates a better big
> loop body (a loop with high register pressure).  GCC generated loop body has
> 50 x86-64 insns with 22 memory accesses vs 56 with 26 memory access for LLVM.
As far as how to spilling live range at loop boundary is concerned, gcc is not
very efficient. To make loop body not be the focus, when we remove two
live-range variables in original case, we can get same amount of memory
accesses for gcc and llvm, both of which generates 9 register spills in loop
body. I attached modified case and assembly files.

[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order

2019-04-24 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174

--- Comment #3 from Feng Xue  ---
Created attachment 46237
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46237=edit
test case for aarch64

Add another case composed for aarch64.

[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order

2019-04-21 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174

--- Comment #2 from Feng Xue  ---
Created attachment 46219
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46219=edit
asm file generated by llvm

[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order

2019-04-21 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174

--- Comment #1 from Feng Xue  ---
Created attachment 46218
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46218=edit
asm file generated by gcc

[Bug rtl-optimization/90168] context-sensitive local register allocation

2019-04-19 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90168

--- Comment #2 from Feng Xue  ---
(In reply to Eric Botcazou from comment #1)
> > Supposed a function as the following, in which 'cond', 'S1' and 'S2' are
> > completely irrelevant, means they do not access same variables(in term of
> > RA, they own separate live range set).
> > 
> >   f1()
> >   { 
> >   if (cond) {
> >   S1
> >   } else {
> >   S2
> >   }
> >   }
> > 
> > Ideally, we can expect that register allocation on 'S1'is totally
> > independent of 'S2', w or w/o which makes no difference.
> 
> This seems a rather far-fetched assumption, to say the least.  This would
> essentially imply that no global optimization is applied to the function.
> 
> > Its result should be same as below function consisting of only 'S1':
> > 
> >   f2()
> >   {
> >   S1
> >   }
> > 
> > But we found gcc does not has this property. Strictly speaking, this is not
> > a bug, but exposes some kind of instability in code generation, has
> > undeterminable impact on some optimization, such as inlining. 
> 
> And do you know of any non-toy/production compiler that has the property?

llvm has, and icc nearly has(only minor difference in register number, but
completely same spills).

> 
> > Investigation shows this is related to integer-based frequency normalization
> > (REG_FREQ_FROM_BB) used by RA, which always rounds up a small frequency
> > (less than 1) to 1. In foo1(), introduction of new code makes profile counts
> > of CODES be decreased, so that impact of frequency normalization error
> > becomes more significant and actually distorts original proportion of
> > profile counts among basic blocks in CODES. For example, in foo(), two
> > blocks have counts of 3 and 100 receptively, and in foo1(), they become 0.3
> > and 10, after rounding up, they are 1 and 10, thus proportion is changed
> > from (3 vs 100) to (1 vs 10).
> > 
> > Possible solution might be to adjust two scale factors used by
> > REG_FREQ_FROM_BB : REG_FREQ_MAX and BB_FREQ_MAX, or to use float type to
> > hold frequency?
> 
> Which version of the compiler are you using? This changed in GCC 8.

GCC trunk (9.0.1 20190325)

  1   2   >