Re: Check that passes do not forget to define profile

2023-08-24 Thread Jan Hubicka via Gcc-patches
> On Thu, Aug 24, 2023 at 3:15 PM Jan Hubicka via Gcc-patches
>  wrote:
> >
> > Hi,
> > this patch extends verifier to check that all probabilities and counts are
> > initialized if profile is supposed to be present.  This is a bit complicated
> > by the posibility that we inline !flag_guess_branch_probability function
> > into function with profile defined and in this case we need to stop
> > verification.  For this reason I added flag to cfg structure tracking this.
> >
> > Bootstrapped/regtested x86_64-linux, comitted.
> 
> Couldn't we have massaged profile_status to avoid extra full_profile?
> Aka add PROFILE_{READ,GUESSED}_PARTIAL?

I am working in direction of removing profile_status.  We mostly use it
to determine whether profile is reliable (or present at all).
This is available locally in profile quality info of profile_count and
profile_probability.

Most existing tests of that value goes wrong when we inline functions
with one profile status to functions with another, so they should be
replaced by more local tests.

Honza


Check that passes do not forget to define profile

2023-08-24 Thread Jan Hubicka via Gcc-patches
Hi,
this patch extends verifier to check that all probabilities and counts are
initialized if profile is supposed to be present.  This is a bit complicated
by the posibility that we inline !flag_guess_branch_probability function
into function with profile defined and in this case we need to stop
verification.  For this reason I added flag to cfg structure tracking this.

Bootstrapped/regtested x86_64-linux, comitted.

gcc/ChangeLog:

* cfg.h (struct control_flow_graph): New field full_profile.
* auto-profile.cc (afdo_annotate_cfg): Set full_profile to true.
* cfg.cc (init_flow): Set full_profile to false.
* graphite.cc (graphite_transform_loops): Set full_profile to false.
* lto-streamer-in.cc (input_cfg): Initialize full_profile flag.
* predict.cc (pass_profile::execute): Set full_profile to true.
* symtab-thunks.cc (expand_thunk): Set full_profile to true.
* tree-cfg.cc (gimple_verify_flow_info): Verify that profile is full
if full_profile is set.
* tree-inline.cc (initialize_cfun): Initialize full_profile.
(expand_call_inline): Combine full_profile.


diff --git a/gcc/auto-profile.cc b/gcc/auto-profile.cc
index e3af3555e75..ff3b763945c 100644
--- a/gcc/auto-profile.cc
+++ b/gcc/auto-profile.cc
@@ -1578,6 +1578,7 @@ afdo_annotate_cfg (const stmt_set _stmts)
 }
   update_max_bb_count ();
   profile_status_for_fn (cfun) = PROFILE_READ;
+  cfun->cfg->full_profile = true;
   if (flag_value_profile_transformations)
 {
   gimple_value_profile_transformations ();
diff --git a/gcc/cfg.cc b/gcc/cfg.cc
index 9eb9916f61a..b7865f14e7f 100644
--- a/gcc/cfg.cc
+++ b/gcc/cfg.cc
@@ -81,6 +81,7 @@ init_flow (struct function *the_fun)
 = ENTRY_BLOCK_PTR_FOR_FN (the_fun);
   the_fun->cfg->edge_flags_allocated = EDGE_ALL_FLAGS;
   the_fun->cfg->bb_flags_allocated = BB_ALL_FLAGS;
+  the_fun->cfg->full_profile = false;
 }
 
 /* Helper function for remove_edge and free_cffg.  Frees edge structure
diff --git a/gcc/cfg.h b/gcc/cfg.h
index a0e944979c8..53e2553012c 100644
--- a/gcc/cfg.h
+++ b/gcc/cfg.h
@@ -78,6 +78,9 @@ struct GTY(()) control_flow_graph {
   /* Dynamically allocated edge/bb flags.  */
   int edge_flags_allocated;
   int bb_flags_allocated;
+
+  /* Set if the profile is computed on every edge and basic block.  */
+  bool full_profile;
 };
 
 
diff --git a/gcc/graphite.cc b/gcc/graphite.cc
index 19f8975ffa2..2b387d5b016 100644
--- a/gcc/graphite.cc
+++ b/gcc/graphite.cc
@@ -512,6 +512,8 @@ graphite_transform_loops (void)
 
   if (changed)
 {
+  /* FIXME: Graphite does not update profile meaningfully currently.  */
+  cfun->cfg->full_profile = false;
   cleanup_tree_cfg ();
   profile_status_for_fn (cfun) = PROFILE_ABSENT;
   release_recorded_exits (cfun);
diff --git a/gcc/lto-streamer-in.cc b/gcc/lto-streamer-in.cc
index 0cce14414ca..d3128fcebe4 100644
--- a/gcc/lto-streamer-in.cc
+++ b/gcc/lto-streamer-in.cc
@@ -1030,6 +1030,7 @@ input_cfg (class lto_input_block *ib, class data_in 
*data_in,
   basic_block p_bb;
   unsigned int i;
   int index;
+  bool full_profile = false;
 
   init_empty_tree_cfg_for_function (fn);
 
@@ -1071,6 +1072,8 @@ input_cfg (class lto_input_block *ib, class data_in 
*data_in,
  data_in->location_cache.input_location_and_block (>goto_locus,
, ib, data_in);
  e->probability = profile_probability::stream_in (ib);
+ if (!e->probability.initialized_p ())
+   full_profile = false;
 
}
 
@@ -1145,6 +1148,7 @@ input_cfg (class lto_input_block *ib, class data_in 
*data_in,
 
   /* Rebuild the loop tree.  */
   flow_loops_find (loops);
+  cfun->cfg->full_profile = full_profile;
 }
 
 
diff --git a/gcc/predict.cc b/gcc/predict.cc
index 5a1a561cc24..396746cbfd1 100644
--- a/gcc/predict.cc
+++ b/gcc/predict.cc
@@ -4131,6 +4131,7 @@ pass_profile::execute (function *fun)
 scev_initialize ();
 
   tree_estimate_probability (false);
+  cfun->cfg->full_profile = true;
 
   if (nb_loops > 1)
 scev_finalize ();
diff --git a/gcc/symtab-thunks.cc b/gcc/symtab-thunks.cc
index 4c04235c41b..23ead0d2138 100644
--- a/gcc/symtab-thunks.cc
+++ b/gcc/symtab-thunks.cc
@@ -648,6 +648,7 @@ expand_thunk (cgraph_node *node, bool output_asm_thunks,
  ? PROFILE_READ : PROFILE_GUESSED;
   /* FIXME: C++ FE should stop setting TREE_ASM_WRITTEN on thunks.  */
   TREE_ASM_WRITTEN (thunk_fndecl) = false;
+  cfun->cfg->full_profile = true;
   delete_unreachable_blocks ();
   update_ssa (TODO_update_ssa);
   checking_verify_flow_info ();
diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc
index 272d5ce321e..ffab7518b15 100644
--- a/gcc/tree-cfg.cc
+++ b/gcc/tree-cfg.cc
@@ -5684,6 +5684,26 @@ gimple_verify_flow_info (void)
error ("fallthru to exit from bb %d", e->src->index);
err = true;
   }
+  if (cfun->cfg->full_profile
+  && 

Fix profile update in tree-ssa-reassoc

2023-08-23 Thread Jan Hubicka via Gcc-patches
Hi,
this patch adds missing profile update to maybe_optimize_range_tests.
Jakub, I hope I got the code right: I think it basically analyzes the
chain of conditionals, finds some basic blocks involved in the range
testing and then puts all the test into first BB.

The patch fixes gcc.dg/tree-ssa/update-threading.c profile misupdate on
power-pc.  Curiously enough the code is produced differently for x86_64.
I tried to find testcase for x86_64 and found that

testsuite/gcc.dg/tree-ssa/reassoc-33.c
testsuite/gcc.dg/tree-ssa/reassoc-37.c
testsuite/gcc.dg/tree-ssa/reassoc-43.c

are testing this function. However sadly neighter of these testcases seems
to work as expected.  For example in testsuite/gcc.dg/tree-ssa/reassoc-33.c
we turn

;; basic block 3, loop depth 0, count 708669600 (estimated locally, freq 
0.6600), maybe hot
;;  prev block 2, next block 4, flags: (NEW, VISITED)
;;  pred:   2 [66.0% (guessed)]  count:708669600 (estimated locally, freq 
0.6600) (FALSE_VALUE,EXECUTABLE)
_4 = a_14(D) == 44;
_5 = a_14(D) == 78;
_30 = 0;
_6 = _4 | _5;
if (_30 != 0)
  goto ; [34.00%]
else
  goto ; [66.00%]
;;  succ:   7 [34.0% (guessed)]  count:240947667 (estimated locally, freq 
0.2244) (TRUE_VALUE,EXECUTABLE)
;;  4 [66.0% (guessed)]  count:467721933 (estimated locally, freq 
0.4356) (FALSE_VALUE,EXECUTABLE)

to

;; basic block 2, loop depth 0, count 1073741824 (estimated locally, freq 
1.), maybe hot
;;  prev block 0, next block 3, flags: (NEW, VISITED)
;;  pred:   ENTRY [always]  count:1073741824 (estimated locally, freq 
1.) (FALLTHRU,EXECUTABLE)
_18 = (unsigned int) a_14(D);
_19 = _18 + 4294967253;
_24 = (unsigned int) a_14(D);
_25 = _24 + 4294967253;
_26 = _25 & 4294967260;
_27 = _26 == 0;
_20 = _19 <= 3;
_1 = a_14(D) == 43;
_21 = (unsigned int) a_14(D);
_22 = _21 + 4294967221;
_23 = _22 <= 3;
_2 = a_14(D) == 75;
_31 = _27;
_3 = _1 | _2;
if (_31 != 0)
  goto ; [34.00%]
else
  goto ; [66.00%]

which replaces later tests

;; basic block 4, loop depth 0, count 467721934 (estimated locally, freq 
0.4356), maybe hot
;;  prev block 3, next block 5, flags: (NEW, VISITED)
;;  pred:   3 [66.0% (guessed)]  count:467721933 (estimated locally, freq 
0.4356) (FALSE_VALUE,EXECUTABLE)
_7 = a_14(D) == 77;
_8 = a_14(D) == 46;
_29 = 0;
_9 = _7 | _8;
if (_29 != 0)
  goto ; [34.00%]
else
  goto ; [66.00%]

;; basic block 5, loop depth 0, count 308696475 (estimated locally, freq 
0.2875), maybe hot
;;  prev block 4, next block 6, flags: (NEW, VISITED)
;;  pred:   4 [66.0% (guessed)]  count:308696475 (estimated locally, freq 
0.2875) (FALSE_VALUE,EXECUTABLE)
_10 = a_14(D) == 76;
_11 = a_14(D) == 45;
_28 = 0;
_12 = _10 | _11;
if (_28 != 0)
  goto ; [50.00%]
else
  goto ; [50.00%]
;;  succ:   7 [50.0% (guessed)]  count:154348238 (estimated locally, freq 
0.1437) (TRUE_VALUE,EXECUTABLE)
;;  6 [50.0% (guessed)]  count:154348238 (estimated locally, freq 
0.1437) (FALSE_VALUE,EXECUTABLE)

However BB4 and BB5 is not updated to be unconditional by tree-ssa-reassoc pass
and we thus miss the profile update.

This happens later in forwprop but at that time it is too late to update the 
probabilities.
So we get:

;;   basic block 2, loop depth 0, count 1073741824 (estimated locally, freq 
1.), maybe hot
;;prev block 0, next block 3, flags: (NEW, VISITED)
;;pred:   ENTRY [always]  count:1073741824 (estimated locally, freq 
1.) (FALLTHRU,EXECUTABLE)
  _24 = (unsigned int) a_14(D);
  _25 = _24 + 4294967253;
  _26 = _25 & 4294967260;
  _27 = _26 == 0;
  if (_26 == 0)
goto ; [34.00%]
  else
goto ; [66.00%]
;;succ:   4 [34.0% (guessed)]  count:365072224 (estimated locally, freq 
0.3400) (TRUE_VALUE,EXECUTABLE)
;;3 [66.0% (guessed)]  count:708669600 (estimated locally, freq 
0.6600) (FALSE_VALUE,EXECUTABLE)

;;   basic block 3, loop depth 0, count 154348237 (estimated locally, freq 
0.1437), maybe hot
;;   Invalid sum of incoming counts 708669600 (estimated locally, freq 0.6600), 
should be 154348237 (estimated locally, freq 0.1437)
;;prev block 2, next block 4, flags: (NEW, VISITED)
;;pred:   2 [66.0% (guessed)]  count:708669600 (estimated locally, freq 
0.6600) (FALSE_VALUE,EXECUTABLE)
;;succ:   4 [always]  count:154348237 (estimated locally, freq 0.1437) 
(FALLTHRU,EXECUTABLE) c.c:12:12

;;   basic block 4, loop depth 0, count 1073741824 (estimated locally, freq 
1.), maybe hot
;;   Invalid sum of incoming counts 519420461 (estimated locally, freq 0.4837), 
should be 1073741824 (estimated locally, freq 1.)
;;prev block 3, next block 1, flags: (NEW, VISITED)
;;pred:   2 [34.0% (guessed)]  count:365072224 (estimated locally, freq 
0.3400) (TRUE_VALUE,EXECUTABLE)
;;3 [always]  count:154348237 (estimated locally, freq 0.1437) 
(FALLTHRU,EXECUTABLE) c.c:12:12
  # _13 = PHI 
  return _13;

Jakub, it seems that the code is originally yours.  Any idea why those are not 
turned to
constant true 

Re: Loop-ch improvements, part 3

2023-08-23 Thread Jan Hubicka via Gcc-patches
> We seem to peel one iteration for no good reason.  The loop is
> a do-while loop already.  The key is we see the first iteration
> exit condition is known not taken and then:
Hi,
this is patch fixing wrong return value in should_duplicate_loop_header_p.
Doing so uncovered suboptimal decisions on some jump threading testcases
where we chose to stop duplicating just before basic block that has zero
cost and duplicating so would be always a win.

This is because the heuristics trying to chose right point to duplicate
all winning blocks and to get loop to be do_while did not account
zero_cost blocks in all cases.  The patch simplifies the logic by
simply remembering zero cost blocks and handling them last after
the right stopping point is chosen.

Bootstrapped/regtested x86_64-linux, OK?

gcc/ChangeLog:

* tree-ssa-loop-ch.cc (enum ch_decision): Fix comment.
(should_duplicate_loop_header_p): Fix return value for static exits.
(ch_base::copy_headers): Improve handling of ch_possible_zero_cost.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/copy-headers-9.c: Update template.

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-9.c 
b/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-9.c
index b49d1fc9576..11ee29458a2 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-9.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-9.c
@@ -13,7 +13,6 @@ void test (int m, int n)
}
while (i<10);
 }
-/* { dg-final { scan-tree-dump-times "Duplicating bb . is a win" 2 "ch2" } } */
-/* { dg-final { scan-tree-dump-times "Duplicating bb . is a win. it has zero" 
1 "ch2" } } */
+/* { dg-final { scan-tree-dump-times "Duplicating bb . is a win" 1 "ch2" } } */
 /* { dg-final { scan-tree-dump-times "Will duplicate bb" 2 "ch2" } } */
 /* { dg-final { scan-tree-dump "is now do-while loop" "ch2" } } */
diff --git a/gcc/tree-ssa-loop-ch.cc b/gcc/tree-ssa-loop-ch.cc
index 6cdb87a762f..461416e4086 100644
--- a/gcc/tree-ssa-loop-ch.cc
+++ b/gcc/tree-ssa-loop-ch.cc
@@ -176,7 +176,7 @@ enum ch_decision
   ch_impossible,
   /* We can copy it if it enables wins.  */
   ch_possible,
-  /* We can "cop" it if it enables wins and doing
+  /* We can "copy" it if it enables wins and doing
  so will introduce no new code.  */
   ch_possible_zero_cost,
   /* We want to copy.  */
@@ -464,7 +464,7 @@ should_duplicate_loop_header_p (basic_block header, class 
loop *loop,
  TODO: Even if duplication costs some size we may opt to do so in case
  exit probability is significant enough (do partial peeling).  */
   if (static_exit)
-return code_size_cost ? ch_possible_zero_cost : ch_win;
+return !code_size_cost ? ch_possible_zero_cost : ch_possible;
 
   /* We was not able to prove that conditional will be eliminated.  */
   int insns = estimate_num_insns (last, _size_weights);
@@ -824,6 +824,7 @@ ch_base::copy_headers (function *fun)
   int last_win_nheaders = 0;
   bool last_win_invariant_exit = false;
   ch_decision ret;
+  auto_vec  decision;
   hash_set  *invariant_exits = new hash_set ;
   hash_set  *static_exits = new hash_set ;
   while ((ret = should_duplicate_loop_header_p (header, loop, ranger,
@@ -833,6 +834,7 @@ ch_base::copy_headers (function *fun)
 != ch_impossible)
{
  nheaders++;
+ decision.safe_push (ret);
  if (ret >= ch_win)
{
  last_win_nheaders = nheaders;
@@ -841,20 +843,6 @@ ch_base::copy_headers (function *fun)
fprintf (dump_file, "Duplicating bb %i is a win\n",
 header->index);
}
- /* Duplicate BB if has zero cost but be sure it will not
-imply duplication of other BBs.  */
- else if (ret == ch_possible_zero_cost
-  && (last_win_nheaders == nheaders - 1
-  || (last_win_nheaders == nheaders - 2
-  && last_win_invariant_exit)))
-   {
- last_win_nheaders = nheaders;
- last_win_invariant_exit = false;
- if (dump_file && (dump_flags & TDF_DETAILS))
-   fprintf (dump_file,
-"Duplicating bb %i is a win; it has zero cost\n",
-header->index);
-   }
  else
if (dump_file && (dump_flags & TDF_DETAILS))
  fprintf (dump_file, "May duplicate bb %i\n", header->index);
@@ -884,6 +872,16 @@ ch_base::copy_headers (function *fun)
fprintf (dump_file,
 "Duplicating header BB to obtain do-while loop\n");
}
+  /* "Duplicate" all BBs with zero cost following last basic blocks we
+decided to copy.  */
+  while (last_win_nheaders < (int)decision.length ()
+&& decision[last_win_nheaders] == ch_possible_zero_cost)
+   {
+ if (dump_file && (dump_flags & TDF_DETAILS))
+   fprintf (dump_file,
+"

Re: Loop-ch improvements, part 3

2023-08-22 Thread Jan Hubicka via Gcc-patches
> 
> We seem to peel one iteration for no good reason.  The loop is
> a do-while loop already.  The key is we see the first iteration
> exit condition is known not taken and then:
> 
>  Registering value_relation (path_oracle) (iter.24_6 > iter.24_5) (root: 
> bb2)
> Stmt is static (constant in the first iteration)
>   Analyzing: if (iter.24_6 != 16)
>  Registering killing_def (path_oracle) iter.24_6
>  Registering value_relation (path_oracle) (iter.24_6 > iter.24_5) (root: 
> bb2)
> Will eliminate peeled conditional in bb 3.
> Duplicating bb 3 is a win; it has zero cost
>   Not duplicating bb 5: it is single succ.
> Copying headers of loop 1
> Will duplicate bb 3
> Duplicating header of the loop 1 up to edge 3->4
> Loop 1 is do-while loop
> Loop 1 is now do-while loop.
> Exit count: 0 (estimated locally)
> Entry count: 10631108 (estimated locally)
> Peeled all exits: decreased number of iterations of loop 1 by 1.
> 
> and that's because of
> 
>   /* If the static exit fully optimize out, it is win to "duplicate"
>  it.
> 
>  TODO: Even if duplication costs some size we may opt to do so in case
>  exit probability is significant enough (do partial peeling).  */
>   if (static_exit)
> return code_size_cost ? ch_possible_zero_cost : ch_win;
> 
> IMHO we're over aggressively apply early peeling here.  That holds
> generally, not only for OMP simd loops (which we could identify).
> 
> Why are we doing this game for single-block do-while loops?

It seems I just wrongly updated the old conditional. Sorry for that.
It should be:
diff --git a/gcc/tree-ssa-loop-ch.cc b/gcc/tree-ssa-loop-ch.cc
index 6cdb87a762f..8142add4bec 100644
--- a/gcc/tree-ssa-loop-ch.cc
+++ b/gcc/tree-ssa-loop-ch.cc
@@ -464,7 +464,7 @@ should_duplicate_loop_header_p (basic_block header, class 
loop *loop,
  TODO: Even if duplication costs some size we may opt to do so in case
  exit probability is significant enough (do partial peeling).  */
   if (static_exit)
-return code_size_cost ? ch_possible_zero_cost : ch_win;
+return !code_size_cost ? ch_possible_zero_cost : ch_possible;
 
   /* We was not able to prove that conditional will be eliminated.  */
   int insns = estimate_num_insns (last, _size_weights);

So the heuristics knows that if there is no code produced "peeling" is
good idea since it eliminates one conditional for free. Otherwise it
should know that peeling is possible but only done if it produces
do-while-loop

As TODO says it would make to duplicate also if the exit likely avoids
entering the loop (which would be cheaper than peeling full first
iteration), but that can be done incrementally.

I am testing the fix.

Honza


Re: [PATCH] tree-optimization/110991 - unroll size estimate after vectorization

2023-08-14 Thread Jan Hubicka via Gcc-patches
> The following testcase shows that we are bad at identifying inductions
> that will be optimized away after vectorizing them because SCEV doesn't
> handle vectorized defs.  The following rolls a simpler identification
> of SSA cycles covering a PHI and an assignment with a binary operator
> with a constant second operand.
> 
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
> 
> Note, I also have a more general approach (will reply to this mail
> with an RFC).

Looks good to me.  This clearly be generalized to more complicated
expressions, so that is what you plan to do next?

Honza
> 
> Any comments on this particular change?
> 
>   PR tree-optimization/110991
>   * tree-ssa-loop-ivcanon.cc (constant_after_peeling): Handle
>   VIEW_CONVERT_EXPR , handle more simple IV-like SSA cycles
>   that will end up constant.
> 
>   * gcc.dg/tree-ssa/cunroll-16.c: New testcase.
> ---
>  gcc/testsuite/gcc.dg/tree-ssa/cunroll-16.c | 17 
>  gcc/tree-ssa-loop-ivcanon.cc   | 46 +-
>  2 files changed, 62 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/cunroll-16.c
> 
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/cunroll-16.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/cunroll-16.c
> new file mode 100644
> index 000..9bb66ff8299
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/cunroll-16.c
> @@ -0,0 +1,17 @@
> +/* PR/110991 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-cunroll-details -fdump-tree-optimized" } */
> +
> +static unsigned char a;
> +static signed char b;
> +void foo(void);
> +int main() {
> +  a = 25;
> +  for (; a > 13; --a)
> +b = a > 127 ?: a << 3;
> +  if (!b)
> +foo();
> +}
> +
> +/* { dg-final { scan-tree-dump "optimized: loop with \[0-9\]\+ iterations 
> completely unrolled" "cunroll" } } */
> +/* { dg-final { scan-tree-dump-not "foo" "optimized" } } */
> diff --git a/gcc/tree-ssa-loop-ivcanon.cc b/gcc/tree-ssa-loop-ivcanon.cc
> index a895e8e65be..99e50ee2efe 100644
> --- a/gcc/tree-ssa-loop-ivcanon.cc
> +++ b/gcc/tree-ssa-loop-ivcanon.cc
> @@ -166,6 +166,11 @@ constant_after_peeling (tree op, gimple *stmt, class 
> loop *loop)
>if (CONSTANT_CLASS_P (op))
>  return true;
>  
> +  /* Get at the actual SSA operand.  */
> +  if (handled_component_p (op)
> +  && TREE_CODE (TREE_OPERAND (op, 0)) == SSA_NAME)
> +op = TREE_OPERAND (op, 0);
> +
>/* We can still fold accesses to constant arrays when index is known.  */
>if (TREE_CODE (op) != SSA_NAME)
>  {
> @@ -198,7 +203,46 @@ constant_after_peeling (tree op, gimple *stmt, class 
> loop *loop)
>tree ev = analyze_scalar_evolution (loop, op);
>if (chrec_contains_undetermined (ev)
>|| chrec_contains_symbols (ev))
> -return false;
> +{
> +  if (ANY_INTEGRAL_TYPE_P (TREE_TYPE (op)))
> + {
> +   gassign *ass = nullptr;
> +   gphi *phi = nullptr;
> +   if (is_a  (SSA_NAME_DEF_STMT (op)))
> + {
> +   ass = as_a  (SSA_NAME_DEF_STMT (op));
> +   if (TREE_CODE (gimple_assign_rhs1 (ass)) == SSA_NAME)
> + phi = dyn_cast 
> + (SSA_NAME_DEF_STMT (gimple_assign_rhs1  (ass)));
> + }
> +   else if (is_a  (SSA_NAME_DEF_STMT (op)))
> + {
> +   phi = as_a  (SSA_NAME_DEF_STMT (op));
> +   if (gimple_bb (phi) == loop->header)
> + {
> +   tree def = gimple_phi_arg_def_from_edge
> + (phi, loop_latch_edge (loop));
> +   if (TREE_CODE (def) == SSA_NAME
> +   && is_a  (SSA_NAME_DEF_STMT (def)))
> + ass = as_a  (SSA_NAME_DEF_STMT (def));
> + }
> + }
> +   if (ass && phi)
> + {
> +   tree rhs1 = gimple_assign_rhs1 (ass);
> +   if (gimple_assign_rhs_class (ass) == GIMPLE_BINARY_RHS
> +   && CONSTANT_CLASS_P (gimple_assign_rhs2 (ass))
> +   && rhs1 == gimple_phi_result (phi)
> +   && gimple_bb (phi) == loop->header
> +   && (gimple_phi_arg_def_from_edge (phi, loop_latch_edge (loop))
> +   == gimple_assign_lhs (ass))
> +   && (CONSTANT_CLASS_P (gimple_phi_arg_def_from_edge
> +  (phi, loop_preheader_edge (loop)
> + return true;
> + }
> + }
> +  return false;
> +}
>return true;
>  }
>  
> -- 
> 2.35.3


Avoid division by zero in fold_loop_internal_call

2023-08-14 Thread Jan Hubicka via Gcc-patches
Hi,
My patch to fix profile after folding internal call is missing check for the
case profile was already zero before if-conversion.

Bootstrapped/regtested x86_64-linux, comitted.

gcc/ChangeLog:

PR gcov-profile/110988
* tree-cfg.cc (fold_loop_internal_call): Avoid division by zero.

diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc
index ab1f8067c54..105f4b1c953 100644
--- a/gcc/tree-cfg.cc
+++ b/gcc/tree-cfg.cc
@@ -7734,11 +7734,14 @@ fold_loop_internal_call (gimple *g, tree value)
 test.  This should not happen as the guarded code should
 start with pre-header.  */
  gcc_assert (single_pred_edge (taken_edge->dest));
- taken_edge->dest->count
-   = taken_edge->dest->count.apply_scale (new_count,
-  old_count);
- scale_strictly_dominated_blocks (taken_edge->dest,
-  new_count, old_count);
+ if (old_count.nonzero_p ())
+   {
+ taken_edge->dest->count
+   = taken_edge->dest->count.apply_scale (new_count,
+  old_count);
+ scale_strictly_dominated_blocks (taken_edge->dest,
+  new_count, old_count);
+   }
}
}
 }


Fix division by zero in tree-ssa-loop-split

2023-08-10 Thread Jan Hubicka via Gcc-patches
Hi,
Profile update I added to tree-ssa-loop-split can divide by zero in
situation that the conditional is predicted with 0 probability which
is triggered by jump threading update in the testcase.

gcc/ChangeLog:

PR middle-end/110923
* tree-ssa-loop-split.cc (split_loop): Watch for division by zero.

gcc/testsuite/ChangeLog:

PR middle-end/110923
* gcc.dg/tree-ssa/pr110923.c: New test.

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr110923.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr110923.c
new file mode 100644
index 000..8f5720a5e9e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr110923.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */ 
+/* { dg-options "-O3 -fdump-tree-lsplit-details-blocks" } */
+int a, b, c, d;
+int main() {
+  for (a = 0; a < 2; a++) {
+if (b > 2)
+  c = 0;
+if (b > a)
+  d = 0;
+  }
+  return 0;
+}
+/* { dg-final { scan-tree-dump-times "loop split" 1 "lsplit" } } */
+/* { dg-final { scan-tree-dump-not "Invalid sum" "lsplit" } } */
diff --git a/gcc/tree-ssa-loop-split.cc b/gcc/tree-ssa-loop-split.cc
index 2f7918c6e65..64464802c1e 100644
--- a/gcc/tree-ssa-loop-split.cc
+++ b/gcc/tree-ssa-loop-split.cc
@@ -703,7 +703,7 @@ split_loop (class loop *loop1)
   split between of the two new loops.  Keep orignal estimate since
   it is likely better then completely dropping it.
 
-  TODO: If we know that onle of the new loops has constant
+  TODO: If we know that one of the new loops has constant
   number of iterations, we can do better.  We could also update
   upper bounds.  */
if (loop1->any_estimate
@@ -713,11 +713,15 @@ split_loop (class loop *loop1)
  ? true_edge->probability.to_sreal () : (sreal)1;
sreal scale2 = false_edge->probability.reliable_p ()
  ? false_edge->probability.to_sreal () : (sreal)1;
+   sreal div1 = loop1_prob.to_sreal ();
/* +1 to get header interations rather than latch iterations and 
then
   -1 to convert back.  */
-   loop1->nb_iterations_estimate
- = MAX sreal)loop1->nb_iterations_estimate.to_shwi () + 1) * 
scale
-/ loop1_prob.to_sreal ()).to_nearest_int () - 1, 0);
+   if (div1 != 0)
+ loop1->nb_iterations_estimate
+   = MAX sreal)loop1->nb_iterations_estimate.to_shwi () + 1)
+  * scale / div1).to_nearest_int () - 1, 0);
+   else
+ loop1->any_estimate = false;
loop2->nb_iterations_estimate
  = MAX sreal)loop2->nb_iterations_estimate.to_shwi () + 1) * 
scale2
 / profile_probability::very_likely ().to_sreal ())


Fix profile update in duplicat_loop_body_to_header_edge for loops with 0 count_in

2023-08-10 Thread Jan Hubicka via Gcc-patches
Hi,
this patch makes duplicate_loop_body_to_header_edge to not drop profile counts 
to
uninitialized when count_in is 0.  This happens because profile_probability in 
0 count
is undefined.

Bootstrapped/regtested x86_64-linux, committed.

gcc/ChangeLog:

* cfgloopmanip.cc (duplicate_loop_body_to_header_edge): Special case 
loops with
0 iteration count.

diff --git a/gcc/cfgloopmanip.cc b/gcc/cfgloopmanip.cc
index b237ad4e8ac..a2ed54a23bb 100644
--- a/gcc/cfgloopmanip.cc
+++ b/gcc/cfgloopmanip.cc
@@ -1296,6 +1296,16 @@ duplicate_loop_body_to_header_edge (class loop *loop, 
edge e,
}
   profile_probability prob_pass_wont_exit =
  new_count_le.probability_in (count_in);
+  /* If profile count is 0, the probability will be uninitialized.
+We can set probability to any initialized value to avoid
+precision loss.  If profile is sane, all counts will be 0 anyway.  */
+  if (!count_in.nonzero_p ())
+   {
+ prob_pass_thru
+ = profile_probability::always ().apply_scale (1, 2);
+ prob_pass_wont_exit
+ = profile_probability::always ().apply_scale (1, 2);
+   }
 
   scale_step = XNEWVEC (profile_probability, ndupl);
 
@@ -1306,7 +1316,9 @@ duplicate_loop_body_to_header_edge (class loop *loop, 
edge e,
 
   /* Complete peeling is special as the probability of exit in last
 copy becomes 1.  */
-  if (flags & DLTHE_FLAG_COMPLETTE_PEEL)
+  if (!count_in.nonzero_p ())
+   ;
+  else if (flags & DLTHE_FLAG_COMPLETTE_PEEL)
{
  profile_count wanted_count = e->count ();
 


Fix profile updating bug in tree-ssa-threadupdate

2023-08-10 Thread Jan Hubicka via Gcc-patches
Hi,
ssa_fix_duplicate_block_edges later calls update_profile to correct profile 
after threading.
In the testcase this does not work since we lose track of the duplicated edge.  
This 
happens because redirect_edge_and_branch returns NULL if the edge already has 
correct
destination which is the case.

Bootstrapped/regtesed x86_64-linux, comitted.

gcc/ChangeLog:

* tree-ssa-threadupdate.cc (ssa_fix_duplicate_block_edges): Fix profile 
update.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/phi_on_compare-1.c: Check profile consistency.

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/phi_on_compare-1.c 
b/gcc/testsuite/gcc.dg/tree-ssa/phi_on_compare-1.c
index be504ddb11a..923497f7896 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/phi_on_compare-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/phi_on_compare-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-Ofast -fdump-tree-dom2" } */
+/* { dg-options "-Ofast -fdump-tree-dom2 -fdump-tree-optimized-details-blocks" 
} */
 
 void g (int);
 void g1 (int);
@@ -33,3 +33,4 @@ f (long a, long b, long c, long d, long x)
optimization in the backward threader before killing the forward
threader.  Similarly for the other phi_on_compare-*.c tests.  */
 /* { dg-final { scan-tree-dump-times "Removing basic block" 1 "dom2" } } */
+/* { dg-final { scan-tree-dump-not "Invalid sum" "optimized" } } */
diff --git a/gcc/tree-ssa-threadupdate.cc b/gcc/tree-ssa-threadupdate.cc
index d5416b21a78..a5b9a002a8a 100644
--- a/gcc/tree-ssa-threadupdate.cc
+++ b/gcc/tree-ssa-threadupdate.cc
@@ -1059,14 +1059,19 @@ ssa_fix_duplicate_block_edges (struct redirection_data 
*rd,
 threading path.  */
  if (!any_remaining_duplicated_blocks (path, i))
{
- e2 = redirect_edge_and_branch (victim, elast->dest);
- /* If we redirected the edge, then we need to copy PHI arguments
-at the target.  If the edge already existed (e2 != victim
-case), then the PHIs in the target already have the correct
-arguments.  */
- if (e2 == victim)
-   copy_phi_args (e2->dest, elast, e2,
-  path, multi_incomings ? 0 : i);
+ if (victim->dest != elast->dest)
+   {
+ e2 = redirect_edge_and_branch (victim, elast->dest);
+ /* If we redirected the edge, then we need to copy PHI 
arguments
+at the target.  If the edge already existed (e2 != victim
+case), then the PHIs in the target already have the correct
+arguments.  */
+ if (e2 == victim)
+   copy_phi_args (e2->dest, elast, e2,
+  path, multi_incomings ? 0 : i);
+   }
+ else
+   e2 = victim;
}
  else
{


Fix undefined behaviour in profile_count::differs_from_p

2023-08-10 Thread Jan Hubicka via Gcc-patches
Hi,
This patch avoid overflow in profile_count::differs_from_p and also makes it to
return false from one of the values is undefined while other is defined.

Bootstrapped/regtested x86_64-linux, comitted.

gcc/ChangeLog:

* profile-count.cc (profile_count::differs_from_p): Fix overflow and
handling of undefined values.

diff --git a/gcc/profile-count.cc b/gcc/profile-count.cc
index e63c9432388..a14f379db8f 100644
--- a/gcc/profile-count.cc
+++ b/gcc/profile-count.cc
@@ -128,13 +128,14 @@ profile_count::differs_from_p (profile_count other) const
 {
   gcc_checking_assert (compatible_p (other));
   if (!initialized_p () || !other.initialized_p ())
-return false;
+return initialized_p () != other.initialized_p ();
   if ((uint64_t)m_val - (uint64_t)other.m_val < 100
   || (uint64_t)other.m_val - (uint64_t)m_val < 100)
 return false;
   if (!other.m_val)
 return true;
-  int64_t ratio = (int64_t)m_val * 100 / other.m_val;
+  uint64_t ratio;
+  safe_scale_64bit (m_val, 100, other.m_val, );
   return ratio < 99 || ratio > 101;
 }
 


Re: [PATCH] Support -m[no-]gather -m[no-]scatter to enable/disable vectorization for all gather/scatter instructions.

2023-08-10 Thread Jan Hubicka via Gcc-patches
> On Thu, Aug 10, 2023 at 9:42 AM Uros Bizjak  wrote:
> >
> > On Thu, Aug 10, 2023 at 9:40 AM Richard Biener
> >  wrote:
> > >
> > > On Thu, Aug 10, 2023 at 3:13 AM liuhongt  wrote:
> > > >
> > > > Currently we have 3 different independent tunes for gather
> > > > "use_gather,use_gather_2parts,use_gather_4parts",
> > > > similar for scatter, there're
> > > > "use_scatter,use_scatter_2parts,use_scatter_4parts"
> > > >
> > > > The patch support 2 standardizing options to enable/disable
> > > > vectorization for all gather/scatter instructions. The options is
> > > > interpreted by driver to 3 tunes.
> > > >
> > > > bootstrapped and regtested on x86_64-pc-linux-gnu.
> > > > Ok for trunk?
> > >
> > > I think -mgather/-mscatter are too close to -mfma suggesting they
> > > enable part of an ISA but they won't disable the use of intrinsics
> > > or enable gather/scatter on CPUs where the ISA doesn't have them.
> > >
> > > May I suggest to invent a more generic "short-cut" to
> > > -mtune-ctrl=^X, maybe -mdisable=X?  And for gather/scatter
> > > tunables add ^use_gather_any to cover all cases?  (or
> > > change what use_gather controls - it seems we changed its
> > > meaning before, and instead add use_gather_8parts and
> > > use_gather_16parts)
> > >
> > > That is, what's the point of this?
> >
> > https://www.phoronix.com/review/downfall
> >
> > that caused:
> >
> > https://www.phoronix.com/review/intel-downfall-benchmarks
> 
> Yes, I know.  But there's -mtune-ctl= doing the trick.
> GCC 11 had only 'use_gather', covering all number of lanes.  I suggest
> to resurrect that behavior and add use_gather_8+parts (or two, IIRC
> gather works only on SI/SFmode or larger).
> 
> Then -mtune-ctl=^use_gather works which I think is nice enough?

-mtune-ctl is really intended for GCC developers.  It is not backward
compatible, fully documented and bad sets of values may trigger ICEs.
If gathers became very slow, I think normal users may want to disable
them and in such situation specialized command line option makes sense
to me.

Honza
> 
> Richard.
> 
> > Uros.


Fix profile update after versioning ifconverted loop

2023-08-06 Thread Jan Hubicka via Gcc-patches
Hi,
If loop is ifconverted and later versioning by vectorizer, vectorizer will
reuse the scalar loop produced by ifconvert. Curiously enough it does not seem
to do so for versions produced by loop distribution while for loop distribution
this matters (since since both ldist versions survive to final code) while
after ifcvt it does not (since we remove non-vectorized path).

This patch fixes associated profile update.  Here it is necessary to scale both
arms of the conditional according to runtime checks inserted.  We got partly
right the loop body, but not the preheader block and block after exit.  The
first is particularly bad since it changes loop iterations estimates.

So we now turn 4 original loops:
  loop 1: iterations by profile: 473.497707 (reliable) entry count:84821 
(precise, freq 0.9979)
  loop 2: iterations by profile: 100.00 (reliable) entry count:39848881 
(precise, freq 468.8104)
  loop 3: iterations by profile: 100.00 (reliable) entry count:39848881 
(precise, freq 468.8104)
  loop 4: iterations by profile: 100.999596 (reliable) entry count:84167 
(precise, freq 0.9902)

Into following loops
  iterations by profile: 5.312499 (unreliable, maybe flat) entry count:12742188 
(guessed, freq 149.9081)
 vectorized and split loop 1, peeled
  iterations by profile: 0.009496 (unreliable, maybe flat) entry count:374798 
(guessed, freq 4.4094)
 split loop 1 (last iteration), peeled
  iterations by profile: 100.08 (unreliable) entry count:3945039 (guessed, 
freq 46.4122)
 scalar version of loop 1
  iterations by profile: 100.07 (unreliable) entry count:7101070 (guessed, 
freq 83.5420)
 redundant scalar version of loop 1 which we could eliminate if vectorizer 
understood ldist
  iterations by profile: 100.00 (unreliable) entry count:35505353 (guessed, 
freq 417.7100)
 unvectorized loop 2
  iterations by profile: 5.312500 (unreliable) entry count:25563855 (guessed, 
freq 300.7512)
 vectorized loop 2, not peeled (hits max-peel-insns)
  iterations by profile: 100.07 (unreliable) entry count:7101070 (guessed, 
freq 83.5420)
 unvectorized loop 3
  iterations by profile: 5.312500 (unreliable) entry count:25563855 (guessed, 
freq 300.7512)
 vectorized loop 3, not peeled (hits max-peel-insns)
  iterations by profile: 473.497707 (reliable) entry count:84821 (precise, freq 
0.9979)
 loop 1
  iterations by profile: 100.999596 (reliable) entry count:84167 (precise, freq 
0.9902)
 loop 4

With this change we are on 0 profile erros on hmmer benchmark:

Pass dump id |dynamic mismatch  |overall  |
 |in count  |size|time|
172t ch_vect |0 |  996   | 385812023346   |
173t ifcvt   | 71010686+71010686| 1021  +2.5%| 468361969416 +21.4%|
174t vect|210830784   +139820098| 1497 +46.6%| 216073467874 -53.9%|
175t dce |210830784 | 1387  -7.3%| 205273170281  -5.0%|
176t pcom|210830784 | 1387   | 201722634966  -1.7%|
177t cunroll |0   -210830784| 1443  +4.0%| 180441501289 -10.5%|
182t ivopts  |0 | 1385  -4.0%| 136412345683 -24.4%|
183t lim |0 | 1389  +0.3%| 135093950836  -1.0%|
192t reassoc |0 | 1381  -0.6%| 134778347700  -0.2%|
193t slsr|0 | 1380  -0.1%| 134738100330  -0.0%|
195t tracer  |0 | 1521 +10.2%| 134738179146  +0.0%|
196t fre |  2680654 +2680654| 1489  -2.1%| 134659672725  -0.1%|
198t dom |  5361308 +2680654| 1473  -1.1%| 134449553658  -0.2%|
201t vrp |  5361308 | 1474  +0.1%| 134489004050  +0.0%|
202t ccp |  5361308 | 1472  -0.1%| 134440752274  -0.0%|
204t dse |  5361308 | 1444  -1.9%| 133802300525  -0.5%|
206t forwprop|  5361308 | 1433  -0.8%| 133542828370  -0.2%|
207t sink|  5361308 | 1431  -0.1%| 133542658728  -0.0%|
211t store-me|  5361308 | 1430  -0.1%| 133542573728  -0.0%|
212t cddce   |  5361308 | 1428  -0.1%| 133541776728  -0.0%|
258r expand  |  5361308 |||
260r into_cfg|  5361308 | 9334  -0.8%| 885820707913  -0.6%|
261r jump|  5361308 | 9330  -0.0%| 885820367913  -0.0%|
265r fwprop1 |  5361308 | 9206  -1.3%| 876756504385  -1.0%|
267r rtl pre |  5361308 | 9210  +0.0%| 876914305953  +0.0%|
269r cprop   |  5361308 | 9202  -0.1%| 876756165101  -0.0%|
271r cse_loca|  5361308 | 9198  -0.0%| 876727760821  -0.0%|
272r ce1 |  5361308 | 9126  -0.8%| 875726815885  -0.1%|
276r loop2_in|  5361308 | 

Fix profile update after peeled epilogues

2023-08-06 Thread Jan Hubicka via Gcc-patches
Hi,
Epilogue peeling expects the scalar loop to have same number of executions as
the vector loop which is true at the beggining of vectorization. However if the
epilogues are vectorized, this is no longer the case.  In this situation the
loop preheader is replaced by new guard code with correct profile, however
loop body is left unscaled.  This leads to loop that exists more often then
it is entered.

This patch add slogic to scale the frequencies down and also to fix profile
of original preheader where necesary.

Bootstrapped/regtested x86_64-linux, comitted.

gcc/ChangeLog:

* tree-vect-loop-manip.cc (vect_do_peeling): Fix profile update of 
peeled epilogues.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/vect-bitfield-read-1.c: Check profile consistency.
* gcc.dg/vect/vect-bitfield-read-2.c: Check profile consistency.
* gcc.dg/vect/vect-bitfield-read-3.c: Check profile consistency.
* gcc.dg/vect/vect-bitfield-read-4.c: Check profile consistency.
* gcc.dg/vect/vect-bitfield-read-5.c: Check profile consistency.
* gcc.dg/vect/vect-bitfield-read-6.c: Check profile consistency.
* gcc.dg/vect/vect-bitfield-read-7.c: Check profile consistency.
* gcc.dg/vect/vect-bitfield-write-1.c: Check profile consistency.
* gcc.dg/vect/vect-bitfield-write-2.c: Check profile consistency.
* gcc.dg/vect/vect-bitfield-write-3.c: Check profile consistency.
* gcc.dg/vect/vect-bitfield-write-4.c: Check profile consistency.
* gcc.dg/vect/vect-bitfield-write-5.c: Check profile consistency.
* gcc.dg/vect/vect-epilogues-2.c: Check profile consistency.
* gcc.dg/vect/vect-epilogues.c: Check profile consistency.
* gcc.dg/vect/vect-mask-store-move-1.c: Check profile consistency.

diff --git a/gcc/testsuite/gcc.dg/vect/vect-bitfield-read-1.c 
b/gcc/testsuite/gcc.dg/vect/vect-bitfield-read-1.c
index 42e50d9f0c8..147c959568d 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-bitfield-read-1.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-bitfield-read-1.c
@@ -1,3 +1,4 @@
+/* { dg-additional-options "-fdump-tree-optimized-details-blocks" } */
 /* { dg-require-effective-target vect_int } */
 /* { dg-require-effective-target vect_shift } */
 
@@ -39,3 +40,4 @@ int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-not "Invalid sum" "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-bitfield-read-2.c 
b/gcc/testsuite/gcc.dg/vect/vect-bitfield-read-2.c
index a9aeefcd72c..982e6a7967b 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-bitfield-read-2.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-bitfield-read-2.c
@@ -1,3 +1,4 @@
+/* { dg-additional-options "-fdump-tree-optimized-details-blocks" } */
 /* { dg-require-effective-target vect_shift } */
 /* { dg-require-effective-target vect_long_long } */
 
@@ -42,3 +43,4 @@ int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-not "Invalid sum" "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-bitfield-read-3.c 
b/gcc/testsuite/gcc.dg/vect/vect-bitfield-read-3.c
index c7d0fd26bad..f2a43c39f50 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-bitfield-read-3.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-bitfield-read-3.c
@@ -1,3 +1,4 @@
+/* { dg-additional-options "-fdump-tree-optimized-details-blocks" } */
 /* { dg-require-effective-target vect_int } */
 /* { dg-require-effective-target vect_shift } */
 
@@ -43,3 +44,4 @@ int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-not "Invalid sum" "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-bitfield-read-4.c 
b/gcc/testsuite/gcc.dg/vect/vect-bitfield-read-4.c
index 6a3ed8c0c6f..9f6f0220664 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-bitfield-read-4.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-bitfield-read-4.c
@@ -1,3 +1,4 @@
+/* { dg-additional-options "-fdump-tree-optimized-details-blocks" } */
 /* { dg-require-effective-target vect_shift } */
 /* { dg-require-effective-target vect_long_long } */
 
@@ -44,3 +45,4 @@ int main (void)
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
 
+/* { dg-final { scan-tree-dump-not "Invalid sum" "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-bitfield-read-5.c 
b/gcc/testsuite/gcc.dg/vect/vect-bitfield-read-5.c
index b2889df8a0a..662aed104cf 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-bitfield-read-5.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-bitfield-read-5.c
@@ -1,3 +1,4 @@
+/* { dg-additional-options "-fdump-tree-optimized-details-blocks" } */
 /* { dg-require-effective-target vect_int } */
 /* { dg-require-effective-target vect_shift } */
 
@@ -41,3 +42,4 @@ int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-not "Invalid sum" "optimized" } } */
diff --git 

Re: Disable loop distribution for loops with estimated iterations 0

2023-08-04 Thread Jan Hubicka via Gcc-patches
> On Fri, Aug 4, 2023 at 9:16 AM Jan Hubicka via Gcc-patches
>  wrote:
> >
> > Hi,
> > this prevents useless loop distribiton produced in hmmer.  With FDO we now
> > correctly work out that the loop created for last iteraiton is not going to
> > iterate however loop distribution still produces a verioned loop that has no
> > chance to survive loop vectorizer since we only keep distributed loops
> > when loop vectorization suceeds and it requires number of (header) 
> > iterations
> > to exceed the vectorization factor.
> >
> > Bootstrapped/regtested x86_64-linux, OK?
> 
> OK.
> 
> But why does optimize_loop_for_speed_p () return true when the loop
> isn't expected to iterate?  Wouldn't that be a better place to fix this
> and similar issues in other places then?

optimize_loop_for_speed_p checks whether the loop header is considered
hot so we want to get it running fast.  I think it is up to each loop
transform to decide whether it helps loops with low iteration counts or
hight iteration counts or both.  Loop peeling and copy header are passes
that does helps low iteration count loops.  I think we have more.

For example I wondered if I should also disable splitting but I think
moving the conditional out of loop will likely help even if loop has
small trip count.

I briefly looked what passes already have cost model based on iteration
estimate. I guess we should also tame down invariant motion and perhaps
others.

Honza
> 
> Thanks,
> Richard.
> 
> > gcc/ChangeLog:
> >
> > * tree-loop-distribution.cc (loop_distribution::execute): Disable
> > distribution for loops with estimated iterations 0.
> >
> > diff --git a/gcc/tree-loop-distribution.cc b/gcc/tree-loop-distribution.cc
> > index cf7c197aaf7..8ff2108f284 100644
> > --- a/gcc/tree-loop-distribution.cc
> > +++ b/gcc/tree-loop-distribution.cc
> > @@ -3871,10 +3871,20 @@ loop_distribution::execute (function *fun)
> >
> >   bool destroy_p;
> >   int nb_generated_loops, nb_generated_calls;
> > + bool only_patterns = !optimize_loop_for_speed_p (loop)
> > +  || !flag_tree_loop_distribution;
> > + /* do not try to distribute loops that are not expected to 
> > iterate.  */
> > + if (!only_patterns)
> > +   {
> > + HOST_WIDE_INT iterations = estimated_loop_iterations_int 
> > (loop);
> > + if (iterations < 0)
> > +   iterations = likely_max_loop_iterations_int (loop);
> > + if (!iterations)
> > +   only_patterns = true;
> > +   }
> >   nb_generated_loops
> > = distribute_loop (loop, work_list, cd, _generated_calls,
> > -  _p, (!optimize_loop_for_speed_p 
> > (loop)
> > -   || 
> > !flag_tree_loop_distribution));
> > +  _p, only_patterns);
> >   if (destroy_p)
> > loops_to_be_destroyed.safe_push (loop);
> >


Re: Fix profile upate after vectorizer peeling

2023-08-04 Thread Jan Hubicka via Gcc-patches
Hi,
so I found the problem.  We duplicate multiple paths and end up with:

;; basic block 6, loop depth 0, count 365072224 (estimated locally, freq 0.3400)
;;  prev block 12, next block 7, flags: (NEW, REACHABLE, VISITED)
;;  pred:   4 [never (guessed)]  count:0 (estimated locally, freq 0.) 
(TRUE_VALUE,EXECUTABLE)
;;  10 [always]  count:365072224 (estimated locally, freq 0.3400) 
(FALLTHRU,EXECUTABLE)
# _18 = PHI <0(4), 0(10)>
# d_39 = PHI 
if (_18 == 0)
  goto ; [97.06%]
else
  goto ; [2.94%]
;;  succ:   8 [97.1% (guessed)]  count:354334801 (estimated locally, freq 
0.3300) (TRUE_VALUE,EXECUTABLE)
;;  7 [2.9% (guessed)]  count:10737423 (estimated locally, freq 
0.0100) (FALSE_VALUE,EXECUTABLE)

Here goto bb 7 is never taken but profile is wrong.

Before threading we have chain of conditionals:

  __asm__("pushf{l|d}
pushf{l|d}
pop{l}  %0
mov{l}  {%0, %1|%1, %0}
xor{l}  {%2, %0|%0, %2}
push{l} %0
popf{l|d}
pushf{l|d}
pop{l}  %0
popf{l|d}
" : "=" __eax_19, "=" __ebx_20 : "i" 2097152);
  _21 = __eax_19 ^ __ebx_20;
  _22 = _21 & 2097152;
  if (_22 == 0)
goto ; [34.00%]
  else
goto ; [66.00%]
  
   [local count: 708669602 freq: 0.66]:
  __asm__ __volatile__("cpuid
" : "=a" __eax_24, "=b" __ebx_25, "=c" __ecx_26, "=d" __edx_27 : "0" 0);
  
   [local count: 1073741826 freq: 1.00]:
  # _33 = PHI <0(2), __eax_24(3)> 
  _16 = _33 == 0;
  if (_33 == 0)
goto ; [34.00%]
  else
goto ; [66.00%]

   [local count: 708669600 freq: 0.66]:
  __asm__ __volatile__("cpuid
" : "=a" a_44, "=b" b_45, "=c" c_46, "=d" d_47 : "0" 1, "2" 0);

   [local count: 1073741824 freq: 1.00]:
  # _18 = PHI <0(4), 1(5)>
  # d_39 = PHI 
  if (_18 == 0)
goto ; [33.00%]
  else
goto ; [67.00%]


If first _22 == 0 then also _33 == 0 and _18 == 0 but the last case has
probability 33% while the first 34%, so the profile guess is not
consistent with the threaded path.  So threading is right to end up with
profile inconsistency, but it should print reason for doing it.

One option is to disable optimization for the check.  Other option is to
get the first conditional predicted right.
Would this be OK?

gcc/ChangeLog:

* config/i386/cpuid.h: Mark CPUs not supporting cpuid as unlikely.

diff --git a/gcc/config/i386/cpuid.h b/gcc/config/i386/cpuid.h
index 03fd6fc9478..9c768ac0b6d 100644
--- a/gcc/config/i386/cpuid.h
+++ b/gcc/config/i386/cpuid.h
@@ -295,7 +295,7 @@ __get_cpuid_max (unsigned int __ext, unsigned int *__sig)
   : "i" (0x0020));
 #endif
 
-  if (!((__eax ^ __ebx) & 0x0020))
+  if (__builtin_expect (!((__eax ^ __ebx) & 0x0020), 0))
 return 0;
 #endif
 


Disable loop distribution for loops with estimated iterations 0

2023-08-04 Thread Jan Hubicka via Gcc-patches
Hi,
this prevents useless loop distribiton produced in hmmer.  With FDO we now
correctly work out that the loop created for last iteraiton is not going to
iterate however loop distribution still produces a verioned loop that has no
chance to survive loop vectorizer since we only keep distributed loops
when loop vectorization suceeds and it requires number of (header) iterations
to exceed the vectorization factor.

Bootstrapped/regtested x86_64-linux, OK?

gcc/ChangeLog:

* tree-loop-distribution.cc (loop_distribution::execute): Disable
distribution for loops with estimated iterations 0.

diff --git a/gcc/tree-loop-distribution.cc b/gcc/tree-loop-distribution.cc
index cf7c197aaf7..8ff2108f284 100644
--- a/gcc/tree-loop-distribution.cc
+++ b/gcc/tree-loop-distribution.cc
@@ -3871,10 +3871,20 @@ loop_distribution::execute (function *fun)
 
  bool destroy_p;
  int nb_generated_loops, nb_generated_calls;
+ bool only_patterns = !optimize_loop_for_speed_p (loop)
+  || !flag_tree_loop_distribution;
+ /* do not try to distribute loops that are not expected to iterate.  
*/
+ if (!only_patterns)
+   {
+ HOST_WIDE_INT iterations = estimated_loop_iterations_int (loop);
+ if (iterations < 0)
+   iterations = likely_max_loop_iterations_int (loop);
+ if (!iterations)
+   only_patterns = true;
+   }
  nb_generated_loops
= distribute_loop (loop, work_list, cd, _generated_calls,
-  _p, (!optimize_loop_for_speed_p (loop)
-   || !flag_tree_loop_distribution));
+  _p, only_patterns);
  if (destroy_p)
loops_to_be_destroyed.safe_push (loop);
 


Re: Fix profile upate after vectorizer peeling

2023-08-04 Thread Jan Hubicka via Gcc-patches
> >
> > A couple cycles ago I separated most of code to distinguish between the
> > back and forward threaders.  There is class jt_path_registry that is
> > common to both, and {fwd,back}_jt_path_registry for the forward and
> > backward threaders respectively.  It's not perfect, but it's a start.
> 
> Yep, it's back_jt_path_registry::update_cfg / duplicate_thread_path
> that lacks the updates.

duplicate_thread_path has profile update (using
profile_bb_update_for_threading and
scale_bbs_frequencies_profile_count).  It will however silently keep
profile misupdated if the cfg were originally inconsistent with the
threaded path (in which case it is intended to keep profile
inconsistent, but we should have it logged so we know it is "okay after
all").  I will add logging same as in profile_bb_update_for_threading, so
these things are easier to figure out.

What happens in the test is that we have __builtin_constant_p that
blocks early threading and we thread only after profile is constructed.
I did not check by hand if the original profile is guessed
inconsistently.

Honza
> 
> Richard.
> 
> > Aldy
> >


Update estimated iteraitons counts after splitting

2023-08-03 Thread Jan Hubicka via Gcc-patches
Hi,
Hmmer's internal function has 4 loops.  The following is the profile at start:

  loop 1:
  estimate 472
  iterations by profile: 473.497707 (reliable) count in:84821 (precise, freq 
0.9979)

loop 2:
estimate 99
iterations by profile: 100.00 (reliable) count in:39848881 (precise, 
freq 468.8104)

loop 3:
estimate 99
iterations by profile: 100.00 (reliable) count in:39848881 (precise, 
freq 468.8104)

  loop 4:
  estimate 100
  iterations by profile: 100.999596 (reliable) execution count:84167 (precise, 
freq 0.9902)

So the first loops is outer loop and second/third loops are nesed. Fourth loop 
is not critical.
Precise iteraiton counts are unknown (473 and 100 comes from profile)
Nested loop has following form:

for (k = 1; k <= M; k++) {
  mc[k] = mpp[k-1]   + tpmm[k-1];
  if ((sc = ip[k-1]  + tpim[k-1]) > mc[k])  mc[k] = sc;
  if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k])  mc[k] = sc;
  if ((sc = xmb  + bp[k]) > mc[k])  mc[k] = sc;
  mc[k] += ms[k];
  if (mc[k] < -INFTY) mc[k] = -INFTY;

  dc[k] = dc[k-1] + tpdd[k-1];
  if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc;
  if (dc[k] < -INFTY) dc[k] = -INFTY;

  if (k < M) {
ic[k] = mpp[k] + tpmi[k];
if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc;
ic[k] += is[k];
if (ic[k] < -INFTY) ic[k] = -INFTY;
  }

We do quite some belly dancing here.
 1) loop-ch slightly misupdates profile, so the estimates of 99
does not match profile setimate of 100.
 2) loops-split splits on if (k < M) and produces two loops.
It fails to notice that the second loop never iterates.
It used to misupdate profile a lot which later caused internal
loop to become cold.  This is fixed now.
 3) loop-dist introduces runtime aliasing checks for both loops
 4) tree vectorizer vectorizes some of the copies of the loop produces
and drops expected iteration counts
 5) loop peeling peels the loops with expected low iteration counts
 6) complete loop unrolling kills some loops in prologues/epilogues.

We end up with quite many loops and run out of registers:

  iterations by profile: 5.312499 (unreliable, maybe flat)
this is vectorized internal loops after loop peeling

  iterations by profile: 0.009495 (unreliable, maybe flat)
  iterations by profile: 0.009495 (unreliable, maybe flat)
  iterations by profile: 0.009495 (unreliable, maybe flat)
  iterations by profile: 0.009495 (unreliable, maybe flat)
Those are all versioned/peeled and vectorized variants of the loop never 
looping

  iterations by profile: 100.08 (unreliable)
  iterations by profile: 100.00 (unreliable)
Those are variants with failed aliasing checks

  iterations by profile: 9.662853 (unreliable, maybe flat)
  iterations by profile: 4.646072 (unreliable)
  iterations by profile: 100.07 (unreliable)
  iterations by profile: 5.312500 (unreliable)
  iterations by profile: 473.497707 (reliable)
This is loop 1

  iterations by profile: 100.999596 (reliable)
This is the loop 4.

This patch fixes loop iteration estimate update after loop split so we get:

  iterations by profile: 5.312499 (unreliable, maybe flat) entry count:12742188 
(guessed, freq 149.9081)
This is remainder of the peeled vectorized loop 2.  It misses estimate that 
is correct since after peeling it 6 times it is essentially
impossible to tell what the remaining loop profile is (without histograms)

  iterations by profile: 0.009496 (unreliable, maybe flat) entry count:374801 
(guessed, freq 4.4094)
Peeled split part of loop 2 (one that never loops).  We ought to work this 
out
but at least w

  estimate 99
  iterations by profile: 100.08 (unreliable) entry count:3945039 (guessed, 
freq 46.4122)
  estimate 99
  iterations by profile: 100.00 (unreliable) entry count:35505353 (guessed, 
freq 417.7100)

  estimate 99
  iterations by profile: 9.662853 (unreliable, maybe flat) entry count:35505353 
(guessed, freq 417.7100)
Profile here mismatches estimate - I will need to work out why.

  estimate 5
  iterations by profile: 4.646072 (unreliable) entry count:31954818 (guessed, 
freq 375.9390)
This is vectorized but not peeled loop 3
  estimate 99
  iterations by profile: 100.07 (unreliable) entry count:7101070 (guessed, 
freq 83.5420)
Unvectorized variant of loop 3
  estimate 5
  iterations by profile: 5.312500 (unreliable) entry count:25563855 (guessed, 
freq 300.7512)
Another vectorized variant of loop 3
  estimate 472
  iterations by profile: 473.497707 (reliable) entry count:84821 (precise, freq 
0.9979)
Outer loop

  estimate 100
  iterations by profile: 100.999596 (reliable) entry count:84167 (precise, freq 
0.9902)
loop 4, not vectorized/peeled

So there is still work to do on this testcase, but with the patch we prevent 3 
useless loops.

Bootstrapped/regtested x86_64-linux, committed.

gcc/ChangeLog:

* tree-ssa-loop-split.cc 

Fix profiledbootstrap

2023-08-03 Thread Jan Hubicka via Gcc-patches
Hi,
Profiledbootstrap fails with ICE in update_loop_exit_probability_scale_dom_bbs
called from loop unroling.
The reason is that under relatively rare situations, we may run into case where
loop has multiple exits and all are considered as likely but then we scale down
the profile and one of the exits becomes unlikely. 

We pass around unadjusted_exit_count to scale exit probability correctly.  In 
this
case we may end up using uninitialized value and profile-count type 
intentionally
bombs on that.

Profiledbootstrapped x86_64-linux, comitted.

gcc/ChangeLog:

PR bootstrap/110857
* cfgloopmanip.cc (scale_loop_profile): (Un)initialize
unadjusted_exit_count.

diff --git a/gcc/cfgloopmanip.cc b/gcc/cfgloopmanip.cc
index 86360b5f380..b237ad4e8ac 100644
--- a/gcc/cfgloopmanip.cc
+++ b/gcc/cfgloopmanip.cc
@@ -742,7 +742,7 @@ scale_loop_profile (class loop *loop, profile_probability p,
   /* In a consistent profile unadjusted_exit_count should be same as count_in,
  however to preserve as much of the original info, avoid recomputing
  it.  */
-  profile_count unadjusted_exit_count;
+  profile_count unadjusted_exit_count = profile_count::uninitialized ();
   if (exit_edge)
 unadjusted_exit_count = exit_edge->count ();
   scale_loop_frequencies (loop, scale_prob);


Re: Fix profile upate after vectorizer peeling

2023-08-03 Thread Jan Hubicka via Gcc-patches
> > Jeff, an help would be appreciated here :)
> > 
> > I will try to debug this.  One option would be to disable branch
> > prediciton on vect_check for time being - it is not inlined anyway
> Not a lot of insight.  The backwards threader uses a totally different API
> for the CFG/SSA updates and that API I don't think has made any significant
> effort to keep the profile up-to-date.

OK, at least some hints where the missing profile updat should be, would
be good. There is update_profile in tree-ssa-threadupdate and
understaning what is missing would be nice
In general it would be nice to mind profile when updating CFG :)

Honza
> 
> Jeff


Re: [PATCH] ipa-sra: Don't consider CLOBBERS as writes preventing splitting

2023-08-03 Thread Jan Hubicka via Gcc-patches
> On Mon, Jul 31, 2023 at 7:05 PM Martin Jambor  wrote:
> >
> > Hi,
> >
> > when IPA-SRA detects whether a parameter passed by reference is
> > written to, it does not special case CLOBBERs which means it often
> > bails out unnecessarily, especially when dealing with C++ destructors.
> > Fixed by the obvious continue in the two relevant loops.
> >
> > The (slightly) more complex testcases in the PR need surprisingly more
> > effort but the simple one can be fixed now easily by this patch and I'll
> > work on the others incrementally.
> >
> > Bootstrapped and currently undergoing testsuite run on x86_64-linux.  OK
> > if it passes too?
> 
> LGTM, btw - how are the clobbers handled during transform?

Looks good to me too. I was also wondering if we want to preserve
something about the clobber.  If SRA fully suceeds it would not be
needed but if the original location is not fully SRAed we may
theoretically lose information. We put additonal clobber after
destructor call, so one would need to wrap it in non-dstructor and be
sure that ipa-modref understands the clobber in order to obtain a
testcase.

Honza


Re: Fix profile upate after vectorizer peeling

2023-08-03 Thread Jan Hubicka via Gcc-patches
> > 
> > Note most of the profile consistency checks FAIL when testing with -m32 on
> > x86_64-unknown-linux-gnu ...
> > 
> > For example vect-11.c has
> > 
> > ;;   basic block 4, loop depth 0, count 719407024 (estimated locally,
> > freq 0.6700), maybe hot
> > ;;   Invalid sum of incoming counts 708669602 (estimated locally, freq
> > 0.6600), should be 719407024 (estimated locally, freq 0.6700)
> > ;;prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
> > ;;pred:   3 [always (guessed)]  count:708669602 (estimated
> > locally, freq 0.6600) (FALSE_VALUE,EXECUTABLE)
> >   __asm__ __volatile__("cpuid
> > " : "=a" a_44, "=b" b_45, "=c" c_46, "=d" d_47 : "0" 1, "2" 0);
> >   _3 = d_47 & 67108864;
> > 
> > so it looks like it's the check_vect () function that goes wrong
> > everywhere but only on i?86.
> > The first dump with the Invalid sum is 095t.fixup_cfg3 already.
> 
> Sorry for that, looks like missing/undetected noreturn.  I will take a look.

The mismatch at fixup_cfg3 is harmless since we repropagate frequencies
later now.  The misupdate is caused by jump threading:

vect-11.c.102t.adjust_alignment:;;   Invalid sum of incoming counts 354334800 
(estimated locally, freq 0.3300), should be 233860966 (estimated locally, freq 
0.2178)
vect-11.c.102t.adjust_alignment:;;   Invalid sum of incoming counts 354334800 
(estimated locally, freq 0.3300), should be 474808634 (estimated locally, freq 
0.4422)
vect-11.c.107t.rebuild_frequencies1
vect-11.c.116t.threadfull1:;;   Invalid sum of incoming counts 708669600 
(estimated locally, freq 0.6600), should be 719407024 (estimated locally, freq 
0.6700)

I know that there are problems left in profile threading update.  It was
main pass disturbing profile until gcc13 and now works for basic
testcases but not always.  I already spent quite some time trying to
figure out what is wrong with profile threading (PR103680), so at least
this is small testcase.

Jeff, an help would be appreciated here :)

I will try to debug this.  One option would be to disable branch
prediciton on vect_check for time being - it is not inlined anyway

diff --git a/gcc/testsuite/gcc.dg/vect/tree-vect.h 
b/gcc/testsuite/gcc.dg/vect/tree-vect.h
index c4b81441216..544be31be78 100644
--- a/gcc/testsuite/gcc.dg/vect/tree-vect.h
+++ b/gcc/testsuite/gcc.dg/vect/tree-vect.h
@@ -20,7 +20,7 @@ sig_ill_handler (int sig)
   exit(0);
 }
 
-static void __attribute__((noinline))
+static void __attribute__((noinline,optimize(0)))
 check_vect (void)
 {
   signal(SIGILL, sig_ill_handler);

Honza


Re: Fix profile upate after vectorizer peeling

2023-08-03 Thread Jan Hubicka via Gcc-patches
> 
> Note most of the profile consistency checks FAIL when testing with -m32 on
> x86_64-unknown-linux-gnu ...
> 
> For example vect-11.c has
> 
> ;;   basic block 4, loop depth 0, count 719407024 (estimated locally,
> freq 0.6700), maybe hot
> ;;   Invalid sum of incoming counts 708669602 (estimated locally, freq
> 0.6600), should be 719407024 (estimated locally, freq 0.6700)
> ;;prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
> ;;pred:   3 [always (guessed)]  count:708669602 (estimated
> locally, freq 0.6600) (FALSE_VALUE,EXECUTABLE)
>   __asm__ __volatile__("cpuid
> " : "=a" a_44, "=b" b_45, "=c" c_46, "=d" d_47 : "0" 1, "2" 0);
>   _3 = d_47 & 67108864;
> 
> so it looks like it's the check_vect () function that goes wrong
> everywhere but only on i?86.
> The first dump with the Invalid sum is 095t.fixup_cfg3 already.

Sorry for that, looks like missing/undetected noreturn.  I will take a look.

Honza


Re: [PATCH] Swap loop splitting and final value replacement

2023-08-03 Thread Jan Hubicka via Gcc-patches
> The following swaps the loop splitting pass and the final value
> replacement pass to avoid keeping the IV of the earlier loop
> live when not necessary.  The existing gcc.target/i386/pr87007-5.c
> testcase shows that we otherwise fail to elide an empty loop
> later.  I don't see any good reason why loop splitting would need
> final value replacement, all exit values honor the constraints
> we place on loop header PHIs automatically.
> 
> Bootstrap and regtest running on x86_64-unknown-linux-gnu, I plan
> to install this if testing succeeds.
Thanks! I was just looking into the same. This should let us to turn the
split loop into non-loop for hmmer.

Honza


Fix profile update after cancelled loop distribution

2023-08-02 Thread Jan Hubicka via Gcc-patches
Hi,
Loop distribution and ifcvt introduces verisons of loops which may be removed
later if vectorization fails.  Ifcvt does this by temporarily breaking profile
and producing conditional that has two arms with 100% probability because we
know one of the versions will be removed.

Loop distribution is trickier, since it introduces test for alignment that
either survives to final code if vecotorization suceeds or is turned if it
fails.

Here we need to assign some reasonable probabilities for the case vectorization
goes well, so this code adds logic to scale profile back in case we remove the
call.

This is not perfect since we drop precise BB counts to guessed.  It is not big
deal since we do not use much reliablity of bb counts after this point.  Other
option would be to apply scale only if vectorization succeeds which however
needs bit more work at tree-loop-distribution side and would need all code in
this patch with small change that fold_loop_internal_call will have to know how
to adjust if conditional stays. I decided to go for easier solution for now.

Bootstrapped/regtested x86_64-linux, committed.

gcc/ChangeLog:

* cfg.cc (scale_strictly_dominated_blocks): New function.
* cfg.h (scale_strictly_dominated_blocks): Declare.
* tree-cfg.cc (fold_loop_internal_call): Fixup CFG profile.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/pr98308.c: Check that profile is consistent.

diff --git a/gcc/cfg.cc b/gcc/cfg.cc
index 0de6d6b9e71..9eb9916f61a 100644
--- a/gcc/cfg.cc
+++ b/gcc/cfg.cc
@@ -1195,3 +1195,27 @@ get_loop_copy (class loop *loop)
   else
 return NULL;
 }
+
+/* Scales the frequencies of all basic blocks that are strictly
+   dominated by BB by NUM/DEN.  */
+
+void
+scale_strictly_dominated_blocks (basic_block bb,
+profile_count num, profile_count den)
+{
+  basic_block son;
+
+  if (!den.nonzero_p () && !(num == profile_count::zero ()))
+return;
+  auto_vec  worklist;
+  worklist.safe_push (bb);
+
+  while (!worklist.is_empty ())
+for (son = first_dom_son (CDI_DOMINATORS, worklist.pop ());
+son;
+son = next_dom_son (CDI_DOMINATORS, son))
+  {
+   son->count = son->count.apply_scale (num, den);
+   worklist.safe_push (son);
+  }
+}
diff --git a/gcc/cfg.h b/gcc/cfg.h
index 4bf4263ebfc..a0e944979c8 100644
--- a/gcc/cfg.h
+++ b/gcc/cfg.h
@@ -127,6 +127,8 @@ extern void set_bb_copy (basic_block, basic_block);
 extern basic_block get_bb_copy (basic_block);
 void set_loop_copy (class loop *, class loop *);
 class loop *get_loop_copy (class loop *);
+void scale_strictly_dominated_blocks (basic_block,
+ profile_count, profile_count);
 
 /* Generic RAII class to allocate a bit from storage of integer type T.
The allocated bit is accessible as mask with the single bit set
diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc
index c65af8cc800..c158454946c 100644
--- a/gcc/tree-cfg.cc
+++ b/gcc/tree-cfg.cc
@@ -7703,6 +7703,44 @@ fold_loop_internal_call (gimple *g, tree value)
   FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
SET_USE (use_p, value);
   update_stmt (use_stmt);
+  /* If we turn conditional to constant, scale profile counts.
+We know that the conditional was created by loop distribution
+and all basic blocks dominated by the taken edge are part of
+the loop distributed.  */
+  if (gimple_code (use_stmt) == GIMPLE_COND)
+   {
+ edge true_edge, false_edge;
+ extract_true_false_edges_from_block (gimple_bb (use_stmt),
+  _edge, _edge);
+ edge taken_edge = NULL, other_edge = NULL;
+ if (gimple_cond_true_p (as_a (use_stmt)))
+   {
+ taken_edge = true_edge;
+ other_edge = false_edge;
+   }
+ else if (gimple_cond_false_p (as_a (use_stmt)))
+   {
+ taken_edge = false_edge;
+ other_edge = true_edge;
+   }
+ if (taken_edge
+ && !(taken_edge->probability == profile_probability::always ()))
+   {
+ profile_count old_count = taken_edge->count ();
+ profile_count new_count = taken_edge->src->count;
+ taken_edge->probability = profile_probability::always ();
+ other_edge->probability = profile_probability::never ();
+ /* If we have multiple predecessors, we can't use the dominance
+test.  This should not happen as the guarded code should
+start with pre-header.  */
+ gcc_assert (single_pred_edge (taken_edge->dest));
+ taken_edge->dest->count
+   = taken_edge->dest->count.apply_scale (new_count,
+  old_count);
+ scale_strictly_dominated_blocks (taken_edge->dest,
+  new_count, old_count);
+   }
+   }

Fix profile upate after vectorizer peeling

2023-08-01 Thread Jan Hubicka via Gcc-patches
Hi,
This patch fixes update after constant peeling in profilogue.  We now reached 0 
profile
update bugs on tramp3d vectorizaiton and also on quite few testcases, so I am 
enabling the
testuiste checks so we do not regress again.

Bootstrapped/regtested x86_64, comitted.

Honza

gcc/ChangeLog:

* tree-vect-loop-manip.cc (vect_do_peeling): Fix profile update after
constant prologue peeling.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/vect-1-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-1.c: Check profile consistency.
* gcc.dg/vect/vect-10-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-10.c: Check profile consistency.
* gcc.dg/vect/vect-100.c: Check profile consistency.
* gcc.dg/vect/vect-103.c: Check profile consistency.
* gcc.dg/vect/vect-104.c: Check profile consistency.
* gcc.dg/vect/vect-105-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-105.c: Check profile consistency.
* gcc.dg/vect/vect-106.c: Check profile consistency.
* gcc.dg/vect/vect-107.c: Check profile consistency.
* gcc.dg/vect/vect-108.c: Check profile consistency.
* gcc.dg/vect/vect-109.c: Check profile consistency.
* gcc.dg/vect/vect-11.c: Check profile consistency.
* gcc.dg/vect/vect-110.c: Check profile consistency.
* gcc.dg/vect/vect-112-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-112.c: Check profile consistency.
* gcc.dg/vect/vect-113.c: Check profile consistency.
* gcc.dg/vect/vect-114.c: Check profile consistency.
* gcc.dg/vect/vect-115.c: Check profile consistency.
* gcc.dg/vect/vect-116.c: Check profile consistency.
* gcc.dg/vect/vect-117.c: Check profile consistency.
* gcc.dg/vect/vect-118.c: Check profile consistency.
* gcc.dg/vect/vect-119.c: Check profile consistency.
* gcc.dg/vect/vect-11a.c: Check profile consistency.
* gcc.dg/vect/vect-12.c: Check profile consistency.
* gcc.dg/vect/vect-120.c: Check profile consistency.
* gcc.dg/vect/vect-121.c: Check profile consistency.
* gcc.dg/vect/vect-122.c: Check profile consistency.
* gcc.dg/vect/vect-123.c: Check profile consistency.
* gcc.dg/vect/vect-124.c: Check profile consistency.
* gcc.dg/vect/vect-126.c: Check profile consistency.
* gcc.dg/vect/vect-13.c: Check profile consistency.
* gcc.dg/vect/vect-14.c: Check profile consistency.
* gcc.dg/vect/vect-15-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-15.c: Check profile consistency.
* gcc.dg/vect/vect-17.c: Check profile consistency.
* gcc.dg/vect/vect-18.c: Check profile consistency.
* gcc.dg/vect/vect-19.c: Check profile consistency.
* gcc.dg/vect/vect-2-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-2.c: Check profile consistency.
* gcc.dg/vect/vect-20.c: Check profile consistency.
* gcc.dg/vect/vect-21.c: Check profile consistency.
* gcc.dg/vect/vect-22.c: Check profile consistency.
* gcc.dg/vect/vect-23.c: Check profile consistency.
* gcc.dg/vect/vect-24.c: Check profile consistency.
* gcc.dg/vect/vect-25.c: Check profile consistency.
* gcc.dg/vect/vect-26.c: Check profile consistency.
* gcc.dg/vect/vect-27.c: Check profile consistency.
* gcc.dg/vect/vect-28.c: Check profile consistency.
* gcc.dg/vect/vect-29.c: Check profile consistency.
* gcc.dg/vect/vect-3.c: Check profile consistency.
* gcc.dg/vect/vect-30.c: Check profile consistency.
* gcc.dg/vect/vect-31-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-31.c: Check profile consistency.
* gcc.dg/vect/vect-32-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-32-chars.c: Check profile consistency.
* gcc.dg/vect/vect-32.c: Check profile consistency.
* gcc.dg/vect/vect-33-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-33.c: Check profile consistency.
* gcc.dg/vect/vect-34-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-34.c: Check profile consistency.
* gcc.dg/vect/vect-35-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-35.c: Check profile consistency.
* gcc.dg/vect/vect-36-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-36.c: Check profile consistency.
* gcc.dg/vect/vect-38.c: Check profile consistency.
* gcc.dg/vect/vect-4.c: Check profile consistency.
* gcc.dg/vect/vect-40.c: Check profile consistency.
* gcc.dg/vect/vect-42.c: Check profile consistency.
* gcc.dg/vect/vect-44.c: Check profile consistency.
* gcc.dg/vect/vect-46.c: Check profile consistency.
* gcc.dg/vect/vect-48.c: Check profile consistency.
* gcc.dg/vect/vect-5.c: 

Fix profile update after loop versioning in vectorizer

2023-07-29 Thread Jan Hubicka via Gcc-patches
Hi,
Vectorizer while loop versioning produces a versioned loop
guarded with two conditionals of the form

  if (cond1)
goto scalar_loop
  else
goto next_bb
next_bb:
  if (cond2)
godo scalar_loop
  else
goto vector_loop

It wants the combined test to be prob (whch is set to likely)
and uses profile_probability::split to determine probability
of cond1 and cond2.

However spliting  is turning:

 if (cond)
   goto lab; // ORIG probability
 into
 if (cond1)
   goto lab; // FIRST = ORIG * CPROB probability
 if (cond2)
   goto lab; // SECOND probability

Which is or instead of and.  As a result we get pretty low probabiility
of entering vectorized loop.

The fixes this by introducing sqrt to profile probability (which is correct
way to split this) and also adding pow that is needed elsewhere.

While loop versioning I now produce code as if there was only one combined
conditional and then update probability of conditional produced (containig
cond1).  Later edge is split and new conditional is added. At that time
it is necessary to update probability of the BB containing second conditional
so everything matches.

Bootstrapped/regtested x86_64-linux, comitted.

gcc/ChangeLog:

* profile-count.cc (profile_probability::sqrt): New member function.
(profile_probability::pow): Likewise.
* profile-count.h: (profile_probability::sqrt): Declare
(profile_probability::pow): Likewise.
* tree-vect-loop-manip.cc (vect_loop_versioning): Fix profile update.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/vect-profile-upate-2.c: New test.

diff --git a/gcc/profile-count.cc b/gcc/profile-count.cc
index eaf0f0d787e..e63c9432388 100644
--- a/gcc/profile-count.cc
+++ b/gcc/profile-count.cc
@@ -471,3 +471,60 @@ profile_probability::to_sreal () const
   gcc_checking_assert (initialized_p ());
   return ((sreal)m_val) >> (n_bits - 2);
 }
+
+/* Compute square root.  */
+
+profile_probability
+profile_probability::sqrt () const
+{
+  if (!initialized_p () || *this == never () || *this == always ())
+return *this;
+  profile_probability ret = *this;
+  ret.m_quality = MIN (ret.m_quality, ADJUSTED);
+  uint32_t min_range = m_val;
+  uint32_t max_range = max_probability;
+  if (!m_val)
+max_range = 0;
+  if (m_val == max_probability)
+min_range = max_probability;
+  while (min_range != max_range)
+{
+  uint32_t val = (min_range + max_range) / 2;
+  uint32_t val2 = RDIV ((uint64_t)val * val, max_probability);
+  if (val2 == m_val)
+   min_range = max_range = m_val;
+  else if (val2 > m_val)
+   max_range = val - 1;
+  else if (val2 < m_val)
+   min_range = val + 1;
+}
+  ret.m_val = min_range;
+  return ret;
+}
+
+/* Compute n-th power of THIS.  */
+
+profile_probability
+profile_probability::pow (int n) const
+{
+  if (n == 1 || !initialized_p ())
+return *this;
+  if (!n)
+return profile_probability::always ();
+  if (!nonzero_p ()
+  || !(profile_probability::always () - *this).nonzero_p ())
+return *this;
+  profile_probability ret = profile_probability::always ();
+  profile_probability v = *this;
+  int p = 1;
+  while (true)
+{
+  if (n & p)
+   ret = ret * v;
+  p <<= 1;
+  if (p > n)
+   break;
+  v = v * v;
+}
+  return ret;
+}
diff --git a/gcc/profile-count.h b/gcc/profile-count.h
index 88a6431c21a..002bcb83481 100644
--- a/gcc/profile-count.h
+++ b/gcc/profile-count.h
@@ -650,6 +650,12 @@ public:
   return *this;
 }
 
+  /* Compute n-th power.  */
+  profile_probability pow (int) const;
+
+  /* Compute sware root.  */
+  profile_probability sqrt () const;
+
   /* Get the value of the count.  */
   uint32_t value () const { return m_val; }
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vect-profile-upate-2.c 
b/gcc/testsuite/gcc.dg/tree-ssa/vect-profile-upate-2.c
new file mode 100644
index 000..4a5f6bc4e23
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/vect-profile-upate-2.c
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized-details-blocks" } */
+void
+test (int *a, int *b, int n)
+{
+   for (int i = 0; i < n; i++)
+   a[i]+=b[i];
+}
+/* { dg-final { scan-tree-dump-not "Invalid sum" "optimized"} } */
diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index 30baac6db44..e53a99e7c3c 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -3784,7 +3784,7 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
 }
 
   tree cost_name = NULL_TREE;
-  profile_probability prob2 = profile_probability::uninitialized ();
+  profile_probability prob2 = profile_probability::always ();
   if (cond_expr
   && EXPR_P (cond_expr)
   && (version_niter
@@ -3797,7 +3797,7 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
  is_gimple_val, NULL_TREE);
   /* Split prob () into two so that the overall 

Re: Loop-split improvements, part 3

2023-07-28 Thread Jan Hubicka via Gcc-patches
> On Fri, Jul 28, 2023 at 2:57 PM Jan Hubicka via Gcc-patches
>  wrote:
> >
> > Hi,
> > This patch extends tree-ssa-loop-split to understand test of the form
> >  if (i==0)
> > and
> >  if (i!=0)
> > which triggers only during the first iteration.  Naturally we should
> > also be able to trigger last iteration or split into 3 cases if
> > the test indeed can fire in the middle of the loop.
> >
> > Last iteration is bit trickier pattern matching so I want to do it
> > incrementally, but I implemented easy case using value range that handled
> > loops with constant iterations.
> >
> > The testcase gets misupdated profile, I will also fix that incrementally.
> >
> > Bootstrapped/regtested x86_64-linux, OK?
> 
> OK, though I think we can handle more loops by simply conservatively peeling
> one iteration at the beginning/end with such conditions and would be not 
> subject
> to all other limitations the loop splitting pass has?

I was also thinking of extending loop peeling heuristics by this.
Loop-ch already can handle case where the static test exits loop, so we
could get this if I figure out how to merge the analysis.

To handle last iteration (like in hmmer), we would need to extend loop
peeling to support that.

Even with that tree-ssa-loop-split has chance to be more informed and
have better cost model.  Let me see how many restrictions can be dropped
it.

Honza


Loop-split improvements, part 3

2023-07-28 Thread Jan Hubicka via Gcc-patches
Hi,
This patch extends tree-ssa-loop-split to understand test of the form
 if (i==0)
and
 if (i!=0)
which triggers only during the first iteration.  Naturally we should
also be able to trigger last iteration or split into 3 cases if
the test indeed can fire in the middle of the loop.

Last iteration is bit trickier pattern matching so I want to do it
incrementally, but I implemented easy case using value range that handled
loops with constant iterations.

The testcase gets misupdated profile, I will also fix that incrementally.

Bootstrapped/regtested x86_64-linux, OK?

gcc/ChangeLog:

PR middle-end/77689
* tree-ssa-loop-split.cc: Include value-query.h.
(split_at_bb_p): Analyze cases where EQ/NE can be turned
into LT/LE/GT/GE; return updated guard code.
(split_loop): Use guard code.

gcc/testsuite/ChangeLog:

PR middle-end/77689
* g++.dg/tree-ssa/loop-split-1.C: New test.

diff --git a/gcc/testsuite/g++.dg/tree-ssa/loop-split-1.C 
b/gcc/testsuite/g++.dg/tree-ssa/loop-split-1.C
new file mode 100644
index 000..9581438b536
--- /dev/null
+++ b/gcc/testsuite/g++.dg/tree-ssa/loop-split-1.C
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-lsplit-details -std=c++11" } */
+#include 
+#include 
+
+constexpr unsigned s = 1;
+
+int main()
+{
+std::vector a, b, c;
+a.reserve(s);
+b.reserve(s);
+c.reserve(s);
+
+for(unsigned i = 0; i < s; ++i)
+{
+if(i == 0)
+a[i] = b[i] * c[i];
+else
+a[i] = (b[i] + c[i]) * c[i-1] * std::log(i);
+}
+}
+/* { dg-final { scan-tree-dump-times "loop split" 1 "lsplit" } } */
diff --git a/gcc/tree-ssa-loop-split.cc b/gcc/tree-ssa-loop-split.cc
index 70cd0aaefa7..641346cba70 100644
--- a/gcc/tree-ssa-loop-split.cc
+++ b/gcc/tree-ssa-loop-split.cc
@@ -42,6 +42,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "gimple-fold.h"
 #include "gimplify-me.h"
 #include "print-tree.h"
+#include "value-query.h"
 
 /* This file implements two kinds of loop splitting.
 
@@ -75,7 +76,8 @@ along with GCC; see the file COPYING3.  If not see
point in *BORDER and the comparison induction variable in IV.  */
 
 static tree
-split_at_bb_p (class loop *loop, basic_block bb, tree *border, affine_iv *iv)
+split_at_bb_p (class loop *loop, basic_block bb, tree *border, affine_iv *iv,
+  enum tree_code *guard_code)
 {
   gcond *stmt;
   affine_iv iv2;
@@ -87,19 +89,6 @@ split_at_bb_p (class loop *loop, basic_block bb, tree 
*border, affine_iv *iv)
 
   enum tree_code code = gimple_cond_code (stmt);
 
-  /* Only handle relational comparisons, for equality and non-equality
- we'd have to split the loop into two loops and a middle statement.  */
-  switch (code)
-{
-  case LT_EXPR:
-  case LE_EXPR:
-  case GT_EXPR:
-  case GE_EXPR:
-   break;
-  default:
-   return NULL_TREE;
-}
-
   if (loop_exits_from_bb_p (loop, bb))
 return NULL_TREE;
 
@@ -129,6 +118,56 @@ split_at_bb_p (class loop *loop, basic_block bb, tree 
*border, affine_iv *iv)
   if (!iv->no_overflow)
 return NULL_TREE;
 
+  /* Only handle relational comparisons, for equality and non-equality
+ we'd have to split the loop into two loops and a middle statement.  */
+  switch (code)
+{
+  case LT_EXPR:
+  case LE_EXPR:
+  case GT_EXPR:
+  case GE_EXPR:
+   break;
+  case NE_EXPR:
+  case EQ_EXPR:
+   /* If the test check for first iteration, we can handle NE/EQ
+  with only one split loop.  */
+   if (operand_equal_p (iv->base, iv2.base, 0))
+ {
+   if (code == EQ_EXPR)
+ code = !tree_int_cst_sign_bit (iv->step) ? LE_EXPR : GE_EXPR;
+   else
+ code = !tree_int_cst_sign_bit (iv->step) ? GT_EXPR : LT_EXPR;
+   break;
+ }
+   /* Similarly when the test checks for minimal or maximal
+  value range.  */
+   else
+ {
+   int_range<2> r;
+   get_global_range_query ()->range_of_expr (r, op0, stmt);
+   if (!r.varying_p () && !r.undefined_p ()
+   && TREE_CODE (op1) == INTEGER_CST)
+ {
+   wide_int val = wi::to_wide (op1);
+   if (known_eq (val, r.lower_bound ()))
+ {
+   code = (code == EQ_EXPR) ? LE_EXPR : GT_EXPR;
+   break;
+ }
+   else if (known_eq (val, r.upper_bound ()))
+ {
+   code = (code == EQ_EXPR) ? GE_EXPR : LT_EXPR;
+   break;
+ }
+ }
+ }
+   /* TODO: We can compare with exit condition; it seems that testing for
+  last iteration is common case.  */
+   return NULL_TREE;
+  default:
+   return NULL_TREE;
+}
+
   if (dump_file && (dump_flags & TDF_DETAILS))
 {
   fprintf (dump_file, "Found potential split point: 

Re: Loop-split improvements, part 2

2023-07-28 Thread Jan Hubicka via Gcc-patches
> On Fri, Jul 28, 2023 at 9:58 AM Jan Hubicka via Gcc-patches
>  wrote:
> >
> > Hi,
> > this patch fixes profile update in the first case of loop splitting.
> > The pass still gives up on very basic testcases:
> >
> > __attribute__ ((noinline,noipa))
> > void test1 (int n)
> > {
> >   if (n <= 0 || n > 10)
> > return;
> >   for (int i = 0; i <= n; i++)
> > {
> >   if (i < n)
> > do_something ();
> >   if (a[i])
> > do_something2();
> > }
> > }
> > Here I needed to do the conditoinal that enforces sane value range of n.
> > The reason is that it gives up on:
> >   !number_of_iterations_exit (loop1, exit1, , false, true)
> > and without the conditonal we get assumption that n>=0 and not INT_MAX.
> > I think from overflow we shold derive that INT_MAX test is not needed and 
> > since
> > the loop does nothing for n<0 it is also just an paranoia.
> 
> I only get n != 2147483647 (loop header copying does the n >= 0).  Indeed
> this test looks odd.  It's because we turn i <= n into i < n + 1 and analyze
> that (our canonical test is LT_EXPR), for this to work n may not be INT_MAX.

Yep, I can't think on how that can disturb loop splitting.  The loop
above is similar to one in hmmer so people do loops like that.
We should be able to use the fact that i can not overflow to get rid of
this assumtion, but I am not that famililar with that code...

I think it would help elsewhere too?
> 
> In principle it could just look at the scalar evolution for the IV in
> the exit test.
> Aka use simple_iv () and check ->no_overflow?

Yep, I tink that should be enough.  It uses simple_iv to analyze the
in-loop conditionals.  I will look into that.

Honza


Loop-split improvements, part 2

2023-07-28 Thread Jan Hubicka via Gcc-patches
Hi,
this patch fixes profile update in the first case of loop splitting.
The pass still gives up on very basic testcases:

__attribute__ ((noinline,noipa))
void test1 (int n)
{
  if (n <= 0 || n > 10)
return; 
  for (int i = 0; i <= n; i++)
{
  if (i < n)
do_something ();
  if (a[i])
do_something2();
}
}

Here I needed to do the conditoinal that enforces sane value range of n.
The reason is that it gives up on:
  !number_of_iterations_exit (loop1, exit1, , false, true)
and without the conditonal we get assumption that n>=0 and not INT_MAX.
I think from overflow we shold derive that INT_MAX test is not needed and since
the loop does nothing for n<0 it is also just an paranoia.

I am not sure how to fix this though :(.  In general the pass does not really
need to compute iteration count.  It only needs to know what direction the IVs
go so it can detect tests that fires in first part of iteration space.

Rich, any idea what the correct test should be?

In testcase:
  for (int i = 0; i < 200; i++)
if (i < 150)
  do_something ();
else
  do_something2 ();
the old code did wrong update of the exit condition probabilities.
We know that first loop iterates 150 times and the second loop 50 times
and we get it by simply scaling loop body by the probability of inner test.

With the patch we now get:

   [count: 1000]:

   [count: 15]:<- loop 1 correctly iterates 149 times
  # i_10 = PHI 
  do_something ();
  i_7 = i_10 + 1;
  if (i_7 <= 149)
goto ; [99.33%]
  else
goto ; [0.67%]

   [count: 149000]:
  goto ; [100.00%]

   [count: 1000]:
  # i_15 = PHI 

   [count: 49975]:<- loop 2 should iterate 50 times but
   we are slightly wrong
  # i_3 = PHI 
  do_something2 ();
  i_14 = i_3 + 1;
  if (i_14 != 200)
goto ; [98.00%]
  else
goto ; [2.00%]

   [count: 48975]:
  goto ; [100.00%]

   [count: 1000]:   <- this test is always true becuase it is
  reached form bb 3
  # i_18 = PHI 
  if (i_18 != 200)
goto ; [99.95%]
  else
goto ; [0.05%]

   [count: 1000]:
  return;

The reason why we are slightly wrong is the condtion in bb17 that 
is always true but the pass does not konw it.

Rich any idea how to do that?  I think connect_loops should work out
the cas where the loop exit conditon is never satisfied at the time
the splitted condition fails for first time.

Also we do not update loop iteration expectancies.  If we were able to 
work out if one of the loop has constant iteration count, we could do it
perfectly.

Before patch on hmmer we get a lot of mismatches:
Profile report here claims:
dump id |static mismat|dynamic mismatch |   

|in count |in count  |time  |   

lsplit  |  5+5|   8151850567  +8151850567| 531506481006   +57.9%| 
ldist   |  9+4|  15345493501  +7193642934| 606848841056   +14.2%| 
ifcvt   | 10+1|  15487514871   +142021370| 689469797790   +13.6%| 
vect| 35   +25|  17558425961  +2070911090| 517375405715   -25.0%| 
cunroll | 42+7|  16898736178   -659689783| 452445796198-4.9%|  
loopdone| 33-9|   2678017188 -14220718990| 330969127663 |   

tracer  | 34+1|   2678018710+1522| 330613415364+0.0%|  
fre | 33-1|   2676980249 -1038461| 330465677073-0.0%|  
expand  | 28-5|   2497468467   -179511782|--|

With patch

lsplit  |  0  |0 | 328723360744-2.3%|
ldist   |  0  |0 | 396193562452   +20.6%|
ifcvt   |  1+1| 71010686+71010686| 478743508522   +20.8%|
vect| 14   +13|697518955   +626508269| 299398068323   -37.5%|
cunroll | 13-1|489349408   -208169547| 25839725   -10.5%|
loopdone| 11-2|402558559-86790849| 201010712702 |
tracer  | 13+2|402977200  +418641| 200651036623+0.0%|
fre | 13  |402622146  -355054| 200344398654-0.2%|
expand  | 11-2|333608636-69013510|--|

So no mismatches for lsplit and ldist and also lsplit thinks it improves
speed by 2.3% rather than regressig it by 57%. 

Update is still not perfect since we do not work out that the second loop
never iterates.  Also ldist is still wrong siince time should not go up.

Ifcft wrecks profile by desing since it insert conditonals with both arms 100%
that will be eliminated later after vect.  It is not clear to me what happens
in vect though.

Bootstrapped/regtested x86_64-linux, comitted.

gcc/ChangeLog:

PR middle-end/106293
* tree-ssa-loop-split.cc (connect_loops): Change probability
of the test preconditioning second loop to very_likely.
(fix_loop_bb_probability): Handle correctly 

loop-split improvements, part 1

2023-07-28 Thread Jan Hubicka via Gcc-patches
Hi,
while looking on profile misupdate on hmmer I noticed that loop splitting pass 
is not
able to handle the loop it has as an example it should apply on:

   One transformation of loops like:

   for (i = 0; i < 100; i++)
 {
   if (i < 50)
 A;
   else
 B;
 }

   into:

   for (i = 0; i < 50; i++)
 {
   A;
 }
   for (; i < 100; i++)
 {
   B;
 }

The problem is that ivcanon turns the test into i != 100 and the pass
explicitly gives up on any loops ending with != test.  It needs to know
the directoin of the induction variable in order to derive right conditions,
but that can be done also from step.

It turns out that there are no testcases for basic loop splitting.  I will add
some with the profile update fix.

There are other issues, like VRP will turn i < 99 into i == 99 based on
value range which also makes the pass to give up.

Bootstrapped/regtested x86_64-linux, OK?

gcc/ChangeLog:

* tree-ssa-loop-split.cc (split_loop): Also support NE driven
loops when IV test is not overflowing.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/ifc-12.c: Disable loop splitting.
* gcc.target/i386/avx2-gather-6.c: Likewise.
* gcc.target/i386/avx2-vect-aggressive.c: Likewise.

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ifc-12.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ifc-12.c
index 9468c070489..bedf29c7dbc 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ifc-12.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ifc-12.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-Ofast -fdump-tree-ifcvt-stats-blocks-details" } */
+/* { dg-options "-Ofast -fdump-tree-ifcvt-stats-blocks-details 
-fno-split-loops" } */
 /* { dg-require-visibility "" } */
 
 struct st
diff --git a/gcc/testsuite/gcc.target/i386/avx2-gather-6.c 
b/gcc/testsuite/gcc.target/i386/avx2-gather-6.c
index b9119581ae2..47a95dbe989 100644
--- a/gcc/testsuite/gcc.target/i386/avx2-gather-6.c
+++ b/gcc/testsuite/gcc.target/i386/avx2-gather-6.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O3 -mavx2 -fno-common -fdump-tree-vect-details 
-mtune=skylake" } */
+/* { dg-options "-O3 -mavx2 -fno-common -fdump-tree-vect-details 
-mtune=skylake -fno-split-loops" } */
 
 #include "avx2-gather-5.c"
 
diff --git a/gcc/testsuite/gcc.target/i386/avx2-vect-aggressive.c 
b/gcc/testsuite/gcc.target/i386/avx2-vect-aggressive.c
index 57192791857..fa336e70e84 100644
--- a/gcc/testsuite/gcc.target/i386/avx2-vect-aggressive.c
+++ b/gcc/testsuite/gcc.target/i386/avx2-vect-aggressive.c
@@ -1,6 +1,6 @@
 /* { dg-do run } */
 /* { dg-require-effective-target avx2 } */
-/* { dg-options "-mavx2 -O3 -fopenmp-simd -fdump-tree-vect-details 
-fdisable-tree-thread1" } */
+/* { dg-options "-mavx2 -O3 -fopenmp-simd -fdump-tree-vect-details 
-fdisable-tree-thread1 -fno-split-loops" } */
 
 #include "avx2-check.h"
 #define N 64
diff --git a/gcc/tree-ssa-loop-split.cc b/gcc/tree-ssa-loop-split.cc
index b41b5e614c2..27780370d85 100644
--- a/gcc/tree-ssa-loop-split.cc
+++ b/gcc/tree-ssa-loop-split.cc
@@ -540,10 +545,17 @@ split_loop (class loop *loop1)
   || !empty_block_p (loop1->latch)
   || !easy_exit_values (loop1)
   || !number_of_iterations_exit (loop1, exit1, , false, true)
-  || niter.cmp == ERROR_MARK
-  /* We can't yet handle loops controlled by a != predicate.  */
-  || niter.cmp == NE_EXPR)
+  || niter.cmp == ERROR_MARK)
 return false;
+  if (niter.cmp == NE_EXPR)
+{
+  if (!niter.control.no_overflow)
+   return false;
+  if (tree_int_cst_sign_bit (niter.control.step) > 0)
+   niter.cmp = GT_EXPR;
+  else
+   niter.cmp = LT_EXPR;
+}
 
   bbs = get_loop_body (loop1);
 


Make store likely in optimize_mask_stores

2023-07-27 Thread Jan Hubicka via Gcc-patches
Hi,
as discussed with Richard, we want store to be likely in
optimize_mask_stores.

Bootstrapped/regtested x86_64-linux, comitted.

gcc/ChangeLog:

* tree-vect-loop.cc (optimize_mask_stores): Make store
likely.

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 2561552fe6e..a83952aff60 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -11741,7 +11741,7 @@ optimize_mask_stores (class loop *loop)
   e->flags = EDGE_TRUE_VALUE;
   efalse = make_edge (bb, store_bb, EDGE_FALSE_VALUE);
   /* Put STORE_BB to likely part.  */
-  efalse->probability = profile_probability::unlikely ();
+  efalse->probability = profile_probability::likely ();
   e->probability = efalse->probability.invert ();
   store_bb->count = efalse->count ();
   make_single_succ_edge (store_bb, join_bb, EDGE_FALLTHRU);


Fix profile update after RTL unrolling

2023-07-27 Thread Jan Hubicka via Gcc-patches
This patch fixes profile update after RTL unroll, that is now done same way as
in tree one.  We still produce (slightly) corrupted profile for multiple exit
loops I can try to fix incrementally.

I also updated testcases to look for profile mismatches so they do not creep
back in again.

Bootstrapped/regtested x86_64-liux, comitted.

gcc/ChangeLog:

* cfgloop.h (single_dom_exit): Declare.
* cfgloopmanip.h (update_exit_probability_after_unrolling): Declare.
* cfgrtl.cc (struct cfg_hooks): Fix comment.
* loop-unroll.cc (unroll_loop_constant_iterations): Update exit edge.
* tree-ssa-loop-ivopts.h (single_dom_exit): Do not declare it here.
* tree-ssa-loop-manip.cc (update_exit_probability_after_unrolling):
Break out from ...
(tree_transform_and_unroll_loop): ... here;

gcc/testsuite/ChangeLog:

* gcc.dg/tree-prof/peel-1.c: Test for profile mismatches.
* gcc.dg/tree-prof/unroll-1.c: Test for profile mismatches.
* gcc.dg/tree-ssa/peel1.c: Test for profile mismatches.
* gcc.dg/unroll-1.c: Test for profile mismatches.
* gcc.dg/unroll-3.c: Test for profile mismatches.
* gcc.dg/unroll-4.c: Test for profile mismatches.
* gcc.dg/unroll-5.c: Test for profile mismatches.
* gcc.dg/unroll-6.c: Test for profile mismatches.

diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 22293e1c237..c4622d4b853 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -921,6 +921,7 @@ extern bool get_estimated_loop_iterations (class loop 
*loop, widest_int *nit);
 extern bool get_max_loop_iterations (const class loop *loop, widest_int *nit);
 extern bool get_likely_max_loop_iterations (class loop *loop, widest_int *nit);
 extern int bb_loop_depth (const_basic_block);
+extern edge single_dom_exit (class loop *);
 
 /* Converts VAL to widest_int.  */
 
diff --git a/gcc/cfgloopmanip.h b/gcc/cfgloopmanip.h
index af6a29f70c4..dab7b31c1e7 100644
--- a/gcc/cfgloopmanip.h
+++ b/gcc/cfgloopmanip.h
@@ -68,5 +68,6 @@ class loop * loop_version (class loop *, void *,
 void adjust_loop_info_after_peeling (class loop *loop, int npeel, bool 
precise);
 void scale_dominated_blocks_in_loop (class loop *loop, basic_block bb,
 profile_count num, profile_count den);
+void update_exit_probability_after_unrolling (class loop *loop, edge new_exit);
 
 #endif /* GCC_CFGLOOPMANIP_H */
diff --git a/gcc/cfgrtl.cc b/gcc/cfgrtl.cc
index 36e43d0d737..abcb472e2a2 100644
--- a/gcc/cfgrtl.cc
+++ b/gcc/cfgrtl.cc
@@ -5409,7 +5409,7 @@ struct cfg_hooks cfg_layout_rtl_cfg_hooks = {
   rtl_flow_call_edges_add,
   NULL, /* execute_on_growing_pred */
   NULL, /* execute_on_shrinking_pred */
-  duplicate_loop_body_to_header_edge, /* duplicate loop for trees */
+  duplicate_loop_body_to_header_edge, /* duplicate loop for rtl */
   rtl_lv_add_condition_to_bb, /* lv_add_condition_to_bb */
   NULL, /* lv_adjust_loop_header_phi*/
   rtl_extract_cond_bb_edges, /* extract_cond_bb_edges */
diff --git a/gcc/loop-unroll.cc b/gcc/loop-unroll.cc
index 9d8ba11..bbfa6ccc770 100644
--- a/gcc/loop-unroll.cc
+++ b/gcc/loop-unroll.cc
@@ -487,6 +487,7 @@ unroll_loop_constant_iterations (class loop *loop)
   bool exit_at_end = loop_exit_at_end_p (loop);
   struct opt_info *opt_info = NULL;
   bool ok;
+  bool flat = maybe_flat_loop_profile (loop);
 
   niter = desc->niter;
 
@@ -603,9 +604,14 @@ unroll_loop_constant_iterations (class loop *loop)
   ok = duplicate_loop_body_to_header_edge (
 loop, loop_latch_edge (loop), max_unroll, wont_exit, desc->out_edge,
 _edges,
-DLTHE_FLAG_UPDATE_FREQ | (opt_info ? DLTHE_RECORD_COPY_NUMBER : 0));
+DLTHE_FLAG_UPDATE_FREQ | (opt_info ? DLTHE_RECORD_COPY_NUMBER : 0)
+| (flat ? DLTHE_FLAG_FLAT_PROFILE : 0));
   gcc_assert (ok);
 
+  edge new_exit = single_dom_exit (loop);
+  if (new_exit)
+update_exit_probability_after_unrolling (loop, new_exit);
+
   if (opt_info)
 {
   apply_opt_in_copies (opt_info, max_unroll, true, true);
diff --git a/gcc/profile-count.h b/gcc/profile-count.h
index 88a6431c21a..e860c5db540 100644
--- a/gcc/profile-count.h
+++ b/gcc/profile-count.h
@@ -650,6 +650,9 @@ public:
   return *this;
 }
 
+  /* Compute n-th power.  */
+  profile_probability pow (int) const;
+
   /* Get the value of the count.  */
   uint32_t value () const { return m_val; }
 
diff --git a/gcc/testsuite/gcc.dg/tree-prof/peel-1.c 
b/gcc/testsuite/gcc.dg/tree-prof/peel-1.c
index 7245b68c1ee..32ecccb16da 100644
--- a/gcc/testsuite/gcc.dg/tree-prof/peel-1.c
+++ b/gcc/testsuite/gcc.dg/tree-prof/peel-1.c
@@ -1,4 +1,4 @@
-/* { dg-options "-O3 -fdump-tree-cunroll-details -fno-unroll-loops 
-fpeel-loops" } */
+/* { dg-options "-O3 -fdump-tree-cunroll-details-blocks 
-fdump-tree-optimized-details-blocks -fno-unroll-loops -fpeel-loops" } */
 void abort();
 
 int a[1000];
@@ -21,3 +21,5 @@ main()
   return 0;
 }
 /* { dg-final-use { scan-tree-dump "Peeled loop ., 1 times" "cunroll" } 

Fix profile update in tree_transform_and_unroll_loop

2023-07-27 Thread Jan Hubicka via Gcc-patches
Hi,
This patch fixes profile update in tree_transform_and_unroll_loop which is used
by predictive comming.  I stared by attempt to fix
gcc.dg/tree-ssa/update-unroll-1.c I xfailed last week, but it turned to be
harder job.

Unrolling was never fixed for changes in duplicate_loop_body_to_header_edge
which is now smarter on getting profile right when some exists are eliminated.
A lot of manual profile can thus now be done using existing infrastructure.

I also noticed that scale_dominated_blocks_in_loop does job identical
to loop I wrote in scale_loop_profile and thus I commonized the implementaiton
and removed recursion.

I also extended duplicate_loop_body_to_header_edge to handle flat profiles same
way as we do in vectorizer. Without it we end up with less then 0 iteration
count in gcc.dg/tree-ssa/update-unroll-1.c (it is unrolled 32times but predicted
to iterated fewer times) and added missing code to update loop_info.

Bootstrapped/regtested x86_64-linux, comitted.

gcc/ChangeLog:

* cfgloopmanip.cc (scale_dominated_blocks_in_loop): Move here from
tree-ssa-loop-manip.cc and avoid recursion.
(scale_loop_profile): Use scale_dominated_blocks_in_loop.
(duplicate_loop_body_to_header_edge): Add DLTHE_FLAG_FLAT_PROFILE
flag.
* cfgloopmanip.h (DLTHE_FLAG_FLAT_PROFILE): Define.
(scale_dominated_blocks_in_loop): Declare.
* predict.cc (dump_prediction): Do not ICE on uninitialized probability.
(change_edge_frequency): Remove.
* predict.h (change_edge_frequency): Remove.
* tree-ssa-loop-manip.cc (scale_dominated_blocks_in_loop): Move to
cfgloopmanip.cc.
(niter_for_unrolled_loop): Remove.
(tree_transform_and_unroll_loop): Fix profile update.

gcc/testsuite/ChangeLog:

* gcc.dg/pr102385.c: Check for no profile mismatches.
* gcc.dg/pr96931.c: Check for no profile mismatches.
* gcc.dg/tree-ssa/predcom-1.c: Check for no profile mismatches.
* gcc.dg/tree-ssa/predcom-2.c: Check for no profile mismatches.
* gcc.dg/tree-ssa/predcom-3.c: Check for no profile mismatches.
* gcc.dg/tree-ssa/predcom-4.c: Check for no profile mismatches.
* gcc.dg/tree-ssa/predcom-5.c: Check for no profile mismatches.
* gcc.dg/tree-ssa/predcom-7.c: Check for one profile mismatch.
* gcc.dg/tree-ssa/predcom-8.c: Check for no profile mismatches.
* gcc.dg/tree-ssa/predcom-dse-1.c: Check for no profile mismatches.
* gcc.dg/tree-ssa/predcom-dse-10.c: Check for no profile mismatches.
* gcc.dg/tree-ssa/predcom-dse-11.c: Check for no profile mismatches.
* gcc.dg/tree-ssa/predcom-dse-12.c: Check for no profile mismatches.
* gcc.dg/tree-ssa/predcom-dse-2.c: Check for no profile mismatches.
* gcc.dg/tree-ssa/predcom-dse-3.c: Check for no profile mismatches.
* gcc.dg/tree-ssa/predcom-dse-4.c: Check for no profile mismatches.
* gcc.dg/tree-ssa/predcom-dse-5.c: Check for no profile mismatches.
* gcc.dg/tree-ssa/predcom-dse-6.c: Check for no profile mismatches.
* gcc.dg/tree-ssa/predcom-dse-7.c: Check for no profile mismatches.
* gcc.dg/tree-ssa/predcom-dse-8.c: Check for no profile mismatches.
* gcc.dg/tree-ssa/predcom-dse-9.c: Check for no profile mismatches.
* gcc.dg/tree-ssa/update-unroll-1.c: Unxfail.

diff --git a/gcc/cfgloopmanip.cc b/gcc/cfgloopmanip.cc
index 3012a8d60f7..c3d292d0dd4 100644
--- a/gcc/cfgloopmanip.cc
+++ b/gcc/cfgloopmanip.cc
@@ -499,6 +499,32 @@ scale_loop_frequencies (class loop *loop, 
profile_probability p)
   free (bbs);
 }
 
+/* Scales the frequencies of all basic blocks in LOOP that are strictly
+   dominated by BB by NUM/DEN.  */
+
+void
+scale_dominated_blocks_in_loop (class loop *loop, basic_block bb,
+   profile_count num, profile_count den)
+{
+  basic_block son;
+
+  if (!den.nonzero_p () && !(num == profile_count::zero ()))
+return;
+  auto_vec  worklist;
+  worklist.safe_push (bb);
+
+  while (!worklist.is_empty ())
+for (son = first_dom_son (CDI_DOMINATORS, worklist.pop ());
+son;
+son = next_dom_son (CDI_DOMINATORS, son))
+  {
+   if (!flow_bb_inside_loop_p (loop, son))
+ continue;
+   son->count = son->count.apply_scale (num, den);
+   worklist.safe_push (son);
+  }
+}
+
 /* Scale profile in LOOP by P.
If ITERATION_BOUND is not -1, scale even further if loop is predicted
to iterate too many times.
@@ -649,19 +675,9 @@ scale_loop_profile (class loop *loop, profile_probability 
p,
   if (other_edge && other_edge->dest == loop->latch)
loop->latch->count -= new_exit_count - old_exit_count;
   else
-   {
- basic_block *body = get_loop_body (loop);
- profile_count new_count = exit_edge->src->count - new_exit_count;
- profile_count old_count = exit_edge->src->count - old_exit_count;
-
- for (unsigned int i = 

Fix profile update in tree-ssa-loop-im.cc

2023-07-27 Thread Jan Hubicka via Gcc-patches
Hi,
this fixes two bugs in tree-ssa-loop-im.cc.  First is that cap probability is 
not
reliable, but it is constructed with adjusted quality.  Second is that sometimes
the conditional has wrong joiner BB count.  This is visible on
testsuite/gcc.dg/pr102385.c however the testcase triggers another profile
update bug in pcom, so I will update it in followup patch.

gcc/ChangeLog:

* tree-ssa-loop-im.cc (execute_sm_if_changed): Turn cap probability
to guessed; fix count of new_bb.

diff --git a/gcc/tree-ssa-loop-im.cc b/gcc/tree-ssa-loop-im.cc
index f5b01e986ae..268f466bdc9 100644
--- a/gcc/tree-ssa-loop-im.cc
+++ b/gcc/tree-ssa-loop-im.cc
@@ -2059,7 +2059,8 @@ execute_sm_if_changed (edge ex, tree mem, tree tmp_var, 
tree flag,
nbbs++;
 }
 
-  profile_probability cap = profile_probability::always ().apply_scale (2, 3);
+  profile_probability cap
+ = profile_probability::guessed_always ().apply_scale (2, 3);
 
   if (flag_probability.initialized_p ())
 ;
@@ -2103,6 +2104,8 @@ execute_sm_if_changed (edge ex, tree mem, tree tmp_var, 
tree flag,
 
   old_dest = ex->dest;
   new_bb = split_edge (ex);
+  if (append_cond_position)
+new_bb->count += last_cond_fallthru->count ();
   then_bb = create_empty_bb (new_bb);
   then_bb->count = new_bb->count.apply_probability (flag_probability);
   if (irr)


Fix profile_count::apply_probability

2023-07-27 Thread Jan Hubicka via Gcc-patches
Hi,
profile_count::apply_probability misses check for uninitialized
probability which leads to completely random results on applying
uninitialized probability to initialized scale.  This can make
difference when i.e. inlining -fno-guess-branch-probability function to
-fguess-branch-probability one.

Boootstrapped/regtested x86_64-linux, commited.
gcc/ChangeLog:

* profile-count.h (profile_count::apply_probability): Fix
handling of uninitialized probabilities, optimize scaling
by probability 1.

diff --git a/gcc/profile-count.h b/gcc/profile-count.h
index bf1136782a3..e860c5db540 100644
--- a/gcc/profile-count.h
+++ b/gcc/profile-count.h
@@ -1129,11 +1132,11 @@ public:
   /* Scale counter according to PROB.  */
   profile_count apply_probability (profile_probability prob) const
 {
-  if (*this == zero ())
+  if (*this == zero () || prob == profile_probability::always ())
return *this;
   if (prob == profile_probability::never ())
return zero ();
-  if (!initialized_p ())
+  if (!initialized_p () || !prob.initialized_p ())
return uninitialized ();
   profile_count ret;
   uint64_t tmp;


Fix profile_count::to_sreal_scale

2023-07-26 Thread Jan Hubicka via Gcc-patches
Hi,
this patch makes profile_count::to_sreal_scale consider the scale
unknown when in is 0.  This fixes the case where loop has 0 executions
in profile feedback and thus we can't determine its trip count.

Bootstrapped/regtested x86_64-linux, comitted.

Honza

gcc/ChangeLog:

* profile-count.cc (profile_count::to_sreal_scale): Value is not know
if we divide by zero.

diff --git a/gcc/profile-count.cc b/gcc/profile-count.cc
index 2c07ebc5942..eaf0f0d787e 100644
--- a/gcc/profile-count.cc
+++ b/gcc/profile-count.cc
@@ -345,7 +345,7 @@ profile_count::to_sreal_scale (profile_count in, bool 
*known) const
   return 1;
 }
   if (known)
-*known = true;
+*known = in.m_val != 0;
   if (*this == in)
 return 1;
   gcc_checking_assert (compatible_p (in));


Re: Fix optimize_mask_stores profile update

2023-07-21 Thread Jan Hubicka via Gcc-patches
> On Mon, Jul 17, 2023 at 12:36 PM Jan Hubicka via Gcc-patches
>  wrote:
> >
> > Hi,
> > While looking into sphinx3 regression I noticed that vectorizer produces
> > BBs with overall probability count 120%.  This patch fixes it.
> > Richi, I don't know how to create a testcase, but having one would
> > be nice.
> >
> > Bootstrapped/regtested x86_64-linux, commited last night (sorry for
> > late email)
> 
> This should trigger with sth like
> 
>   for (i)
> if (cond[i])
>   out[i] = 1.;
> 
> so a masked store and then using AVX2+.  ISTR we disable AVX masked
> stores on zen (but not AVX512).

Richard,
if we know probability of if (cond[i]) to be p,
then we know that the combined conditional is somewhere between
  low = p  (the strategy packing true and falses into VF sized
blocks)
and
  high = min (p*vf,1)
   (the stragegy doing only one true per block if possible)
Likely value is

  likely = 1-pow(1-p, vf)

I wonder if we can work out p at least in common cases. 
Making store unlikely as we do right now will place it offline with
extra jump.  Making it likely is better unless p is very small.

I think if p is close to 0 or 1 which may be common case the analysis
above may be useful. If range [low...high] is small, we can use likely
and keep it as reliable.
If it is high, we can probably just end up with guessed value close but
above 50% so the store stays inline.

Honza


Re: [PATCH]AArch64 fix regexp for live_1.c sve test

2023-07-21 Thread Jan Hubicka via Gcc-patches
Avoid scaling flat loop profiles of vectorized loops

As discussed, when vectorizing loop with static profile, it is not always good 
idea
to divide the header frequency by vectorization factor because the profile may
not realistically represent the expected number of iterations.  Since in such 
cases
we default to relatively low iteration counts (based on average for spec2k17), 
this
will make vectorized loop body look cold.

This patch makes vectorizer to look for flat profiles and only possibly reduce 
the
profile by known upper bound on iteration counts.

Bootstrapp/regtested of x86_64-linux in progress. I intend to commit this after
testers pick other profile related changes from today.
Tamar, Richard, it would be nice to know if it fixes the testcase you was 
looking at
and possibly turn it into a testcase?

gcc/ChangeLog:

* tree-vect-loop.cc (scale_profile_for_vect_loop): Avoid scaling flat
profiles by vectorization factor.
(vect_transform_loop): Check for flat profiles.

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index b44fb9c7712..d036a7d4480 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -10837,11 +10837,25 @@ vect_get_loop_len (loop_vec_info loop_vinfo, 
gimple_stmt_iterator *gsi,
 }
 
 /* Scale profiling counters by estimation for LOOP which is vectorized
-   by factor VF.  */
+   by factor VF.
+   If FLAT is true, the loop we started with had unrealistically flat
+   profile.  */
 
 static void
-scale_profile_for_vect_loop (class loop *loop, unsigned vf)
+scale_profile_for_vect_loop (class loop *loop, unsigned vf, bool flat)
 {
+  /* For flat profiles do not scale down proportionally by VF and only
+ cap by known iteration count bounds.  */
+  if (flat)
+{
+  if (dump_file && (dump_flags & TDF_DETAILS))
+   fprintf (dump_file,
+"Vectorized loop profile seems flat; not scaling iteration "
+"count down by the vectorization factor %i\n", vf);
+  scale_loop_profile (loop, profile_probability::always (),
+ get_likely_max_loop_iterations_int (loop));
+  return;
+}
   /* Loop body executes VF fewer times and exit increases VF times.  */
   edge exit_e = single_exit (loop);
   profile_count entry_count = loop_preheader_edge (loop)->count ();
@@ -10852,7 +10866,13 @@ scale_profile_for_vect_loop (class loop *loop, 
unsigned vf)
   while (vf > 1
 && loop->header->count > entry_count
 && loop->header->count < entry_count * vf)
-vf /= 2;
+{
+  if (dump_file && (dump_flags & TDF_DETAILS))
+   fprintf (dump_file,
+"Vectorization factor %i seems too large for profile "
+"prevoiusly believed to be consistent; reducing.\n", vf);
+  vf /= 2;
+}
 
   if (entry_count.nonzero_p ())
 set_edge_probability_and_rescale_others
@@ -11184,6 +11204,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple 
*loop_vectorized_call)
   gimple *stmt;
   bool check_profitability = false;
   unsigned int th;
+  bool flat = maybe_flat_loop_profile (loop);
 
   DUMP_VECT_SCOPE ("vec_transform_loop");
 
@@ -11252,7 +11273,6 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple 
*loop_vectorized_call)
  _vector, _vector_mult_vf, th,
  check_profitability, niters_no_overflow,
  );
-
   if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo)
   && LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo).initialized_p ())
 scale_loop_frequencies (LOOP_VINFO_SCALAR_LOOP (loop_vinfo),
@@ -11545,7 +11565,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple 
*loop_vectorized_call)
  assumed_vf) - 1
 : wi::udiv_floor (loop->nb_iterations_estimate + bias_for_assumed,
   assumed_vf) - 1);
-  scale_profile_for_vect_loop (loop, assumed_vf);
+  scale_profile_for_vect_loop (loop, assumed_vf, flat);
 
   if (dump_enabled_p ())
 {


Fix gcc.dg/tree-ssa/copy-headers-9.c and gcc.dg/tree-ssa/dce-1.c failures

2023-07-21 Thread Jan Hubicka via Gcc-patches
Hi,
this patch fixes template in the two testcases so it matches the output
correctly.  I did not re-test after last changes in the previous patch,
sorry for that.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/copy-headers-9.c: Fix template for 
tree-ssa-loop-ch.cc changes.
* gcc.dg/tree-ssa/dce-1.c: Likewise.

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-9.c 
b/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-9.c
index 7cc162ca94d..b49d1fc9576 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-9.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-9.c
@@ -13,8 +13,7 @@ void test (int m, int n)
}
while (i<10);
 }
-/* { dg-final { scan-tree-dump-times "Duplicating bb . is a win" 1 "ch2" } } */
-/* { dg-final { scan-tree-dump-times "May duplicate bb" 1 "ch2" } } */
-/* { dg-final { scan-tree-dump-times "Duplicating additional BB to obtain 
do-while loop" 1 "ch2" } } */
+/* { dg-final { scan-tree-dump-times "Duplicating bb . is a win" 2 "ch2" } } */
+/* { dg-final { scan-tree-dump-times "Duplicating bb . is a win. it has zero" 
1 "ch2" } } */
 /* { dg-final { scan-tree-dump-times "Will duplicate bb" 2 "ch2" } } */
 /* { dg-final { scan-tree-dump "is now do-while loop" "ch2" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/dce-1.c 
b/gcc/testsuite/gcc.dg/tree-ssa/dce-1.c
index 91c3bcd6c1c..3ebfa988503 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/dce-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/dce-1.c
@@ -13,6 +13,6 @@ int foo (int b, int j)
 }
 /* Check that empty loop is eliminated in this case.  We should no longer have
the exit condition after the loop.  */
-/* { dg-final { scan-tree-dump-not "999)" "cddce1"} } */
-/* { dg-final { scan-tree-dump-not "1000)" "cddce1"} } */
+/* { dg-final { scan-tree-dump-not "999\\)" "cddce1"} } */
+/* { dg-final { scan-tree-dump-not "1000\\)" "cddce1"} } */
 


Implement flat loop profile detection

2023-07-21 Thread Jan Hubicka via Gcc-patches
Hi,
this patch adds maybe_flat_loop_profile which can be used in loop profile udpate
to detect situation where the profile may be unrealistically flat and should
not be dwonscalled after vectorizing, unrolling and other transforms that
assume that loop has high iteration count even if the CFG profile says
otherwise.

Profile is flat if it was statically detected and at that time we had
no idea about actual number of iterations or we artificially capped them.
So the function considers flat all profiles that have guessed or lower
reliability in their count and there is no nb_iteration_bounds/estimate
which would prove that the profile iteration count is high enough.

Bootstrapped/regtested x86_64-linux, comitted.

gcc/ChangeLog:

* cfgloop.h (maybe_flat_loop_profile): Declare
* cfgloopanal.cc (maybe_flat_loop_profile): New function.
* tree-cfg.cc (print_loop_info): Print info about flat profiles.

diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 269694c7962..22293e1c237 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -407,6 +407,7 @@ gcov_type expected_loop_iterations_unbounded (const class 
loop *,
 extern bool expected_loop_iterations_by_profile (const class loop *loop,
 sreal *ret,
 bool *reliable = NULL);
+extern bool maybe_flat_loop_profile (const class loop *);
 extern unsigned expected_loop_iterations (class loop *);
 extern rtx doloop_condition_get (rtx_insn *);
 
diff --git a/gcc/cfgloopanal.cc b/gcc/cfgloopanal.cc
index c86a537f024..d8923b27e5d 100644
--- a/gcc/cfgloopanal.cc
+++ b/gcc/cfgloopanal.cc
@@ -303,6 +303,67 @@ expected_loop_iterations_by_profile (const class loop 
*loop, sreal *ret,
   return true;
 }
 
+/* Return true if loop CFG profile may be unrealistically flat.
+   This is a common case, since average loops iterate only about 5 times.
+   In the case we do not have profile feedback or do not know real number of
+   iterations during profile estimation, we are likely going to predict it with
+   similar low iteration count.  For static loop profiles we also artificially
+   cap profile of loops with known large iteration count so they do not appear
+   significantly more hot than other loops with unknown iteration counts.
+
+   For loop optimization heuristics we ignore CFG profile and instead
+   use get_estimated_loop_iterations API which returns estimate
+   only when it is realistic.  For unknown counts some optimizations,
+   like vectorizer or unroller make guess that iteration count will
+   be large.  In this case we need to avoid scaling down the profile
+   after the loop transform.  */
+
+bool
+maybe_flat_loop_profile (const class loop *loop)
+{
+  bool reliable;
+  sreal ret;
+
+  if (!expected_loop_iterations_by_profile (loop, , ))
+return true;
+
+  /* Reliable CFG estimates ought never be flat.  Sanity check with
+ nb_iterations_estimate.  If those differ, it is a but in profile
+ updating code  */
+  if (reliable)
+{
+  int64_t intret = ret.to_nearest_int ();
+  if (loop->any_estimate
+ && (wi::ltu_p (intret * 2, loop->nb_iterations_estimate)
+ || wi::gtu_p (intret, loop->nb_iterations_estimate * 2)))
+   {
+ if (dump_file && (dump_flags & TDF_DETAILS))
+   fprintf (dump_file,
+   "Loop %i has inconsistent iterations estimates: "
+   "reliable CFG based iteration estimate is %f "
+   "while nb_iterations_estimate is %i\n",
+   loop->num,
+   ret.to_double (),
+   (int)loop->nb_iterations_estimate.to_shwi ());
+ return true;
+   }
+  return false;
+}
+
+  /* Allow some margin of error and see if we are close to known bounds.
+ sreal (9,-3) is 9/8  */
+  int64_t intret = (ret * sreal (9, -3)).to_nearest_int ();
+  if (loop->any_upper_bound && wi::geu_p (intret, 
loop->nb_iterations_upper_bound))
+return false;
+  if (loop->any_likely_upper_bound
+  && wi::geu_p (intret, loop->nb_iterations_likely_upper_bound))
+return false;
+  if (loop->any_estimate
+  && wi::geu_p (intret, loop->nb_iterations_estimate))
+return false;
+  return true;
+}
+
 /* Returns expected number of iterations of LOOP, according to
measured or guessed profile.
 
diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc
index a6c97a04662..c65af8cc800 100644
--- a/gcc/tree-cfg.cc
+++ b/gcc/tree-cfg.cc
@@ -8523,8 +8523,11 @@ print_loop_info (FILE *file, const class loop *loop, 
const char *prefix)
   bool reliable;
   sreal iterations;
   if (loop->num && expected_loop_iterations_by_profile (loop, , 
))
-fprintf (file, "\n%siterations by profile: %f %s", prefix,
-iterations.to_double (), reliable ? "(reliable)" : "(unreliable)");
+{
+  fprintf (file, "\n%siterations by profile: %f (%s%s)", prefix,
+  iterations.to_double (), reliable ? "reliable" : 

Fix sreal::to_int and implement sreal::to_nearest_int

2023-07-21 Thread Jan Hubicka via Gcc-patches
Fix sreal::to_int and implement sreal::to_nearest_int

while exploring new loop estimate dumps, I noticed that loop iterating 1.8
times by profile is etimated as iterating once instead of 2 by nb_estimate.
While nb_estimate should really be a sreal and I will convert it incrementally,
I found problem is in previous patch doing:

+ *nit = (snit + 0.5).to_int ();

this does not work for sreal because it has only constructor from integer, so
first 0.5 is rounded to 0 and then added to snit.

Some code uses sreal(1, -1) which produces 0.5, but it reuqires unnecessary
addition, so I decided to add to_nearest_int.  Testing it I noticed that to_int
is buggy:
  (sreal(3)/2).to_int () == 1
while
  (sreal(-3)/2).to_int () == -2
Probably not big deal in practice as we do not do conversions on
negative values.

Fix is easy, we need to correctly shift in positive values.  This patch fixes
it and adds the to_nearest_int alternative.

Bootstrapped/regtested x86_64-linux, will commit it shortly.

gcc/ChangeLog:

* sreal.cc (sreal::to_nearest_int): New.
(sreal_verify_basics): Verify also to_nearest_int.
(verify_aritmetics): Likewise.
(sreal_verify_conversions): New.
(sreal_cc_tests): Call sreal_verify_conversions.
* sreal.h: (sreal::to_nearest_int): Declare

diff --git a/gcc/sreal.cc b/gcc/sreal.cc
index 8e99d871420..606a571e339 100644
--- a/gcc/sreal.cc
+++ b/gcc/sreal.cc
@@ -116,7 +116,26 @@ sreal::to_int () const
   if (m_exp > 0)
 return sign * (SREAL_ABS ((int64_t)m_sig) << m_exp);
   if (m_exp < 0)
-return m_sig >> -m_exp;
+return sign * (SREAL_ABS ((int64_t)m_sig) >> -m_exp);
+  return m_sig;
+}
+
+/* Return nearest integer value of *this.  */
+
+int64_t
+sreal::to_nearest_int () const
+{
+  int64_t sign = SREAL_SIGN (m_sig);
+
+  if (m_exp <= -SREAL_BITS)
+return 0;
+  if (m_exp >= SREAL_PART_BITS)
+return sign * INTTYPE_MAXIMUM (int64_t);
+  if (m_exp > 0)
+return sign * (SREAL_ABS ((int64_t)m_sig) << m_exp);
+  if (m_exp < 0)
+return sign * ((SREAL_ABS ((int64_t)m_sig) >> -m_exp)
+  + ((SREAL_ABS (m_sig) >> (-m_exp - 1)) & 1));
   return m_sig;
 }
 
@@ -286,6 +305,8 @@ sreal_verify_basics (void)
 
   ASSERT_EQ (INT_MIN/2, minimum.to_int ());
   ASSERT_EQ (INT_MAX/2, maximum.to_int ());
+  ASSERT_EQ (INT_MIN/2, minimum.to_nearest_int ());
+  ASSERT_EQ (INT_MAX/2, maximum.to_nearest_int ());
 
   ASSERT_FALSE (minus_two < minus_two);
   ASSERT_FALSE (seven < seven);
@@ -315,6 +336,10 @@ verify_aritmetics (int64_t a, int64_t b)
   ASSERT_EQ (a - b, (sreal (a) - sreal (b)).to_int ());
   ASSERT_EQ (b + a, (sreal (b) + sreal (a)).to_int ());
   ASSERT_EQ (b - a, (sreal (b) - sreal (a)).to_int ());
+  ASSERT_EQ (a + b, (sreal (a) + sreal (b)).to_nearest_int ());
+  ASSERT_EQ (a - b, (sreal (a) - sreal (b)).to_nearest_int ());
+  ASSERT_EQ (b + a, (sreal (b) + sreal (a)).to_nearest_int ());
+  ASSERT_EQ (b - a, (sreal (b) - sreal (a)).to_nearest_int ());
 }
 
 /* Verify arithmetics for interesting numbers.  */
@@ -377,6 +402,33 @@ sreal_verify_negative_division (void)
   ASSERT_EQ (sreal (1234567) / sreal (-1234567), sreal (-1));
 }
 
+static void
+sreal_verify_conversions (void)
+{
+  ASSERT_EQ ((sreal (11) / sreal (3)).to_int (), 3);
+  ASSERT_EQ ((sreal (11) / sreal (3)).to_nearest_int (), 4);
+  ASSERT_EQ ((sreal (10) / sreal (3)).to_int (), 3);
+  ASSERT_EQ ((sreal (10) / sreal (3)).to_nearest_int (), 3);
+  ASSERT_EQ ((sreal (9) / sreal (3)).to_int (), 3);
+  ASSERT_EQ ((sreal (9) / sreal (3)).to_nearest_int (), 3);
+  ASSERT_EQ ((sreal (-11) / sreal (3)).to_int (), -3);
+  ASSERT_EQ ((sreal (-11) / sreal (3)).to_nearest_int (), -4);
+  ASSERT_EQ ((sreal (-10) / sreal (3)).to_int (), -3);
+  ASSERT_EQ ((sreal (-10) / sreal (3)).to_nearest_int (), -3);
+  ASSERT_EQ ((sreal (-3)).to_int (), -3);
+  ASSERT_EQ ((sreal (-3)).to_nearest_int (), -3);
+  for (int i = -10 ; i < 10; i += 123)
+for (int j = -1 ; j < 10; j += 71)
+  if (j != 0)
+   {
+ sreal sval = ((sreal)i) / (sreal)j;
+ double val = (double)i / (double)j;
+ ASSERT_EQ ((fabs (sval.to_double () - val) < 0.1), true);
+ ASSERT_EQ (sval.to_int (), (int)val);
+ ASSERT_EQ (sval.to_nearest_int (), lround (val));
+   }
+}
+
 /* Run all of the selftests within this file.  */
 
 void sreal_cc_tests ()
@@ -385,6 +437,7 @@ void sreal_cc_tests ()
   sreal_verify_arithmetics ();
   sreal_verify_shifting ();
   sreal_verify_negative_division ();
+  sreal_verify_conversions ();
 }
 
 } // namespace selftest
diff --git a/gcc/sreal.h b/gcc/sreal.h
index 8700807a131..4dbb83c3005 100644
--- a/gcc/sreal.h
+++ b/gcc/sreal.h
@@ -51,6 +51,7 @@ public:
 
   void dump (FILE *) const;
   int64_t to_int () const;
+  int64_t to_nearest_int () const;
   double to_double () const;
   void stream_out (struct output_block *);
   static sreal stream_in (class lto_input_block *);


Re: loop-ch improvements, part 5

2023-07-21 Thread Jan Hubicka via Gcc-patches
> > The patch requires bit of testsuite changes
> >  - I disabled ch in loop-unswitch-17.c since it tests unswitching of
> >loop invariant conditional.
> >  - pr103079.c needs ch disabled to trigger vrp situation it tests for
> >(otherwise we optimize stuff earlier and better)
> >  - copy-headers-7.c now gets only 2 basic blocks duplicated since
> >last conditional does not seem to benefit from duplicating,
> >so I reordered them.
> > copy-headers-9 tests the new logic.
> >
> > Bootstrapped/regtested x86_64-linux, OK?
> 
> OK.  In case the size heuristics are a bit too optimistic we could avoid the
Thanks!
> peeling in the -Os case?  Did you do any stats on TUs to see whether code
> actually increases in the end?

I did only stats on tramp3d and some GCC source files with -O2 where the
new heuristics actually tends to duplicate fewer BBs overall because of
the logic stopping the duplication chain after last winning header while
the prevoious implementation keeps rolling loop more.  Difference is
small (sub 1%) since most loops are very simple and have only one header
BB to duplicate.  We however handle more loops overall and produce more
do-whiles.

I think there is some potential in getting heuristics more speculative
now and allowing more partial peeling, but the code right now is still
on safe side.

For -Os we set code growth limit to 0 so we only duplicate if we know
that one of the two copies will be optimized out.  This is more strict
than we did previously and I need to get more stats on this - we may
want to bump up the limit or at least increase it to account the extra
jump saved with while -> do-while conversion.

Honza


loop-ch improvements, part 5

2023-07-21 Thread Jan Hubicka via Gcc-patches
Hi,
currently loop-ch skips all do-while loops.  But when loop is not do-while
in addition to original goal of turining it to do-while it can do additional
things:
 1) move out loop invariant computations
 2) duplicate loop invariant conditionals and eliminate them in loop body.
 3) prove that some exits are always true in first iteration
and can be skipped

Most of time 1 can be done by lim (exception is when the invariant computation
is conditional). For 2 we however don't really have other place doing it except
for loop unswitching that is more expensive (it will duplicate the loop and
then optimize out one path to non-loop).
3 can be done by loop peeling but it is also more expensive by duplicating full
loop body.

This patch improves heuristics by not giving up on do-while loops and trying
to find sequence of BBs to duplicate to obtain one of goals:
 - turn loop to do-while
 - eliminate invariant conditional in loop body
 - do partial "peeling" as long as code optimizes enough so this does not
   increase code size.
This can be improved upon, but I think this patch should finally get
heuristics into shape that it does not do weird things.

The patch requires bit of testsuite changes
 - I disabled ch in loop-unswitch-17.c since it tests unswitching of
   loop invariant conditional.
 - pr103079.c needs ch disabled to trigger vrp situation it tests for
   (otherwise we optimize stuff earlier and better)
 - copy-headers-7.c now gets only 2 basic blocks duplicated since
   last conditional does not seem to benefit from duplicating,
   so I reordered them.
copy-headers-9 tests the new logic.

Bootstrapped/regtested x86_64-linux, OK?

gcc/ChangeLog:

* tree-ssa-loop-ch.cc (enum ch_decision): New enum.
(should_duplicate_loop_header_p): Return info on profitability.
(do_while_loop_p): Watch for constant conditionals.
(update_profile_after_ch): Do not sanity check that all
static exits are taken.
(ch_base::copy_headers): Run on all loops.
(pass_ch::process_loop_p): Improve heuristics by handling also
do_while loop and duplicating shortest sequence containing all
winning blocks.

gcc/testsuite/ChangeLog:

* gcc.dg/loop-unswitch-17.c: Disable ch.
* gcc.dg/pr103079.c: Disable ch.
* gcc.dg/tree-ssa/copy-headers-7.c: Update so ch behaves
as expected.
* gcc.dg/tree-ssa/copy-headers.c: Update template.
* gcc.dg/tree-ssa/copy-headers-9.c: New test.

diff --git a/gcc/testsuite/gcc.dg/loop-unswitch-17.c 
b/gcc/testsuite/gcc.dg/loop-unswitch-17.c
index 8655e09a51c..4b806c475b1 100644
--- a/gcc/testsuite/gcc.dg/loop-unswitch-17.c
+++ b/gcc/testsuite/gcc.dg/loop-unswitch-17.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -funswitch-loops -fdump-tree-unswitch-optimized" } */
+/* { dg-options "-O2 -funswitch-loops -fdump-tree-unswitch-optimized 
-fno-tree-ch" } */
 
 int foo (int a)
 {
diff --git a/gcc/testsuite/gcc.dg/pr103079.c b/gcc/testsuite/gcc.dg/pr103079.c
index 7f6632fc669..7b107544725 100644
--- a/gcc/testsuite/gcc.dg/pr103079.c
+++ b/gcc/testsuite/gcc.dg/pr103079.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-Os -fdump-tree-vrp2" } */
+/* { dg-options "-Os -fdump-tree-vrp2 -fno-tree-ch" } */
 
 int a, b = -2;
 int main() {
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-7.c 
b/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-7.c
index e2a6c75f2e9..b3df3b6398e 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-7.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-7.c
@@ -4,7 +4,7 @@
 int is_sorted(int *a, int n, int m, int k)
 {
   if (k > 0)
-for (int i = 0; i < n - 1 && m && k > i; i++)
+for (int i = 0; k > i && m && i < n - 1 ; i++)
   if (a[i] > a[i + 1])
return 0;
   return 1;
@@ -17,5 +17,4 @@ int is_sorted(int *a, int n, int m, int k)
 /* { dg-final { scan-tree-dump-times "Conditional combines static and 
invariant" 0 "ch2" } } */
 /* { dg-final { scan-tree-dump-times "Will elliminate invariant exit" 1 "ch2" 
} } */
 /* { dg-final { scan-tree-dump-times "Will eliminate peeled conditional" 1 
"ch2" } } */
-/* { dg-final { scan-tree-dump-times "Not duplicating bb .: condition based on 
non-IV loop variant." 1 "ch2" } } */
 /* { dg-final { scan-tree-dump-times "Will duplicate bb" 3 "ch2" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-9.c 
b/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-9.c
new file mode 100644
index 000..7cc162ca94d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-9.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-ch-details" } */
+int a[100];
+void test (int m, int n)
+{
+   int i = 0;
+   do
+   {
+   if (m)
+   break;
+   i++;
+   a[i]=0;
+   }
+   while (i<10);
+}
+/* { dg-final { scan-tree-dump-times "Duplicating bb . is a win" 1 "ch2" } } */
+/* { dg-final { 

finite_loop_p tweak

2023-07-21 Thread Jan Hubicka via Gcc-patches
Hi,
we have finite_p flag in loop structure.  finite_loop_p already know to
use it, but we also may set the flag when we prove loop to be finite by
SCEV analysis to avoid duplicated work.

Bootstrapped/regtested x86_64-linux, OK?

gcc/ChangeLog:

* tree-ssa-loop-niter.cc (finite_loop_p): Reorder to do cheap
tests first; update finite_p flag.

diff --git a/gcc/tree-ssa-loop-niter.cc b/gcc/tree-ssa-loop-niter.cc
index 3c4e66291fb..e5985bee235 100644
--- a/gcc/tree-ssa-loop-niter.cc
+++ b/gcc/tree-ssa-loop-niter.cc
@@ -3338,24 +3338,6 @@ finite_loop_p (class loop *loop)
   widest_int nit;
   int flags;
 
-  flags = flags_from_decl_or_type (current_function_decl);
-  if ((flags & (ECF_CONST|ECF_PURE)) && !(flags & ECF_LOOPING_CONST_OR_PURE))
-{
-  if (dump_file && (dump_flags & TDF_DETAILS))
-   fprintf (dump_file, "Found loop %i to be finite: it is within pure or 
const function.\n",
-loop->num);
-  return true;
-}
-
-  if (loop->any_upper_bound
-  || max_loop_iterations (loop, ))
-{
-  if (dump_file && (dump_flags & TDF_DETAILS))
-   fprintf (dump_file, "Found loop %i to be finite: upper bound found.\n",
-loop->num);
-  return true;
-}
-
   if (loop->finite_p)
 {
   unsigned i;
@@ -3368,11 +3350,36 @@ finite_loop_p (class loop *loop)
  {
if (dump_file)
  fprintf (dump_file, "Assume loop %i to be finite: it has an exit "
-  "and -ffinite-loops is on.\n", loop->num);
+  "and -ffinite-loops is on or loop was
+  " previously finite.\n",
+  loop->num);
return true;
  }
 }
 
+  flags = flags_from_decl_or_type (current_function_decl);
+  if ((flags & (ECF_CONST|ECF_PURE)) && !(flags & ECF_LOOPING_CONST_OR_PURE))
+{
+  if (dump_file && (dump_flags & TDF_DETAILS))
+   fprintf (dump_file,
+"Found loop %i to be finite: it is within "
+"pure or const function.\n",
+loop->num);
+  loop->finite_p = true;
+  return true;
+}
+
+  if (loop->any_upper_bound
+  /* Loop with no normal exit will not pass max_loop_iterations.  */
+  || (!loop->finite_p && max_loop_iterations (loop, )))
+{
+  if (dump_file && (dump_flags & TDF_DETAILS))
+   fprintf (dump_file, "Found loop %i to be finite: upper bound found.\n",
+loop->num);
+  loop->finite_p = true;
+  return true;
+}
+
   return false;
 }
 


Improve loop dumping

2023-07-21 Thread Jan Hubicka via Gcc-patches
Hi,
we have flow_loop_dump and print_loop. While print_loop was extended to dump
stuff from loop structure we added over years (loop info), flow_loop_dump was 
not.
-fdump-tree-all files contains flow_loop_dump which makes it hard to see what
metadata we have attached to loop.

This patch unifies dumping of these fields from both functions.  For example 
for:
int a[100];
main()
{
for (int i = 0;  i < 10; i++)
a[i]=i;
}
we now print:
;; Loop 0
;;  header 0, latch 1
;;  depth 0, outer -1
;;  nodes: 0 1 2 3 4 5
;;
;; Loop 1
;;  header 4, latch 3
;;  depth 1, outer 0, finite_p
;;  upper_bound 10
;;  likely_upper_bound 10
;;  estimate 10
;;  iterations by profile: 10.001101 (unreliable)

finite_p, upper_bound, likely_upper_bound estimate and iterations by profile is 
new.

Bootstrap/regtest on x86_64 in progress. OK if it passes?

Honza

gcc/ChangeLog:

* cfgloop.cc (flow_loop_dump): Use print_loop_info.
* cfgloop.h (print_loop_info): Declare.
* tree-cfg.cc (print_loop_info): Break out from ...; add
printing of missing fields and profile
(print_loop): ... here.

diff --git a/gcc/cfgloop.cc b/gcc/cfgloop.cc
index 020e5734d95..9ca85e648a7 100644
--- a/gcc/cfgloop.cc
+++ b/gcc/cfgloop.cc
@@ -135,17 +135,12 @@ flow_loop_dump (const class loop *loop, FILE *file,
   fprintf (file, "\n");
 }
 
-  fprintf (file, ";;  depth %d, outer %ld\n",
+  fprintf (file, ";;  depth %d, outer %ld",
   loop_depth (loop), (long) (loop_outer (loop)
  ? loop_outer (loop)->num : -1));
+  print_loop_info (file, loop, ";;  ");
 
-  bool reliable;
-  sreal iterations;
-  if (loop->num && expected_loop_iterations_by_profile (loop, , 
))
-fprintf (file, ";;  profile-based iteration count: %f %s\n",
-iterations.to_double (), reliable ? "(reliable)" : "(unreliable)");
-
-  fprintf (file, ";;  nodes:");
+  fprintf (file, "\n;;  nodes:");
   bbs = get_loop_body (loop);
   for (i = 0; i < loop->num_nodes; i++)
 fprintf (file, " %d", bbs[i]->index);
diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 4d2fd4b6af5..269694c7962 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -411,6 +411,7 @@ extern unsigned expected_loop_iterations (class loop *);
 extern rtx doloop_condition_get (rtx_insn *);
 
 void mark_loop_for_removal (loop_p);
+void print_loop_info (FILE *file, const class loop *loop, const char *);
 
 /* Induction variable analysis.  */
 
diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc
index 7ccc2a5a5a7..a6c97a04662 100644
--- a/gcc/tree-cfg.cc
+++ b/gcc/tree-cfg.cc
@@ -8479,6 +8479,55 @@ print_loops_bb (FILE *file, basic_block bb, int indent, 
int verbosity)
 }
 }
 
+/* Print loop information.  */
+
+void
+print_loop_info (FILE *file, const class loop *loop, const char *prefix)
+{
+  if (loop->can_be_parallel)
+fprintf (file, ", can_be_parallel");
+  if (loop->warned_aggressive_loop_optimizations)
+fprintf (file, ", warned_aggressive_loop_optimizations");
+  if (loop->dont_vectorize)
+fprintf (file, ", dont_vectorize");
+  if (loop->force_vectorize)
+fprintf (file, ", force_vectorize");
+  if (loop->in_oacc_kernels_region)
+fprintf (file, ", in_oacc_kernels_region");
+  if (loop->finite_p)
+fprintf (file, ", finite_p");
+  if (loop->unroll)
+fprintf (file, "\n%sunroll %d", prefix, loop->unroll);
+  if (loop->nb_iterations)
+{
+  fprintf (file, "\n%sniter ", prefix);
+  print_generic_expr (file, loop->nb_iterations);
+}
+
+  if (loop->any_upper_bound)
+{
+  fprintf (file, "\n%supper_bound ", prefix);
+  print_decu (loop->nb_iterations_upper_bound, file);
+}
+  if (loop->any_likely_upper_bound)
+{
+  fprintf (file, "\n%slikely_upper_bound ", prefix);
+  print_decu (loop->nb_iterations_likely_upper_bound, file);
+}
+
+  if (loop->any_estimate)
+{
+  fprintf (file, "\n%sestimate ", prefix);
+  print_decu (loop->nb_iterations_estimate, file);
+}
+  bool reliable;
+  sreal iterations;
+  if (loop->num && expected_loop_iterations_by_profile (loop, , 
))
+fprintf (file, "\n%siterations by profile: %f %s", prefix,
+iterations.to_double (), reliable ? "(reliable)" : "(unreliable)");
+
+}
+
 static void print_loop_and_siblings (FILE *, class loop *, int, int);
 
 /* Pretty print LOOP on FILE, indented INDENT spaces.  Following
@@ -8511,27 +8560,7 @@ print_loop (FILE *file, class loop *loop, int indent, 
int verbosity)
 fprintf (file, ", latch = %d", loop->latch->index);
   else
 fprintf (file, ", multiple latches");
-  fprintf (file, ", niter = ");
-  print_generic_expr (file, loop->nb_iterations);
-
-  if (loop->any_upper_bound)
-{
-  fprintf (file, ", upper_bound = ");
-  print_decu (loop->nb_iterations_upper_bound, file);
-}
-  if (loop->any_likely_upper_bound)
-{
-  fprintf (file, ", likely_upper_bound = ");
-  print_decu (loop->nb_iterations_likely_upper_bound, file);
-   

Cleanup code determining number of iterations from cfg profile

2023-07-20 Thread Jan Hubicka via Gcc-patches
Hi,
this patch cleanups API for determining expected loop iteraitons from profile.
We started with having expected_loop_iterations and only source was the integer
represented BB counts. It did some work on guessing number of iteration if
profile was absent or bogus.  Later we introduced loop_info and added
get_estimated_loop_iterations which made expected_loop_iterations useful mostly
when doing profile updates and not for loop optimization heuristics.  The
naming is bit ambiguous so this difference is not clear.  Even later we
introduced precision tracking to profile and exended the API to return
reliablity of result but did not update all uses to do reasonable stuff with
it.  There is also some cofusion about +-1s concering latch execution counts
versus header execution counts.  

This patch aims to obsolette expected_loop_iterations and
expected_loop_iterations_unbounded (and "suceeds" modulo 1 use of each of two).
It adds expected_loop_iterations_by_profile which computes sreal and does
correct precision/presence tracking.  

Unlike old code, it is based on CFG profile only and  does not attempt to
provide fake answer when info is missing and does not check sanity with
loop_info.

We now define iterations consistently as lath execution in loop_info so I use
that here too.

I converted almost all calls to new API: dumps, code produing loop_info from
CFG profile and profile updating.  Remaining uses are in loop unrolling and
prefetching that needs more TLC I will do incrementally.  

There are some improvements possible which I can play with incrementally.
 - for simple loops with one exit dominating latch we can use exit
   probability for easier to preserve info in loop itraionts.
   THis is probably not too critical since all esitmates should be recorded
   in loop_info and would help mostly if new loop is constructed or old
   loop is lost and redicovered.
 - We may want to avoid trusting the profile if it is obviously inconsistent
   on header.

Bootstrapped/regtested x86_64-linux, plan to commit it later today if
there are no complains.

Honza

gcc/ChangeLog:

* cfgloop.cc: Include sreal.h.
(flow_loop_dump): Dump sreal iteration exsitmate.
(get_estimated_loop_iterations): Update.
* cfgloop.h (expected_loop_iterations_by_profile): Declare.
* cfgloopanal.cc (expected_loop_iterations_by_profile): New function.
(expected_loop_iterations_unbounded): Use new API.
* cfgloopmanip.cc (scale_loop_profile): Use
expected_loop_iterations_by_profile
* predict.cc (pass_profile::execute): Likewise.
* profile.cc (branch_prob): Likewise.
* tree-ssa-loop-niter.cc: Include sreal.h.
(estimate_numbers_of_iterations): Likewise

diff --git a/gcc/cfgloop.cc b/gcc/cfgloop.cc
index ccda7415d70..11336ea45c0 100644
--- a/gcc/cfgloop.cc
+++ b/gcc/cfgloop.cc
@@ -33,6 +33,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "dumpfile.h"
 #include "tree-ssa.h"
 #include "tree-pretty-print.h"
+#include "sreal.h"
 
 static void flow_loops_cfg_dump (FILE *);
 
@@ -138,14 +139,11 @@ flow_loop_dump (const class loop *loop, FILE *file,
   loop_depth (loop), (long) (loop_outer (loop)
  ? loop_outer (loop)->num : -1));
 
-  if (loop->latch)
-{
-  bool read_profile_p;
-  gcov_type nit = expected_loop_iterations_unbounded (loop, 
_profile_p);
-  if (read_profile_p && !loop->any_estimate)
-   fprintf (file, ";;  profile-based iteration count: %" PRIu64 "\n",
-(uint64_t) nit);
-}
+  bool reliable;
+  sreal iterations;
+  if (expected_loop_iterations_by_profile (loop, , ))
+fprintf (file, ";;  profile-based iteration count: %f %s\n",
+iterations.to_double (), reliable ? "(reliable)" : "(unreliable)");
 
   fprintf (file, ";;  nodes:");
   bbs = get_loop_body (loop);
@@ -2014,10 +2012,12 @@ get_estimated_loop_iterations (class loop *loop, 
widest_int *nit)
  profile.  */
   if (!loop->any_estimate)
 {
-  if (loop->header->count.reliable_p ())
+  sreal snit;
+  bool reliable;
+  if (expected_loop_iterations_by_profile (loop, , )
+ && reliable)
{
-  *nit = gcov_type_to_wide_int
-  (expected_loop_iterations_unbounded (loop) + 1);
+ *nit = (snit + 0.5).to_int ();
  return true;
}
   return false;
diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index e7ac2b5f3db..4d2fd4b6af5 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -403,7 +403,10 @@ extern void verify_loop_structure (void);
 /* Loop analysis.  */
 extern bool just_once_each_iteration_p (const class loop *, const_basic_block);
 gcov_type expected_loop_iterations_unbounded (const class loop *,
- bool *read_profile_p = NULL, bool 
by_profile_only = false);
+ bool *read_profile_p = NULL);
+extern bool 

Re: [PATCH]AArch64 fix regexp for live_1.c sve test

2023-07-20 Thread Jan Hubicka via Gcc-patches
> Tamar Christina  writes:
> > Hi All,
> >
> > The resulting predicate register of a whilelo is not
> > restricted to the lower half of the predicate register file.
> >
> > As such these tests started failing after recent changes
> > because the whilelo outside the loop is getting assigned p15.
> 
> It's the whilelo in the loop for me.  We go from:
> 
> .L3:
> ld1bz31.b, p7/z, [x4, x3]
> movprfx z30, z31
> mul z30.b, p5/m, z30.b, z29.b
> st1bz30.b, p7, [x4, x3]
> mov p6.b, p7.b
> add x3, x3, x0
> whilelo p7.b, w3, w1
> b.any   .L3
> 
> to:
> 
> .L3:
> ld1bz31.b, p7/z, [x3, x2]
> movprfx z29, z31
> mul z29.b, p6/m, z29.b, z30.b
> st1bz29.b, p7, [x3, x2]
> add x2, x2, x0
> whilelo p15.b, w2, w1
> b.any   .L4
> [...]
> .p2align 2,,3
> .L4:
> mov p7.b, p15.b
> b   .L3
> 
> This adds an extra (admittedly unconditional) branch to every non-final
> vector iteration, which seems unfortunate.  I don't think we'd see
> p8-p15 otherwise, since the result of the whilelo is used as a
> governing predicate by the next iteration of the loop.
> 
> This happens because the scalar loop is given an 89% chance of iterating.
> Previously we gave the vector loop an 83.33% chance of iterating, whereas
> after 061f74c06735e1fa35b910ae we give it a 12% chance.  0.89^16 == 15.50%,
> so the new probabilities definitely preserve the original probabilities
> more closely.  But for purely heuristic probabilities like these, I'm
> not sure we should lean so heavily into the idea that the vector
> latch is unlikely.
> 
> Honza, Richi, any thoughts?  Just wanted to double-check that this
> was operating as expected before making the tests accept the (arguably)
> less efficient code.  It looks like the commit was more aimed at fixing
> the profile counts for the epilogues, rather than the main loop.

You are right that we shold not scale down static profiles in case they
are artifically flat. It is nice to have actual testcase.
Old code used to test:

  /* Without profile feedback, loops for which we do not know a better estimate
 are assumed to roll 10 times.  When we unroll such loop, it appears to
 roll too little, and it may even seem to be cold.  To avoid this, we
 ensure that the created loop appears to roll at least 5 times (but at
 most as many times as before unrolling).  Don't do adjustment if profile
 feedback is present.  */
  if (new_est_niter < 5 && !profile_p)
{
  if (est_niter < 5)
new_est_niter = est_niter;
  else 
new_est_niter = 5;
} 

This is not right when profile feedback is around and also when we
managed to determine precise #of itrations at branch prediction time and
did not cap.

So I replaced it iwht the test that adjusted header count is not smaller
than the preheader edge count.  However this will happily get loop
iteration count close to 0.

It is bit hard to figure out if profile is realistic:

Sometimes we do
   profile_status_for_fn (cfun) != PROFILE_READ
I am trying to get rid of this test.  With LTO or when comdat profile is
lost we inline together functions with and without profile.

We can test for quality of loop header count to be precise or adjusted.
However at the time vectorizer is modifying loop profile we already
adjusted it for the initial conditional for profitability threshold and
drop it to GUESSED.Even with profile feedback we do not know outcome
probability of that one (Ondrej Kubanek's histograms will help here).

So I think we want to check if we have loop iteration estimate recorded
(that should be true for both profile feedback and loops with known trip
count) and if so compare it what profile says and it is more or less in
match consider profile realistic.  This needs to be done before
vectorizer starts tampering with the loop.

I will try to make patch for that.
Honza
> 
> Thanks,
> Richard
> 
> > This widens the regexp.
> >
> > Tested on aarch64-none-linux-gnu and passes again.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/aarch64/sve/live_1.c: Update assembly.
> >
> > --- inline copy of patch -- 
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/live_1.c 
> > b/gcc/testsuite/gcc.target/aarch64/sve/live_1.c
> > index 
> > 80ee176d1807bf628ad47551d69ff5d84deda79e..2db6c3c209a9514646e92628f3d2dd58d466539c
> >  100644
> > --- a/gcc/testsuite/gcc.target/aarch64/sve/live_1.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/live_1.c
> > @@ -27,10 +27,10 @@
> >  
> >  TEST_ALL (EXTRACT_LAST)
> >  
> > -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7].b, } 2 } } */
> > -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7].h, } 4 } } */
> > -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7].s, } 4 } } */
> > -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7].d, } 4 

loop-ch improvements, part 3

2023-07-20 Thread Jan Hubicka via Gcc-patches
Hi,
this patch makes tree-ssa-loop-ch to understand if-combined conditionals (which
are quite common) and remove the IV-derived heuristics.  That heuristics is
quite dubious because every variable with PHI in header of integral or pointer
type is seen as IV, so in the first basic block we match all loop invariants as
invariants and everything that chagnes in loop as IV-like.

I think the heuristics was mostly there to make header duplication happen when
the exit conditional is constant false in the first iteration and with ranger
we can work this out in good enough precision.

The patch adds notion of "combined exit" which has conditional that is
and/or/xor of loop invariant exit and exit known to be false in first
iteration.  Copying these is a win since the loop conditional will simplify
in both copies.

It seems that those are usual bit or/and/xor and the code size accounting is
true only when the values have at most one bit set or when the static constant
and invariant versions are simple (such as all zeros).  I am not testing this,
so the code may be optimistic here.  I think it is not common enough to matter
and I can not think of correct condition that is not quite complex.

I also improved code size estimate not accounting non-conditionals that are
know to be constant in peeled copy and improved debug output.

This requires testsuite compensaiton.  uninit-pred-loop-1.c.C does:

/* { dg-do compile } */
/* { dg-options "-Wuninitialized -O2 -std=c++98" } */

extern int bar();
int foo(int n, int m)
{
 for (;;) {
   int err = ({int _err; 
 for (int i = 0; i < 16; ++i) {
   if (m+i > n)
  break;
   _err = 17;
   _err = bar();
 }
 _err; 
   }); 

   if (err == 0) return 17;
}

Before path we duplicate
   if (m+i > n)
which makes maybe-uninitialized warning to not be output.  I do not quite see
why copying this out would be a win, since it won't simlify.  Also I think the
warning is correct.  if m>n the loop will bail out before initializing _err and
it will be used unitialized.  I think it is bug elsewhere that header
duplication supresses this.

copy headers does:
int is_sorted(int *a, int n, int m, int k)
{
  for (int i = 0; i < n - 1 && m && k > i; i++)
if (a[i] > a[i + 1])
  return 0;
  return 1;
}

it tests that all three for statement conditionals are duplicaed.  With patch
we no longer do k>i since it is not going to simplify.  So I added test
ensuring that k is positive.  Also the tests requires disabling if-combining and
vrp to avoid conditionals becoming combined ones. So I aded new version of test
that we now behave correctly aslo with if-combine.

ivopt_mult_2.c and ivopt_mult_1.c seems to require loop header
duplication for ivopts to behave particular way, so I also ensured by value
range that the header is duplicated.

Bootstrapped/regtested x86_64-linux, OK?

gcc/ChangeLog:

* tree-ssa-loop-ch.cc (edge_range_query): Rename to ...
(get_range_query): ... this one; do 
(static_loop_exit): Add query parametr, turn ranger to reference.
(loop_static_stmt_p): New function.
(loop_static_op_p): New function.
(loop_iv_derived_p): Remove.
(loop_combined_static_and_iv_p): New function.
(should_duplicate_loop_header_p): Discover combined onditionals;
do not track iv derived; improve dumps.
(pass_ch::execute): Fix whitespace.

gcc/testsuite/ChangeLog:

* g++.dg/uninit-pred-loop-1_c.C: Allow warning.
* gcc.dg/tree-ssa/copy-headers-7.c: Add tests so exit conditition is
static; update template.
* gcc.dg/tree-ssa/ivopt_mult_1.c: Add test so exit condition is static.
* gcc.dg/tree-ssa/ivopt_mult_2.c: Add test so exit condition is static.
* gcc.dg/tree-ssa/copy-headers-8.c: New test.

diff --git a/gcc/testsuite/g++.dg/uninit-pred-loop-1_c.C 
b/gcc/testsuite/g++.dg/uninit-pred-loop-1_c.C
index 711812aae1b..1ee1615526f 100644
--- a/gcc/testsuite/g++.dg/uninit-pred-loop-1_c.C
+++ b/gcc/testsuite/g++.dg/uninit-pred-loop-1_c.C
@@ -15,7 +15,7 @@ int foo(int n, int m)
  _err; 
}); 
 
-   if (err == 0) return 17;
+   if (err == 0) return 17;/* { dg-warning "uninitialized" "warning" } */
  }
 
  return 18;
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-7.c 
b/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-7.c
index 3c9b3807041..e2a6c75f2e9 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-7.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-7.c
@@ -3,9 +3,10 @@
 
 int is_sorted(int *a, int n, int m, int k)
 {
-  for (int i = 0; i < n - 1 && m && k > i; i++)
-if (a[i] > a[i + 1])
-  return 0;
+  if (k > 0)
+for (int i = 0; i < n - 1 && m && k > i; i++)
+  if (a[i] > a[i + 1])
+   return 0;
   return 1;
 }
 
@@ -13,4 +14,8 @@ int is_sorted(int *a, int n, int m, int k)
the invariant test, not the alternate exit test.  */
 
 /* { dg-final { scan-tree-dump "is now do-while loop" "ch2" } } */
+/* { dg-final { 

Re: Fix optimize_mask_stores profile update

2023-07-17 Thread Jan Hubicka via Gcc-patches
> On Mon, Jul 17, 2023 at 12:36 PM Jan Hubicka via Gcc-patches
>  wrote:
> >
> > Hi,
> > While looking into sphinx3 regression I noticed that vectorizer produces
> > BBs with overall probability count 120%.  This patch fixes it.
> > Richi, I don't know how to create a testcase, but having one would
> > be nice.
> >
> > Bootstrapped/regtested x86_64-linux, commited last night (sorry for
> > late email)
> 
> This should trigger with sth like
> 
>   for (i)
> if (cond[i])
>   out[i] = 1.;
> 
> so a masked store and then using AVX2+.  ISTR we disable AVX masked
> stores on zen (but not AVX512).

OK, let me see if I can get a testcase out of that.
> >efalse = make_edge (bb, store_bb, EDGE_FALSE_VALUE);
> >/* Put STORE_BB to likely part.  */
> >efalse->probability = profile_probability::unlikely ();
> > +  e->probability = efalse->probability.invert ();
> >store_bb->count = efalse->count ();
> 
> isn't the count also wrong?  Or rather efalse should be likely().   We're
> testing doing
> 
>   if (!mask all zeros)
> masked-store
> 
> because a masked store with all zero mask can end up invoking COW page fault
> handling multiple times (because it doesn't actually write).

Hmm, I only fixed the profile, efalse was already set to unlikely, but
indeed I think it should be likely. Maybe we can compute some bound on
actual probability by knowing if(cond[i]) probability.
If the loop always does factor many ones or zeros, the probability would
remain the same.
If that is p and they are all independent, the outcome would be
(1-p)^factor

sp we know the conditoinal shoul dbe in ragne (1-p)^factor(1-p),
right?

Honza

> 
> Note -Ofast allows store data races and thus does RMW instead of a masked 
> store.
> 
> >make_single_succ_edge (store_bb, join_bb, EDGE_FALLTHRU);
> >if (dom_info_available_p (CDI_DOMINATORS))


Avoid double profile udpate in try_peel_loop

2023-07-17 Thread Jan Hubicka via Gcc-patches
Hi,
try_peel_loop uses gimple_duplicate_loop_body_to_header_edge which subtracts 
the profile
from the original loop. However then it tries to scale the profile in a wrong 
way
(it forces header count to be entry count).

This eliminates to profile misupdates in the internal loop of sphinx3.

gcc/ChangeLog:

PR middle-end/110649
* tree-ssa-loop-ivcanon.cc (try_peel_loop): Avoid double profile update.

diff --git a/gcc/tree-ssa-loop-ivcanon.cc b/gcc/tree-ssa-loop-ivcanon.cc
index 0117dbfc91b..bdb738af7a8 100644
--- a/gcc/tree-ssa-loop-ivcanon.cc
+++ b/gcc/tree-ssa-loop-ivcanon.cc
@@ -1152,6 +1152,7 @@ try_peel_loop (class loop *loop,
 }
   if (may_be_zero)
 bitmap_clear_bit (wont_exit, 1);
+
   if (!gimple_duplicate_loop_body_to_header_edge (
loop, loop_preheader_edge (loop), npeel, wont_exit, exit,
_to_remove, DLTHE_FLAG_UPDATE_FREQ))
@@ -1168,18 +1169,6 @@ try_peel_loop (class loop *loop,
   adjust_loop_info_after_peeling (loop, npeel, true);
   profile_count entry_count = profile_count::zero ();
 
-  edge e;
-  edge_iterator ei;
-  FOR_EACH_EDGE (e, ei, loop->header->preds)
-if (e->src != loop->latch)
-  {
-   if (e->src->count.initialized_p ())
- entry_count += e->src->count;
-   gcc_assert (!flow_bb_inside_loop_p (loop, e->src));
-  }
-  profile_probability p;
-  p = entry_count.probability_in (loop->header->count);
-  scale_loop_profile (loop, p, -1);
   bitmap_set_bit (peeled_loops, loop->num);
   return true;
 }


Fix profile update in scale_profile_for_vect_loop

2023-07-17 Thread Jan Hubicka via Gcc-patches
Hi,
when vectorizing 4 times, we sometimes do
  for
<4x vectorized body>
  for
<2x vectorized body>
  for
<1x vectorized body>

Here the second two fors handling epilogue never iterates.
Currently vecotrizer thinks that the middle for itrates twice.
This turns out to be scale_profile_for_vect_loop that uses 
niter_for_unrolled_loop.

At that time we know epilogue will iterate at most 2 times
but niter_for_unrolled_loop does not know that the last iteration
will be taken by the epilogue-of-epilogue and thus it think
that the loop may iterate once and exit in middle of second
iteration.

We already do correct job updating niter bounds and this is
just ordering issue.  This patch makes us to first update
the bounds and then do updating of the loop.  I re-implemented
the function more correctly and precisely.

The loop reducing iteration factor for overly flat profiles is bit funny, but
only other method I can think of is to compute sreal scale that would have
similar overhead I think.

Bootstrapped/regtested x86_64-linux, comitted.

gcc/ChangeLog:

PR middle-end/110649
* tree-vect-loop.cc (scale_profile_for_vect_loop):
(vect_transform_loop):
(optimize_mask_stores):

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 7d917bfd72c..b44fb9c7712 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -10842,31 +10842,30 @@ vect_get_loop_len (loop_vec_info loop_vinfo, 
gimple_stmt_iterator *gsi,
 static void
 scale_profile_for_vect_loop (class loop *loop, unsigned vf)
 {
-  edge preheader = loop_preheader_edge (loop);
-  /* Reduce loop iterations by the vectorization factor.  */
-  gcov_type new_est_niter = niter_for_unrolled_loop (loop, vf);
-  profile_count freq_h = loop->header->count, freq_e = preheader->count ();
-
-  if (freq_h.nonzero_p ())
-{
-  profile_probability p;
-
-  /* Avoid dropping loop body profile counter to 0 because of zero count
-in loop's preheader.  */
-  if (!(freq_e == profile_count::zero ()))
-freq_e = freq_e.force_nonzero ();
-  p = (freq_e * (new_est_niter + 1)).probability_in (freq_h);
-  scale_loop_frequencies (loop, p);
-}
-
+  /* Loop body executes VF fewer times and exit increases VF times.  */
   edge exit_e = single_exit (loop);
-  exit_e->probability = profile_probability::always () / (new_est_niter + 1);
-
-  edge exit_l = single_pred_edge (loop->latch);
-  profile_probability prob = exit_l->probability;
-  exit_l->probability = exit_e->probability.invert ();
-  if (prob.initialized_p () && exit_l->probability.initialized_p ())
-scale_bbs_frequencies (>latch, 1, exit_l->probability / prob);
+  profile_count entry_count = loop_preheader_edge (loop)->count ();
+
+  /* If we have unreliable loop profile avoid dropping entry
+ count bellow header count.  This can happen since loops
+ has unrealistically low trip counts.  */
+  while (vf > 1
+&& loop->header->count > entry_count
+&& loop->header->count < entry_count * vf)
+vf /= 2;
+
+  if (entry_count.nonzero_p ())
+set_edge_probability_and_rescale_others
+   (exit_e,
+entry_count.probability_in (loop->header->count / vf));
+  /* Avoid producing very large exit probability when we do not have
+ sensible profile.  */
+  else if (exit_e->probability < profile_probability::always () / (vf * 2))
+set_edge_probability_and_rescale_others (exit_e, exit_e->probability * vf);
+  loop->latch->count = single_pred_edge (loop->latch)->count ();
+
+  scale_loop_profile (loop, profile_probability::always () / vf,
+ get_likely_max_loop_iterations_int (loop));
 }
 
 /* For a vectorized stmt DEF_STMT_INFO adjust all vectorized PHI
@@ -11476,7 +11475,6 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple 
*loop_vectorized_call)
   niters_vector_mult_vf, !niters_no_overflow);
 
   unsigned int assumed_vf = vect_vf_for_cost (loop_vinfo);
-  scale_profile_for_vect_loop (loop, assumed_vf);
 
   /* True if the final iteration might not handle a full vector's
  worth of scalar iterations.  */
@@ -11547,6 +11545,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple 
*loop_vectorized_call)
  assumed_vf) - 1
 : wi::udiv_floor (loop->nb_iterations_estimate + bias_for_assumed,
   assumed_vf) - 1);
+  scale_profile_for_vect_loop (loop, assumed_vf);
 
   if (dump_enabled_p ())
 {


Fix optimize_mask_stores profile update

2023-07-17 Thread Jan Hubicka via Gcc-patches
Hi,
While looking into sphinx3 regression I noticed that vectorizer produces
BBs with overall probability count 120%.  This patch fixes it.
Richi, I don't know how to create a testcase, but having one would
be nice.

Bootstrapped/regtested x86_64-linux, commited last night (sorry for
late email)

gcc/ChangeLog:

PR tree-optimization/110649
* tree-vect-loop.cc (optimize_mask_stores): Set correctly
probability of the if-then-else construct.

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 7d917bfd72c..b44fb9c7712 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -11680,6 +11679,7 @@ optimize_mask_stores (class loop *loop)
   efalse = make_edge (bb, store_bb, EDGE_FALSE_VALUE);
   /* Put STORE_BB to likely part.  */
   efalse->probability = profile_probability::unlikely ();
+  e->probability = efalse->probability.invert ();
   store_bb->count = efalse->count ();
   make_single_succ_edge (store_bb, join_bb, EDGE_FALLTHRU);
   if (dom_info_available_p (CDI_DOMINATORS))


Turn TODO_rebuild_frequencies to a pass

2023-07-14 Thread Jan Hubicka via Gcc-patches
Hi,
currently we rebuild profile_counts from profile_probability after inlining,
because there is a chance that producing large loop nests may get 
unrealistically
large profile_count values.  This is much less of concern when we switched to
new profile_count representation while back.

This propagation can also compensate for profile inconsistencies caused by
optimization passes.  Since inliner is followed by basic cleanup passes that
does not use profile, we get more realistic profile by delaying the 
recomputation
after basic optimizations exposed by inlininig are finished.

This does not fit into TODO machinery, so I turn rebuilding into stand alone
pass and schedule it before first consumer of profile in the optimization
queue.

I also added logic that avoids repropagating when CFG is good and not too close
to overflow.  Propagating visits very basic block loop_depth times, so it is
not linear and avoiding it may help a bit.

On tramp3d we get 14 functions repropagated and 916 are OK.  The repropagated
functions are RB tree ones where we produce crazy loop nests by recurisve 
inlining.
This is something to fix independently.

Bootstrapped/regtested x86_64-linux.  Plan to commit it later today
if there are no complains.

Honza

gcc/ChangeLog:

* passes.cc (execute_function_todo): Remove
TODO_rebuild_frequencies
* passes.def: Add rebuild_frequencies pass.
* predict.cc (estimate_bb_frequencies): Drop
force parameter.
(tree_estimate_probability): Update call of
estimate_bb_frequencies.
(rebuild_frequencies): Turn into a pass; verify CFG profile consistency
first and do not rebuild if not necessary.
(class pass_rebuild_frequencies): New.
(make_pass_rebuild_frequencies): New.
* profile-count.h: Add profile_count::very_large_p.
* tree-inline.cc (optimize_inline_calls): Do not return
TODO_rebuild_frequencies
* tree-pass.h (TODO_rebuild_frequencies): Remove.
(make_pass_rebuild_frequencies): Declare.

diff --git a/gcc/passes.cc b/gcc/passes.cc
index 2f0e378b8b2..d7b0ad271a1 100644
--- a/gcc/passes.cc
+++ b/gcc/passes.cc
@@ -2075,9 +2075,6 @@ execute_function_todo (function *fn, void *data)
   if (flags & TODO_remove_unused_locals)
 remove_unused_locals ();
 
-  if (flags & TODO_rebuild_frequencies)
-rebuild_frequencies ();
-
   if (flags & TODO_rebuild_cgraph_edges)
 cgraph_edge::rebuild_edges ();
 
diff --git a/gcc/passes.def b/gcc/passes.def
index faa5208b26b..f2893ae8a8b 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -206,6 +206,10 @@ along with GCC; see the file COPYING3.  If not see
   NEXT_PASS (pass_post_ipa_warn);
   /* Must run before loop unrolling.  */
   NEXT_PASS (pass_warn_access, /*early=*/true);
+  /* Profile count may overflow as a result of inlinining very large
+ loop nests.  This pass should run before any late pass that makes
+use of profile.  */
+  NEXT_PASS (pass_rebuild_frequencies);
   NEXT_PASS (pass_complete_unrolli);
   NEXT_PASS (pass_backprop);
   NEXT_PASS (pass_phiprop);
@@ -395,6 +399,10 @@ along with GCC; see the file COPYING3.  If not see
  to forward object-size and builtin folding results properly.  */
   NEXT_PASS (pass_copy_prop);
   NEXT_PASS (pass_dce);
+  /* Profile count may overflow as a result of inlinining very large
+ loop nests.  This pass should run before any late pass that makes
+use of profile.  */
+  NEXT_PASS (pass_rebuild_frequencies);
   NEXT_PASS (pass_sancov);
   NEXT_PASS (pass_asan);
   NEXT_PASS (pass_tsan);
diff --git a/gcc/predict.cc b/gcc/predict.cc
index 1aa4c25eb70..26f9f3f6a88 100644
--- a/gcc/predict.cc
+++ b/gcc/predict.cc
@@ -89,7 +89,7 @@ static void predict_paths_leading_to_edge (edge, enum 
br_predictor,
 static bool can_predict_insn_p (const rtx_insn *);
 static HOST_WIDE_INT get_predictor_value (br_predictor, HOST_WIDE_INT);
 static void determine_unlikely_bbs ();
-static void estimate_bb_frequencies (bool force);
+static void estimate_bb_frequencies ();
 
 /* Information we hold about each branch predictor.
Filled using information from predict.def.  */
@@ -3169,8 +3169,9 @@ tree_estimate_probability (bool dry_run)
   delete bb_predictions;
   bb_predictions = NULL;
 
-  if (!dry_run)
-estimate_bb_frequencies (false);
+  if (!dry_run
+  && profile_status_for_fn (cfun) != PROFILE_READ)
+estimate_bb_frequencies ();
   free_dominance_info (CDI_POST_DOMINATORS);
   remove_fake_exit_edges ();
 }
@@ -3923,103 +3924,97 @@ determine_unlikely_bbs ()
 }
 
 /* Estimate and propagate basic block frequencies using the given branch
-   probabilities.  If FORCE is true, the frequencies are used to estimate
-   the counts even when there are already non-zero profile counts.  */
+   probabilities.  */
 
 static void
-estimate_bb_frequencies (bool force)
+estimate_bb_frequencies ()
 {
   

Loop-ch improvements, part 3

2023-07-14 Thread Jan Hubicka via Gcc-patches
Hi,
loop-ch currently does analysis using ranger for all loops to identify
candidates and then follows by phase where headers are duplicated (which
breaks SSA and ranger).  The second stage does more analysis (to see how
many BBs we want to duplicate) but can't use ranger and thus misses
information about static conditionals.

This patch pushes all analysis into the first stage. We record how many
BBs to duplicate and the second stage just duplicats as it is told so.
This makes it possible to also extend range query done also to basic
blocks that are not headers.  This is easy to do, since we already do
path specific query so we only need to extend the path by headers we
decided to dulicate earlier.

This makes it possible to track situations where exit that is always
false in the first iteration for tests not in the original loop header.
Doing so lets us to update profile better and do better heuristics.  In
particular I changed logic as follows
  1) should_duplicate_loop_header_p counts size of duplicated region.  When we
 know that a given conditional will be constant true or constant false 
either
 in the duplicated region, by range query, or in the loop body after
 duplication (since it is loop invariant), we do not account it to code size
 costs
  2) don't need account loop invariant compuations that will be duplicated
 as they will become fully invariant
 (maybe we want to have some cap for register pressure eventually?)
  3) optimize_size logic is now different.  Originally we started duplicating
 iff the first conditional was known to be true by ranger query, but then
 we used same limits as for -O2.

 I now simply lower limits to 0. This means that every conditional
 in duplicated sequence must be either loop invariant or constant when
 duplicated and we only duplicate statements computing loop invariants
 and those we account to 0 size anyway,

This makes code IMO more streamlined (and hopefully will let us to merge
ibts with loop peeling logic), but makes little difference in practice.
The problem is that in loop:

void test2();
void test(int n)
{
  for (int i = 0; n && i < 10; i++)
  test2();
}

We produce:
   [local count: 1073741824 freq: 9.090909]:
  # i_4 = PHI <0(2), i_9(3)>
  _1 = n_7(D) != 0;
  _2 = i_4 <= 9;
  _3 = _1 & _2;
  if (_3 != 0)
goto ; [89.00%]
  else
goto ; [11.00%]

and do not understand that the final conditional is a combination of a 
conditional
that is always true in first iteration and a conditional that is loop invariant.

This is also the case of
void test2();
void test(int n)
{
  for (int i = 0; n; i++)
{
  if (i > 10)
break;
  test2();
}
}
Which we turn to the earlier case in ifcombine.

With disabled ifcombine things however works as exepcted.  This is something
I plan to handle incrementally.  However extending loop-ch and peeling passes
to understand such combined conditionals is still not good enough: at the time 
ifcombine
merged the two conditionals we lost profile information on how often n is 0,
so we can't recover correct profile or know what is expected number of 
iterations
after the transofrm.

Bootstrapped/regtested x86_64-linux, OK?

Honza


gcc/ChangeLog:

* tree-ssa-loop-ch.cc (edge_range_query): Take loop argument; be ready
for queries not in headers.
(static_loop_exit): Add basic blck parameter; update use of
edge_range_query
(should_duplicate_loop_header_p): Add ranger and static_exits
parameter.  Do not account statements that will be optimized
out after duplicaiton in overall size. Add ranger query to
find static exits.
(update_profile_after_ch):  Take static_exits has set instead of
single eliminated_edge.
(ch_base::copy_headers): Do all analysis in the first pass;
remember invariant_exits and static_exits.

diff --git a/gcc/tree-ssa-loop-ch.cc b/gcc/tree-ssa-loop-ch.cc
index 24e7fbc805a..e0139cb432c 100644
--- a/gcc/tree-ssa-loop-ch.cc
+++ b/gcc/tree-ssa-loop-ch.cc
@@ -49,11 +49,13 @@ along with GCC; see the file COPYING3.  If not see
the range of the solved conditional in R.  */
 
 static void
-edge_range_query (irange , edge e, gcond *cond, gimple_ranger )
+edge_range_query (irange , class loop *loop, gcond *cond, gimple_ranger 
)
 {
-  auto_vec path (2);
-  path.safe_push (e->dest);
-  path.safe_push (e->src);
+  auto_vec path;
+  for (basic_block bb = gimple_bb (cond); bb != loop->header; bb = 
single_pred_edge (bb)->src)
+path.safe_push (bb);
+  path.safe_push (loop->header);
+  path.safe_push (loop_preheader_edge (loop)->src);
   path_range_query query (ranger, path);
   if (!query.range_of_stmt (r, cond))
 r.set_varying (boolean_type_node);
@@ -63,17 +65,16 @@ edge_range_query (irange , edge e, gcond *cond, 
gimple_ranger )
and NULL otherwise.  */
 
 static edge
-static_loop_exit (class loop *l, gimple_ranger *ranger)
+static_loop_exit 

Loop-ch improvements, part 2

2023-07-12 Thread Jan Hubicka via Gcc-patches
Hi,
as discussed this patch moves profile updating to tree-ssa-loop-ch.cc since it 
is
now quite ch specific. There are no functional changes.

Boostrapped/regtesed x86_64-linux, comitted.

gcc/ChangeLog:

* tree-cfg.cc (gimple_duplicate_sese_region): Rename to ...
(gimple_duplicate_seme_region): ... this; break out profile updating
code to ...
* tree-ssa-loop-ch.cc (update_profile_after_ch): ... here.
(ch_base::copy_headers): Update.
* tree-cfg.h (gimple_duplicate_sese_region): Rename to ...
(gimple_duplicate_seme_region): ... this.

diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc
index 7dad7b4ac72..7ccc2a5a5a7 100644
--- a/gcc/tree-cfg.cc
+++ b/gcc/tree-cfg.cc
@@ -6662,25 +6662,19 @@ add_phi_args_after_copy (basic_block *region_copy, 
unsigned n_region,
The function returns false if it is unable to copy the region,
true otherwise.
 
-   ELIMINATED_EDGE is an edge that is known to be removed in the dupicated
-   region.  ORIG_ELIMINATED_EDGES, if non-NULL is set of edges known to be
-   removed from the original region.  */
+   It is callers responsibility to update profile.  */
 
 bool
-gimple_duplicate_sese_region (edge entry, edge exit,
+gimple_duplicate_seme_region (edge entry, edge exit,
  basic_block *region, unsigned n_region,
  basic_block *region_copy,
- bool update_dominance,
- edge eliminated_edge,
- hash_set  *orig_eliminated_edges)
+ bool update_dominance)
 {
   unsigned i;
   bool free_region_copy = false, copying_header = false;
   class loop *loop = entry->dest->loop_father;
   edge exit_copy;
   edge redirected;
-  profile_count total_count = profile_count::uninitialized ();
-  profile_count entry_count = profile_count::uninitialized ();
 
   if (!can_copy_bbs_p (region, n_region))
 return false;
@@ -6733,144 +6727,10 @@ gimple_duplicate_sese_region (edge entry, edge exit,
  inside.  */
   auto_vec doms;
   if (update_dominance)
-{
-  doms = get_dominated_by_region (CDI_DOMINATORS, region, n_region);
-}
-
-  if (entry->dest->count.initialized_p ())
-{
-  total_count = entry->dest->count;
-  entry_count = entry->count ();
-  /* Fix up corner cases, to avoid division by zero or creation of negative
-frequencies.  */
-  if (entry_count > total_count)
-   entry_count = total_count;
-}
+doms = get_dominated_by_region (CDI_DOMINATORS, region, n_region);
 
   copy_bbs (region, n_region, region_copy, , 1, _copy, loop,
split_edge_bb_loc (entry), update_dominance);
-  if (total_count.initialized_p () && entry_count.initialized_p ())
-{
-  if (!eliminated_edge
- && (!orig_eliminated_edges || orig_eliminated_edges->is_empty ()))
-   {
- scale_bbs_frequencies_profile_count (region, n_region,
-  total_count - entry_count,
-  total_count);
- scale_bbs_frequencies_profile_count (region_copy, n_region,
-  entry_count, total_count);
-   }
-  else
-   {
- /* We only support only case where eliminated_edge is one and it
-exists first BB.  We also assume that the duplicated region is
-acyclic.  So we expect the following:
-
-  // region_copy_start entry will be scaled to entry_count
-if (cond1) <- this condition will become false
-  and we update probabilities
-  goto loop_exit;
-if (cond2) <- this condition is loop invariant
-  goto loop_exit;
-goto loop_header   <- this will be redirected to loop.
-  // region_copy_end
-loop:
-  
-  // region start
-loop_header:
-  if (cond1)   <- we need to update probabbility here
-goto loop_exit;
-  if (cond2)   <- and determine scaling factor here.
-  moreover cond2 is now always true
-goto loop_exit;
-  else
-goto loop;
-  // region end
-
-Adding support for more exits can be done similarly,
-but only consumer so far is tree-ssa-loop-ch and it uses only this
-to handle the common case of peeling headers which have
-conditionals known to be always true upon entry.  */
- gcc_checking_assert (copying_header);
- for (unsigned int i = 0; i < n_region; i++)
-   {
- edge exit_e, exit_e_copy, e, e_copy;
- if (EDGE_COUNT (region[i]->succs) == 1)
-   {
- 

Re: [x86-64] RFC: Add nosse abi attribute

2023-07-11 Thread Jan Hubicka via Gcc-patches
> > > When a function doesn't contain calls to
> > > unknown functions we can be a bit more lenient: we can make it so that
> > > GCC simply doesn't touch xmm8-15 at all, then no save/restore is
> > > necessary.

One may also take into account that first 8 registers are cheaper to
encode than the later 8, so perhaps we may want to choose range that
contains both.

Honza


Loop-ch improvements, part 1

2023-07-11 Thread Jan Hubicka via Gcc-patches
Hi,
this patch improves profile update in loop-ch to handle situation where 
duplicated header
has loop invariant test.  In this case we konw that all count of the exit edge 
belongs to
the duplicated loop header edge and can update probabilities accordingly.
Since we also do all the work to track this information from analysis to 
duplicaiton
I also added code to turn those conditionals to constants so we do not need 
later
jump threading pass to clean up.

This made me to work out that the propagatoin was buggy in few aspects
 1) it handled every PHI as PHI in header and incorrectly assigned some PHIs
to be IV-like when they are not
 2) it did not check for novops calls that are not required to return same
value on every invocation.
 3) I also added check for asm statement since those are not necessarily
reproducible either.

I would like to do more changes, but tried to prevent this patch from
snowballing.  The analysis of what statements will remain after duplication can
be improved.  I think we should use ranger query for other than first basic
block, too and possibly drop the IV heuristics then.  Also it seems that a lot
of this logic is pretty much same to analysis in peeling pass, so unifying this
would be nice.

I also think I should move the profile update out of
gimple_duplicate_sese_region (it is now very specific to ch) and rename it,
since those regions are singe entry multiple exit.

Bootstrapped/regtsted x86_64-linux, OK?

Honza

gcc/ChangeLog:

* tree-cfg.cc (gimple_duplicate_sese_region): Add ORIG_ELIMINATED_EDGES
parameter and rewrite profile updating code to handle edges elimination.
* tree-cfg.h (gimple_duplicate_sese_region): Update prototpe.
* tree-ssa-loop-ch.cc (loop_invariant_op_p): New function.
(loop_iv_derived_p): New function.
(should_duplicate_loop_header_p): Track invariant exit edges; fix 
handling
of PHIs and propagation of IV derived variables.
(ch_base::copy_headers): Pass around the invariant edges hash set.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/loop-ch-profile-1.c: Remove xfail.

diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc
index 4989906706c..3879fb7c4c1 100644
--- a/gcc/tree-cfg.cc
+++ b/gcc/tree-cfg.cc
@@ -6661,14 +6661,16 @@ add_phi_args_after_copy (basic_block *region_copy, 
unsigned n_region,
true otherwise.
 
ELIMINATED_EDGE is an edge that is known to be removed in the dupicated
-   region.  */
+   region.  ORIG_ELIMINATED_EDGES, if non-NULL is set of edges known to be
+   removed from the original region.  */
 
 bool
 gimple_duplicate_sese_region (edge entry, edge exit,
  basic_block *region, unsigned n_region,
  basic_block *region_copy,
  bool update_dominance,
- edge eliminated_edge)
+ edge eliminated_edge,
+ hash_set  *orig_eliminated_edges)
 {
   unsigned i;
   bool free_region_copy = false, copying_header = false;
@@ -6747,7 +6749,8 @@ gimple_duplicate_sese_region (edge entry, edge exit,
split_edge_bb_loc (entry), update_dominance);
   if (total_count.initialized_p () && entry_count.initialized_p ())
 {
-  if (!eliminated_edge)
+  if (!eliminated_edge
+ && (!orig_eliminated_edges || orig_eliminated_edges->is_empty ()))
{
  scale_bbs_frequencies_profile_count (region, n_region,
   total_count - entry_count,
@@ -6765,7 +6768,7 @@ gimple_duplicate_sese_region (edge entry, edge exit,
 if (cond1) <- this condition will become false
   and we update probabilities
   goto loop_exit;
-if (cond2)
+if (cond2) <- this condition is loop invariant
   goto loop_exit;
 goto loop_header   <- this will be redirected to loop.
   // region_copy_end
@@ -6776,6 +6779,7 @@ gimple_duplicate_sese_region (edge entry, edge exit,
   if (cond1)   <- we need to update probabbility here
 goto loop_exit;
   if (cond2)   <- and determine scaling factor here.
+  moreover cond2 is now always true
 goto loop_exit;
   else
 goto loop;
@@ -6785,53 +6789,84 @@ gimple_duplicate_sese_region (edge entry, edge exit,
 but only consumer so far is tree-ssa-loop-ch and it uses only this
 to handle the common case of peeling headers which have
 conditionals known to be always true upon entry.  */
- gcc_assert (eliminated_edge->src == region[0]
- && EDGE_COUNT (region[0]->succs) == 2
- && copying_header);
-
- edge e, e_copy, 

Re: [PATCH 4/19]middle-end: Fix scale_loop_frequencies segfault on multiple-exits

2023-07-11 Thread Jan Hubicka via Gcc-patches
> > By now we did CCP and FRE so we likely optimized out most of constant
> > conditionals exposed by inline.
> 
> So maybe we should simply delay re-propagation of the profile?  I
> think cunrolli doesn't so much care about the profile - cunrolli
> is (was) about abstraction removal.  Jump threading should be
> the first pass to care.

That is what I was thinking too.  After inlining the profile counts may
be in quite bad shape. If you inline together loop like in exchange that
has large loop nest, we will definitely end up capping counts to avoid
overflow.

cunrolli does:

 ret = tree_unroll_loops_completely (optimize >= 3, false);

which sets may_increase_size to true for -O3 and then

 may_increase_size && optimize_loop_nest_for_speed_p (loop)

which seems reasonable guard and it may get random answers on capped
profile.  It is not big deal to try propagating before cunrolli and then
again before threading and see how much potential this idea has.
I guess I should also double check that the other passes are indeed
safe, but I think it is quite obvoius they should be.

Honza
> 
> Richard.


Re: [PATCH 4/19]middle-end: Fix scale_loop_frequencies segfault on multiple-exits

2023-07-11 Thread Jan Hubicka via Gcc-patches
> 
> What I saw most wrecking the profile is when passes turn
> if (cond) into if (0/1) leaving the CFG adjustment to CFG cleanup
> which then simply deletes one of the outgoing edges without doing
> anything to the (guessed) profile.

Yep, I agree that this is disturbing.  At the cfg cleanup time one can
hardly do anything useful, since the knowledge of transform that caused
profile inconsistency is forgotten.  however I think it is not a complete
disaster.

With profile feedback the most common case of this happening is a
situation where we duplicated code (by inlining, unrolling etc.) into a
context where it behaves differently then the typical behaviour
represented by the profile.

So if one ends up zapping edge with large probability, one also knows
that the code being optimized does not exhibit typical behaviour from
the train run and thus is not very hot.  So profile inconsistency should
not affect performance that much.

So doing nothing is IMO may end up being safter than trying to get the
in/out counts right without really known what is going on.

This is mostly about the scenario "constant propagated this conditional
and profile disagrees with me".  There are other cases where update is
IMO important.  i.e. vectorizer forgetting to cap #of iterations of
epilogue may cause issue since the epilogue loop looks more frequent
than the main vectorized loop and it may cause IRA to insert spilling
into it or so.

When we duplicate we have chance to figure out profile updates.
Also we may try to get as much as possible done early.
I think we should again do loop header copying that does not expand code
at early opts again.  I have some more plans on cleaning up loop-ch and
then we can give it a try.

With guessed profile we always have option to re-do the propagation.
There is TODO_rebuild_frequencies for that which we do after inlining.
This is mostly to handle possible overflows on large loops nests
constructed by inliner.  

We can re-propagate once again after late cleanup passes. Looking at the
queue, we have:

  NEXT_PASS (pass_remove_cgraph_callee_edges);
  /* Initial scalar cleanups before alias computation.
 They ensure memory accesses are not indirect wherever possible.  */
  NEXT_PASS (pass_strip_predict_hints, false /* early_p */);
  NEXT_PASS (pass_ccp, true /* nonzero_p */);
  /* After CCP we rewrite no longer addressed locals into SSA
 form if possible.  */
  NEXT_PASS (pass_object_sizes);
  NEXT_PASS (pass_post_ipa_warn);
  /* Must run before loop unrolling.  */
  NEXT_PASS (pass_warn_access, /*early=*/true);
  NEXT_PASS (pass_complete_unrolli);
 here we care about profile
  NEXT_PASS (pass_backprop);
  NEXT_PASS (pass_phiprop);
  NEXT_PASS (pass_forwprop);
  /* pass_build_alias is a dummy pass that ensures that we
 execute TODO_rebuild_alias at this point.  */
  NEXT_PASS (pass_build_alias);
  NEXT_PASS (pass_return_slot);
  NEXT_PASS (pass_fre, true /* may_iterate */);
  NEXT_PASS (pass_merge_phi);
  NEXT_PASS (pass_thread_jumps_full, /*first=*/true);
 here

By now we did CCP and FRE so we likely optimized out most of constant
conditionals exposed by inline.
Honza


Re: [x86-64] RFC: Add nosse abi attribute

2023-07-11 Thread Jan Hubicka via Gcc-patches
> > > FWIW, this particular patch was regstrapped on x86-64-linux
> > > with trunk from a week ago (and sniff-tested on current trunk).
> >
> > This looks really cool.
> 
> The biggest benefit might be from IPA with LTO where we'd carefully place 
> those
> attributes at WPA time (at that time tieing our hands for later).

Within single partition IRA already propagates the knowledge about
callee-clobbered registers.

Across partition we already automatically enable regparm with -m32
see ix86_function_regparm and tests for target->local and
can_change_attribute

Enabling SSE at the same spot should be easy.

Honza


Re: [PATCH 4/19]middle-end: Fix scale_loop_frequencies segfault on multiple-exits

2023-07-10 Thread Jan Hubicka via Gcc-patches
> On Fri, 7 Jul 2023, Jan Hubicka wrote:
> 
> > > 
> > > Looks good, but I wonder what we can do to at least make the
> > > multiple exit case behave reasonably?  The vectorizer keeps track
> > 
> > > of a "canonical" exit, would it be possible to pass in the main
> > > exit edge and use that instead of single_exit (), would other
> > > exits then behave somewhat reasonable or would we totally screw
> > > things up here?  That is, the "canonical" exit would be the
> > > counting exit while the other exits are on data driven conditions
> > > and thus wouldn't change probability when we reduce the number
> > > of iterations(?)
> > 
> > I can add canonical_exit parameter and make the function to direct flow
> > to it if possible.  However overall I think fixup depends on what
> > transformation led to the change.
> 
> I think the vectorizer knows there's a single counting IV and all
> other exits are dependent on data processed, so the scaling the
> vectorizer just changes the counting IV.  So I think it makes
> sense to pass that exit to the function in all cases.

It really seems to me that vectorized loop is like N loops happening 
in parallel, so the probabilities of alternative exits grows as well.
But canonical exit is right thing to do for prologues - here we really
add extra conditions to the iteration counting exit.
> 
> > Assuming that vectorizer did no prologues and apilogues and we
> > vectorized with factor N, then I think the update could be done more
> > specifically as follows.
> > 
> > We know that header block count dropped by 4. So we can start from that
> > and each time we reach basic block with exit edge, we know the original
> > count of the edge.  This count is unchanged, so one can rescale
> > probabilities out of that BB accordingly.  If loop has no inner loops,
> > we can just walk the body in RPO and propagate scales downwards and we
> > sould arrive to right result
> 
> That should work for alternate exits as well, no?
Yes, i think it could omstly work for acyclic bodies. I ended up
implementing a special case of this for loop-ch in order to handle
corectly loop invariant conditionals.  Will send patch after some
cleanups. (There seems to be more loop invariant conditionals in real
code than I would tought)

Tampering only with loop exit probabilities is not always enought.
If you have:
  while (1)
if (test1)
  {
if (test2)
  break;
  }
increasing count of exit may require increasing probablity of the outer
conditional.   Do we support this in vectorization at all and if so, do
we know something here?
For example if the test1 is triggered if test1 is true in one of
iterations packed togehter, its probability also increases by
vectorization factor.  

We run into this in peeling i.e. when we prove that test1 will trigger
undefined behaviour after one or two iterations but the orignal
esimtated profile believes in higher iteration count.  I added special
case for this yesterday to avoid turning if (test2) to 100% in this case
as that triggers strange codegen in some of fortran testcases.

We also can have
  while (1)
while (test1)
  {
if (test2)
  break;
  }
Which is harder because changing probability of test2 affects number
of iteraitons of the inner loop.  So I am giving up on this.
I think currently it happens mostly with unlooping.
> 
> > I originally added the bound parameter to handle prologues/epilogues
> > which gets new artificial bound.  In prologue I think you are right that
> > the flow will be probably directed to the conditional counting
> > iterations.
> 
> I suppose we'd need to scale both main and epilogue together since
> the epilogue "steals" from the main loop counts.  Likewise if there's
> a skip edge around the vector loop.  I think currently we simply
> set the edge probability of those skip conds rather than basing
> this off the niter values they work on.  Aka if (niter < VF) goto
> epilogue; do {} while (niter / VF); epilogue: do {} while (niter);
> 
> There's also the cost model which might require niter > VF to enter
> the main loop body.

I think I mostly understand this since we was playing with it with Ondra's
histograms (that can be used to get some of the unknowns in the
transformation right). The unknowns (how many times we end up jumpig to
epilogue, for instance, probably can't be reasonably well guessed if we
do not know the loop histogram which currently we know only if we prove
that loop has constant number of iterations.  So I am trying to get
right at least this case first.

Theoretically correct approach would be to first determine entry counts
of prologue and epilogue, then produce what we believe to be correct
profile of those and subtract it from the main loop profile updating
also probabilities in basic blocks where we did nontrivial changes while
updating prologs/epilogs. Finally scale down the main loop profile and
increase exit probabilities.

Honza


Re: [PATCH 4/19]middle-end: Fix scale_loop_frequencies segfault on multiple-exits

2023-07-10 Thread Jan Hubicka via Gcc-patches
Hi,
over weekend I found that vectorizer is missing scale_loop_profile for
epilogues.  It already adjusts loop_info to set max iteraitons, so
adding it was easy. However now predicts the first loop to iterate at
most once (which is too much, I suppose it forgets to divide by epilogue
unrolling factor) and second never.
> 
> The -O2 cost model doesn't want to do epilogues:
> 
>   /* If using the "very cheap" model. reject cases in which we'd keep
>  a copy of the scalar code (even if we might be able to vectorize it).  
> */
>   if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP
>   && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
>   || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>   || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
> {
>   if (dump_enabled_p ())
> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>  "some scalar iterations would need to be 
> peeled\n");
>   return 0;
> }
> 
> it's because of the code size increase.

I know, however -O2 is not -Os and here the tradeoffs of
performance/code size seems a lot better than other code expanding
things we do at -O2 (such as the unrolling 3 times).
I think we set the very cheap cost model very conservatively in order to
get -ftree-vectorize enabled with -O2 and there is some room for finding
right balance.

I get:

jan@localhost:~> cat t.c
int a[99];
__attribute((noipa, weak))
void
test()
{
for (int i = 0 ; i < 99; i++)
a[i]++;
}
void
main()
{
for (int j = 0; j < 1000; j++)
test();
}
jan@localhost:~> gcc -O2 t.c -fno-unroll-loops ; time ./a.out

real0m0.529s
user0m0.528s
sys 0m0.000s

jan@localhost:~> gcc -O2 t.c ; time ./a.out

real0m0.427s
user0m0.426s
sys 0m0.000s
jan@localhost:~> gcc -O3 t.c ; time ./a.out

real0m0.136s
user0m0.135s
sys 0m0.000s
jan@localhost:~> clang -O2 t.c ; time ./a.out


real0m0.116s
user0m0.116s
sys 0m0.000s

Code size (of function test):
 gcc -O2 -fno-unroll-loops 17  bytes
 gcc -O2   29  bytes
 gcc -O3   50  bytes
 clang -O2 510 bytes

So unroling 70% code size growth for 23% speedup.
Vectorizing is 294% code size growth for 388% speedup
Clang does 3000% codde size growth for 456% speedup
> 
> That's clearly much larger code.  On x86 we're also fighting with
> large instruction encodings here, in particular EVEX for AVX512 is
> "bad" here.  We hardly get more than two instructions decoded per
> cycle due to their size.

Agreed, I found it surprising clang does that much of complette unrolling
at -O2. However vectorizing and not unrolling here seems like it may be
a better default for -O2 than what we do currently...

Honza
> 
> Richard.


Improve dumping of profile_count

2023-07-09 Thread Jan Hubicka via Gcc-patches
Hi,
dumps of profile_counts are quite hard to interpret since they are 64bit fixed 
point
values.  In many cases one looks at a single function and it is better to think 
of
basic block frequency, that is how many times it is executed each invocatoin. 
This
patch makes CFG dumps to also print this info.

For example:
main()
{
for (int i = 0; i < 10; i++)
t();
}

the -fdump-tree-optimized-blocks-details now prints:
int main ()
{
  unsigned int ivtmp_1;
  unsigned int ivtmp_2;

;;   basic block 2, loop depth 0, count 97603128 (estimated locally, freq 
1.), maybe hot
;;prev block 0, next block 3, flags: (NEW, VISITED)
;;pred:   ENTRY [always]  count:97603128 (estimated locally, freq 
1.) (FALLTHRU,EXECUTABLE)
;;succ:   3 [always]  count:97603128 (estimated locally, freq 1.) 
(FALLTHRU,EXECUTABLE)

;;   basic block 3, loop depth 1, count 976138697 (estimated locally, freq 
10.0011), maybe hot
;;prev block 2, next block 4, flags: (NEW, VISITED)
;;pred:   3 [90.0% (guessed)]  count:878535568 (estimated locally, freq 
9.0011) (TRUE_VALUE,EXECUTABLE)
;;2 [always]  count:97603128 (estimated locally, freq 1.) 
(FALLTHRU,EXECUTABLE)
  # ivtmp_2 = PHI 
  t ();
  ivtmp_1 = ivtmp_2 + 4294967295;
  if (ivtmp_1 != 0)
goto ; [90.00%]
  else
goto ; [10.00%]
;;succ:   3 [90.0% (guessed)]  count:878535568 (estimated locally, freq 
9.0011) (TRUE_VALUE,EXECUTABLE)
;;4 [10.0% (guessed)]  count:97603129 (estimated locally, freq 
1.) (FALSE_VALUE,EXECUTABLE)

;;   basic block 4, loop depth 0, count 97603128 (estimated locally, freq 
1.), maybe hot
;;prev block 3, next block 1, flags: (NEW, VISITED)
;;pred:   3 [10.0% (guessed)]  count:97603129 (estimated locally, freq 
1.) (FALSE_VALUE,EXECUTABLE)
  return 0;
;;succ:   EXIT [always]  count:97603128 (estimated locally, freq 
1.) (EXECUTABLE)

}

Which makes it easier to see that the inner bb is executed 10 times per 
invocation


gcc/ChangeLog:

* cfg.cc (check_bb_profile): Dump counts with relative frequency.
(dump_edge_info): Likewise.
(dump_bb_info): Likewise.
* profile-count.cc (profile_count::dump): Add comma between quality and
freq.

gcc/testsuite/ChangeLog:

* gcc.dg/predict-22.c: Update template.

diff --git a/gcc/cfg.cc b/gcc/cfg.cc
index 740d4f3581d..0de6d6b9e71 100644
--- a/gcc/cfg.cc
+++ b/gcc/cfg.cc
@@ -475,9 +475,9 @@ check_bb_profile (basic_block bb, FILE * file, int indent)
{
  fprintf (file, ";; %sInvalid sum of incoming counts ",
   s_indent);
- sum.dump (file);
+ sum.dump (file, fun);
  fprintf (file, ", should be ");
- bb->count.dump (file);
+ bb->count.dump (file, fun);
  fprintf (file, "\n");
}
 }
@@ -525,7 +525,7 @@ dump_edge_info (FILE *file, edge e, dump_flags_t flags, int 
do_succ)
   if (e->count ().initialized_p () && do_details)
 {
   fputs (" count:", file);
-  e->count ().dump (file);
+  e->count ().dump (file, cfun);
 }
 
   if (e->flags && do_details)
@@ -808,7 +808,7 @@ dump_bb_info (FILE *outf, basic_block bb, int indent, 
dump_flags_t flags,
  if (bb->count.initialized_p ())
{
  fputs (", count ", outf);
- bb->count.dump (outf);
+ bb->count.dump (outf, cfun);
}
  if (maybe_hot_bb_p (fun, bb))
fputs (", maybe hot", outf);
diff --git a/gcc/profile-count.cc b/gcc/profile-count.cc
index 6bf9700d8db..2c07ebc5942 100644
--- a/gcc/profile-count.cc
+++ b/gcc/profile-count.cc
@@ -94,7 +94,7 @@ profile_count::dump (char *buffer, struct function *fun) const
   else if (fun && initialized_p ()
   && fun->cfg
   && ENTRY_BLOCK_PTR_FOR_FN (fun)->count.initialized_p ())
-sprintf (buffer, "%" PRId64 " (%s freq %.4f)", m_val,
+sprintf (buffer, "%" PRId64 " (%s, freq %.4f)", m_val,
 profile_quality_display_names[m_quality],
 to_sreal_scale (ENTRY_BLOCK_PTR_FOR_FN (fun)->count).to_double ());
   else
diff --git a/gcc/testsuite/gcc.dg/predict-22.c 
b/gcc/testsuite/gcc.dg/predict-22.c
index f14c2b68861..1aed03fae79 100644
--- a/gcc/testsuite/gcc.dg/predict-22.c
+++ b/gcc/testsuite/gcc.dg/predict-22.c
@@ -55,5 +55,5 @@ foo (int x, int y, int z)
   baz ();
 }
 /* { dg-final { scan-tree-dump-times "Invalid sum" 0 "optimized"} } */
-/* { dg-final { scan-tree-dump-times "count 0 .precise.," 1 "optimized"} } */
+/* { dg-final { scan-tree-dump-times "count 0 .precise" 1 "optimized"} } */
 /* { dg-final { scan-rtl-dump-times "COLD_PARTITION" 1 "bbpart"} } */


Add missing dump_file check

2023-07-08 Thread Jan Hubicka via Gcc-patches
Hi,
I forgot to check dump_file being non-NULL before writting to it.
It is somewhat odd that this does not trigger more often - I will take
deeper look tomorrow, but I am checking this in as obvious to avoid ICE.

Honza

gcc/ChangeLog:

PR tree-optimization/110600
* cfgloopmanip.cc (scale_loop_profile): Add mising profile_dump check.

gcc/testsuite/ChangeLog:

PR tree-optimization/110600
* gcc.c-torture/compile/pr110600.c: New test.

diff --git a/gcc/cfgloopmanip.cc b/gcc/cfgloopmanip.cc
index 52732420787..5c0065b2f5a 100644
--- a/gcc/cfgloopmanip.cc
+++ b/gcc/cfgloopmanip.cc
@@ -582,9 +582,10 @@ scale_loop_profile (class loop *loop, profile_probability 
p,
 
   if (exit_edge && exit_edge->src->loop_father != loop)
 {
-  fprintf (dump_file,
-  ";; Loop exit is in inner loop;"
-  " will leave exit probabilities inconsistent\n");
+  if (dump_file && (dump_flags & TDF_DETAILS))
+   fprintf (dump_file,
+";; Loop exit is in inner loop;"
+" will leave exit probabilities inconsistent\n");
 }
   else if (exit_edge)
 {
diff --git a/gcc/testsuite/gcc.c-torture/compile/pr110600.c 
b/gcc/testsuite/gcc.c-torture/compile/pr110600.c
new file mode 100644
index 000..4b126f74e43
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/compile/pr110600.c
@@ -0,0 +1,6 @@
+int a(int b, int c) { return (b ^ c) < 0 ? b : b - c; }
+int main() {
+  for (int e = 0; e != -1; e = a(e, 1))
+;
+  return 0;
+}


Fix profile update in tree-ssa/update-cunroll.c

2023-07-08 Thread Jan Hubicka via Gcc-patches
Fix tree-ssa/update-cunroll.c

In this testcase the profile is misupdated before loop has two exits.
The first exit is one eliminated by complete unrolling while second exit 
remains.
We remove first exit but forget about fact that the source BB of other exit will
then have higher frequency making other exit more likely.

This patch fixes that in duplicate_loop_body_to_header_edge.
While looking into resulting profiles I also noticed that in some cases
scale_loop_profile may drop probabilities to 0 incorrectly either when
trying to update exit from nested loop (which has similar problem) or when the 
profile
was inconsistent as described in coment bellow.

With the patch I now get on tramp3d with -O3:

Profile consistency report:

Pass dump id and name|static mismat|dynamic mismatch  
 |in count |in count  
127t ch  | 12   +10|0 
131t dom | 20+8|0 
134t reassoc | 22+2|0 
136t forwprop| 26+4|   185250  +185250
159t cddce   | 38   +12|   213412   +28162
161t ldist   | 39+1|   213412 
172t ifcvt   | 41+2|   369692  +156280
173t vect|108   +67|  9508861 +9139169
176t cunroll |102-6| 11603578 +2094717
183t loopdone|101-1| 11547143   -56435
197t dom |100-1| 12641109 +1093966
199t threadfull  |102+2| 12849084  +207975
200t vrp |104+2| 13047253  +198169
204t dce |102-2| 12973989   -73264
206t sink| 98-4| 12959537   -14452
211t cddce   |102+4| 12973989   +14452
255t optimized   | 98-4| 12959537   -14452
258r into_cfglayout  | 97-1| 12960039 
259r jump| 98+1| 12960039 
262r cse1| 97-1| 12960039 
275r loop2_unroll| 99+2| 16090384 +3130345
312r pro_and_epilogue|119   +20| 16191103  +100720
323r bbro|118-1| 15877546  -313557

So 118 instead of 160 mismatches.

Bootstrapped/regtested x86_64-linux, comitted.

gcc/ChangeLog:

PR middle-end/110590
* cfgloopmanip.cc (scale_loop_profile): Avoid scaling exits within
inner loops and be more careful about inconsistent profiles.
(duplicate_loop_body_to_header_edge): Fix profile update when eliminated
exit is followed by other exit.

gcc/testsuite/ChangeLog:

PR middle-end/110590
* gcc.dg/tree-prof/update-cunroll-2.c: Remove xfail.
* gcc.dg/tree-ssa/update-cunroll.c: Likewise.

diff --git a/gcc/cfgloopmanip.cc b/gcc/cfgloopmanip.cc
index f56a9b87d1c..52732420787 100644
--- a/gcc/cfgloopmanip.cc
+++ b/gcc/cfgloopmanip.cc
@@ -580,13 +580,47 @@ scale_loop_profile (class loop *loop, profile_probability 
p,
 unadjusted_exit_count = exit_edge->count ();
   scale_loop_frequencies (loop, scale_prob);
 
-  if (exit_edge)
+  if (exit_edge && exit_edge->src->loop_father != loop)
+{
+  fprintf (dump_file,
+  ";; Loop exit is in inner loop;"
+  " will leave exit probabilities inconsistent\n");
+}
+  else if (exit_edge)
 {
   profile_count old_exit_count = exit_edge->count ();
   profile_probability new_probability;
   if (iteration_bound > 0)
-   new_probability
- = unadjusted_exit_count.probability_in (exit_edge->src->count);
+   {
+ /* It may happen that the source basic block of the exit edge is
+inside in-loop condition:
+
+   +-> header
+   ||
+   |   B1
+   |  /  \
+   | |   B2--exit_edge-->
+   |  \  /
+   |   B3
+   +__/
+
+ If B2 count is smaller than desired exit edge count
+ the profile was inconsistent with the newly discovered upper 
bound.
+ Probablity of edge B1->B2 is too low.  We do not attempt to fix
+ that (as it is hard in general) but we want to avoid dropping
+ count of edge B2->B3 to zero may confuse later optimizations.  */
+ if (unadjusted_exit_count.apply_scale (7, 8) > exit_edge->src->count)
+   {
+ if (dump_file && (dump_flags & TDF_DETAILS))
+   fprintf (dump_file,
+";; Source basic block of loop exit count is too 
small;"
+  

Cleanup force_edge_cold

2023-07-07 Thread Jan Hubicka via Gcc-patches
Hi,
we can use the new set_edge_probability_and_rescale_others here.

Bootstrapped/regtested x86_64-linux, comitted.

Honza

gcc/ChangeLog:

* predict.cc (force_edge_cold): Use
set_edge_probability_and_rescale_others; improve dumps.

diff --git a/gcc/predict.cc b/gcc/predict.cc
index d65c26f0ab2..1aa4c25eb70 100644
--- a/gcc/predict.cc
+++ b/gcc/predict.cc
@@ -4421,21 +4421,16 @@ force_edge_cold (edge e, bool impossible)
  there.  */
   else if (prob_sum > profile_probability::never ())
 {
-  if (!(e->probability < goal))
-   e->probability = goal;
-
-  profile_probability prob_comp = prob_sum / e->probability.invert ();
-
   if (dump_file && (dump_flags & TDF_DETAILS))
-   fprintf (dump_file, "Making edge %i->%i %s by redistributing "
-"probability to other edges.\n",
-e->src->index, e->dest->index,
-impossible ? "impossible" : "cold");
-  FOR_EACH_EDGE (e2, ei, e->src->succs)
-   if (e2 != e)
- {
-   e2->probability /= prob_comp;
- }
+   {
+ fprintf (dump_file, "Making edge %i->%i %s by redistributing "
+  "probability to other edges. Original probability: ",
+  e->src->index, e->dest->index,
+  impossible ? "impossible" : "cold");
+ e->probability.dump (dump_file);
+ fprintf (dump_file, "\n");
+   }
+  set_edge_probability_and_rescale_others (e, goal);
   if (current_ir_type () != IR_GIMPLE
  && e->src != ENTRY_BLOCK_PTR_FOR_FN (cfun))
update_br_prob_note (e->src);


Fix some profile consistency testcases

2023-07-07 Thread Jan Hubicka via Gcc-patches
Hi,
Information about profile mismatches is printed only with -details-blocks for 
some time.
I think it should be printed even with default to make it easier to spot when 
someone introduces
new transform that breaks the profile, but I will send separate RFC for that.

This patch enables details in all testcases that greps for Invalid sum.  There 
are 4 testcases
which fails:
  gcc.dg/tree-ssa/loop-ch-profile-1.c
 here the problem is that loop header dulication introduces loop invariant 
conditoinal that is later
 updated by tree-ssa-dom but dom does not take care of updating profile.
 Since loop-ch knows when it duplicates loop invariant, we may be able to 
get this right.

 The test is still useful since it tests that right after ch profile is 
consistent.
  gcc.dg/tree-prof/update-cunroll-2.c
 This is about profile updating code in duplicate_loop_body_to_header_edge 
being wrong when optimized
 out exit is not last in the loop.  In that case the probability of later 
exits needs to be accounted in.
 I will think about making this better - in general this does not seem to 
have easy solution, but for
 special case of chained tests we can definitely account for the later 
exits.
  gcc.dg/tree-ssa/update-unroll-1.c
 This fails after aprefetch invoked unrolling.  I did not look into details 
yet.
  gcc.dg/tree-prof/update-unroll-2.c
 This one seems similar as previous
I decided to xfail these tests and deal with them incrementally and filled in 
PR110590.

Regtested x86_64-linux, will commit it shortly.
Honza
gcc/testsuite/ChangeLog:

* g++.dg/tree-prof/indir-call-prof.C: Add block-details to dump flags.
* gcc.dg/pr43864-2.c: Likewise.
* gcc.dg/pr43864-3.c: Likewise.
* gcc.dg/pr43864-4.c: Likewise.
* gcc.dg/pr43864.c: Likewise.
* gcc.dg/tree-prof/cold_partition_label.c: Likewise.
* gcc.dg/tree-prof/indir-call-prof.c: Likewise.
* gcc.dg/tree-prof/update-cunroll-2.c: Likewise.
* gcc.dg/tree-prof/update-tailcall.c: Likewise.
* gcc.dg/tree-prof/val-prof-1.c: Likewise.
* gcc.dg/tree-prof/val-prof-2.c: Likewise.
* gcc.dg/tree-prof/val-prof-3.c: Likewise.
* gcc.dg/tree-prof/val-prof-4.c: Likewise.
* gcc.dg/tree-prof/val-prof-5.c: Likewise.
* gcc.dg/tree-ssa/fnsplit-1.c: Likewise.
* gcc.dg/tree-ssa/loop-ch-profile-2.c: Likewise.
* gcc.dg/tree-ssa/update-threading.c: Likewise.
* gcc.dg/tree-ssa/update-unswitch-1.c: Likewise.
* gcc.dg/unroll-7.c: Likewise.
* gcc.dg/unroll-8.c: Likewise.
* gfortran.dg/pr25623-2.f90: Likewise.
* gfortran.dg/pr25623.f90: Likewise.
* gcc.dg/tree-ssa/loop-ch-profile-1.c: Likewise; xfail.
* gcc.dg/tree-ssa/update-cunroll.c: Likewise; xfail.
* gcc.dg/tree-ssa/update-unroll-1.c: Likewise; xfail.

diff --git a/gcc/testsuite/g++.dg/tree-prof/indir-call-prof.C 
b/gcc/testsuite/g++.dg/tree-prof/indir-call-prof.C
index b45417106d0..1f7404600ed 100644
--- a/gcc/testsuite/g++.dg/tree-prof/indir-call-prof.C
+++ b/gcc/testsuite/g++.dg/tree-prof/indir-call-prof.C
@@ -1,4 +1,4 @@
-/* { dg-options "-O2 -fdump-tree-optimized -fdump-ipa-profile-optimized 
-fdump-ipa-afdo-optimized" } */
+/* { dg-options "-O2 -fdump-tree-optimized-blocks-details 
-fdump-ipa-profile-optimized -fdump-ipa-afdo-optimized" } */
 
 struct A {
   A () {}
diff --git a/gcc/testsuite/gcc.dg/pr43864-2.c b/gcc/testsuite/gcc.dg/pr43864-2.c
index 6393144ccf7..102295ad4e8 100644
--- a/gcc/testsuite/gcc.dg/pr43864-2.c
+++ b/gcc/testsuite/gcc.dg/pr43864-2.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-tail-merge -fdump-tree-pre" } */
+/* { dg-options "-O2 -ftree-tail-merge -fdump-tree-pre-details-blocks" } */
 
 int
 f (int c, int b, int d)
diff --git a/gcc/testsuite/gcc.dg/pr43864-3.c b/gcc/testsuite/gcc.dg/pr43864-3.c
index 24b59a14b96..f70b8764cc8 100644
--- a/gcc/testsuite/gcc.dg/pr43864-3.c
+++ b/gcc/testsuite/gcc.dg/pr43864-3.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-tail-merge -fdump-tree-pre" } */
+/* { dg-options "-O2 -fdump-tree-pre-blocks-details" } */
 
 /* Commutative case.  */
 
diff --git a/gcc/testsuite/gcc.dg/pr43864-4.c b/gcc/testsuite/gcc.dg/pr43864-4.c
index 8a25b0fd8ef..3de71fccfa2 100644
--- a/gcc/testsuite/gcc.dg/pr43864-4.c
+++ b/gcc/testsuite/gcc.dg/pr43864-4.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-tail-merge -fdump-tree-pre" } */
+/* { dg-options "-O2 -ftree-tail-merge -fdump-tree-pre-details-blocks" } */
 
 /* Different stmt order.  */
 
diff --git a/gcc/testsuite/gcc.dg/pr43864.c b/gcc/testsuite/gcc.dg/pr43864.c
index ed69a737d3f..7bd1ba69281 100644
--- a/gcc/testsuite/gcc.dg/pr43864.c
+++ b/gcc/testsuite/gcc.dg/pr43864.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-tail-merge -fdump-tree-pre" } */
+/* { dg-options "-O2 -fdump-tree-pre-details-blocks" } */
 
 

Re: [PATCH 4/19]middle-end: Fix scale_loop_frequencies segfault on multiple-exits

2023-07-07 Thread Jan Hubicka via Gcc-patches
> Hi Both,
> 
> Thanks for all the reviews/patches so far 
> 
> > >
> > > Looks good, but I wonder what we can do to at least make the multiple
> > > exit case behave reasonably?  The vectorizer keeps track
> > 
> > > of a "canonical" exit, would it be possible to pass in the main exit
> > > edge and use that instead of single_exit (), would other exits then
> > > behave somewhat reasonable or would we totally screw things up here?
> > > That is, the "canonical" exit would be the counting exit while the
> > > other exits are on data driven conditions and thus wouldn't change
> > > probability when we reduce the number of iterations(?)
> > 
> > I can add canonical_exit parameter and make the function to direct flow to 
> > it if
> > possible.  However overall I think fixup depends on what transformation led 
> > to
> > the change.
> > 
> > Assuming that vectorizer did no prologues and apilogues and we vectorized
> > with factor N, then I think the update could be done more specifically as
> > follows.
> > 
> 
> If it helps, how this patch series addresses multiple exits by forcing a 
> scalar
> epilogue, all non canonical_exits would have been redirected to this scalar
> epilogue, so the remaining scalar iteration count will be at most VF.

It looks like profile update after vectorization needs quite some TLC.
My student Ondrej Kubanek also implemented loop histogram profiling
which gives better idea on how commonly prologues/epilogues are needed
and it would be also nice to handle it.
> > ;;   basic block 12, loop depth 0, count 10737416 (estimated locally), maybe
> > hot
> > ;;prev block 9, next block 13, flags: (NEW, VISITED)
> > ;;pred:   8 [50.0% (adjusted)]  count:10737418 (estimated locally)
> > (FALSE_VALUE,EXECUTABLE)
> > ;;succ:   13 [always]  count:10737416 (estimated locally) (FALLTHRU)
> > 
> > ;;   basic block 13, loop depth 1, count 1063004409 (estimated locally),
> > maybe hot
> > ;;prev block 12, next block 14, flags: (NEW, REACHABLE, VISITED)
> > ;;pred:   14 [always]  count:1052266996 (estimated locally)
> > (FALLTHRU,DFS_BACK,EXECUTABLE)
> > ;;12 [always]  count:10737416 (estimated locally) (FALLTHRU)
> >   # i_30 = PHI 
> >   # ivtmp_32 = PHI 
> >   _33 = a[i_30];
> >   _34 = _33 + 1;
> >   a[i_30] = _34;
> >   i_36 = i_30 + 1;
> >   ivtmp_37 = ivtmp_32 - 1;
> >   if (ivtmp_37 != 0)
> > goto ; [98.99%]
> >   else
> > goto ; [1.01%]

Actually it seems that the scalar epilogue loop is with oriignal profile
(predicted to iterate 99 times) which is quite wrong.
Looking at the statistics for yesterday patch, on tramp3d we got 86%
reduction in cummulative profile mismatches after whole optimization
pipeline.  More interestingly however the overall time esimtate
dropped by 18%, so it seems that the profile adjustment done by cunroll
are afecting the profile a lot.

I think the fact that iteration counts of epilogues is not capped is one
of main problems.

We seem to call scale_loop_profile 3 times:

   scale_loop_profile (loop, prob_vector, -1);

This seems to account for the probability that control flow is
redirected to prolog/epilog later.  So it only scales down the profile
but is not responsible 

   scale_loop_profile (prolog, prob_prolog, bound_prolog - 1);

This is does prolog and sets bound.

   scale_loop_profile (epilog, prob_epilog, -1);

This scales epilog but does not set bound at all. 
I think the information is availale since we update the loop_info
datastructures.

Honza


Re: [PATCH 4/19]middle-end: Fix scale_loop_frequencies segfault on multiple-exits

2023-07-07 Thread Jan Hubicka via Gcc-patches
> 
> Looks good, but I wonder what we can do to at least make the
> multiple exit case behave reasonably?  The vectorizer keeps track

> of a "canonical" exit, would it be possible to pass in the main
> exit edge and use that instead of single_exit (), would other
> exits then behave somewhat reasonable or would we totally screw
> things up here?  That is, the "canonical" exit would be the
> counting exit while the other exits are on data driven conditions
> and thus wouldn't change probability when we reduce the number
> of iterations(?)

I can add canonical_exit parameter and make the function to direct flow
to it if possible.  However overall I think fixup depends on what
transformation led to the change.

Assuming that vectorizer did no prologues and apilogues and we
vectorized with factor N, then I think the update could be done more
specifically as follows.

We know that header block count dropped by 4. So we can start from that
and each time we reach basic block with exit edge, we know the original
count of the edge.  This count is unchanged, so one can rescale
probabilities out of that BB accordingly.  If loop has no inner loops,
we can just walk the body in RPO and propagate scales downwards and we
sould arrive to right result

I originally added the bound parameter to handle prologues/epilogues
which gets new artificial bound.  In prologue I think you are right that
the flow will be probably directed to the conditional counting
iterations.

In epilogue we add no artificial iteration cap, so maybe it is more
realistic to simply scale up probability of all exits?

To see what is going on I tried following testcase:

int a[99];
test()
{
  for (int i = 0; i < 99; i++)
  a[i]++;
}

What surprises me is that vectorizer at -O2 does nothing and we end up
unrolling the loop:

L2:
addl$1, (%rax)
addl$1, 4(%rax)
addl$1, 8(%rax)
addq$12, %rax
cmpq$a+396, %rax

Which seems sily thing to do. Vectorized loop with epilogue doing 2 and
1 addition would be better.

With -O3 we vectorize it:


.L2:
movdqa  (%rax), %xmm0
addq$16, %rax
paddd   %xmm1, %xmm0
movaps  %xmm0, -16(%rax)
cmpq%rax, %rdx
jne .L2
movqa+384(%rip), %xmm0
addl$1, a+392(%rip)
movq.LC1(%rip), %xmm1
paddd   %xmm1, %xmm0
movq%xmm0, a+384(%rip)


and correctly drop vectorized loop body to 24 iterations. However the
epilogue has loop for vector size 2 predicted to iterate once (it won't)

;;   basic block 7, loop depth 0, count 10737416 (estimated locally), maybe hot 
;;prev block 5, next block 8, flags: (NEW, VISITED) 
;;pred:   3 [4.0% (adjusted)]  count:10737416 (estimated locally) 
(FALSE_VALUE,EXECUTABLE)
;;succ:   8 [always]  count:10737416 (estimated locally) 
(FALLTHRU,EXECUTABLE)

;;   basic block 8, loop depth 1, count 21474835 (estimated locally), maybe hot 
;;prev block 7, next block 9, flags: (NEW, REACHABLE, VISITED)  
;;pred:   9 [always]  count:10737417 (estimated locally) 
(FALLTHRU,DFS_BACK,EXECUTABLE)
;;7 [always]  count:10737416 (estimated locally) 
(FALLTHRU,EXECUTABLE)
  # i_9 = PHI   
  # ivtmp_13 = PHI   
  # vectp_a.14_40 = PHI  [(void *) + 
384B](7)>
  # vectp_a.18_46 = PHI  [(void *) + 
384B](7)>
  # ivtmp_49 = PHI   
  vect__14.16_42 = MEM  [(int *)vectp_a.14_40];  
  _14 = a[i_9]; 
  vect__15.17_44 = vect__14.16_42 + { 1, 1 };   
  _15 = _14 + 1;
  MEM  [(int *)vectp_a.18_46] = vect__15.17_44;  
  i_17 = i_9 + 1;   
  ivtmp_18 = ivtmp_13 - 1;  
  vectp_a.14_41 = vectp_a.14_40 + 8;
  vectp_a.18_47 = vectp_a.18_46 + 8;
  ivtmp_50 = ivtmp_49 + 1;  
  if (ivtmp_50 < 1) 
goto ; [50.00%]   
  else  
goto ; [50.00%]  

and finally the scalar copy

;;   basic block 12, loop depth 0, count 10737416 (estimated locally), maybe hot
;;prev block 9, next block 13, flags: (NEW, VISITED)
;;pred:   8 [50.0% (adjusted)]  count:10737418 (estimated locally) 

Fix profile update after loop-ch and cunroll

2023-07-06 Thread Jan Hubicka via Gcc-patches
Hi,
this patch makes loop-ch and loop unrolling to fix profile in case the loop is
known to not iterate at all (or iterate few times) while profile claims it
iterates more.  While this is kind of symptomatic fix, it is best we can do
incase profile was originally esitmated incorrectly.

In the testcase the problematic loop is produced by vectorizer and I think
vectorizer should know and account into its costs that vectorizer loop and/or
epilogue is not going to loop after the transformation.  So it would be nice
to fix it on that side, too.

The patch avoids about half of profile mismatches caused by cunroll.

Pass dump id and name|static mismatcdynamic mismatch
 |in count |in count
107t cunrolli|  3+3|17251   +17251
115t threadfull  |  3  |14376-2875
116t vrp |  5+2|30908   +16532
117t dse |  5  |30908
118t dce |  3-2|17251   -13657
127t ch  | 13   +10|17251
131t dom | 39   +26|17251
133t isolate-paths   | 47+8|17251
134t reassoc | 49+2|17251
136t forwprop| 53+4|   202501  +185250
159t cddce   | 61+8|   216211   +13710
161t ldist   | 62+1|   216211
172t ifcvt   | 66+4|   373711  +157500
173t vect|143   +77|  9802097 +9428386
176t cunroll |221   +78| 15639591 +5837494
183t loopdone|218-3| 15577640   -61951
195t fre |214-4| 15577640
197t dom |213-1| 16671606 +1093966
199t threadfull  |215+2| 16879581  +207975
200t vrp |217+2| 17077750  +198169
204t dce |215-2| 17004486   -73264
206t sink|213-2| 17004486
211t cddce   |219+6| 17005926+1440
255t optimized   |217-2| 17005926
256r expand  |210-7| 19571573 +2565647
258r into_cfglayout  |208-2| 19571573
275r loop2_unroll|212+4| 22992432 +3420859
291r ce2 |210-2| 23011838
312r pro_and_epilogue|230   +20| 23073776   +61938
315r jump2   |236+6| 27110534 +4036758
323r bbro|229-7| 21826835 -5283699


W/o the patch cunroll does:

176t cunroll |294  +151|126548439   +116746342

and we end up with 291 mismatches at bbro.

Bootstrapped/regtested x86_64-linux. Plan to commit it after the 
scale_loop_frequency patch.

gcc/ChangeLog:

PR middle-end/25623
* tree-ssa-loop-ch.cc (ch_base::copy_headers): Scale loop frequency to 
maximal number
of iterations determined.
* tree-ssa-loop-ivcanon.cc (try_unroll_loop_completely): Likewise.

gcc/testsuite/ChangeLog:

PR middle-end/25623
* gfortran.dg/pr25623-2.f90: New test.

diff --git a/gcc/testsuite/gfortran.dg/pr25623-2.f90 
b/gcc/testsuite/gfortran.dg/pr25623-2.f90
new file mode 100644
index 000..57679e0d6ed
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/pr25623-2.f90
@@ -0,0 +1,19 @@
+! { dg-do compile }
+! { dg-options "-fdump-tree-optimized-blocks -O3" }
+
+SUBROUTINE S42(a,b,c,N)
+ IMPLICIT NONE
+ integer :: N
+ real*8  :: a(N),b(N),c(N),tmp,tmp2,tmp4
+ real*8, parameter :: p=1.0D0/3.0D0
+ integer :: i
+ c=0.0D0
+ DO i=1,N
+   tmp=a(i)**p ! could even be done with a cube root
+   tmp2=tmp*tmp
+   tmp4=tmp2*tmp2
+   b(i)=b(i)+tmp4
+   c(i)=c(i)+tmp2
+ ENDDO
+END SUBROUTINE
+! { dg-final { scan-tree-dump-not "Invalid sum" "optimized" } }
diff --git a/gcc/tree-ssa-loop-ch.cc b/gcc/tree-ssa-loop-ch.cc
index 291f2dbcab9..72792cec21f 100644
--- a/gcc/tree-ssa-loop-ch.cc
+++ b/gcc/tree-ssa-loop-ch.cc
@@ -422,6 +422,7 @@ ch_base::copy_headers (function *fun)
{
  if (dump_file && (dump_flags & TDF_DETAILS))
fprintf (dump_file, "Loop %d never loops.\n", loop->num);
+ scale_loop_profile (loop, profile_probability::always (), 0);
  loops_to_unloop.safe_push (loop);
  loops_to_unloop_nunroll.safe_push (0);
  continue;
@@ -666,6 +667,7 @@ ch_base::copy_headers (function *fun)
{
  if (dump_file && (dump_flags & TDF_DETAILS))
fprintf (dump_file, "Loop %d no longer loops.\n", loop->num);
+ scale_loop_profile (loop, profile_probability::always (), 0);
  loops_to_unloop.safe_push (loop);
  

Re: [PATCH 4/19]middle-end: Fix scale_loop_frequencies segfault on multiple-exits

2023-07-06 Thread Jan Hubicka via Gcc-patches
Hi,
original scale_loop_profile was implemented to only handle very simple loops
produced by vectorizer at that time (basically loops with only one exit and no
subloops). It also has not been updated to new profile-count API very carefully.
Since I want to use it from loop peeling and unlooping, I need the
function to at least not get profile worse on general loops.

The function does two thigs
 1) scales down the loop profile by a given probability.
This is useful, for example, to scale down profile after peeling when loop
body is executed less often than before
 2) after scaling is done and if profile indicates too large iteration
count update profile to cap iteration count by ITERATION_BOUND parameter.

Step 1 is easy and unchanged.

I changed ITERATION_BOUND to be actual bound on number of iterations as
used elsewhere (i.e. number of executions of latch edge) rather then
number of iterations + 1 as it was before.

To do 2) one needs to do the following
  a) scale own loop profile so frquency o header is at most
 the sum of in-edge counts * (iteration_bound + 1)
  b) update loop exit probabilities so their count is the same
 as before scaling.
  c) reduce frequencies of basic blocks after loop exit

old code did b) by setting probability to 1 / iteration_bound which is
correctly only of the basic block containing exit executes precisely one per
iteration (it is not insie other conditional or inner loop).  This is fixed
now by using set_edge_probability_and_rescale_others

aldo c) was implemented only for special case when the exit was just before
latch bacis block.  I now use dominance info to get right some of addional
case.

I still did not try to do anything for multiple exit loops, though the
implementatoin could be generalized.

Bootstrapped/regtested x86_64-linux.  Plan to cmmit it tonight if there
are no complains.

gcc/ChangeLog:

* cfgloopmanip.cc (scale_loop_profile): Rewrite exit edge
probability update to be safe on loops with subloops.
Make bound parameter to be iteration bound.
* tree-ssa-loop-ivcanon.cc (try_peel_loop): Update call
of scale_loop_profile.
* tree-vect-loop-manip.cc (vect_do_peeling): Likewise.

diff --git a/gcc/cfgloopmanip.cc b/gcc/cfgloopmanip.cc
index 6e09dcbb0b1..524b979a546 100644
--- a/gcc/cfgloopmanip.cc
+++ b/gcc/cfgloopmanip.cc
@@ -499,7 +499,7 @@ scale_loop_frequencies (class loop *loop, 
profile_probability p)
 }
 
 /* Scale profile in LOOP by P.
-   If ITERATION_BOUND is non-zero, scale even further if loop is predicted
+   If ITERATION_BOUND is not -1, scale even further if loop is predicted
to iterate too many times.
Before caling this function, preheader block profile should be already
scaled to final count.  This is necessary because loop iterations are
@@ -510,106 +510,123 @@ void
 scale_loop_profile (class loop *loop, profile_probability p,
gcov_type iteration_bound)
 {
-  edge e, preheader_e;
-  edge_iterator ei;
-
-  if (dump_file && (dump_flags & TDF_DETAILS))
+  if (!(p == profile_probability::always ()))
 {
-  fprintf (dump_file, ";; Scaling loop %i with scale ",
-  loop->num);
-  p.dump (dump_file);
-  fprintf (dump_file, " bounding iterations to %i\n",
-  (int)iteration_bound);
-}
+  if (dump_file && (dump_flags & TDF_DETAILS))
+   {
+ fprintf (dump_file, ";; Scaling loop %i with scale ",
+  loop->num);
+ p.dump (dump_file);
+ fprintf (dump_file, "\n");
+   }
 
-  /* Scale the probabilities.  */
-  scale_loop_frequencies (loop, p);
+  /* Scale the probabilities.  */
+  scale_loop_frequencies (loop, p);
+}
 
-  if (iteration_bound == 0)
+  if (iteration_bound == -1)
 return;
 
   gcov_type iterations = expected_loop_iterations_unbounded (loop, NULL, true);
+  if (iterations == -1)
+return;
 
   if (dump_file && (dump_flags & TDF_DETAILS))
 {
-  fprintf (dump_file, ";; guessed iterations after scaling %i\n",
-  (int)iterations);
+  fprintf (dump_file,
+  ";; guessed iterations of loop %i:%i new upper bound %i:\n",
+  loop->num,
+  (int)iterations,
+  (int)iteration_bound);
 }
 
   /* See if loop is predicted to iterate too many times.  */
   if (iterations <= iteration_bound)
 return;
 
-  preheader_e = loop_preheader_edge (loop);
-
-  /* We could handle also loops without preheaders, but bounding is
- currently used only by optimizers that have preheaders constructed.  */
-  gcc_checking_assert (preheader_e);
-  profile_count count_in = preheader_e->count ();
+  /* Compute number of invocations of the loop.  */
+  profile_count count_in = profile_count::zero ();
+  edge e;
+  edge_iterator ei;
+  FOR_EACH_EDGE (e, ei, loop->header->preds)
+count_in += e->count ();
 
-  if (count_in > profile_count::zero ()
-  && loop->header->count.initialized_p 

update_bb_profile_for_threading TLC

2023-07-06 Thread Jan Hubicka via Gcc-patches
Hi,
this patch applies some TLC to update_bb_profile_for_threading.  The function 
resales
probabilities by:
   FOR_EACH_EDGE (c, ei, bb->succs)
c->probability /= prob;
which is correct but in case prob is 0 (took all execution counts to the newly
constructed path), this leads to undefined results which do not sum to 100%.

In several other plpaces we need to change probability of one edge and rescale
remaining to sum to 100% so I decided to break this off to helper function
set_edge_probability_and_rescale_others

For jump threading the probability of edge is always reduced, so division is 
right
update, however in general case we also may want to increase probability of the 
edge
which needs different scalling.  This is bit hard to do staying with 
probabilities
in range 0...1 for all temporaries.

For this reason I decided to add profile_probability::apply_scale which is 
symmetric
to what we already have in profile_count::apply_scale and does right thing in
both directions.

Finally I added few early exits so we do not produce confused dumps when
profile is missing and special case the common situation where edges out of BB
are precisely two.  In this case we can set the other edge to inverted
probability and not try to scale (which drops probabily quality from
PRECISE to ADJUSTED).

Bootstrapped/regtested x86_64-linux. The patch has no effect on in count 
mismatches
in tramp3d build and improves out-count.  Will commit it shortly.

gcc/ChangeLog:

* cfg.cc (set_edge_probability_and_rescale_others): New function.
(update_bb_profile_for_threading): Use it; simplify the rest.
* cfg.h (set_edge_probability_and_rescale_others): Declare.
* profile-count.h (profile_probability::apply_scale): New.

diff --git a/gcc/cfg.cc b/gcc/cfg.cc
index 57b40110960..740d4f3581d 100644
--- a/gcc/cfg.cc
+++ b/gcc/cfg.cc
@@ -901,6 +901,67 @@ brief_dump_cfg (FILE *file, dump_flags_t flags)
 }
 }
 
+/* Set probability of E to NEW_PROB and rescale other edges
+   from E->src so their sum remains the same.  */
+
+void
+set_edge_probability_and_rescale_others (edge e, profile_probability new_prob)
+{
+  edge e2;
+  edge_iterator ei;
+  if (e->probability == new_prob)
+return;
+  /* If we made E unconditional, drop other frequencies to 0.  */
+  if (new_prob == profile_probability::always ())
+{
+  FOR_EACH_EDGE (e2, ei, e->src->succs)
+   if (e2 != e)
+ e2->probability = profile_probability::never ();
+}
+  else
+{
+  int n = 0;
+  edge other_e = NULL;
+
+  /* See how many other edges are leaving exit_edge->src.  */
+  FOR_EACH_EDGE (e2, ei, e->src->succs)
+   if (e2 != e && !(e2->flags & EDGE_FAKE))
+ {
+   other_e = e2;
+   n++;
+ }
+  /* If there is only one other edge with non-zero probability we do not
+need to scale which drops quality of profile from precise
+to adjusted.  */
+  if (n == 1)
+   other_e->probability = new_prob.invert ();
+  /* Nothing to do if there are no other edges.  */
+  else if (!n)
+   ;
+  /* Do scaling if possible.  */
+  else if (e->probability.invert ().nonzero_p ())
+   {
+ profile_probability num = new_prob.invert (),
+ den = e->probability.invert ();
+ FOR_EACH_EDGE (e2, ei, e->src->succs)
+   if (e2 != e && !(e2->flags & EDGE_FAKE))
+ e2->probability = e2->probability.apply_scale (num, den);
+   }
+  else
+   {
+ if (dump_file && (dump_flags & TDF_DETAILS))
+   fprintf (dump_file,
+";; probability of edge %i->%i set reduced from 1."
+" The remaining edges are left inconsistent.\n",
+e->src->index, e->dest->index);
+ FOR_EACH_EDGE (e2, ei, e->src->succs)
+   if (e2 != e && !(e2->flags & EDGE_FAKE))
+ e2->probability = new_prob.invert ().guessed () / n;
+   }
+}
+  e->probability = new_prob;
+}
+
 /* An edge originally destinating BB of COUNT has been proved to
leave the block by TAKEN_EDGE.  Update profile of BB such that edge E can be
redirected to destination of TAKEN_EDGE.
@@ -912,62 +973,57 @@ void
 update_bb_profile_for_threading (basic_block bb, 
 profile_count count, edge taken_edge)
 {
-  edge c;
-  profile_probability prob;
-  edge_iterator ei;
+  gcc_assert (bb == taken_edge->src);
+
+  /* If there is no profile or the threaded path is never executed
+ we don't need to upate.  */
+  if (!bb->count.initialized_p ()
+  || count == profile_count::zero ())
+return;
 
   if (bb->count < count)
 {
   if (dump_file)
fprintf (dump_file, "bb %i count became negative after threading",
 bb->index);
+  /* If probabilities looks very off, scale down and reduce to guesses
+to avoid dropping the other path close to zero.  */
+  if 

Re: [PATCH 4/19]middle-end: Fix scale_loop_frequencies segfault on multiple-exits

2023-07-04 Thread Jan Hubicka via Gcc-patches
> On Wed, 28 Jun 2023, Tamar Christina wrote:
> 
> > Hi All,
> > 
> > There's an existing bug in loop frequency scaling where the if statement 
> > checks
> > to see if there's a single exit, and records an dump file note but then
> > continues.
> > 
> > It then tries to access the null pointer, which of course fails.
> > 
> > For multiple loop exists it's not really clear how to scale the exit
> > probablities as it's really unknown which exit is most probably.
> > 
> > For that reason I ignore the exit edges during scaling but still adjust the
> > loop body.
> > 
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > 
> > Ok for master?
> 
> I can't really make sense of
> 
>   /* If latch exists, change its count, since we changed
>  probability of exit.  Theoretically we should update everything 
> from
>  source of exit edge to latch, but for vectorizer this is enough.  
> */
>   if (loop->latch && loop->latch != e->src)
> loop->latch->count += count_delta;
> 
> since with simple latches the latch itself is an empty forwarder and
> e->src is the block with the conditional eventually exiting the block.
> That means this condition is always true.
> 
> So I think for exits the idea is to "remove" them by redirecting
> them "count-wise" back into the loop.  So turn
> 
>if (cond) --(exit-count)-- exit
>  |
>  | in-loop-count
>  |
>latch
> 
> into
> 
>[cond-blk-count]
>if (cond) -- (zero count) -- exit
>  |
>  | in-loop-cound + exit-count (== cond-blk-count)
>  |
>latch (now with cond-blk-count)

This is oposite situation.  You have loop predicted to iterate 10 times,
but you found it actually iterates at most twice.  So you want to
 1) scale down profile of every BB in the loop
so header is 2*sum_of_counts_to_from_entry_edges
instead of 10*
 2) reduce probability of loopback and instead increase probability of
exit.

The code attemts to get right only case where loop has one exit 
and instead of
  if (cond) -- (original-wrong-exit-probability) -- exit
it does
  if (cond) -- (exit-probability=1/#iterations) -- exit
Now it should adjust in-loop-count for every path from source of exit to
latch edge.  It just assumes that there is one basic block that is latch
and does it there.

I was just looking into using this for profile update when loop-ch or
complete unrolling proves that loop is iterating fewer times then
profile.  I can cleanup the funtion - it was originall written for the
old reperesentation of probabilities and cound and I did not do very
good job on updating it to new code.

Honza
> 
> and the comment correctly suggests all blocks following from here
> would need similar adjustment (and on in-loop branches the delta would be
> distributed according to probabilities).
> 
> Given the code is quite imperfect I would suggest to change the
> updating of the latch block count to read
> 
>   profile_count count_delta = profile_count::zero ();
>   if (loop->latch
>   && single_pred_p (loop->latch)
>   && loop_exits_from_bb_p (single_pred (loop->latch)))
> {
>   count_delta = single_pred (loop->latch)->count - loop->latch->count;
>   loop->latch->count = single_pred (loop->latch)->count;
> }
> 
>scale_loop_frequencies (loop, p);
> 
>   if (count_delta != 0)
> loop->latch->count -= count_delta;
> 
> which should exactly preserve the exit-before-latch behavior independent
> on the number of exits of the loop.
> 
> Please leave Honza a chance to comment here.
> 
> Thanks,
> Richard.
> 
> 
> > Thanks,
> > Tamar
> > 
> > gcc/ChangeLog:
> > 
> > * cfgloopmanip.cc (scale_loop_frequencies): Fix typo.
> > (scale_loop_profile): Don't access null pointer.
> > 
> > --- inline copy of patch -- 
> > diff --git a/gcc/cfgloopmanip.cc b/gcc/cfgloopmanip.cc
> > index 
> > 6e09dcbb0b1864bc64ffd570a4b923f50c3819b5..b10ef3d2be82902ccd74e52a4318217b2db13bcb
> >  100644
> > --- a/gcc/cfgloopmanip.cc
> > +++ b/gcc/cfgloopmanip.cc
> > @@ -501,7 +501,7 @@ scale_loop_frequencies (class loop *loop, 
> > profile_probability p)
> >  /* Scale profile in LOOP by P.
> > If ITERATION_BOUND is non-zero, scale even further if loop is predicted
> > to iterate too many times.
> > -   Before caling this function, preheader block profile should be already
> > +   Before calling this function, preheader block profile should be already
> > scaled to final count.  This is necessary because loop iterations are
> > determined by comparing header edge count to latch ege count and thus
> > they need to be scaled synchronously.  */
> > @@ -597,14 +597,14 @@ scale_loop_profile (class loop *loop, 
> > profile_probability p,
> >/* If latch exists, change its count, since we changed
> >  probability of exit.  Theoretically we should update everything from
> >  source of exit edge to latch, but for vectorizer this is enough.  */
> > -  if (loop->latch && loop->latch != 

Re: [PATCH] value-prof.cc: Correct edge prob calculation.

2023-07-04 Thread Jan Hubicka via Gcc-patches
> The mod-subtract optimization with ncounts==1 produced incorrect edge
> probabilities due to incorrect conditional probability calculation. This
> patch fixes the calculation.
> 
> gcc/ChangeLog:
> 
>   * value-prof.cc (gimple_mod_subtract_transform): Correct edge
> prob calculation.

OK,
thanks!
Honza


Fix profile updates in copy-header

2023-07-01 Thread Jan Hubicka via Gcc-patches
Hi,
most common source of profile mismatches is now copyheader pass.  The reason is 
that
in comon case the duplicated header condition will become constant true and 
that needs
changes in the loop exit condition probability.

While this can be done by jump threading it is not, since it gives up on loops.
Copy header pass now has logic to prove that first exit will become true, so 
this
patch adds necessary pumbing to the profile updating.
This is done in gimple_duplicate_sese_region in a way that is specific for this
particular case.  I think general case is kind-of unsolvable and loop-ch is the
only user of the infrastructure.  If we later invent some new users, maybe we
can export the region and region_copy arrays and let user to do the update.

With the patch we now get:

Pass dump id and name|static mismat|dynamic mismatch  
 |in count |in count  
107t cunrolli|  3+3|19237   +19237
127t ch  | 13   +10|19237 
131t dom | 39   +26|19237 
133t isolate-paths   | 47+8|19237 
134t reassoc | 49+2|19237 
136t forwprop| 53+4|   226943  +207706
159t cddce   | 61+8|   24   +15279
161t ldist   | 62+1|   24 
172t ifcvt   | 66+4|   415472  +173250
173t vect|143   +77| 10859784+10444312
176t cunroll |294  +151|150357763   +139497979
183t loopdone|291-3|150289533   -68230
194t tracer  |322   +31|153230990 +2941457
195t fre |317-5|153230990 
197t dom |286   -31|154448079 +1217089
199t threadfull  |293+7|154724763  +276684
200t vrp |297+4|155042448  +317685
204t dce |294-3|155017073   -25375
206t sink|292-2|155017073 
211t cddce   |298+6|155018657+1584
255t optimized   |296-2|155018657 
256r expand  |273   -23|154592622  -426035
258r into_cfglayout  |268-5|154592661  +39
275r loop2_unroll|272+4|159701866 +5109205
291r ce2 |270-2|159723509 
312r pro_and_epilogue|290   +20|159792505   +68996
315r jump2   |296+6|164234016 +4441511
323r bbro|294-2|159385430 -4848586

So ch introduces 10 new mismatches while originally it did 308.  At bbro the
number of mismatches dropped from 432 to 294.
Most offender is now cunroll pass. I think it is the case where loop has 
multiple
exits and one of exits becomes to be false in all but last peeled iteration.

This is another case where non-trivial loop update is needed.

Honza

gcc/ChangeLog:

* tree-cfg.cc (gimple_duplicate_sese_region): Add elliminated_edge
parmaeter; update profile.
* tree-cfg.h (gimple_duplicate_sese_region): Update prototype.
* tree-ssa-loop-ch.cc (entry_loop_condition_is_static): Rename to ...
(static_loop_exit): ... this; return the edge to be elliminated.
(ch_base::copy_headers): Handle profile updating for eliminated exits.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/ifc-20040816-1.c: Reduce number of mismatches
from 2 to 1.
* gcc.dg/tree-ssa/loop-ch-profile-1.c: New test.
* gcc.dg/tree-ssa/loop-ch-profile-2.c: New test.

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ifc-20040816-1.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ifc-20040816-1.c
index f8a6495cbaa..b55a533e374 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ifc-20040816-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ifc-20040816-1.c
@@ -39,4 +39,4 @@ int main1 ()
which is folded by vectorizer.  Both outgoing edges must have probability
100% so the resulting profile match after folding.  */
 /* { dg-final { scan-tree-dump-times "Invalid sum of outgoing probabilities 
200.0" 1 "ifcvt" } } */
-/* { dg-final { scan-tree-dump-times "Invalid sum of incoming counts" 2 
"ifcvt" } } */
+/* { dg-final { scan-tree-dump-times "Invalid sum of incoming counts" 1 
"ifcvt" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-ch-profile-1.c 
b/gcc/testsuite/gcc.dg/tree-ssa/loop-ch-profile-1.c
new file mode 100644
index 000..e8bab62b0d9
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-ch-profile-1.c
@@ 

Fix update_bb_profile_for_threading

2023-07-01 Thread Jan Hubicka via Gcc-patches
Hi,
this patch fixes some of profile mismatches caused by profile updating.
It seems that I misupdated update_bb_profile_for_threading in 2017 which
results in invalid updates from rtl threading and threadbackwards.
update_bb_profile_for_threading knows that some paths to BB are being
redirected elsehwere and those paths will exit from BB with E.  So it needs to
determine probability of the duplicated path and redistribute probablities.
For some reaosn however the conditonal probability of redirected path is
computed after its counts is subtracted which is wrong and often results in
probability greater than 100%.

I also fixed error mesage.  Compilling tramp3d I now get following passes
producing mismpatches:
Pass dump id and name|static mismatcdynamic mismatch  
 |in count |in count  
113t fre |  2+2|0 
114t mergephi|  2  |0 
115t threadfull  |  2  |0 
116t vrp |  2  |0 
127t ch  |307  +305|347194302   +347194302
130t thread  |313+6|347221478   +27176
131t dom |321+8|346841121  -380357
134t reassoc |323+2|346841121 
136t forwprop|327+4|347026371  +185250
144t pre |326-1|347040926   +14555
172t ifcvt   |338+2|347218249  +156280
173t vect|409   +71|356357418 +9139169
176t cunroll |377   -32|126071925   -230285493
183t loopdone|376-1|126015489   -56436
194t tracer  |379+3|127258199 +1242710
197t dom |375-4|128352165 +1093966
199t threadfull  |379+4|128526112  +173947
200t vrp |381+2|128724673  +198561
204t dce |374-7|128632495   -92178
206t sink|370-4|128618043   -14452
211t cddce   |372+2|128632495   +14452
248t ehcleanup   |370-2|128618755   -13740
255t optimized   |362-8|128576810   -41945
256r expand  |356-6|128899768  +322958
258r into_cfglayout  |353-3|129051765  +151997
259r jump|354+1|129051765 
262r cse1|353-1|129051765 
275r loop2_unroll|355+2|132182110 +3130345
277r loop2_done  |354-1|132182109   -1
312r pro_and_epilogue|371   +17|13324   +40215
323r bbro|375+4|132095926  -126398

Without the patch at jump2 time we get over 432 mismatches, so 15%
improvement. Some of the mismathces are unavoidable.  

I think ch mismatches are mostly due to loop header copying where the header
condition constant propagates.  Most common case should be threadable in early
optimizations and we also could do better on profile updating here.

Bootstrapped/regtested x6_64-linux, comitted.

gcc/ChangeLog:

PR tree-optimization/103680
* cfg.cc (update_bb_profile_for_threading): Fix profile update;
make message clearer.

gcc/testsuite/ChangeLog:

PR tree-optimization/103680
* gcc.dg/tree-ssa/pr103680.c: New test.
* gcc.dg/tree-prof/cmpsf-1.c: Un-xfail.
# Please enter the commit message for your changes. Lines starting
# with '#' will be ignored, and an empty message aborts the commit.
#
# On branch master
# Your branch is up to date with 'origin/master'.
#
# Changes to be committed:
#   modified:   cfg.cc
#   modified:   testsuite/gcc.dg/tree-prof/cmpsf-1.c
#   new file:   testsuite/gcc.dg/tree-ssa/pr103680.c
#
# Changes not staged for commit:
#   modified:   internal-fn.def
#   modified:   ../libstdc++-v3/include/bits/c++config
#   modified:   ../libstdc++-v3/include/bits/new_allocator.h
#   modified:   ../libstdc++-v3/include/ext/malloc_allocator.h
#   modified:   ../libstdc++-v3/include/ext/random.tcc
#
# Untracked files:
#   ../1
#   ../alwaysexec
#   ../b/
#   ../buil3/
#   ../build-in/
#   ../build-inst/
#   ../build-inst2/
#   ../build-kub/
#   ../build-lto/
#   ../build-lto2/
#   ../build-lto3/
#   ../build-ppc/
#   ../build-profiled/
#   ../build/
#   ../build2/
#   ../build3/
#   ../changes
#   .cfgloopmanip.cc.swo

Fix predictions of conditionals with __builtin_expect

2023-06-30 Thread Jan Hubicka via Gcc-patches
Hi,
while looking into the std::vector _M_realloc_insert codegen I noticed that 
call of __throw_bad_alloc is predicted with 10% probability. This is because
the conditional guarding it has __builtin_expect (cond, 0) on it.  This
incorrectly takes precedence over more reliable heuristics predicting that call
to cold noreturn is likely not going to happen.

So I reordered the predictors so __builtin_expect_with_probability comes first
after predictors that never makes a mistake (so user can use it to always
specify the outcome by hand).  I also downgraded malloc predictor since I do
think user-defined malloc functions & new operators may behave funny ways and
moved usual __builtin_expect after the noreturn cold predictor.

This triggered latent bug in expr_expected_value_1 where

  if (*predictor < predictor2)
*predictor = predictor2;

should be:

  if (predictor2 < *predictor)
*predictor = predictor2;

which eventually triggered an ICE on combining heuristics.  This made me notice
that we can do slightly better while combining expected values in case only 
one of the parameters (such as in a*b when we expect a==0) can determine
overall result.

Note that the new code may pick weaker heuristics in case that both values are
predicted.  Not sure if this scenario is worth the extra CPU time: there is
not correct way to combine the probabilities anyway since we do not know if
the predictions are independent, so I think users should not rely on it.

Fixing this issue uncovered another problem.  In 2018 Martin Liska added
code predicting that MALLOC returns non-NULL but instead of that he predicts
that it returns true (boolean 1).  This sort of works for testcase testing
 malloc (10) != NULL
but, for example, we will predict
 malloc (10) == malloc (10)
as true, which is not right and such comparsion may happen in real code

I think proper way is to update expr_expected_value_1 to work with value
ranges, but that needs greater surgery so I decided to postpone this and
only add FIXME and fill PR110499.

Bootstrapped/regtested x86_64-linux.  Will commit it shortly.

gcc/ChangeLog:

PR middle-end/109849
* predict.cc (estimate_bb_frequencies): Turn to static function.
(expr_expected_value_1): Fix handling of binary expressions with
predicted values.
* predict.def (PRED_MALLOC_NONNULL): Move later in the priority queue.
(PRED_BUILTIN_EXPECT_WITH_PROBABILITY): Move to almost top of the 
priority
queue.
* predict.h (estimate_bb_frequencies): No longer declare it.

gcc/testsuite/ChangeLog:

PR middle-end/109849
* gcc.dg/predict-18.c: Improve testcase.

diff --git a/gcc/predict.cc b/gcc/predict.cc
index 5e3c1d69ca4..688c0970f1c 100644
--- a/gcc/predict.cc
+++ b/gcc/predict.cc
@@ -89,6 +90,7 @@ static void predict_paths_leading_to_edge (edge, enum 
br_predictor,
 static bool can_predict_insn_p (const rtx_insn *);
 static HOST_WIDE_INT get_predictor_value (br_predictor, HOST_WIDE_INT);
 static void determine_unlikely_bbs ();
+static void estimate_bb_frequencies (bool force);
 
 /* Information we hold about each branch predictor.
Filled using information from predict.def.  */
@@ -2485,7 +2487,11 @@ expr_expected_value_1 (tree type, tree op0, enum 
tree_code code,
{
  if (predictor)
*predictor = PRED_MALLOC_NONNULL;
- return boolean_true_node;
+ /* FIXME: This is wrong and we need to convert the logic
+to value ranges.  This makes predictor to assume that
+malloc always returns (size_t)1 which is not the same
+as returning non-NULL.  */
+ return fold_convert (type, boolean_true_node);
}
 
  if (DECL_BUILT_IN_CLASS (decl) == BUILT_IN_NORMAL)
@@ -2563,7 +2569,9 @@ expr_expected_value_1 (tree type, tree op0, enum 
tree_code code,
  case BUILT_IN_REALLOC:
if (predictor)
  *predictor = PRED_MALLOC_NONNULL;
-   return boolean_true_node;
+   /* FIXME: This is wrong and we need to convert the logic
+  to value ranges.  */
+   return fold_convert (type, boolean_true_node);
  default:
break;
}
@@ -2575,18 +2583,43 @@ expr_expected_value_1 (tree type, tree op0, enum 
tree_code code,
   if (get_gimple_rhs_class (code) == GIMPLE_BINARY_RHS)
 {
   tree res;
+  tree nop0 = op0;
+  tree nop1 = op1;
+  if (TREE_CODE (op0) != INTEGER_CST)
+   {
+ /* See if expected value of op0 is good enough to determine the 
result.  */
+ nop0 = expr_expected_value (op0, visited, predictor, probability);
+ if (nop0
+ && (res = fold_build2 (code, type, nop0, op1)) != NULL
+ && TREE_CODE (res) == INTEGER_CST)
+   return res;
+ if (!nop0)
+   nop0 = op0;
+}
  

Extend ipa-fnsummary to skip __builtin_expect

2023-06-29 Thread Jan Hubicka via Gcc-patches
Compute ipa-predicates for conditionals involving __builtin_expect_p

std::vector allocator looks as follows:

__attribute__((nodiscard))
struct pair * std::__new_allocator 
>::allocate (struct __new_allocator * const this, size_type __n, const void * 
D.27753)
{
  bool _1;
  long int _2;
  long int _3;
  long unsigned int _5;
  struct pair * _9;

   [local count: 1073741824]:
  _1 = __n_7(D) > 1152921504606846975;
  _2 = (long int) _1;
  _3 = __builtin_expect (_2, 0);
  if (_3 != 0)
goto ; [10.00%]
  else
goto ; [90.00%]

   [local count: 107374184]:
  if (__n_7(D) > 2305843009213693951)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 53687092]:
  std::__throw_bad_array_new_length ();

   [local count: 53687092]:
  std::__throw_bad_alloc ();

   [local count: 966367641]:
  _5 = __n_7(D) * 8;
  _9 = operator new (_5);
  return _9;
}


So there is check for allocated block size being greater than max_size which is
wrapper in __builtin_expect.  This makes ipa-fnsummary to give up analyzing
predicates and it will miss the fact that the two different calls to __throw
will be optimized out if __n is larady smaller than 1152921504606846975 which
it is after _M_check_len.

This patch extends ipa-fnsummary to understand functions that return their
parameter.

We still do not get the value range propagated sicne _M_check_len is not
inlined early and ipa-prop misses return functions, but we get closer :)

Bootstrapped/regtested x86_64-linux, comitted.


gcc/ChangeLog:

PR tree-optimization/109849
* ipa-fnsummary.cc (decompose_param_expr): Skip
functions returning its parameter.
(set_cond_stmt_execution_predicate): Return early
if predicate was constructed.

gcc/testsuite/ChangeLog:

PR tree-optimization/109849
* gcc.dg/ipa/pr109849.c: New test.

diff --git a/gcc/ipa-fnsummary.cc b/gcc/ipa-fnsummary.cc
index 78cbb60d056..a09f6305c63 100644
--- a/gcc/ipa-fnsummary.cc
+++ b/gcc/ipa-fnsummary.cc
@@ -1516,6 +1516,19 @@ decompose_param_expr (struct ipa_func_body_info *fbi,
 
   if (TREE_CODE (expr) != SSA_NAME || SSA_NAME_IS_DEFAULT_DEF (expr))
break;
+  stmt = SSA_NAME_DEF_STMT (expr);
+
+  if (gcall *call = dyn_cast  (stmt))
+   {
+ int flags = gimple_call_return_flags (call);
+ if (!(flags & ERF_RETURNS_ARG))
+   goto fail;
+ int arg = flags & ERF_RETURN_ARG_MASK;
+ if (arg >= (int)gimple_call_num_args (call))
+   goto fail;
+ expr = gimple_call_arg (stmt, arg);
+ continue;
+   }
 
   if (!is_gimple_assign (stmt = SSA_NAME_DEF_STMT (expr)))
break;
@@ -1664,6 +1677,7 @@ set_cond_stmt_execution_predicate (struct 
ipa_func_body_info *fbi,
}
}
   vec_free (param_ops);
+  return;
 }
 
   if (TREE_CODE (op) != SSA_NAME)
diff --git a/gcc/testsuite/gcc.dg/ipa/pr109849.c 
b/gcc/testsuite/gcc.dg/ipa/pr109849.c
new file mode 100644
index 000..09b62f90d70
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/ipa/pr109849.c
@@ -0,0 +1,27 @@
+/* { dg-do compile } */
+/* { dg-options "-Os -fdump-ipa-inline-details" } */
+void bad (void);
+void
+test(int a)
+{
+   if (__builtin_expect (a>3, 0))
+   {
+   bad ();
+   bad ();
+   bad ();
+   bad ();
+   bad ();
+   bad ();
+   bad ();
+   bad ();
+   }
+}
+void
+foo (int a)
+{
+   if (a>0)
+   __builtin_unreachable ();
+   test (a);
+}
+/* { dg-final { scan-ipa-dump "Inlined 2 calls" "inline"  } } */
+/* { dg-final { scan-ipa-dump "Inlining test" "inline"  } } */


Enable early inlining into always_inline functions

2023-06-28 Thread Jan Hubicka via Gcc-patches
Hi,
early inliner currently skips always_inline functions and moreover we ignore
calls from always_inline in ipa_reverse_postorder.  This leads to disabling
most of propagation done using early optimization that is quite bad when
early inline functions are not leaf functions, which is now quite common
in libstdc++.

This patch instead of fully disabling the inline checks calls in callee.
I am quite conservative about what can be inlined as this patch is bit
touchy anyway.  To avoid problems with always_inline being optimized
after early inline I extended inline_always_inline_functions to lazilly
compute fnsummary when needed.

Bootstrapped/regtested x86_64-linux, will commit it shortly.

gcc/ChangeLog:

PR middle-end/110334
* ipa-fnsummary.h (ipa_fn_summary): Add
safe_to_inline_to_always_inline.
* ipa-inline.cc (can_early_inline_edge_p): ICE
if SSA is not built; do cycle checking for
always_inline functions.
(inline_always_inline_functions): Be recrusive;
watch for cycles; do not updat overall summary.
(early_inliner): Do not give up on always_inlines.
* ipa-utils.cc (ipa_reverse_postorder): Do not skip
always inlines.

gcc/testsuite/ChangeLog:

PR middle-end/110334
* g++.dg/opt/pr66119.C: Disable early inlining.
* gcc.c-torture/compile/pr110334.c: New test.
* gcc.dg/tree-ssa/pr110334.c: New test.

diff --git a/gcc/ipa-fnsummary.h b/gcc/ipa-fnsummary.h
index fcc01167d0d..0c5a81e2dca 100644
--- a/gcc/ipa-fnsummary.h
+++ b/gcc/ipa-fnsummary.h
@@ -126,8 +126,8 @@ public:
   ipa_fn_summary ()
 : min_size (0),
   inlinable (false), single_caller (false),
-  fp_expressions (false), target_info (0),
-  estimated_stack_size (false),
+  fp_expressions (false), safe_to_inline_to_always_inline (0),
+  target_info (0), estimated_stack_size (false),
   time (0), conds (NULL),
   size_time_table (), call_size_time_table (vNULL),
   loop_iterations (NULL), loop_strides (NULL),
@@ -165,6 +165,8 @@ public:
   unsigned int single_caller : 1;
   /* True if function contains any floating point expressions.  */
   unsigned int fp_expressions : 1;
+  /* Cache for analysis of can_early_inline_edge_p.  */
+  unsigned int safe_to_inline_to_always_inline : 2;
   /* Like fp_expressions field above, but it's to hold some target specific
  information, such as some target specific isa flags.  Note that for
  offloading target compilers, this field isn't streamed.  */
diff --git a/gcc/ipa-inline.cc b/gcc/ipa-inline.cc
index efc8df7d4e0..71a1c6ca68e 100644
--- a/gcc/ipa-inline.cc
+++ b/gcc/ipa-inline.cc
@@ -680,28 +680,60 @@ can_early_inline_edge_p (struct cgraph_edge *e)
   e->inline_failed = CIF_BODY_NOT_AVAILABLE;
   return false;
 }
-  /* In early inliner some of callees may not be in SSA form yet
- (i.e. the callgraph is cyclic and we did not process
- the callee by early inliner, yet).  We don't have CIF code for this
- case; later we will re-do the decision in the real inliner.  */
-  if (!gimple_in_ssa_p (DECL_STRUCT_FUNCTION (e->caller->decl))
-  || !gimple_in_ssa_p (DECL_STRUCT_FUNCTION (callee->decl)))
-{
-  if (dump_enabled_p ())
-   dump_printf_loc (MSG_MISSED_OPTIMIZATION, e->call_stmt,
-"  edge not inlinable: not in SSA form\n");
-  return false;
-}
-  else if (profile_arc_flag
-  && ((lookup_attribute ("no_profile_instrument_function",
-DECL_ATTRIBUTES (caller->decl)) == NULL_TREE)
-  != (lookup_attribute ("no_profile_instrument_function",
-DECL_ATTRIBUTES (callee->decl)) == 
NULL_TREE)))
+  gcc_assert (gimple_in_ssa_p (DECL_STRUCT_FUNCTION (e->caller->decl))
+ && gimple_in_ssa_p (DECL_STRUCT_FUNCTION (callee->decl)));
+  if (profile_arc_flag
+  && ((lookup_attribute ("no_profile_instrument_function",
+   DECL_ATTRIBUTES (caller->decl)) == NULL_TREE)
+ != (lookup_attribute ("no_profile_instrument_function",
+   DECL_ATTRIBUTES (callee->decl)) == NULL_TREE)))
 return false;
 
   if (!can_inline_edge_p (e, true, true)
   || !can_inline_edge_by_limits_p (e, true, false, true))
 return false;
+  /* When inlining regular functions into always-inline functions
+ during early inlining watch for possible inline cycles.  */
+  if (DECL_DISREGARD_INLINE_LIMITS (caller->decl)
+  && lookup_attribute ("always_inline", DECL_ATTRIBUTES (caller->decl))
+  && (!DECL_DISREGARD_INLINE_LIMITS (callee->decl)
+ || !lookup_attribute ("always_inline", DECL_ATTRIBUTES 
(callee->decl
+{
+  /* If there are indirect calls, inlining may produce direct call.
+TODO: We may lift this restriction if we avoid errors on formely
+indirect calls to always_inline functions.  Taking address
+of 

Re: [PATCH] libstdc++: Use RAII in std::vector::_M_realloc_insert

2023-06-28 Thread Jan Hubicka via Gcc-patches
> I think the __throw_bad_alloc() and __throw_bad_array_new_length()
> functions should always be rare, so marking them cold seems fine (users who
> define their own allocators that want to throw bad_alloc "often" will
> probably throw it directly, they shouldn't be using our __throw_bad_alloc()
> function anyway). I don't think __throw_bad_exception is ever used, so that
> doesn't matter (we could remove it from the header and just keep its
> definition in the library, but there's no big advantage to doing that).
> Others like __throw_length_error() should also be very very rare, and could
> be marked cold.
> 
> Maybe we should just mark everything in  as cold. If
> users want to avoid the cost of calls to those functions they can do so by
> checking function preconditions/arguments to avoid the exceptions. There
> are very few places where a throwing libstdc++ API doesn't have a way to
> avoid the exception. The only one that isn't easily avoidable is
> __throw_bad_alloc but OOM should be rare.

Hi,
this marks everything in functexcept.h as cold and I also noticed that
we probably want to mark as such terminate.

With fix to 110436 and -O3 we now inline _M_realloc_insert, yay :)

tested on x86_64-linux, OK?

* include/bits/c++config (std::__terminate): Mark cold.
* include/bits/functexcept.h: Mark everything as cold.
* libsupc++/exception: Mark terminate and unexpected as cold.
diff --git a/libstdc++-v3/include/bits/c++config 
b/libstdc++-v3/include/bits/c++config
index 009a017b048..dd47f274d5f 100644
--- a/libstdc++-v3/include/bits/c++config
+++ b/libstdc++-v3/include/bits/c++config
@@ -320,7 +320,7 @@ namespace std
   extern "C++" __attribute__ ((__noreturn__, __always_inline__))
   inline void __terminate() _GLIBCXX_USE_NOEXCEPT
   {
-void terminate() _GLIBCXX_USE_NOEXCEPT __attribute__ ((__noreturn__));
+void terminate() _GLIBCXX_USE_NOEXCEPT __attribute__ 
((__noreturn__,__cold__));
 terminate();
   }
 #pragma GCC visibility pop
diff --git a/libstdc++-v3/include/bits/functexcept.h 
b/libstdc++-v3/include/bits/functexcept.h
index 2765f7865df..a2e97413661 100644
--- a/libstdc++-v3/include/bits/functexcept.h
+++ b/libstdc++-v3/include/bits/functexcept.h
@@ -57,61 +57,61 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 
   // Helper for exception objects in 
   void
-  __throw_bad_cast(void) __attribute__((__noreturn__));
+  __throw_bad_cast(void) __attribute__((__noreturn__,__cold__));
 
   void
-  __throw_bad_typeid(void) __attribute__((__noreturn__));
+  __throw_bad_typeid(void) __attribute__((__noreturn__,__cold__));
 
   // Helpers for exception objects in 
   void
-  __throw_logic_error(const char*) __attribute__((__noreturn__));
+  __throw_logic_error(const char*) __attribute__((__noreturn__,__cold__));
 
   void
-  __throw_domain_error(const char*) __attribute__((__noreturn__));
+  __throw_domain_error(const char*) __attribute__((__noreturn__,__cold__));
 
   void
-  __throw_invalid_argument(const char*) __attribute__((__noreturn__));
+  __throw_invalid_argument(const char*) __attribute__((__noreturn__,__cold__));
 
   void
-  __throw_length_error(const char*) __attribute__((__noreturn__));
+  __throw_length_error(const char*) __attribute__((__noreturn__,__cold__));
 
   void
-  __throw_out_of_range(const char*) __attribute__((__noreturn__));
+  __throw_out_of_range(const char*) __attribute__((__noreturn__,__cold__));
 
   void
-  __throw_out_of_range_fmt(const char*, ...) __attribute__((__noreturn__))
+  __throw_out_of_range_fmt(const char*, ...) 
__attribute__((__noreturn__,__cold__))
 __attribute__((__format__(__gnu_printf__, 1, 2)));
 
   void
-  __throw_runtime_error(const char*) __attribute__((__noreturn__));
+  __throw_runtime_error(const char*) __attribute__((__noreturn__,__cold__));
 
   void
-  __throw_range_error(const char*) __attribute__((__noreturn__));
+  __throw_range_error(const char*) __attribute__((__noreturn__,__cold__));
 
   void
-  __throw_overflow_error(const char*) __attribute__((__noreturn__));
+  __throw_overflow_error(const char*) __attribute__((__noreturn__,__cold__));
 
   void
-  __throw_underflow_error(const char*) __attribute__((__noreturn__));
+  __throw_underflow_error(const char*) __attribute__((__noreturn__,__cold__));
 
   // Helpers for exception objects in 
   void
-  __throw_ios_failure(const char*) __attribute__((__noreturn__));
+  __throw_ios_failure(const char*) __attribute__((__noreturn__,__cold__));
 
   void
-  __throw_ios_failure(const char*, int) __attribute__((__noreturn__));
+  __throw_ios_failure(const char*, int) __attribute__((__noreturn__,__cold__));
 
   // Helpers for exception objects in 
   void
-  __throw_system_error(int) __attribute__((__noreturn__));
+  __throw_system_error(int) __attribute__((__noreturn__,__cold__));
 
   // Helpers for exception objects in 
   void
-  __throw_future_error(int) __attribute__((__noreturn__));
+  __throw_future_error(int) __attribute__((__noreturn__,__cold__));
 
   // 

Re: Enable ranger for ipa-prop

2023-06-28 Thread Jan Hubicka via Gcc-patches
> 
> On 6/27/23 12:24, Jan Hubicka wrote:
> > > On 6/27/23 09:19, Jan Hubicka wrote:
> > > > Hi,
> > > > as shown in the testcase (which would eventually be useful for
> > > > optimizing std::vector's push_back), ipa-prop can use context dependent 
> > > > ranger
> > > > queries for better value range info.
> > > > 
> > > > Bootstrapped/regtested x86_64-linux, OK?
> > > Quick question.
> > > 
> > > When you call enable_ranger(), its gives you a ranger back, but it also 
> > > sets
> > > the range query for the specified context to that same instance.  So from
> > > that point forward  all existing calls to get_range_query(fun) will now 
> > > use
> > > the context ranger
> > > 
> > > enable_ranger (struct function *fun, bool use_imm_uses)
> > > <...>
> > >    gcc_checking_assert (!fun->x_range_query);
> > >    r = new gimple_ranger (use_imm_uses);
> > >    fun->x_range_query = r;
> > >    return r;
> > > 
> > > So you probably dont have to pass a ranger around?  or is that ranger you
> > > are passing for a different context?
> > I don't need passing ranger around - I just did not know that.  I tought
> > the default one is the context insensitive one, I will simplify the
> > patch.  I need to look more into how ranger works.
> > 
> > 
> No need. Its magic!
Cool, thanks for the explanation!
I pushed the following simplified version of the patch.  Now back to
getting push_back faster :)

gcc/ChangeLog:

PR tree-optimization/110377
* ipa-prop.cc (ipa_compute_jump_functions_for_edge): Pass statement to
the ranger query.
(ipa_analyze_node): Enable ranger.

gcc/testsuite/ChangeLog:

PR tree-optimization/110377
* gcc.dg/ipa/pr110377.c: New test.

diff --git a/gcc/ipa-prop.cc b/gcc/ipa-prop.cc
index 41c812194ca..33bda8288fc 100644
--- a/gcc/ipa-prop.cc
+++ b/gcc/ipa-prop.cc
@@ -2386,7 +2386,7 @@ ipa_compute_jump_functions_for_edge (struct 
ipa_func_body_info *fbi,
 
  if (TREE_CODE (arg) == SSA_NAME
  && param_type
- && get_range_query (cfun)->range_of_expr (vr, arg)
+ && get_range_query (cfun)->range_of_expr (vr, arg, cs->call_stmt)
  && vr.nonzero_p ())
addr_nonzero = true;
  else if (tree_single_nonzero_warnv_p (arg, _overflow))
@@ -2408,7 +2408,7 @@ ipa_compute_jump_functions_for_edge (struct 
ipa_func_body_info *fbi,
  && Value_Range::supports_type_p (param_type)
  && irange::supports_p (TREE_TYPE (arg))
  && irange::supports_p (param_type)
- && get_range_query (cfun)->range_of_expr (vr, arg)
+ && get_range_query (cfun)->range_of_expr (vr, arg, cs->call_stmt)
  && !vr.undefined_p ())
{
  Value_Range resvr (vr);
@@ -3190,7 +3190,9 @@ ipa_analyze_node (struct cgraph_node *node)
   bi->cg_edges.safe_push (cs);
 }
 
+  enable_ranger (cfun, false);
   analysis_dom_walker ().walk (ENTRY_BLOCK_PTR_FOR_FN (cfun));
+  disable_ranger (cfun);
 
   ipa_release_body_info ();
   free_dominance_info (CDI_DOMINATORS);
diff --git a/gcc/testsuite/gcc.dg/ipa/pr110377.c 
b/gcc/testsuite/gcc.dg/ipa/pr110377.c
new file mode 100644
index 000..63120a97bd0
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/ipa/pr110377.c
@@ -0,0 +1,16 @@
+/* { dg-do compile */
+/* { dg-options "-O2 -fdump-ipa-cp" } */
+int test3(int);
+__attribute__ ((noinline))
+void test2(int a)
+{
+   test3(a);
+}
+void
+test(int n)
+{
+if (n > 5)
+  __builtin_unreachable ();
+test2(n);
+}
+/* { dg-final { scan-ipa-dump "-INF, 5" "cp" } }  */


Re: Enable ranger for ipa-prop

2023-06-27 Thread Jan Hubicka via Gcc-patches
> 
> On 6/27/23 09:19, Jan Hubicka wrote:
> > Hi,
> > as shown in the testcase (which would eventually be useful for
> > optimizing std::vector's push_back), ipa-prop can use context dependent 
> > ranger
> > queries for better value range info.
> > 
> > Bootstrapped/regtested x86_64-linux, OK?
> 
> Quick question.
> 
> When you call enable_ranger(), its gives you a ranger back, but it also sets
> the range query for the specified context to that same instance.  So from
> that point forward  all existing calls to get_range_query(fun) will now use
> the context ranger
> 
> enable_ranger (struct function *fun, bool use_imm_uses)
> <...>
>   gcc_checking_assert (!fun->x_range_query);
>   r = new gimple_ranger (use_imm_uses);
>   fun->x_range_query = r;
>   return r;
> 
> So you probably dont have to pass a ranger around?  or is that ranger you
> are passing for a different context?

I don't need passing ranger around - I just did not know that.  I tought
the default one is the context insensitive one, I will simplify the
patch.  I need to look more into how ranger works.

Honza
> 
> 
> Andrew
> 
> 


Enable ranger for ipa-prop

2023-06-27 Thread Jan Hubicka via Gcc-patches
Hi,
as shown in the testcase (which would eventually be useful for
optimizing std::vector's push_back), ipa-prop can use context dependent ranger
queries for better value range info.

Bootstrapped/regtested x86_64-linux, OK?

Honza

gcc/ChangeLog:

PR middle-end/110377
* ipa-prop.cc (ipa_compute_jump_functions_for_edge): Add ranger
parameter; use ranger instance for rnage queries.
(ipa_compute_jump_functions_for_bb): Pass around ranger.
(analysis_dom_walker::before_dom_children): Enable ranger.

gcc/testsuite/ChangeLog:

PR middle-end/110377
* gcc.dg/tree-ssa/pr110377.c: New test.

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr110377.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr110377.c
new file mode 100644
index 000..cbe3441caea
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr110377.c
@@ -0,0 +1,16 @@
+/* { dg-do compile */
+/* { dg-options "-O2 -fdump-ipa-fnsummary" } */
+int test3(int);
+__attribute__ ((noinline))
+void test2(int a)
+{
+   test3(a);
+}
+void
+test(int n)
+{
+if (n > 5)
+  __builtin_unreachable ();
+test2(n);
+}
+/* { dg-final { scan-tree-dump "-INF, 5-INF" "fnsummary" } }  */
diff --git a/gcc/ipa-prop.cc b/gcc/ipa-prop.cc
index 41c812194ca..693d4805d93 100644
--- a/gcc/ipa-prop.cc
+++ b/gcc/ipa-prop.cc
@@ -2341,7 +2341,8 @@ ipa_set_jfunc_vr (ipa_jump_func *jf, const ipa_vr )
 
 static void
 ipa_compute_jump_functions_for_edge (struct ipa_func_body_info *fbi,
-struct cgraph_edge *cs)
+struct cgraph_edge *cs,
+gimple_ranger *ranger)
 {
   ipa_node_params *info = ipa_node_params_sum->get (cs->caller);
   ipa_edge_args *args = ipa_edge_args_sum->get_create (cs);
@@ -2386,7 +2387,7 @@ ipa_compute_jump_functions_for_edge (struct 
ipa_func_body_info *fbi,
 
  if (TREE_CODE (arg) == SSA_NAME
  && param_type
- && get_range_query (cfun)->range_of_expr (vr, arg)
+ && get_range_query (cfun)->range_of_expr (vr, arg, cs->call_stmt)
  && vr.nonzero_p ())
addr_nonzero = true;
  else if (tree_single_nonzero_warnv_p (arg, _overflow))
@@ -2408,7 +2409,7 @@ ipa_compute_jump_functions_for_edge (struct 
ipa_func_body_info *fbi,
  && Value_Range::supports_type_p (param_type)
  && irange::supports_p (TREE_TYPE (arg))
  && irange::supports_p (param_type)
- && get_range_query (cfun)->range_of_expr (vr, arg)
+ && ranger->range_of_expr (vr, arg, cs->call_stmt)
  && !vr.undefined_p ())
{
  Value_Range resvr (vr);
@@ -2517,7 +2518,8 @@ ipa_compute_jump_functions_for_edge (struct 
ipa_func_body_info *fbi,
from BB.  */
 
 static void
-ipa_compute_jump_functions_for_bb (struct ipa_func_body_info *fbi, basic_block 
bb)
+ipa_compute_jump_functions_for_bb (struct ipa_func_body_info *fbi, basic_block 
bb,
+  gimple_ranger *ranger)
 {
   struct ipa_bb_info *bi = ipa_get_bb_info (fbi, bb);
   int i;
@@ -2536,7 +2538,7 @@ ipa_compute_jump_functions_for_bb (struct 
ipa_func_body_info *fbi, basic_block b
  && !gimple_call_fnspec (cs->call_stmt).known_p ())
continue;
}
-  ipa_compute_jump_functions_for_edge (fbi, cs);
+  ipa_compute_jump_functions_for_edge (fbi, cs, ranger);
 }
 }
 
@@ -3110,19 +3112,27 @@ class analysis_dom_walker : public dom_walker
 {
 public:
   analysis_dom_walker (struct ipa_func_body_info *fbi)
-: dom_walker (CDI_DOMINATORS), m_fbi (fbi) {}
+: dom_walker (CDI_DOMINATORS), m_fbi (fbi)
+  {
+m_ranger = enable_ranger (cfun, false);
+  }
+  ~analysis_dom_walker ()
+  {
+disable_ranger (cfun);
+  }
 
   edge before_dom_children (basic_block) final override;
 
 private:
   struct ipa_func_body_info *m_fbi;
+  gimple_ranger *m_ranger;
 };
 
 edge
 analysis_dom_walker::before_dom_children (basic_block bb)
 {
   ipa_analyze_params_uses_in_bb (m_fbi, bb);
-  ipa_compute_jump_functions_for_bb (m_fbi, bb);
+  ipa_compute_jump_functions_for_bb (m_fbi, bb, m_ranger);
   return NULL;
 }
 


Re: [PATCH] Improve DSE to handle stores before __builtin_unreachable ()

2023-06-26 Thread Jan Hubicka via Gcc-patches
Hi,
playing with testcases for path isolation and const function, I noticed
that we do not seem to even try to isolate out of range array accesses:
int a[3]={0,1,2};
test(int i)
{
   if (i > 3)
 return test2(a[i]);
   return a[i];
}

Here call to test2 is dead, since a[i] will access memory past of the
array.  We produce a warning:

t.c:5:24: warning: array subscript 4 is above array bounds of ‘int[3]’ 
[-Warray-bounds=]

but we still keep the call:

test:
.LFB0:
.cfi_startproc
movslq  %edi, %rax
movla(,%rax,4), %eax
cmpl$3, %edi
jg  .L4
ret
.p2align 4,,10
.p2align 3
.L4:
movl%eax, %edi
xorl%eax, %eax
jmp test2

We eventually move the load before conditional, but at path isolation
time it is still quite obvious the conditional being true invokes
undefined behaviour

int test (int i)
{
  int _1;
  int _2;
  int _6;
  int _8;
  
   [local count: 1073741824]:
  if (i_4(D) > 3)
goto ; [20.24%]
  else
goto ; [79.76%]

   [local count: 217325344]:
  _1 = a[i_4(D)];
  _8 = test2 (_1);
  goto ; [100.00%]

   [local count: 856416481]:
  _6 = a[i_4(D)];
  
   [local count: 1073741824]:
  # _2 = PHI <_8(3), _6(4)>
  return _2;
} 

Curiously adjusting the testcase:

const int a[3]={0,1,2};
test(int i)
{
if (i == 3)
return test2(a[i]);
return a[i];
}
no longer has undefined behaviour visible at isolate-paths
int test (int i)
{
  int _1;
  int _5;
  int _7;

   [local count: 1073741824]:
  if (i_3(D) == 3)
goto ; [11.56%]
  else
goto ; [88.44%]

   [local count: 124124552]:
  _7 = test2 (0);
  goto ; [100.00%]

   [local count: 949617273]:
  _5 = a[i_3(D)];

   [local count: 1073741824]:
  # _1 = PHI <_7(3), _5(4)>
  return _1;
}
since we fold the load to 0.  It would perhaps help optimizers to keep info on 
undefined behaviour happening there.

Honza


Fix profile of forwardes produced by cd-dce

2023-06-26 Thread Jan Hubicka via Gcc-patches
Hi,
compiling the testcase from PR109849 (which uses std:vector based stack to
drive a loop) with profile feedbakc leads to profile mismatches introduced by
tree-ssa-dce.  This is the new code to produce unified forwarder blocks for
PHIs.

I am not including the testcase itself since
checking it for Invalid sum is probably going to be too fragile and this should
show in our LNT testers. The patch however fixes the mismatch.

Bootstrapped/regtested x86_64-linux and plan to commit it shortly.

gcc/ChangeLog:

PR tree-optimization/109849
* tree-ssa-dce.cc (make_forwarders_with_degenerate_phis): Fix profile
count of newly constructed forwarder block.

diff --git a/gcc/tree-ssa-dce.cc b/gcc/tree-ssa-dce.cc
index 2949957f883..f0b02456132 100644
--- a/gcc/tree-ssa-dce.cc
+++ b/gcc/tree-ssa-dce.cc
@@ -1865,12 +1865,15 @@ make_forwarders_with_degenerate_phis (function *fn)
}
  free_dominance_info (fn, CDI_DOMINATORS);
  basic_block forwarder = split_edge (args[start].first);
+ profile_count count = profile_count::zero ();
  for (unsigned j = start + 1; j < i; ++j)
{
  edge e = args[j].first;
  redirect_edge_and_branch_force (e, forwarder);
  redirect_edge_var_map_clear (e);
+ count += e->count ();
}
+ forwarder->count = count;
  if (vphi)
{
  tree def = copy_ssa_name (vphi_args[0]);


Re: [PATCH] Improve DSE to handle stores before __builtin_unreachable ()

2023-06-25 Thread Jan Hubicka via Gcc-patches
> > Also as discussed some time ago, the volatile loads between traps has
> > effect of turning previously pure/const functions into non-const which
> > is somewhat sad, so it is still on my todo list to change it this stage1
> > to something more careful.   We discussed internal functions trap_store
> > and trap_load which will expand to load/store + trap but will make it
> > clear that side effect does not count to modref.
> It's been a long time since I looked at this code -- isn't it the case that
> we already must have had a load/store and that all we've done is change its
> form (to enable more DCE) and added the volatile marker?
> 
> Meaning that it couldn't have been pure/cost before, could it?  Or is it the
> case that you want to not have the erroneous path be the sole reason to
> spoil pure/const detection -- does that happen often in practice?

I noticed that while looking into cases during GCC bootstrap where
function analysis went worse after inlning than before, so it happens
in practice.

mod-ref can now analyse independently whether function reads/wrotes
memory (and what memory) and whether function has other side effects or
is non-deterministics.  We can now DSE function call if it has no side
effects and all memory stores are understood and dead.

Problem is that split paths turns undefined behaivour into side effects
and blocks this optimization.

Honza


Re: [PATCH] libstdc++: Use RAII in std::vector::_M_realloc_insert

2023-06-23 Thread Jan Hubicka via Gcc-patches
> I intend to push this to trunk once testing finishes.
> 
> I generated the diff with -b so the whitespace changes aren't shown,
> because there was some re-indenting that makes the diff look larger than
> it really is.
> 
> Honza, I don't think this is likely to make much difference for the PR
> 110287 testcases, but I think it simplifies the code and so is an
> improvement in terms of maintenance and readability.

Thanks for cleaning it up :)
The new version seems slightly smaller than the original in inliner
metrics.

I started to look if we can break out useful parts out of
_M_realloc_insert to make it smaller and fit -O3 inline limits.
ipa-fnsplit does some "useful" job, like break out:

   [local count: 107374184]:
  if (__n_9(D) > 2305843009213693951)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 53687092]:
  std::__throw_bad_array_new_length ();

   [local count: 53687092]:
  std::__throw_bad_alloc ();

from std::__new_allocator ::allocate
into a separate function, which saves another 4 instructions in the estimate.

It is fun to notice that both checks are dead with default predictor,
but we do not know that because _M_check_len is not inlined and we do
not have return value value range propagation, which I will add.  With
your propsed change to _M_check_len we will also need to solve PR110377
and actually notice the value range implied by __bulitin_unreachable
early enough.

What however also goes wrong is that after splitting we decide to inline
it back before we consider inlining _M_realloc_insert, so the savings
does not help.  The reason is that the profile is estimated as:

  _4 = __builtin_expect (_3, 0);
  if (_4 != 0)
goto ; [10.00%]
  else
goto ; [90.00%]

so we expect that with 10% probability the allocation will exceed 64bit
address space.  The reason is that __builtin_expect is defined to have
10% missrate which we can't change, since it is used in algorithms where
the probability of unlikely value really is non-zero.

There is __builtin_expect_with_probability that makes it to possible to
set probability to 0 or 100 that may be better in such situation,
however here it is useless.  If code path leads to noreturn function,
we predict it as noreturn.  This heuristics has lower precedence than
builtin_expect so it is not applied, but would do the same work.

To work out that the code path is really very unlikely and should be
offloaded to a cold section, we however need:

diff --git a/libstdc++-v3/include/bits/functexcept.h 
b/libstdc++-v3/include/bits/functexcept.h
index 89972baf2c9..2765f7865df 100644
--- a/libstdc++-v3/include/bits/functexcept.h
+++ b/libstdc++-v3/include/bits/functexcept.h
@@ -46,14 +46,14 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 #if _GLIBCXX_HOSTED
   // Helper for exception objects in 
   void
-  __throw_bad_exception(void) __attribute__((__noreturn__));
+  __throw_bad_exception(void) __attribute__((__noreturn__,__cold__));
 
   // Helper for exception objects in 
   void
-  __throw_bad_alloc(void) __attribute__((__noreturn__));
+  __throw_bad_alloc(void) __attribute__((__noreturn__,__cold__));
 
   void
-  __throw_bad_array_new_length(void) __attribute__((__noreturn__));
+  __throw_bad_array_new_length(void) __attribute__((__noreturn__,__cold__));
 
   // Helper for exception objects in 
   void

This makes us to drop cont to profile_count::zero which indicates that
the code is very likely not executed at all during run of the program.

The reason why we can't take such a strong hint from unreachable
attribute is twofold.  First most programs do call "exit (0)" so taking
this as a strong hint may make us to optimize whole program for size.
Second is that we consider a possibility that insane developers may make
EH delivery relatively common.

Would be possible to annotate throw functions in libstdc++ which are
very unlikely taken by a working program as __cold__ and possibly drop
the redundant __builtin_expect?

I will reorder predictors so __builtin_cold_noreturn and
__builtin_expect_with_probability thakes precedence over
__builtin_expect.

It is fun to see how many things can go wrong in such a simple use of
libstdc++ :)

Honza


Re: Tiny phiprop compile time optimization

2023-06-23 Thread Jan Hubicka via Gcc-patches
Hi,
here is updated version with TODO_update_ssa_only_virtuals.
bootstrapped/regtested x86_64-linux. OK?

gcc/ChangeLog:

* tree-ssa-phiprop.cc (propagate_with_phi): Compute post dominators on
demand.
(pass_phiprop::execute): Do not compute it here; return
update_ssa_only_virtuals if something changed.
(pass_data_phiprop): Remove TODO_update_ssa from todos.


diff --git a/gcc/tree-ssa-phiprop.cc b/gcc/tree-ssa-phiprop.cc
index 8c9ce903472..b01ef4495c2 100644
--- a/gcc/tree-ssa-phiprop.cc
+++ b/gcc/tree-ssa-phiprop.cc
@@ -340,6 +340,9 @@ propagate_with_phi (basic_block bb, gphi *phi, struct 
phiprop_d *phivn,
   gimple *def_stmt;
   tree vuse;
 
+  if (!dom_info_available_p (cfun, CDI_POST_DOMINATORS))
+   calculate_dominance_info (CDI_POST_DOMINATORS);
+
   /* Only replace loads in blocks that post-dominate the PHI node.  That
  makes sure we don't end up speculating loads.  */
   if (!dominated_by_p (CDI_POST_DOMINATORS,
@@ -485,7 +488,7 @@ const pass_data pass_data_phiprop =
   0, /* properties_provided */
   0, /* properties_destroyed */
   0, /* todo_flags_start */
-  TODO_update_ssa, /* todo_flags_finish */
+  0, /* todo_flags_finish */
 };
 
 class pass_phiprop : public gimple_opt_pass
@@ -513,7 +516,6 @@ pass_phiprop::execute (function *fun)
   size_t n;
 
   calculate_dominance_info (CDI_DOMINATORS);
-  calculate_dominance_info (CDI_POST_DOMINATORS);
 
   n = num_ssa_names;
   phivn = XCNEWVEC (struct phiprop_d, n);
@@ -539,7 +541,7 @@ pass_phiprop::execute (function *fun)
 
   free_dominance_info (CDI_POST_DOMINATORS);
 
-  return 0;
+  return did_something ? TODO_update_ssa_only_virtuals : 0;
 }
 
 } // anon namespace


Re: Do not account __builtin_unreachable guards in inliner

2023-06-23 Thread Jan Hubicka via Gcc-patches
> 
> So you need to feed it with extra info on the optimized out stmts because
> as-is it will not remove __builtin_unreachable ().  That means you're

My plan was to add entry point to tree-ssa-dce that will take an
set of stmts declared dead by external force and will do the usual mark
stage bypassing mark_stmt_if_necessary if the stmt is in the set of
deads.

> doing the find_obviously_necessary_stmts manually, skipping the
> conditional and all stmts it controls to the __builtin_unreachable () path?
> 
> I also think you want something cheaper than non-cd-dce mark, you also don't
> want to bother with stores/loads?

You are probably right. cd-dce marking became bit of a monster and I do
not want to care about memory.
One can add extra flag to avoid processing of memory, but the code I
would re-use is quite small.

I can do my own mark  just considering phis, pre-identified
conditionals and basic gimple_assigns with no side effects as possibly
unnecesary stmts.  I can completely ignore debug stmts.

So it should be one pass through the statments to populate the worklist
& simple walk of the ssa graph to propagae it.

> 
> Also when you only do this conditional how do you plan to use the result?

Well, the analysis is a loop that walks all basic blocks and then all
stmts.  I can keep track if computation of live stmts was done and in
that case query the flag assume it is true otherwise.

Honza


Re: Ping [PATCH v4] Add condition coverage profiling

2023-06-23 Thread Jan Hubicka via Gcc-patches
> > 
> > gcc/ChangeLog:
> > 
> > * builtins.cc (expand_builtin_fork_or_exec): Check
> > profile_condition_flag.
> > * collect2.cc (main): Add -fno-profile-conditions to OBSTACK.
> > * common.opt: Add new options -fprofile-conditions and
> > * doc/gcov.texi: Add --conditions documentation.
> > * doc/invoke.texi: Add -fprofile-conditions documentation.
> > * gcc.cc: Link gcov on -fprofile-conditions.
> > * gcov-counter.def (GCOV_COUNTER_CONDS): New.
> > * gcov-dump.cc (tag_conditions): New.
> > * gcov-io.h (GCOV_TAG_CONDS): New.
> > (GCOV_TAG_CONDS_LENGTH): Likewise.
> > (GCOV_TAG_CONDS_NUM): Likewise.
> > * gcov.cc (class condition_info): New.
> > (condition_info::condition_info): New.
> > (condition_info::popcount): New.
> > (struct coverage_info): New.
> > (add_condition_counts): New.
> > (output_conditions): New.
> > (print_usage): Add -g, --conditions.
> > (process_args): Likewise.
> > (output_intermediate_json_line): Output conditions.
> > (read_graph_file): Read conditions counters.
> > (read_count_file): Read conditions counters.
> > (file_summary): Print conditions.
> > (accumulate_line_info): Accumulate conditions.
> > (output_line_details): Print conditions.
> > * ipa-inline.cc (can_early_inline_edge_p): Check
> > profile_condition_flag.
> > * ipa-split.cc (pass_split_functions::gate): Likewise.
> > * passes.cc (finish_optimization_passes): Likewise.
> > * profile.cc (find_conditions): New declaration.
> > (cov_length): Likewise.
> > (cov_blocks): Likewise.
> > (cov_masks): Likewise.
> > (cov_free): Likewise.
> > (instrument_decisions): New.
> > (read_thunk_profile): Control output to file.
> > (branch_prob): Call find_conditions, instrument_decisions.
> > (init_branch_prob): Add total_num_conds.
> > (end_branch_prob): Likewise.
> > * tree-profile.cc (struct conds_ctx): New.
> > (CONDITIONS_MAX_TERMS): New.
> > (EDGE_CONDITION): New.
> > (cmp_index_map): New.
> > (index_of): New.
> > (block_conditional_p): New.
> > (edge_conditional_p): New.
> > (single): New.
> > (single_edge): New.
> > (contract_edge): New.
> > (contract_edge_up): New.
> > (ancestors_of): New.
> > (struct outcomes): New.
> > (conditional_succs): New.
> > (condition_index): New.
> > (masking_vectors): New.
> > (cond_reachable_from): New.
> > (neighborhood): New.
> > (isolate_expression): New.
> > (emit_bitwise_op): New.
> > (make_index_map_visit): New.
> > (make_index_map): New.
> > (collect_conditions): New.
> > (yes): New.
> > (struct condcov): New.
> > (cov_length): New.
> > (cov_blocks): New.
> > (cov_masks): New.
> > (cov_free): New.
> > (find_conditions): New.
> > (instrument_decisions): New.
> > (tree_profiling): Check profile_condition_flag.
> > (pass_ipa_tree_profile::gate): Likewise.
> > 
> > gcc/testsuite/ChangeLog:
> > 
> > * lib/gcov.exp: Add condition coverage test function.
> > * g++.dg/gcov/gcov-18.C: New test.
> > * gcc.misc-tests/gcov-19.c: New test.
> > * gcc.misc-tests/gcov-20.c: New test.
> > * gcc.misc-tests/gcov-21.c: New test.
> > ---
> > 
> > v2:
> > * Moved the docs to rst/sphinx
> > * Output and message uses the 'conditions outcomes' vocabulary
> > * Fixed errors reported by contrib/style-check. Note that a few
> >   warnings persist but are either in comments (ascii art) or because
> >   the surrounding code (typically lists) are formatted the same way
> > v3:
> > * Revert docs from rst/sphinx to texinfo
> > v4:
> > * Rebased on trunk, removed @gol from texi
> > 
> >  gcc/builtins.cc|2 +-
> >  gcc/collect2.cc|7 +-
> >  gcc/common.opt |8 +
> >  gcc/doc/gcov.texi  |   37 +
> >  gcc/doc/invoke.texi|   19 +
> >  gcc/gcc.cc |4 +-
> >  gcc/gcov-counter.def   |3 +
> >  gcc/gcov-dump.cc   |   24 +
> >  gcc/gcov-io.h  |3 +
> >  gcc/gcov.cc|  200 +++-
> >  gcc/ipa-inline.cc  |2 +-
> >  gcc/ipa-split.cc   |3 +-
> >  gcc/passes.cc  |3 +-
> >  gcc/profile.cc |   84 +-
> >  gcc/testsuite/g++.dg/gcov/gcov-18.C|  234 +
> >  gcc/testsuite/gcc.misc-tests/gcov-19.c | 1250 
> >  gcc/testsuite/gcc.misc-tests/gcov-20.c |   22 +
> >  gcc/testsuite/gcc.misc-tests/gcov-21.c |   16 +
> >  gcc/testsuite/lib/gcov.exp |  191 +++-
> >  gcc/tree-profile.cc| 1048 +++-
> >  libgcc/libgcov-merge.c |5 +
> >  21 files changed, 3137 insertions(+), 28 

Re: Do not account __builtin_unreachable guards in inliner

2023-06-23 Thread Jan Hubicka via Gcc-patches
> On Mon, Jun 19, 2023 at 12:15 PM Jan Hubicka  wrote:
> >
> > > On Mon, Jun 19, 2023 at 9:52 AM Jan Hubicka via Gcc-patches
> > >  wrote:
> > > >
> > > > Hi,
> > > > this was suggested earlier somewhere, but I can not find the thread.
> > > > C++ has assume attribute that expands int
> > > >   if (conditional)
> > > > __builtin_unreachable ()
> > > > We do not want to account the conditional in inline heuristics since
> > > > we know that it is going to be optimized out.
> > > >
> > > > Bootstrapped/regtested x86_64-linux, will commit it later today if
> > > > thre are no complains.
> > >
> > > I think we also had the request to not account the condition feeding
> > > stmts (if they only feed it and have no side-effects).  libstdc++ has
> > > complex range comparisons here.  Also ...
> >
> > I was thinking of this: it depends on how smart do we want to get.
> > We also have dead conditionals guarding clobbers, predicts and other
> > stuff.  In general we can use mark phase of cd-dce telling it to ignore
> > those statements and then use its resut in the analysis.
> 
> Hmm, possible but a bit heavy-handed.  There's simple_dce_from_worklist
> which might be a way to do this (of course we cannot use that 1:1).  Also
> then consider
> 
>  a = a + 1;
>  if (a > 10)
>__builtin_unreachable ();
>  if (a < 5)
>__builtin_unreachable ();
> 
> and a has more than one use but both are going away.  So indeed a
> more global analysis would be needed to get the full benefit.

I was looking into simple_dce_from_worklist and if I understand it
right, it simply walks list of SSA names which probably lost some uses
by the consuming pass. If they have zero non-debug uses and defining statement 
has
no side effects, then they are removed.

I think this is not really fitting the bill here since the example above
is likely to be common and also if we want one assign filling
conditional optimized out, we probably want to handle case with multiple
assignments.  What about
 1) walk function body and see if there are conditionals we know will be
optimized out (at the begining those can be only those which has one
arm reaching __bulitin_unreachable
 2) if there are none, just proceed with fnsummary construction
 3) if there were some, do non-cd-dce mark stage which will skip those
dead conditional identified in 1
and proceed to fnsummary construction with additional bitmap of
marked stmts.

This should be cheaper than unconditionally doing cd-dce and should
handle common cases?
Honza


Re: [PATCH] Improve DSE to handle stores before __builtin_unreachable ()

2023-06-22 Thread Jan Hubicka via Gcc-patches
> 
> 
> On 6/22/23 00:31, Richard Biener wrote:
> > I think there's a difference in that __builtin_trap () is observable
> > while __builtin_unreachable () is not and reaching __builtin_unreachable
> > () invokes undefined behavior while reaching __builtin_trap () does not.
> > 
> > So the isolation code marking the trapping code volatile should be
> > enough and the trap () is just there to end the basic block
> > (and maybe be on the safe side to really trap).
> Agreed WRT observability -- but that's not really the point of the trap and
> if we wanted we could change that behavior.
> 
> The trap is there to halt execution immediately rather than letting it keep
> running.  That was a design decision from a security standpoint. If we've
> detected that we're executing undefined behavior, stop rather than
> potentially letting a malicious actor turn a bug into an exploit.

Also as discussed some time ago, the volatile loads between traps has
effect of turning previously pure/const functions into non-const which
is somewhat sad, so it is still on my todo list to change it this stage1
to something more careful.   We discussed internal functions trap_store
and trap_load which will expand to load/store + trap but will make it
clear that side effect does not count to modref.

I wanted to give it some time if I can come with something better, but
didn't so far.

Honza


Re: [PATCH] Improve DSE to handle stores before __builtin_unreachable ()

2023-06-21 Thread Jan Hubicka via Gcc-patches
> 
> If I manually add a __builtin_unreachable () to the above case
> I see the *(int *)0 = 0; store DSEd.  Maybe we should avoid
> removing stores that might trap here?  POSIX wise such a trap
> could be a way to jump out of the path leading to unreachable ()
> via siglongjmp ...

I am not sure how much POSIX actually promises here.
I don't think we are supposed to keep such undefined behaviours in
original order.  We compile:

int test (int *a, int *b, int c)
{
int res = *a;
return res + *b / c;
}

to:

test:
.LFB0:
.cfi_startproc
movl(%rsi), %eax
movl%edx, %ecx
cltd
idivl   %ecx
addl(%rdi), %eax
ret

So we read *b before *a.  Passing a==NULL, b non-null and c==0 and
using signal sigsev to recover the program before division by 0 will not
work with optimization.

Reaching unreachable is always undefined behaviour so I think we are
safe to reorder it with a load.
Honza


> 
> Thanks,
> Richard.


Re: [libstdc++] Improve M_check_len

2023-06-20 Thread Jan Hubicka via Gcc-patches
> > >
> > >   size_type
> > >   _M_check_len(size_type __n, const char* __s) const
> > >   {
> > > const size_type __size = size();
> > > const size_type __max_size = max_size();
> > >
> > > if (__is_same(allocator_type, allocator<_Tp>)
> > >   && __size > __max_size / 2)
> > >
> > 
> > This check is wrong for C++17 and older standards, because max_size()
> > changed value in C++20.
> > 
> > In C++17 it was PTRDIFF_MAX / sizeof(T) but in C++20 it's SIZE_MAX /
> > sizeof(T). So on 32-bit targets using C++17, it's possible a std::vector
> > could use PTRDIFF_MAX/2 bytes, and then the size <= max_size/2 assumption
> > would not hold.
> 
> Can we go with this perhaps only for 64bit targets?
> I am not sure how completely safe this idea is in 32bit world: I guess
> one can have OS that lets you to allocate half of address space as one
> allocation.

Perhaps something like:
  size > std::min ((uint64_t)__max_size, ((uint64_t)1 << 62) / sizeof (_Tp))
is safe for all allocators and 32bit, so we won't need __is_same test
and test for 64bit?

Honza
> 
> Thanks!
> Honza


Re: [libstdc++] Improve M_check_len

2023-06-20 Thread Jan Hubicka via Gcc-patches
> >
> >   size_type
> >   _M_check_len(size_type __n, const char* __s) const
> >   {
> > const size_type __size = size();
> > const size_type __max_size = max_size();
> >
> > if (__is_same(allocator_type, allocator<_Tp>)
> >   && __size > __max_size / 2)
> >
> 
> This check is wrong for C++17 and older standards, because max_size()
> changed value in C++20.
> 
> In C++17 it was PTRDIFF_MAX / sizeof(T) but in C++20 it's SIZE_MAX /
> sizeof(T). So on 32-bit targets using C++17, it's possible a std::vector
> could use PTRDIFF_MAX/2 bytes, and then the size <= max_size/2 assumption
> would not hold.

Can we go with this perhaps only for 64bit targets?
I am not sure how completely safe this idea is in 32bit world: I guess
one can have OS that lets you to allocate half of address space as one
allocation.

Thanks!
Honza


Re: [libstdc++] Improve M_check_len

2023-06-19 Thread Jan Hubicka via Gcc-patches
> On Mon, 19 Jun 2023 at 12:20, Jakub Jelinek wrote:
> 
> > On Mon, Jun 19, 2023 at 01:05:36PM +0200, Jan Hubicka via Gcc-patches
> > wrote:
> > > - if (max_size() - size() < __n)
> > > -   __throw_length_error(__N(__s));
> > > + const size_type __max_size = max_size();
> > > + // On 64bit systems vectors can not reach overflow by growing
> > > + // by small sizes; before this happens, we will run out of memory.
> > > + if (__builtin_constant_p(__n)
> > > + && __builtin_constant_p(__max_size)
> > > + && sizeof(ptrdiff_t) >= 8
> > > + && __max_size * sizeof(_Tp) >= ((ptrdiff_t)1 << 60)
> >
> > Isn't there a risk of overlow in the __max_size * sizeof(_Tp) computation?
> >
> 
> For std::allocator, no, because max_size() is size_t(-1) / sizeof(_Tp). But
> for a user-defined allocator that has a silly max_size(), yes, that's
> possible.
> 
> I still don't really understand why any change is needed here. The PR says
> that the current _M_check_len brings in the EH code, but how/why does that
> happen? The __throw_length_error function is not inline, it's defined in
> libstdc++.so, so why isn't it just an extern call? Is the problem that it

It is really quite interesting peformance problem which does affect real
code. Extra extern call counts (especially since it is seen as
3 calls by inliner).  

This is _M_check_len after early optimizations (so as seen by inline
heuristics):

   [local count: 1073741824]:
  _15 = this_7(D)->D.26656._M_impl.D.25963._M_finish;
  _14 = this_7(D)->D.26656._M_impl.D.25963._M_start;
  _13 = _15 - _14;
  _10 = _13 /[ex] 8;
  _8 = (long unsigned int) _10;
  _1 = 1152921504606846975 - _8;
  __n.3_2 = __n;
  if (_1 < __n.3_2)
goto ; [0.04%]
  else
goto ; [99.96%]

   [local count: 429496]:
  std::__throw_length_error (__s_16(D));

   [local count: 1073312329]:
  D.27696 = _8;
  if (__n.3_2 > _8)
goto ; [34.00%]
  else
goto ; [66.00%]

   [local count: 364926196]:

   [local count: 1073312330]:
  # _18 = PHI <(4), &__n(5)>
  _3 = *_18;
  __len_11 = _3 + _8;
  D.27696 ={v} {CLOBBER(eol)};
  if (_8 > __len_11)
goto ; [35.00%]
  else
goto ; [65.00%]

   [local count: 697653013]:
  _5 = MIN_EXPR <__len_11, 1152921504606846975>;

   [local count: 1073312330]:
  # iftmp.4_4 = PHI <1152921504606846975(6), _5(7)>
  return iftmp.4_4;

So a lot of code that is essnetially semantically equivalent to:

   return __size + MAX_EXPR (__n, __size)

at least with the default allocator.

Early inliner decides that it is not good idea to early inline. 
At this stage we inline mostly calls where we expect code to get
smaller after inlining and since the function contains another
uninlinable call, this does not seem likely.

With -O3 we will inline it later at IPA stage, but only when the code is
considered hot. 
With -O2 we decide to keep it offline if the unit contians multiple
calls to the function otherwise we inline it since it wins in the code
size estimation model.

The problem is that _M_check_len is used by _M_realloc_insert that later
feeds result to the allocator.  There is extra redundancy since
allocator can call std::__throw_bad_array_new_length and 
std::__throw_bad_alloc for bad sizes, but _M_check_len will not produce
them which is something we won't work out before inlning it.

As a result _M_realloc_insert is seen as very large function by
inliner heuristics (71 instructions).  Functions that are not
declared inline are inlined if smaller than 15 instructions with -O2
and 30 instructions with -O3. So we don't inline.

This hurts common lops that use vector as a stack and calls push_back
in internal loop. Not inlining prevents SRA and we end up saving and
loading the end of vector pointer on every iteration of the loop.

The following testcase:

typedef unsigned int uint32_t;
std::pair pair;
void
test()
{
std::vector> stack;
stack.push_back (pair);
while (!stack.empty()) {
std::pair cur = stack.back();
stack.pop_back();
if (!cur.first)
{
cur.second++;
stack.push_back (cur);
}
if (cur.second > 1)
break;
}
}
int
main()
{
for (int i = 0; i < 1; i++)
  test();
}

Runs for me 0.5s with _M_realoc_insert not inlined and 0.048s with
_M_realloc_insert inlined.  Clang inlines it even at -O2 and does
0.063s.  I believe it is the reason why jpegxl library is slower
when built with GCC and since such loops are quite common in say
DFS walk, I think it is frequent problem.
> makes _M_check_len potentially-throwing? Because that's basically the
> entire point of _M_check_len: to thr

Re: [libstdc++] Improve M_check_len

2023-06-19 Thread Jan Hubicka via Gcc-patches
> > -   if (max_size() - size() < __n)
> > - __throw_length_error(__N(__s));
> > +   // On 64bit systems vectors of small sizes can not
> > +   // reach overflow by growing by small sizes; before
> > +   // this happens, we will run out of memory.
> > +   if (__builtin_constant_p (sizeof (_Tp))
> >
> 
> This shouldn't be here, of course sizeof is a constant.
OK :)
> 
> No space before the opening parens, libstdc++ doesn't follow GNU style.
Fixed.
> 
> 
> 
> > +   && __builtin_constant_p (__n)
> > +   && sizeof (ptrdiff_t) >= 8
> > +   && __n < max_size () / 2)
> >
> 
> This check is not OK. As I said in Bugzilla just now, max_size() depends on
> the allocator, which could return something much smaller than PTRDIFF_MAX.
> You can't make this assumption for all specializations of std::vector.
> 
> If Alloc::max_size() == 100 and this->size() == 100 then this function
> needs to throw length_error for *any* n. In the general case you cannot
> remove size() from this condition.
> 
> For std::allocator it's safe to assume that max_size() is related to
> PTRDIFF_MAX/sizeof(T), but this patch would apply to all allocators.

Here is updated version.  I simply __builtin_constant_p max_size and
test it is large enough.  For that we need to copy it into temporary
variable since we fold-const __builtin_constant_p (function (x))
early, before function gets inlined.

I also added __builtin_unreachable to determine return value range
as discussed in PR.

Honza

diff --git a/libstdc++-v3/include/bits/stl_vector.h 
b/libstdc++-v3/include/bits/stl_vector.h
index 70ced3d101f..7a1966405ca 100644
--- a/libstdc++-v3/include/bits/stl_vector.h
+++ b/libstdc++-v3/include/bits/stl_vector.h
@@ -1895,11 +1895,29 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
   size_type
   _M_check_len(size_type __n, const char* __s) const
   {
-   if (max_size() - size() < __n)
- __throw_length_error(__N(__s));
+   const size_type __max_size = max_size();
+   // On 64bit systems vectors can not reach overflow by growing
+   // by small sizes; before this happens, we will run out of memory.
+   if (__builtin_constant_p(__n)
+   && __builtin_constant_p(__max_size)
+   && sizeof(ptrdiff_t) >= 8
+   && __max_size * sizeof(_Tp) >= ((ptrdiff_t)1 << 60)
+   && __n < __max_size / 2)
+ {
+   const size_type __len = size() + (std::max)(size(), __n);
+   // let compiler know that __len has sane value range.
+   if (__len < __n || __len >= __max_size)
+ __builtin_unreachable();
+   return __len;
+ }
+   else
+ {
+   if (__max_size - size() < __n)
+ __throw_length_error(__N(__s));
 
-   const size_type __len = size() + (std::max)(size(), __n);
-   return (__len < size() || __len > max_size()) ? max_size() : __len;
+   const size_type __len = size() + (std::max)(size(), __n);
+   return (__len < size() || __len > __max_size) ? __max_size : __len;
+ }
   }
 
   // Called by constructors to check initial size.


  1   2   3   4   5   >