Re: [PATCH] Take branch misprediction effects into account when RTL loop unrolling (issue6099055)

2012-06-19 Thread Teresa Johnson
Ping.
Teresa

On Fri, May 18, 2012 at 7:21 AM, Teresa Johnson tejohn...@google.com wrote:
 Ping?
 Teresa

 On Fri, May 11, 2012 at 6:11 AM, Teresa Johnson tejohn...@google.com wrote:
 Ping?
 Teresa

 On Fri, May 4, 2012 at 3:41 PM, Teresa Johnson tejohn...@google.com wrote:

 On David's suggestion, I have removed the changes that rename niter_desc
 to
 loop_desc from this patch to focus the patch on the unrolling changes. I
 can
 submit a cleanup patch to do the renaming as soon as this one goes in.

 Bootstrapped and tested on x86_64-unknown-linux-gnu.  Ok for trunk?

 Thanks,
 Teresa

 Here is the new description of improvements from the original patch:

 Improved patch based on feedback. Main changes are:

 1) Improve efficiency by caching loop analysis results in the loop
 auxiliary
 info structure hanging off the loop structure. Added a new routine,
 analyze_loop_insns, to fill in information about the average and total
 number
 of branches, as well as whether there are any floating point set and call
 instructions in the loop. The new routine is invoked when we first create
 a
 loop's niter_desc struct, and the caller (get_simple_loop_desc) has been
 modified to handle creating a niter_desc for the fake outermost loop.

 2) Improvements to max_unroll_with_branches:
 - Treat the fake outermost loop (the procedure body) as we would a hot
 outer
 loop, i.e. compute the max unroll looking at its nested branches, instead
 of
 shutting off unrolling when we reach the fake outermost loop.
 - Pull the checks previously done in the caller into the routine (e.g.
 whether the loop iterates frequently or contains fp instructions).
 - Fix a bug in the previous version that sometimes caused overflow in the
 new unroll factor.

 3) Remove float variables, and use integer computation to compute the
 average number of branches in the loop.

 4) Detect more types of floating point computations in the loop by walking
 all set instructions, not just single sets.

 2012-05-04   Teresa Johnson  tejohn...@google.com

        * doc/invoke.texi: Update the documentation with new params.
        * loop-unroll.c (max_unroll_with_branches): New function.
        (decide_unroll_constant_iterations,
 decide_unroll_runtime_iterations):
        Add heuristic to avoid increasing branch mispredicts when
 unrolling.
        (decide_peel_simple, decide_unroll_stupid): Retrieve number of
        branches from niter_desc instead of via function that walks loop.
        * loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns
        function, and add guards to enable this function to work for the
        outermost loop.
        * cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions.
        (num_loop_branches): Remove.
        * cfgloop.h (struct loop_desc): Added new fields to cache
 additional
        loop analysis information.
        (num_loop_branches): Remove.
        (analyze_loop_insns): Declare.
        * params.def (PARAM_MIN_ITER_UNROLL_WITH_BRANCHES): New param.
        (PARAM_UNROLL_OUTER_LOOP_BRANCH_BUDGET): Ditto.

 Index: doc/invoke.texi
 ===
 --- doc/invoke.texi     (revision 187013)
 +++ doc/invoke.texi     (working copy)
 @@ -8842,6 +8842,12 @@ The maximum number of insns of an unswitched loop.
  @item max-unswitch-level
  The maximum number of branches unswitched in a single loop.

 +@item min-iter-unroll-with-branches
 +Minimum iteration count to ignore branch effects when unrolling.
 +
 +@item unroll-outer-loop-branch-budget
 +Maximum number of branches allowed in hot outer loop region after unroll.
 +
  @item lim-expensive
  The minimum cost of an expensive expression in the loop invariant motion.

 Index: loop-unroll.c
 ===
 --- loop-unroll.c       (revision 187013)
 +++ loop-unroll.c       (working copy)
 @@ -152,6 +152,99 @@ static void combine_var_copies_in_loop_exit (struc
                                             basic_block);
  static rtx get_expansion (struct var_to_expand *);

 +/* Compute the maximum number of times LOOP can be unrolled without
 exceeding
 +   a branch budget, which can increase branch mispredictions. The number
 of
 +   branches is computed by weighting each branch with its expected
 execution
 +   probability through the loop based on profile data. If no profile
 feedback
 +   data exists, simply return the current NUNROLL factor.  */
 +
 +static unsigned
 +max_unroll_with_branches(struct loop *loop, unsigned nunroll)
 +{
 +  struct loop *outer;
 +  struct niter_desc *outer_desc = 0;
 +  int outer_niters = 1;
 +  int frequent_iteration_threshold;
 +  unsigned branch_budget;
 +  struct niter_desc *desc = get_simple_loop_desc (loop);
 +
 +  /* Ignore loops with FP computation as these tend to benefit much more
 +     consistently from unrolling.  */
 +  if (desc-has_fp)
 +    return nunroll;
 +
 +  frequent_iteration_threshold = PARAM_VALUE

Re: [PATCH] Take branch misprediction effects into account when RTL loop unrolling (issue6099055)

2012-05-18 Thread Teresa Johnson
Ping?
Teresa

On Fri, May 11, 2012 at 6:11 AM, Teresa Johnson tejohn...@google.com wrote:
 Ping?
 Teresa

 On Fri, May 4, 2012 at 3:41 PM, Teresa Johnson tejohn...@google.com wrote:

 On David's suggestion, I have removed the changes that rename niter_desc
 to
 loop_desc from this patch to focus the patch on the unrolling changes. I
 can
 submit a cleanup patch to do the renaming as soon as this one goes in.

 Bootstrapped and tested on x86_64-unknown-linux-gnu.  Ok for trunk?

 Thanks,
 Teresa

 Here is the new description of improvements from the original patch:

 Improved patch based on feedback. Main changes are:

 1) Improve efficiency by caching loop analysis results in the loop
 auxiliary
 info structure hanging off the loop structure. Added a new routine,
 analyze_loop_insns, to fill in information about the average and total
 number
 of branches, as well as whether there are any floating point set and call
 instructions in the loop. The new routine is invoked when we first create
 a
 loop's niter_desc struct, and the caller (get_simple_loop_desc) has been
 modified to handle creating a niter_desc for the fake outermost loop.

 2) Improvements to max_unroll_with_branches:
 - Treat the fake outermost loop (the procedure body) as we would a hot
 outer
 loop, i.e. compute the max unroll looking at its nested branches, instead
 of
 shutting off unrolling when we reach the fake outermost loop.
 - Pull the checks previously done in the caller into the routine (e.g.
 whether the loop iterates frequently or contains fp instructions).
 - Fix a bug in the previous version that sometimes caused overflow in the
 new unroll factor.

 3) Remove float variables, and use integer computation to compute the
 average number of branches in the loop.

 4) Detect more types of floating point computations in the loop by walking
 all set instructions, not just single sets.

 2012-05-04   Teresa Johnson  tejohn...@google.com

        * doc/invoke.texi: Update the documentation with new params.
        * loop-unroll.c (max_unroll_with_branches): New function.
        (decide_unroll_constant_iterations,
 decide_unroll_runtime_iterations):
        Add heuristic to avoid increasing branch mispredicts when
 unrolling.
        (decide_peel_simple, decide_unroll_stupid): Retrieve number of
        branches from niter_desc instead of via function that walks loop.
        * loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns
        function, and add guards to enable this function to work for the
        outermost loop.
        * cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions.
        (num_loop_branches): Remove.
        * cfgloop.h (struct loop_desc): Added new fields to cache
 additional
        loop analysis information.
        (num_loop_branches): Remove.
        (analyze_loop_insns): Declare.
        * params.def (PARAM_MIN_ITER_UNROLL_WITH_BRANCHES): New param.
        (PARAM_UNROLL_OUTER_LOOP_BRANCH_BUDGET): Ditto.

 Index: doc/invoke.texi
 ===
 --- doc/invoke.texi     (revision 187013)
 +++ doc/invoke.texi     (working copy)
 @@ -8842,6 +8842,12 @@ The maximum number of insns of an unswitched loop.
  @item max-unswitch-level
  The maximum number of branches unswitched in a single loop.

 +@item min-iter-unroll-with-branches
 +Minimum iteration count to ignore branch effects when unrolling.
 +
 +@item unroll-outer-loop-branch-budget
 +Maximum number of branches allowed in hot outer loop region after unroll.
 +
  @item lim-expensive
  The minimum cost of an expensive expression in the loop invariant motion.

 Index: loop-unroll.c
 ===
 --- loop-unroll.c       (revision 187013)
 +++ loop-unroll.c       (working copy)
 @@ -152,6 +152,99 @@ static void combine_var_copies_in_loop_exit (struc
                                             basic_block);
  static rtx get_expansion (struct var_to_expand *);

 +/* Compute the maximum number of times LOOP can be unrolled without
 exceeding
 +   a branch budget, which can increase branch mispredictions. The number
 of
 +   branches is computed by weighting each branch with its expected
 execution
 +   probability through the loop based on profile data. If no profile
 feedback
 +   data exists, simply return the current NUNROLL factor.  */
 +
 +static unsigned
 +max_unroll_with_branches(struct loop *loop, unsigned nunroll)
 +{
 +  struct loop *outer;
 +  struct niter_desc *outer_desc = 0;
 +  int outer_niters = 1;
 +  int frequent_iteration_threshold;
 +  unsigned branch_budget;
 +  struct niter_desc *desc = get_simple_loop_desc (loop);
 +
 +  /* Ignore loops with FP computation as these tend to benefit much more
 +     consistently from unrolling.  */
 +  if (desc-has_fp)
 +    return nunroll;
 +
 +  frequent_iteration_threshold = PARAM_VALUE
 (PARAM_MIN_ITER_UNROLL_WITH_BRANCHES);
 +  if (expected_loop_iterations (loop) = (unsigned)

Re: [PATCH] Take branch misprediction effects into account when RTL loop unrolling (issue6099055)

2012-05-11 Thread Teresa Johnson
Ping?
Teresa

On Fri, May 4, 2012 at 3:41 PM, Teresa Johnson tejohn...@google.com wrote:

 On David's suggestion, I have removed the changes that rename niter_desc
 to
 loop_desc from this patch to focus the patch on the unrolling changes. I
 can
 submit a cleanup patch to do the renaming as soon as this one goes in.

 Bootstrapped and tested on x86_64-unknown-linux-gnu.  Ok for trunk?

 Thanks,
 Teresa

 Here is the new description of improvements from the original patch:

 Improved patch based on feedback. Main changes are:

 1) Improve efficiency by caching loop analysis results in the loop
 auxiliary
 info structure hanging off the loop structure. Added a new routine,
 analyze_loop_insns, to fill in information about the average and total
 number
 of branches, as well as whether there are any floating point set and call
 instructions in the loop. The new routine is invoked when we first create
 a
 loop's niter_desc struct, and the caller (get_simple_loop_desc) has been
 modified to handle creating a niter_desc for the fake outermost loop.

 2) Improvements to max_unroll_with_branches:
 - Treat the fake outermost loop (the procedure body) as we would a hot
 outer
 loop, i.e. compute the max unroll looking at its nested branches, instead
 of
 shutting off unrolling when we reach the fake outermost loop.
 - Pull the checks previously done in the caller into the routine (e.g.
 whether the loop iterates frequently or contains fp instructions).
 - Fix a bug in the previous version that sometimes caused overflow in the
 new unroll factor.

 3) Remove float variables, and use integer computation to compute the
 average number of branches in the loop.

 4) Detect more types of floating point computations in the loop by walking
 all set instructions, not just single sets.

 2012-05-04   Teresa Johnson  tejohn...@google.com

        * doc/invoke.texi: Update the documentation with new params.
        * loop-unroll.c (max_unroll_with_branches): New function.
        (decide_unroll_constant_iterations,
 decide_unroll_runtime_iterations):
        Add heuristic to avoid increasing branch mispredicts when
 unrolling.
        (decide_peel_simple, decide_unroll_stupid): Retrieve number of
        branches from niter_desc instead of via function that walks loop.
        * loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns
        function, and add guards to enable this function to work for the
        outermost loop.
        * cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions.
        (num_loop_branches): Remove.
        * cfgloop.h (struct loop_desc): Added new fields to cache
 additional
        loop analysis information.
        (num_loop_branches): Remove.
        (analyze_loop_insns): Declare.
        * params.def (PARAM_MIN_ITER_UNROLL_WITH_BRANCHES): New param.
        (PARAM_UNROLL_OUTER_LOOP_BRANCH_BUDGET): Ditto.

 Index: doc/invoke.texi
 ===
 --- doc/invoke.texi     (revision 187013)
 +++ doc/invoke.texi     (working copy)
 @@ -8842,6 +8842,12 @@ The maximum number of insns of an unswitched loop.
  @item max-unswitch-level
  The maximum number of branches unswitched in a single loop.

 +@item min-iter-unroll-with-branches
 +Minimum iteration count to ignore branch effects when unrolling.
 +
 +@item unroll-outer-loop-branch-budget
 +Maximum number of branches allowed in hot outer loop region after unroll.
 +
  @item lim-expensive
  The minimum cost of an expensive expression in the loop invariant motion.

 Index: loop-unroll.c
 ===
 --- loop-unroll.c       (revision 187013)
 +++ loop-unroll.c       (working copy)
 @@ -152,6 +152,99 @@ static void combine_var_copies_in_loop_exit (struc
                                             basic_block);
  static rtx get_expansion (struct var_to_expand *);

 +/* Compute the maximum number of times LOOP can be unrolled without
 exceeding
 +   a branch budget, which can increase branch mispredictions. The number
 of
 +   branches is computed by weighting each branch with its expected
 execution
 +   probability through the loop based on profile data. If no profile
 feedback
 +   data exists, simply return the current NUNROLL factor.  */
 +
 +static unsigned
 +max_unroll_with_branches(struct loop *loop, unsigned nunroll)
 +{
 +  struct loop *outer;
 +  struct niter_desc *outer_desc = 0;
 +  int outer_niters = 1;
 +  int frequent_iteration_threshold;
 +  unsigned branch_budget;
 +  struct niter_desc *desc = get_simple_loop_desc (loop);
 +
 +  /* Ignore loops with FP computation as these tend to benefit much more
 +     consistently from unrolling.  */
 +  if (desc-has_fp)
 +    return nunroll;
 +
 +  frequent_iteration_threshold = PARAM_VALUE
 (PARAM_MIN_ITER_UNROLL_WITH_BRANCHES);
 +  if (expected_loop_iterations (loop) = (unsigned)
 frequent_iteration_threshold)
 +    return nunroll;
 +
 +  /* If there was no profile 

[PATCH] Take branch misprediction effects into account when RTL loop unrolling (issue6099055)

2012-05-04 Thread Teresa Johnson
On David's suggestion, I have removed the changes that rename niter_desc to
loop_desc from this patch to focus the patch on the unrolling changes. I can
submit a cleanup patch to do the renaming as soon as this one goes in.

Bootstrapped and tested on x86_64-unknown-linux-gnu.  Ok for trunk?

Thanks,
Teresa

Here is the new description of improvements from the original patch:

Improved patch based on feedback. Main changes are:

1) Improve efficiency by caching loop analysis results in the loop auxiliary
info structure hanging off the loop structure. Added a new routine,
analyze_loop_insns, to fill in information about the average and total number
of branches, as well as whether there are any floating point set and call
instructions in the loop. The new routine is invoked when we first create a
loop's niter_desc struct, and the caller (get_simple_loop_desc) has been
modified to handle creating a niter_desc for the fake outermost loop.

2) Improvements to max_unroll_with_branches:
- Treat the fake outermost loop (the procedure body) as we would a hot outer
loop, i.e. compute the max unroll looking at its nested branches, instead of
shutting off unrolling when we reach the fake outermost loop.
- Pull the checks previously done in the caller into the routine (e.g.
whether the loop iterates frequently or contains fp instructions).
- Fix a bug in the previous version that sometimes caused overflow in the
new unroll factor.

3) Remove float variables, and use integer computation to compute the
average number of branches in the loop.

4) Detect more types of floating point computations in the loop by walking
all set instructions, not just single sets.

2012-05-04   Teresa Johnson  tejohn...@google.com

* doc/invoke.texi: Update the documentation with new params.
* loop-unroll.c (max_unroll_with_branches): New function.
(decide_unroll_constant_iterations, decide_unroll_runtime_iterations):
Add heuristic to avoid increasing branch mispredicts when unrolling.
(decide_peel_simple, decide_unroll_stupid): Retrieve number of
branches from niter_desc instead of via function that walks loop.
* loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns
function, and add guards to enable this function to work for the
outermost loop.
* cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions.
(num_loop_branches): Remove.
* cfgloop.h (struct loop_desc): Added new fields to cache additional
loop analysis information.
(num_loop_branches): Remove.
(analyze_loop_insns): Declare.
* params.def (PARAM_MIN_ITER_UNROLL_WITH_BRANCHES): New param.
(PARAM_UNROLL_OUTER_LOOP_BRANCH_BUDGET): Ditto.

Index: doc/invoke.texi
===
--- doc/invoke.texi (revision 187013)
+++ doc/invoke.texi (working copy)
@@ -8842,6 +8842,12 @@ The maximum number of insns of an unswitched loop.
 @item max-unswitch-level
 The maximum number of branches unswitched in a single loop.
 
+@item min-iter-unroll-with-branches
+Minimum iteration count to ignore branch effects when unrolling.
+
+@item unroll-outer-loop-branch-budget
+Maximum number of branches allowed in hot outer loop region after unroll.
+
 @item lim-expensive
 The minimum cost of an expensive expression in the loop invariant motion.
 
Index: loop-unroll.c
===
--- loop-unroll.c   (revision 187013)
+++ loop-unroll.c   (working copy)
@@ -152,6 +152,99 @@ static void combine_var_copies_in_loop_exit (struc
 basic_block);
 static rtx get_expansion (struct var_to_expand *);
 
+/* Compute the maximum number of times LOOP can be unrolled without exceeding
+   a branch budget, which can increase branch mispredictions. The number of
+   branches is computed by weighting each branch with its expected execution
+   probability through the loop based on profile data. If no profile feedback
+   data exists, simply return the current NUNROLL factor.  */
+
+static unsigned
+max_unroll_with_branches(struct loop *loop, unsigned nunroll)
+{
+  struct loop *outer;
+  struct niter_desc *outer_desc = 0;
+  int outer_niters = 1;
+  int frequent_iteration_threshold;
+  unsigned branch_budget;
+  struct niter_desc *desc = get_simple_loop_desc (loop);
+
+  /* Ignore loops with FP computation as these tend to benefit much more
+ consistently from unrolling.  */
+  if (desc-has_fp)
+return nunroll;
+
+  frequent_iteration_threshold = PARAM_VALUE 
(PARAM_MIN_ITER_UNROLL_WITH_BRANCHES);
+  if (expected_loop_iterations (loop) = (unsigned) 
frequent_iteration_threshold)
+return nunroll;
+
+  /* If there was no profile feedback data, av_num_branches will be 0
+ and we won't limit unrolling. If the av_num_branches is at most 1,
+ also don't limit unrolling as the back-edge branch will not 

[PATCH] Take branch misprediction effects into account when RTL loop unrolling (issue6099055)

2012-05-01 Thread Teresa Johnson
Improved patch based on feedback. Main changes are:

1) Improve efficiency by caching loop analysis results in the loop auxiliary
info structure hanging off the loop structure. Renamed this structure from
niter_desc to loop_desc to reflect the broader type of information cached
in the structure. Added a new routine, analyze_loop_insns, to fill in
information about the average and total number of branches, as well
as whether there are any floating point set and call instructions in the loop.
The new routine is invoked when we first create a loop's loop_desc struct,
and the caller (get_simple_loop_desc) has been modified to handle creating
a loop_desc for the fake outermost loop.

2) Improvements to max_unroll_with_branches:
- Treat the fake outermost loop (the procedure body) as we would a hot outer
loop, i.e. compute the max unroll looking at its nested branches, instead of
shutting off unrolling when we reach the fake outermost loop.
- Pull the checks previously done in the caller into the routine (e.g.
whether the loop iterates frequently or contains fp instructions).
- Fix a bug in the previous version that sometimes caused overflow in the 
new unroll factor.

3) Remove float variables, and use integer computation to compute the
average number of branches in the loop.

4) Detect more types of floating point computations in the loop by walking
all set instructions, not just single sets.

Bootstrapped and tested on x86_64-unknown-linux-gnu.  Ok for trunk?

Thanks,
Teresa

2012-05-01   Teresa Johnson  tejohn...@google.com

* doc/invoke.texi: Update the documentation with new params.
* loop-unroll.c (max_unroll_with_branches): New function.
(loop_exit_at_end_p, decide_peel_once_rolling): Rename niter_desc
to loop_desc.
(decide_peel_completely, peel_loop_completely): Ditto.
(unroll_loop_runtime_iterations, decide_peel_simple): Ditto.
(peel_loop_simple, decide_unroll_stupid, unroll_loop_stupid): Ditto.
(decide_unroll_constant_iterations, decide_unroll_runtime_iterations):
Ditto, and add heuristic to avoid increasing branch mispredicts when
unrolling.
* loop-doloop.c (doloop_valid_p): Rename niter_desc to loop_desc.
(doloop_modify, doloop_optimize): Ditto.
* loop-iv.c (simplify_using_initial_values): Ditto.
(canonicalize_iv_subregs, determine_max_iter): Ditto.
(iv_number_of_iterations, check_simple_exit): Ditto.
(find_simple_exit, free_simple_loop_desc): Ditto.
(get_simple_loop_desc): Ditto, invoke new analyze_loop_insns
function, and add guards to enable this function to work for the
outermost loop.
* cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions.
* cfgloop.h (struct loop_desc): Renamed from struct niter_desc
and added new fields to cache additional loop analysis information.
(find_simple_exit, get_simple_loop_desc, simple_loop_desc): Rename
niter_desc to loop_desc.
(analyze_loop_insns): Declare.
* params.def (PARAM_MIN_ITER_UNROLL_WITH_BRANCHES): New param.
(PARAM_UNROLL_OUTER_LOOP_BRANCH_BUDGET): Ditto.


Index: doc/invoke.texi
===
--- doc/invoke.texi (revision 187013)
+++ doc/invoke.texi (working copy)
@@ -8842,6 +8842,12 @@ The maximum number of insns of an unswitched loop.
 @item max-unswitch-level
 The maximum number of branches unswitched in a single loop.
 
+@item min-iter-unroll-with-branches
+Minimum iteration count to ignore branch effects when unrolling.
+
+@item unroll-outer-loop-branch-budget
+Maximum number of branches allowed in hot outer loop region after unroll.
+
 @item lim-expensive
 The minimum cost of an expensive expression in the loop invariant motion.
 
Index: loop-unroll.c
===
--- loop-unroll.c   (revision 187013)
+++ loop-unroll.c   (working copy)
@@ -152,6 +152,99 @@ static void combine_var_copies_in_loop_exit (struc
 basic_block);
 static rtx get_expansion (struct var_to_expand *);
 
+/* Compute the maximum number of times LOOP can be unrolled without exceeding
+   a branch budget, which can increase branch mispredictions. The number of
+   branches is computed by weighting each branch with its expected execution
+   probability through the loop based on profile data. If no profile feedback
+   data exists, simply return the current NUNROLL factor.  */
+
+static unsigned
+max_unroll_with_branches(struct loop *loop, unsigned nunroll)
+{
+  struct loop *outer;
+  struct loop_desc *outer_desc = 0;
+  int outer_niters = 1;
+  int frequent_iteration_threshold;
+  unsigned branch_budget;
+  struct loop_desc *desc = get_simple_loop_desc (loop);
+
+  /* Ignore loops with FP computation as these tend to benefit much more
+ consistently from unrolling.  */
+  if 

Re: [PATCH] Take branch misprediction effects into account when RTL loop unrolling (issue6099055)

2012-04-27 Thread Igor Zamyatin
Are you sure that tree-level unrollers are turned on at O2? My
impression was that they work only at O3 or with f[unroll,peel]-loops
flags.

On Tue, Apr 24, 2012 at 6:13 PM, Andi Kleen a...@firstfloor.org wrote:
 tejohn...@google.com (Teresa Johnson) writes:

 This patch adds heuristics to limit unrolling in loops with branches
 that may increase branch mispredictions. It affects loops that are
 not frequently iterated, and that are nested within a hot region of code 
 that already contains many branch instructions.

 Performance tested with both internal benchmarks and with SPEC
 2000/2006 on a variety of Intel systems (Core2, Corei7, SandyBridge) and a 
 couple of different AMD Opteron systems.
 This improves performance of an internal search indexing benchmark by
 close to 2% on all the tested Intel platforms.  It also consistently
 improves 445.gobmk (with FDO feedback where unrolling kicks in) by
 close to 1% on AMD Opteron. Other performance effects are neutral.

 Bootstrapped and tested on x86_64-unknown-linux-gnu.  Is this ok for trunk?

 One problem with any unrolling heuristics is currently that gcc has
 both the tree level and the rtl level unroller. The tree one is even
 on at -O3.  So if you tweak anything for one you have to affect both,
 otherwise the other may still do the wrong thing(tm).

Tree level unrollers (cunrolli and cunroll) do complete unroll. At O2,
both of them are turned on, but gcc does not allow any code growth --
which makes them pretty useless at O2 (very few loops qualify). The
default max complete peel iteration is also too low compared with both
icc and llvm.  This needs to be tuned.

David


 For some other tweaks I looked into a shared cost model some time ago.
 May be still needed.

 -Andi

 --
 a...@linux.intel.com -- Speaking for myself only


Re: [PATCH] Take branch misprediction effects into account when RTL loop unrolling (issue6099055)

2012-04-27 Thread Xinliang David Li
On Fri, Apr 27, 2012 at 12:07 AM, Igor Zamyatin izamya...@gmail.com wrote:
 Are you sure that tree-level unrollers are turned on at O2? My
 impression was that they work only at O3 or with f[unroll,peel]-loops
 flags.

yes they are on but only have effect on tiny loops with very small
trip count. With O3 or with -funroll,peel-loops, the size is allowed
to grow.

David


 On Tue, Apr 24, 2012 at 6:13 PM, Andi Kleen a...@firstfloor.org wrote:
 tejohn...@google.com (Teresa Johnson) writes:

 This patch adds heuristics to limit unrolling in loops with branches
 that may increase branch mispredictions. It affects loops that are
 not frequently iterated, and that are nested within a hot region of code 
 that already contains many branch instructions.

 Performance tested with both internal benchmarks and with SPEC
 2000/2006 on a variety of Intel systems (Core2, Corei7, SandyBridge) and a 
 couple of different AMD Opteron systems.
 This improves performance of an internal search indexing benchmark by
 close to 2% on all the tested Intel platforms.  It also consistently
 improves 445.gobmk (with FDO feedback where unrolling kicks in) by
 close to 1% on AMD Opteron. Other performance effects are neutral.

 Bootstrapped and tested on x86_64-unknown-linux-gnu.  Is this ok for trunk?

 One problem with any unrolling heuristics is currently that gcc has
 both the tree level and the rtl level unroller. The tree one is even
 on at -O3.  So if you tweak anything for one you have to affect both,
 otherwise the other may still do the wrong thing(tm).

 Tree level unrollers (cunrolli and cunroll) do complete unroll. At O2,
 both of them are turned on, but gcc does not allow any code growth --
 which makes them pretty useless at O2 (very few loops qualify). The
 default max complete peel iteration is also too low compared with both
 icc and llvm.  This needs to be tuned.

 David


 For some other tweaks I looked into a shared cost model some time ago.
 May be still needed.

 -Andi

 --
 a...@linux.intel.com -- Speaking for myself only


Re: [PATCH] Take branch misprediction effects into account when RTL loop unrolling (issue6099055)

2012-04-25 Thread Andi Kleen
 Tree level unrollers (cunrolli and cunroll) do complete unroll. At O2,
 both of them are turned on, but gcc does not allow any code growth --
 which makes them pretty useless at O2 (very few loops qualify). The
 default max complete peel iteration is also too low compared with both
 icc and llvm.  This needs to be tuned.

I found that at -O3 (where tree unroll is on by default) there is quite a 
bit of useless unrolling. I got somewhat irritated that my printf debug loops 
were commonly unrolled.

-Andi



Re: [PATCH] Take branch misprediction effects into account when RTL loop unrolling (issue6099055)

2012-04-25 Thread Richard Guenther
On Tue, Apr 24, 2012 at 11:26 PM, Teresa Johnson tejohn...@google.com wrote:
 This patch adds heuristics to limit unrolling in loops with branches that may 
 increase
 branch mispredictions. It affects loops that are not frequently iterated, and 
 that are
 nested within a hot region of code that already contains many branch 
 instructions.

 Performance tested with both internal benchmarks and with SPEC 2000/2006 on a 
 variety
 of Intel systems (Core2, Corei7, SandyBridge) and a couple of different AMD 
 Opteron systems.
 This improves performance of an internal search indexing benchmark by close 
 to 2% on
 all the tested Intel platforms.  It also consistently improves 445.gobmk 
 (with FDO feedback
 where unrolling kicks in) by close to 1% on AMD Opteron. Other performance 
 effects are
 neutral.

 Bootstrapped and tested on x86_64-unknown-linux-gnu.  Is this ok for trunk?

 Thanks,
 Teresa

 2012-04-24   Teresa Johnson  tejohn...@google.com

        * loop-unroll.c (loop_has_call): New function.
        (loop_has_FP_comp): Ditto.
        (compute_weighted_branches): Ditto.
        (max_unroll_with_branches): Ditto.
        (decide_unroll_constant_iterations): Add heuristic to avoid
        increasing branch mispredicts when unrolling.
        (decide_unroll_runtime_iterations): Ditto.
        * params.def (PARAM_MIN_ITER_UNROLL_WITH_BRANCHES): New param.
        (PARAM_UNROLL_OUTER_LOOP_BRANCH_BUDGET): Ditto.

 Index: loop-unroll.c
 ===
 --- loop-unroll.c       (revision 186783)
 +++ loop-unroll.c       (working copy)
 @@ -152,6 +152,180 @@ static void combine_var_copies_in_loop_exit (struc
                                             basic_block);
  static rtx get_expansion (struct var_to_expand *);

 +/* Determine whether LOOP contains call.  */
 +static bool
 +loop_has_call(struct loop *loop)
 +{
 +  basic_block *body, bb;
 +  unsigned i;
 +  rtx insn;
 +
 +  body = get_loop_body (loop);

You repeatedly do this and walk over all blocks.  Please think about
compile-time
issues when writing code.

This all looks sort-of target specific to me and I don't see why this
very specialized
patch is a good idea when unrolling does a very poor job deciding what and how
much to unroll generally.

Richard.

 +  for (i = 0; i  loop-num_nodes; i++)
 +    {
 +      bb = body[i];
 +
 +      FOR_BB_INSNS (bb, insn)
 +        {
 +          if (CALL_P (insn))
 +            {
 +              free (body);
 +              return true;
 +            }
 +        }
 +    }
 +  free (body);
 +  return false;
 +}
 +
 +/* Determine whether LOOP contains floating-point computation.  */
 +static bool
 +loop_has_FP_comp(struct loop *loop)
 +{
 +  rtx set, dest;
 +  basic_block *body, bb;
 +  unsigned i;
 +  rtx insn;
 +
 +  body = get_loop_body (loop);
 +  for (i = 0; i  loop-num_nodes; i++)
 +    {
 +      bb = body[i];
 +
 +      FOR_BB_INSNS (bb, insn)
 +        {
 +          set = single_set (insn);
 +          if (!set)
 +            continue;
 +
 +          dest = SET_DEST (set);
 +          if (FLOAT_MODE_P (GET_MODE (dest)))
 +            {
 +              free (body);
 +              return true;
 +            }
 +        }
 +    }
 +  free (body);
 +  return false;
 +}
 +
 +/* Compute the number of branches in LOOP, weighted by execution counts.  */
 +static float
 +compute_weighted_branches(struct loop *loop)
 +{
 +  int header_count = loop-header-count;
 +  unsigned i;
 +  float n;
 +  basic_block * body;
 +
 +  /* If no profile feedback data exists, don't limit unrolling  */
 +  if (header_count == 0)
 +    return 0.0;
 +
 +  gcc_assert (loop-latch != EXIT_BLOCK_PTR);
 +
 +  body = get_loop_body (loop);
 +  n = 0.0;
 +  for (i = 0; i  loop-num_nodes; i++)
 +    {
 +      if (EDGE_COUNT (body[i]-succs) = 2)
 +        {
 +          /* If this block is executed less frequently than the header (loop
 +             entry), then it is weighted based on the ratio of times it is
 +             executed compared to the header.  */
 +          if (body[i]-count  header_count)
 +            n += ((float)body[i]-count)/header_count;
 +
 +          /* When it is executed more frequently than the header (i.e. it is
 +             in a nested inner loop), simply weight the branch at 1.0.  */
 +          else
 +            n += 1.0;
 +        }
 +    }
 +  free (body);
 +
 +  return n;
 +}
 +
 +/* Compute the maximum number of times LOOP can be unrolled without exceeding
 +   a branch budget, which can increase branch mispredictions. The number of
 +   branches is computed by weighting each branch with its expected execution
 +   probability through the loop based on profile data. If no profile feedback
 +   data exists, simply return the current NUNROLL factor.  */
 +static unsigned
 +max_unroll_with_branches(struct loop *loop, unsigned nunroll)
 +{
 +  struct loop *outer;
 +  struct niter_desc *outer_desc;
 +  int outer_niters = 1;
 +  float weighted_outer_branches = 

Re: [PATCH] Take branch misprediction effects into account when RTL loop unrolling (issue6099055)

2012-04-25 Thread Teresa Johnson
On Tue, Apr 24, 2012 at 4:38 PM, Steven Bosscher stevenb@gmail.com wrote:
 On Tue, Apr 24, 2012 at 11:26 PM, Teresa Johnson tejohn...@google.com wrote:

        * params.def (PARAM_MIN_ITER_UNROLL_WITH_BRANCHES): New param.
        (PARAM_UNROLL_OUTER_LOOP_BRANCH_BUDGET): Ditto.

 You should add documentation for these new PARAMs to doc/invoke.texi.

Ok, will do.


 I don't really like these new PARAMs: All other loop PARAMs are based
 on the number of insns in a loop, or the maximum number of times a
 transformation is applied. Your new
 PARAM_MIN_ITER_UNROLL_WITH_BRANCHES is completely different, because
 it is a number of iterations. This makes the PARAM value feel even
 more arbitrary than all the other PARAMs to some extend already do...

That's true, they are different in what they are checking than some of
the other loop unrolling params. But I need some threshold for
determining when a loop is hot enough that its unrolled branches will
be executed frequently enough to train the branch predictor and also
where the impact on the branch prediction in the outer region of code
is less likely to matter overall. The defaults were chosen so that the
new unrolling limit should only kick in for loops that are not
iterating much anyway, and where the outer hot region has quite a few
branches.


 (The only other PARAM like that is PARAM_ALIGN_LOOP_ITERATIONS, and
 its default value also looks quite arbitrary...)


 Index: loop-unroll.c
 ===
 --- loop-unroll.c       (revision 186783)
 +++ loop-unroll.c       (working copy)
 @@ -152,6 +152,180 @@ static void combine_var_copies_in_loop_exit (struc
                                             basic_block);
  static rtx get_expansion (struct var_to_expand *);

 +/* Determine whether LOOP contains call.  */
 +static bool
 +loop_has_call(struct loop *loop)
 +{
 +  basic_block *body, bb;
 +  unsigned i;
 +  rtx insn;
 +
 +  body = get_loop_body (loop);
 +  for (i = 0; i  loop-num_nodes; i++)
 +    {
 +      bb = body[i];
 +
 +      FOR_BB_INSNS (bb, insn)
 +        {
 +          if (CALL_P (insn))
 +            {
 +              free (body);
 +              return true;
 +            }
 +        }
 +    }
 +  free (body);
 +  return false;
 +}
 +
 +/* Determine whether LOOP contains floating-point computation.  */
 +static bool
 +loop_has_FP_comp(struct loop *loop)
 +{
 +  rtx set, dest;
 +  basic_block *body, bb;
 +  unsigned i;
 +  rtx insn;
 +
 +  body = get_loop_body (loop);
 +  for (i = 0; i  loop-num_nodes; i++)
 +    {
 +      bb = body[i];
 +
 +      FOR_BB_INSNS (bb, insn)
 +        {
 +          set = single_set (insn);
 +          if (!set)
 +            continue;
 +
 +          dest = SET_DEST (set);
 +          if (FLOAT_MODE_P (GET_MODE (dest)))
 +            {
 +              free (body);
 +              return true;

 So you only detect single-set FP operations where some insns stores in
 a float mode. It wouldn't be very difficult to just walk over all sets
 and look for float modes. This is also necessary e.g. for x87 sincos,
 as well as various insns on other machines. Your comments say you
 don't want to apply the new heuristic to loops containing FP
 operations because these loops usually benefit more from unrolling.
 Therefore, you should IMHO look at non-single_set() insns also here,
 to avoid applying the heuristics to loops containing non-single_set()
 FP insns.

Ok, thanks for the suggestion, I will expand this for the next version
of the patch.



 +            }
 +        }
 +    }
 +  free (body);
 +  return false;
 +}

 Nit: You are calling loop_has_call and loop_has_FP_comp() twice on
 each loop (first for constant iterations and next for runtime
 iterations),

I don't think that is true for loop_has_FP_comp, since it is called in
decide_unroll_constant_iterations and decide_unroll_runtime_iterations
just after we have checked if the loop has a constant number of
iterations, and returned early depending on the result of this check
and which routine we are in. So each inner loop will only reach the
call to loop_has_FP_comp in one of these routines.

In the case of loop_has_call, which is only called for a hot outer
loop, it is true we could invoke that more than once. That would
happen if a hot outer loop contains more than one nested inner loop
with a small iteration count and branches that we attempt to unroll
(it is called at most once per inner loop that we attempt to unroll).
I thought about attempting to cache this info for the outer loop in
the structure returned by get_simple_loop_desc() as you also suggest
below. I was concerned that currently this returns an niter_desc
structure which holds info about the # iterations, and this
information doesn't fit into that category. However, I could go ahead
and add it to that structure and perhaps rename the structure to
something more generic like loop_desc. What do you think?

The other issue is that we don't need this 

Re: [PATCH] Take branch misprediction effects into account when RTL loop unrolling (issue6099055)

2012-04-25 Thread Teresa Johnson
On Tue, Apr 24, 2012 at 6:13 PM, Andi Kleen a...@firstfloor.org wrote:
 tejohn...@google.com (Teresa Johnson) writes:

 This patch adds heuristics to limit unrolling in loops with branches that 
 may increase
 branch mispredictions. It affects loops that are not frequently iterated, 
 and that are
 nested within a hot region of code that already contains many branch 
 instructions.

 Performance tested with both internal benchmarks and with SPEC 2000/2006 on 
 a variety
 of Intel systems (Core2, Corei7, SandyBridge) and a couple of different AMD 
 Opteron systems.
 This improves performance of an internal search indexing benchmark by close 
 to 2% on
 all the tested Intel platforms.  It also consistently improves 445.gobmk 
 (with FDO feedback
 where unrolling kicks in) by close to 1% on AMD Opteron. Other performance 
 effects are
 neutral.

 Bootstrapped and tested on x86_64-unknown-linux-gnu.  Is this ok for trunk?

 One problem with any unrolling heuristics is currently that gcc has both
 the tree level and the rtl level unroller. The tree one is even on at
 -O3.  So if you tweak anything for one you have to affect both, otherwise the
 other may still do the wrong thing(tm).

It's true that the tree level unroller could benefit from taking
branch mispredict effects into account as well. But since that is only
performing full unrolling of constant trip count loops I suspect that
there will be additional things that need to be considered, such as
whether the full unrolling enables better optimization in the
surrounding code/loop. Hence I wanted to tackle that later.


 For some other tweaks I looked into a shared cost model some time ago.
 May be still needed.

Yes, I think it would be good to unify some of the profitability
checks between the two unrolling passes, or at least between the tree
and rtl level full unrollers/peelers.

Teresa


 -Andi

 --
 a...@linux.intel.com -- Speaking for myself only



-- 
Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413


Re: [PATCH] Take branch misprediction effects into account when RTL loop unrolling (issue6099055)

2012-04-25 Thread Teresa Johnson
On Wed, Apr 25, 2012 at 2:03 AM, Richard Guenther
richard.guent...@gmail.com wrote:
 On Tue, Apr 24, 2012 at 11:26 PM, Teresa Johnson tejohn...@google.com wrote:
 This patch adds heuristics to limit unrolling in loops with branches that 
 may increase
 branch mispredictions. It affects loops that are not frequently iterated, 
 and that are
 nested within a hot region of code that already contains many branch 
 instructions.

 Performance tested with both internal benchmarks and with SPEC 2000/2006 on 
 a variety
 of Intel systems (Core2, Corei7, SandyBridge) and a couple of different AMD 
 Opteron systems.
 This improves performance of an internal search indexing benchmark by close 
 to 2% on
 all the tested Intel platforms.  It also consistently improves 445.gobmk 
 (with FDO feedback
 where unrolling kicks in) by close to 1% on AMD Opteron. Other performance 
 effects are
 neutral.

 Bootstrapped and tested on x86_64-unknown-linux-gnu.  Is this ok for trunk?

 Thanks,
 Teresa

 2012-04-24   Teresa Johnson  tejohn...@google.com

        * loop-unroll.c (loop_has_call): New function.
        (loop_has_FP_comp): Ditto.
        (compute_weighted_branches): Ditto.
        (max_unroll_with_branches): Ditto.
        (decide_unroll_constant_iterations): Add heuristic to avoid
        increasing branch mispredicts when unrolling.
        (decide_unroll_runtime_iterations): Ditto.
        * params.def (PARAM_MIN_ITER_UNROLL_WITH_BRANCHES): New param.
        (PARAM_UNROLL_OUTER_LOOP_BRANCH_BUDGET): Ditto.

 Index: loop-unroll.c
 ===
 --- loop-unroll.c       (revision 186783)
 +++ loop-unroll.c       (working copy)
 @@ -152,6 +152,180 @@ static void combine_var_copies_in_loop_exit (struc
                                             basic_block);
  static rtx get_expansion (struct var_to_expand *);

 +/* Determine whether LOOP contains call.  */
 +static bool
 +loop_has_call(struct loop *loop)
 +{
 +  basic_block *body, bb;
 +  unsigned i;
 +  rtx insn;
 +
 +  body = get_loop_body (loop);

 You repeatedly do this and walk over all blocks.  Please think about
 compile-time
 issues when writing code.

See my response to Steven where I address this issue and mention some
approaches to reducing the loop body walks. Please let me know if you
have any feedback on that.


 This all looks sort-of target specific to me and I don't see why this
 very specialized
 patch is a good idea when unrolling does a very poor job deciding what and how
 much to unroll generally.

I am hoping this will improve upon the job the unroller does in
deciding when/how to unroll. I didn't think that it was too target
specific as branch mispredictions could affect many targets. Note that
there are already some much more basic checks for the branch
misprediction effects in both decide_peel_simple and
decide_unroll_stupid, for example:

  /* Do not simply peel loops with branches inside -- it increases number
 of mispredicts.  */
  if (num_loop_branches (loop)  1)
{
  if (dump_file)
fprintf (dump_file, ;; Not peeling, contains branches\n);
  return;
}

It is possible that both of these checks could be made less aggressive
using the approach in this patch, which affects many more loops and
hence I am trying to add some more intelligent checking of whether
branch mispredicts might be triggered.

 Thanks,
Teresa


 Richard.

 +  for (i = 0; i  loop-num_nodes; i++)
 +    {
 +      bb = body[i];
 +
 +      FOR_BB_INSNS (bb, insn)
 +        {
 +          if (CALL_P (insn))
 +            {
 +              free (body);
 +              return true;
 +            }
 +        }
 +    }
 +  free (body);
 +  return false;
 +}
 +
 +/* Determine whether LOOP contains floating-point computation.  */
 +static bool
 +loop_has_FP_comp(struct loop *loop)
 +{
 +  rtx set, dest;
 +  basic_block *body, bb;
 +  unsigned i;
 +  rtx insn;
 +
 +  body = get_loop_body (loop);
 +  for (i = 0; i  loop-num_nodes; i++)
 +    {
 +      bb = body[i];
 +
 +      FOR_BB_INSNS (bb, insn)
 +        {
 +          set = single_set (insn);
 +          if (!set)
 +            continue;
 +
 +          dest = SET_DEST (set);
 +          if (FLOAT_MODE_P (GET_MODE (dest)))
 +            {
 +              free (body);
 +              return true;
 +            }
 +        }
 +    }
 +  free (body);
 +  return false;
 +}
 +
 +/* Compute the number of branches in LOOP, weighted by execution counts.  */
 +static float
 +compute_weighted_branches(struct loop *loop)
 +{
 +  int header_count = loop-header-count;
 +  unsigned i;
 +  float n;
 +  basic_block * body;
 +
 +  /* If no profile feedback data exists, don't limit unrolling  */
 +  if (header_count == 0)
 +    return 0.0;
 +
 +  gcc_assert (loop-latch != EXIT_BLOCK_PTR);
 +
 +  body = get_loop_body (loop);
 +  n = 0.0;
 +  for (i = 0; i  loop-num_nodes; i++)
 +    {
 +      if (EDGE_COUNT (body[i]-succs) = 2)
 +        {
 +          /* If this block is 

Re: [PATCH] Take branch misprediction effects into account when RTL loop unrolling (issue6099055)

2012-04-25 Thread Xinliang David Li
I think the general mechanism applies to most of the targets. What is
needed is target specific parameter (branch budget) tuning which can
be done separately -- there exist a way to do that already.

David

On Wed, Apr 25, 2012 at 2:03 AM, Richard Guenther
richard.guent...@gmail.com wrote:
 On Tue, Apr 24, 2012 at 11:26 PM, Teresa Johnson tejohn...@google.com wrote:
 This patch adds heuristics to limit unrolling in loops with branches that 
 may increase
 branch mispredictions. It affects loops that are not frequently iterated, 
 and that are
 nested within a hot region of code that already contains many branch 
 instructions.

 Performance tested with both internal benchmarks and with SPEC 2000/2006 on 
 a variety
 of Intel systems (Core2, Corei7, SandyBridge) and a couple of different AMD 
 Opteron systems.
 This improves performance of an internal search indexing benchmark by close 
 to 2% on
 all the tested Intel platforms.  It also consistently improves 445.gobmk 
 (with FDO feedback
 where unrolling kicks in) by close to 1% on AMD Opteron. Other performance 
 effects are
 neutral.

 Bootstrapped and tested on x86_64-unknown-linux-gnu.  Is this ok for trunk?

 Thanks,
 Teresa

 2012-04-24   Teresa Johnson  tejohn...@google.com

        * loop-unroll.c (loop_has_call): New function.
        (loop_has_FP_comp): Ditto.
        (compute_weighted_branches): Ditto.
        (max_unroll_with_branches): Ditto.
        (decide_unroll_constant_iterations): Add heuristic to avoid
        increasing branch mispredicts when unrolling.
        (decide_unroll_runtime_iterations): Ditto.
        * params.def (PARAM_MIN_ITER_UNROLL_WITH_BRANCHES): New param.
        (PARAM_UNROLL_OUTER_LOOP_BRANCH_BUDGET): Ditto.

 Index: loop-unroll.c
 ===
 --- loop-unroll.c       (revision 186783)
 +++ loop-unroll.c       (working copy)
 @@ -152,6 +152,180 @@ static void combine_var_copies_in_loop_exit (struc
                                             basic_block);
  static rtx get_expansion (struct var_to_expand *);

 +/* Determine whether LOOP contains call.  */
 +static bool
 +loop_has_call(struct loop *loop)
 +{
 +  basic_block *body, bb;
 +  unsigned i;
 +  rtx insn;
 +
 +  body = get_loop_body (loop);

 You repeatedly do this and walk over all blocks.  Please think about
 compile-time
 issues when writing code.

 This all looks sort-of target specific to me and I don't see why this
 very specialized
 patch is a good idea when unrolling does a very poor job deciding what and how
 much to unroll generally.

 Richard.

 +  for (i = 0; i  loop-num_nodes; i++)
 +    {
 +      bb = body[i];
 +
 +      FOR_BB_INSNS (bb, insn)
 +        {
 +          if (CALL_P (insn))
 +            {
 +              free (body);
 +              return true;
 +            }
 +        }
 +    }
 +  free (body);
 +  return false;
 +}
 +
 +/* Determine whether LOOP contains floating-point computation.  */
 +static bool
 +loop_has_FP_comp(struct loop *loop)
 +{
 +  rtx set, dest;
 +  basic_block *body, bb;
 +  unsigned i;
 +  rtx insn;
 +
 +  body = get_loop_body (loop);
 +  for (i = 0; i  loop-num_nodes; i++)
 +    {
 +      bb = body[i];
 +
 +      FOR_BB_INSNS (bb, insn)
 +        {
 +          set = single_set (insn);
 +          if (!set)
 +            continue;
 +
 +          dest = SET_DEST (set);
 +          if (FLOAT_MODE_P (GET_MODE (dest)))
 +            {
 +              free (body);
 +              return true;
 +            }
 +        }
 +    }
 +  free (body);
 +  return false;
 +}
 +
 +/* Compute the number of branches in LOOP, weighted by execution counts.  */
 +static float
 +compute_weighted_branches(struct loop *loop)
 +{
 +  int header_count = loop-header-count;
 +  unsigned i;
 +  float n;
 +  basic_block * body;
 +
 +  /* If no profile feedback data exists, don't limit unrolling  */
 +  if (header_count == 0)
 +    return 0.0;
 +
 +  gcc_assert (loop-latch != EXIT_BLOCK_PTR);
 +
 +  body = get_loop_body (loop);
 +  n = 0.0;
 +  for (i = 0; i  loop-num_nodes; i++)
 +    {
 +      if (EDGE_COUNT (body[i]-succs) = 2)
 +        {
 +          /* If this block is executed less frequently than the header (loop
 +             entry), then it is weighted based on the ratio of times it is
 +             executed compared to the header.  */
 +          if (body[i]-count  header_count)
 +            n += ((float)body[i]-count)/header_count;
 +
 +          /* When it is executed more frequently than the header (i.e. it is
 +             in a nested inner loop), simply weight the branch at 1.0.  */
 +          else
 +            n += 1.0;
 +        }
 +    }
 +  free (body);
 +
 +  return n;
 +}
 +
 +/* Compute the maximum number of times LOOP can be unrolled without 
 exceeding
 +   a branch budget, which can increase branch mispredictions. The number of
 +   branches is computed by weighting each branch with its expected execution
 +   probability through the loop based on 

[PATCH] Take branch misprediction effects into account when RTL loop unrolling (issue6099055)

2012-04-24 Thread Teresa Johnson
This patch adds heuristics to limit unrolling in loops with branches that may 
increase
branch mispredictions. It affects loops that are not frequently iterated, and 
that are
nested within a hot region of code that already contains many branch 
instructions.

Performance tested with both internal benchmarks and with SPEC 2000/2006 on a 
variety
of Intel systems (Core2, Corei7, SandyBridge) and a couple of different AMD 
Opteron systems.
This improves performance of an internal search indexing benchmark by close to 
2% on
all the tested Intel platforms.  It also consistently improves 445.gobmk (with 
FDO feedback
where unrolling kicks in) by close to 1% on AMD Opteron. Other performance 
effects are
neutral.

Bootstrapped and tested on x86_64-unknown-linux-gnu.  Is this ok for trunk?

Thanks,
Teresa

2012-04-24   Teresa Johnson  tejohn...@google.com

* loop-unroll.c (loop_has_call): New function.
(loop_has_FP_comp): Ditto.
(compute_weighted_branches): Ditto.
(max_unroll_with_branches): Ditto.
(decide_unroll_constant_iterations): Add heuristic to avoid
increasing branch mispredicts when unrolling.
(decide_unroll_runtime_iterations): Ditto.
* params.def (PARAM_MIN_ITER_UNROLL_WITH_BRANCHES): New param.
(PARAM_UNROLL_OUTER_LOOP_BRANCH_BUDGET): Ditto.

Index: loop-unroll.c
===
--- loop-unroll.c   (revision 186783)
+++ loop-unroll.c   (working copy)
@@ -152,6 +152,180 @@ static void combine_var_copies_in_loop_exit (struc
 basic_block);
 static rtx get_expansion (struct var_to_expand *);
 
+/* Determine whether LOOP contains call.  */
+static bool
+loop_has_call(struct loop *loop)
+{
+  basic_block *body, bb;
+  unsigned i;
+  rtx insn;
+
+  body = get_loop_body (loop);
+  for (i = 0; i  loop-num_nodes; i++)
+{
+  bb = body[i];
+
+  FOR_BB_INSNS (bb, insn)
+{
+  if (CALL_P (insn))
+{
+  free (body);
+  return true;
+}
+}
+}
+  free (body);
+  return false;
+}
+
+/* Determine whether LOOP contains floating-point computation.  */
+static bool
+loop_has_FP_comp(struct loop *loop)
+{
+  rtx set, dest;
+  basic_block *body, bb;
+  unsigned i;
+  rtx insn;
+
+  body = get_loop_body (loop);
+  for (i = 0; i  loop-num_nodes; i++)
+{
+  bb = body[i];
+
+  FOR_BB_INSNS (bb, insn)
+{
+  set = single_set (insn);
+  if (!set)
+continue;
+
+  dest = SET_DEST (set);
+  if (FLOAT_MODE_P (GET_MODE (dest)))
+{
+  free (body);
+  return true;
+}
+}
+}
+  free (body);
+  return false;
+}
+
+/* Compute the number of branches in LOOP, weighted by execution counts.  */
+static float
+compute_weighted_branches(struct loop *loop)
+{
+  int header_count = loop-header-count;
+  unsigned i;
+  float n;
+  basic_block * body;
+
+  /* If no profile feedback data exists, don't limit unrolling  */
+  if (header_count == 0)
+return 0.0;
+
+  gcc_assert (loop-latch != EXIT_BLOCK_PTR);
+
+  body = get_loop_body (loop);
+  n = 0.0;
+  for (i = 0; i  loop-num_nodes; i++)
+{
+  if (EDGE_COUNT (body[i]-succs) = 2)
+{
+  /* If this block is executed less frequently than the header (loop
+ entry), then it is weighted based on the ratio of times it is
+ executed compared to the header.  */
+  if (body[i]-count  header_count)
+n += ((float)body[i]-count)/header_count;
+
+  /* When it is executed more frequently than the header (i.e. it is
+ in a nested inner loop), simply weight the branch at 1.0.  */
+  else
+n += 1.0;
+}
+}
+  free (body);
+
+  return n;
+}
+
+/* Compute the maximum number of times LOOP can be unrolled without exceeding
+   a branch budget, which can increase branch mispredictions. The number of
+   branches is computed by weighting each branch with its expected execution
+   probability through the loop based on profile data. If no profile feedback
+   data exists, simply return the current NUNROLL factor.  */
+static unsigned
+max_unroll_with_branches(struct loop *loop, unsigned nunroll)
+{
+  struct loop *outer;
+  struct niter_desc *outer_desc;
+  int outer_niters = 1;
+  float weighted_outer_branches = 0.0;
+  float weighted_num_branches = compute_weighted_branches (loop);
+
+  /* If there was no profile feedback data, weighted_num_branches will be 0.0
+ and we won't limit unrolling. If the weighted_num_branches is at most 1.0,
+ also don't limit unrolling as the back-edge branch will not be 
duplicated.  */
+  if (weighted_num_branches = 1.0)
+return nunroll;
+
+  /* Walk up the loop tree until we find a hot outer loop in which the current
+ loop is nested. At that point we will compute the number 

Re: [PATCH] Take branch misprediction effects into account when RTL loop unrolling (issue6099055)

2012-04-24 Thread Andrew Pinski
On Tue, Apr 24, 2012 at 2:26 PM, Teresa Johnson tejohn...@google.com wrote:
 This patch adds heuristics to limit unrolling in loops with branches that may 
 increase
 branch mispredictions. It affects loops that are not frequently iterated, and 
 that are
 nested within a hot region of code that already contains many branch 
 instructions.

 Performance tested with both internal benchmarks and with SPEC 2000/2006 on a 
 variety
 of Intel systems (Core2, Corei7, SandyBridge) and a couple of different AMD 
 Opteron systems.
 This improves performance of an internal search indexing benchmark by close 
 to 2% on
 all the tested Intel platforms.  It also consistently improves 445.gobmk 
 (with FDO feedback
 where unrolling kicks in) by close to 1% on AMD Opteron. Other performance 
 effects are
 neutral.

 Bootstrapped and tested on x86_64-unknown-linux-gnu.  Is this ok for trunk?

 Thanks,
 Teresa

 2012-04-24   Teresa Johnson  tejohn...@google.com

        * loop-unroll.c (loop_has_call): New function.
        (loop_has_FP_comp): Ditto.
        (compute_weighted_branches): Ditto.
        (max_unroll_with_branches): Ditto.
        (decide_unroll_constant_iterations): Add heuristic to avoid
        increasing branch mispredicts when unrolling.
        (decide_unroll_runtime_iterations): Ditto.
        * params.def (PARAM_MIN_ITER_UNROLL_WITH_BRANCHES): New param.
        (PARAM_UNROLL_OUTER_LOOP_BRANCH_BUDGET): Ditto.

 Index: loop-unroll.c
 ===
 --- loop-unroll.c       (revision 186783)
 +++ loop-unroll.c       (working copy)
 @@ -152,6 +152,180 @@ static void combine_var_copies_in_loop_exit (struc
                                             basic_block);
  static rtx get_expansion (struct var_to_expand *);

 +/* Determine whether LOOP contains call.  */
 +static bool
 +loop_has_call(struct loop *loop)
 +{
 +  basic_block *body, bb;
 +  unsigned i;
 +  rtx insn;
 +
 +  body = get_loop_body (loop);
 +  for (i = 0; i  loop-num_nodes; i++)
 +    {
 +      bb = body[i];
 +
 +      FOR_BB_INSNS (bb, insn)
 +        {
 +          if (CALL_P (insn))
 +            {
 +              free (body);
 +              return true;
 +            }
 +        }
 +    }
 +  free (body);
 +  return false;
 +}
 +
 +/* Determine whether LOOP contains floating-point computation.  */
 +static bool
 +loop_has_FP_comp(struct loop *loop)
 +{
 +  rtx set, dest;
 +  basic_block *body, bb;
 +  unsigned i;
 +  rtx insn;
 +
 +  body = get_loop_body (loop);
 +  for (i = 0; i  loop-num_nodes; i++)
 +    {
 +      bb = body[i];
 +
 +      FOR_BB_INSNS (bb, insn)
 +        {
 +          set = single_set (insn);
 +          if (!set)
 +            continue;
 +
 +          dest = SET_DEST (set);
 +          if (FLOAT_MODE_P (GET_MODE (dest)))
 +            {
 +              free (body);
 +              return true;
 +            }
 +        }
 +    }
 +  free (body);
 +  return false;
 +}
 +
 +/* Compute the number of branches in LOOP, weighted by execution counts.  */
 +static float
 +compute_weighted_branches(struct loop *loop)
 +{
 +  int header_count = loop-header-count;
 +  unsigned i;
 +  float n;
 +  basic_block * body;
 +
 +  /* If no profile feedback data exists, don't limit unrolling  */
 +  if (header_count == 0)
 +    return 0.0;
 +
 +  gcc_assert (loop-latch != EXIT_BLOCK_PTR);
 +
 +  body = get_loop_body (loop);
 +  n = 0.0;
 +  for (i = 0; i  loop-num_nodes; i++)
 +    {
 +      if (EDGE_COUNT (body[i]-succs) = 2)
 +        {
 +          /* If this block is executed less frequently than the header (loop
 +             entry), then it is weighted based on the ratio of times it is
 +             executed compared to the header.  */
 +          if (body[i]-count  header_count)
 +            n += ((float)body[i]-count)/header_count;

Please don't introduce more floating point usage into the compiler
since it could change between different hosts (sse vs x87 for an
example).
Maybe use a fixed point multiply of 1000 (note use a macro for this
special value though)  like what is used in the rest of the predict
code.


Thanks,
Andrew Pinski


 +
 +          /* When it is executed more frequently than the header (i.e. it is
 +             in a nested inner loop), simply weight the branch at 1.0.  */
 +          else
 +            n += 1.0;
 +        }
 +    }
 +  free (body);
 +
 +  return n;
 +}
 +
 +/* Compute the maximum number of times LOOP can be unrolled without exceeding
 +   a branch budget, which can increase branch mispredictions. The number of
 +   branches is computed by weighting each branch with its expected execution
 +   probability through the loop based on profile data. If no profile feedback
 +   data exists, simply return the current NUNROLL factor.  */
 +static unsigned
 +max_unroll_with_branches(struct loop *loop, unsigned nunroll)
 +{
 +  struct loop *outer;
 +  struct niter_desc *outer_desc;
 +  int outer_niters = 1;
 +  float weighted_outer_branches = 

Re: [PATCH] Take branch misprediction effects into account when RTL loop unrolling (issue6099055)

2012-04-24 Thread Teresa Johnson
Resending my response in plain text so it will go through to gcc-patches...

On Tue, Apr 24, 2012 at 2:36 PM, Teresa Johnson tejohn...@google.com wrote:



 On Tue, Apr 24, 2012 at 2:30 PM, Andrew Pinski pins...@gmail.com wrote:

 On Tue, Apr 24, 2012 at 2:26 PM, Teresa Johnson tejohn...@google.com wrote:
  This patch adds heuristics to limit unrolling in loops with branches that 
  may increase
  branch mispredictions. It affects loops that are not frequently iterated, 
  and that are
  nested within a hot region of code that already contains many branch 
  instructions.
 
  Performance tested with both internal benchmarks and with SPEC 2000/2006 
  on a variety
  of Intel systems (Core2, Corei7, SandyBridge) and a couple of different 
  AMD Opteron systems.
  This improves performance of an internal search indexing benchmark by 
  close to 2% on
  all the tested Intel platforms.  It also consistently improves 445.gobmk 
  (with FDO feedback
  where unrolling kicks in) by close to 1% on AMD Opteron. Other performance 
  effects are
  neutral.
 
  Bootstrapped and tested on x86_64-unknown-linux-gnu.  Is this ok for trunk?
 
  Thanks,
  Teresa
 
  2012-04-24   Teresa Johnson  tejohn...@google.com
 
         * loop-unroll.c (loop_has_call): New function.
         (loop_has_FP_comp): Ditto.
         (compute_weighted_branches): Ditto.
         (max_unroll_with_branches): Ditto.
         (decide_unroll_constant_iterations): Add heuristic to avoid
         increasing branch mispredicts when unrolling.
         (decide_unroll_runtime_iterations): Ditto.
         * params.def (PARAM_MIN_ITER_UNROLL_WITH_BRANCHES): New param.
         (PARAM_UNROLL_OUTER_LOOP_BRANCH_BUDGET): Ditto.
 
  Index: loop-unroll.c
  ===
  --- loop-unroll.c       (revision 186783)
  +++ loop-unroll.c       (working copy)
  @@ -152,6 +152,180 @@ static void combine_var_copies_in_loop_exit (struc
                                              basic_block);
   static rtx get_expansion (struct var_to_expand *);
 
  +/* Determine whether LOOP contains call.  */
  +static bool
  +loop_has_call(struct loop *loop)
  +{
  +  basic_block *body, bb;
  +  unsigned i;
  +  rtx insn;
  +
  +  body = get_loop_body (loop);
  +  for (i = 0; i  loop-num_nodes; i++)
  +    {
  +      bb = body[i];
  +
  +      FOR_BB_INSNS (bb, insn)
  +        {
  +          if (CALL_P (insn))
  +            {
  +              free (body);
  +              return true;
  +            }
  +        }
  +    }
  +  free (body);
  +  return false;
  +}
  +
  +/* Determine whether LOOP contains floating-point computation.  */
  +static bool
  +loop_has_FP_comp(struct loop *loop)
  +{
  +  rtx set, dest;
  +  basic_block *body, bb;
  +  unsigned i;
  +  rtx insn;
  +
  +  body = get_loop_body (loop);
  +  for (i = 0; i  loop-num_nodes; i++)
  +    {
  +      bb = body[i];
  +
  +      FOR_BB_INSNS (bb, insn)
  +        {
  +          set = single_set (insn);
  +          if (!set)
  +            continue;
  +
  +          dest = SET_DEST (set);
  +          if (FLOAT_MODE_P (GET_MODE (dest)))
  +            {
  +              free (body);
  +              return true;
  +            }
  +        }
  +    }
  +  free (body);
  +  return false;
  +}
  +
  +/* Compute the number of branches in LOOP, weighted by execution counts.  
  */
  +static float
  +compute_weighted_branches(struct loop *loop)
  +{
  +  int header_count = loop-header-count;
  +  unsigned i;
  +  float n;
  +  basic_block * body;
  +
  +  /* If no profile feedback data exists, don't limit unrolling  */
  +  if (header_count == 0)
  +    return 0.0;
  +
  +  gcc_assert (loop-latch != EXIT_BLOCK_PTR);
  +
  +  body = get_loop_body (loop);
  +  n = 0.0;
  +  for (i = 0; i  loop-num_nodes; i++)
  +    {
  +      if (EDGE_COUNT (body[i]-succs) = 2)
  +        {
  +          /* If this block is executed less frequently than the header 
  (loop
  +             entry), then it is weighted based on the ratio of times it is
  +             executed compared to the header.  */
  +          if (body[i]-count  header_count)
  +            n += ((float)body[i]-count)/header_count;

 Please don't introduce more floating point usage into the compiler
 since it could change between different hosts (sse vs x87 for an
 example).
 Maybe use a fixed point multiply of 1000 (note use a macro for this
 special value though)  like what is used in the rest of the predict
 code.


 Ok, got it. I will address this in the next version of the patch.

 Thanks,
 Teresa




 Thanks,
 Andrew Pinski


  +
  +          /* When it is executed more frequently than the header (i.e. it 
  is
  +             in a nested inner loop), simply weight the branch at 1.0.  */
  +          else
  +            n += 1.0;
  +        }
  +    }
  +  free (body);
  +
  +  return n;
  +}
  +
  +/* Compute the maximum number of times LOOP can be unrolled without 
  exceeding
  +   a branch budget, which 

Re: [PATCH] Take branch misprediction effects into account when RTL loop unrolling (issue6099055)

2012-04-24 Thread Steven Bosscher
On Tue, Apr 24, 2012 at 11:26 PM, Teresa Johnson tejohn...@google.com wrote:

        * params.def (PARAM_MIN_ITER_UNROLL_WITH_BRANCHES): New param.
        (PARAM_UNROLL_OUTER_LOOP_BRANCH_BUDGET): Ditto.

You should add documentation for these new PARAMs to doc/invoke.texi.

I don't really like these new PARAMs: All other loop PARAMs are based
on the number of insns in a loop, or the maximum number of times a
transformation is applied. Your new
PARAM_MIN_ITER_UNROLL_WITH_BRANCHES is completely different, because
it is a number of iterations. This makes the PARAM value feel even
more arbitrary than all the other PARAMs to some extend already do...

(The only other PARAM like that is PARAM_ALIGN_LOOP_ITERATIONS, and
its default value also looks quite arbitrary...)


 Index: loop-unroll.c
 ===
 --- loop-unroll.c       (revision 186783)
 +++ loop-unroll.c       (working copy)
 @@ -152,6 +152,180 @@ static void combine_var_copies_in_loop_exit (struc
                                             basic_block);
  static rtx get_expansion (struct var_to_expand *);

 +/* Determine whether LOOP contains call.  */
 +static bool
 +loop_has_call(struct loop *loop)
 +{
 +  basic_block *body, bb;
 +  unsigned i;
 +  rtx insn;
 +
 +  body = get_loop_body (loop);
 +  for (i = 0; i  loop-num_nodes; i++)
 +    {
 +      bb = body[i];
 +
 +      FOR_BB_INSNS (bb, insn)
 +        {
 +          if (CALL_P (insn))
 +            {
 +              free (body);
 +              return true;
 +            }
 +        }
 +    }
 +  free (body);
 +  return false;
 +}
 +
 +/* Determine whether LOOP contains floating-point computation.  */
 +static bool
 +loop_has_FP_comp(struct loop *loop)
 +{
 +  rtx set, dest;
 +  basic_block *body, bb;
 +  unsigned i;
 +  rtx insn;
 +
 +  body = get_loop_body (loop);
 +  for (i = 0; i  loop-num_nodes; i++)
 +    {
 +      bb = body[i];
 +
 +      FOR_BB_INSNS (bb, insn)
 +        {
 +          set = single_set (insn);
 +          if (!set)
 +            continue;
 +
 +          dest = SET_DEST (set);
 +          if (FLOAT_MODE_P (GET_MODE (dest)))
 +            {
 +              free (body);
 +              return true;

So you only detect single-set FP operations where some insns stores in
a float mode. It wouldn't be very difficult to just walk over all sets
and look for float modes. This is also necessary e.g. for x87 sincos,
as well as various insns on other machines. Your comments say you
don't want to apply the new heuristic to loops containing FP
operations because these loops usually benefit more from unrolling.
Therefore, you should IMHO look at non-single_set() insns also here,
to avoid applying the heuristics to loops containing non-single_set()
FP insns.


 +            }
 +        }
 +    }
 +  free (body);
 +  return false;
 +}

Nit: You are calling loop_has_call and loop_has_FP_comp() twice on
each loop (first for constant iterations and next for runtime
iterations), maybe you can fuse the functions and cache the results
(e.g. with two bitmaps, or put it in the loop description and retrieve
it with get_simple_loop_desc). Actually num_loop_branches()
could/should also be cached. I realize that the loop body walks are
probably not very expensive (and compile time probably isn't a concern
if you're using profile driven optimizations) but they do all add
up...


 +/* Compute the number of branches in LOOP, weighted by execution counts.  */
 +static float
 +compute_weighted_branches(struct loop *loop)

The floating point thing was already mentioned by Andrew. You can use
integer math instead (for examples, look for BB_FREQ_MAX e.g. in
average_num_loop_insns()).


 +  while (loop_outer(outer))
 +    {
 +      outer_desc = get_simple_loop_desc (outer);
 +
 +      if (outer_desc-const_iter)
 +        outer_niters *= outer_desc-niter;
 +      else if (outer-header-count)
 +        outer_niters *= expected_loop_iterations (outer);
 +
 +      weighted_outer_branches = compute_weighted_branches (outer);

Can you delay this computation of weighted_outer_branches call to ...

 +      /* Should have been checked by caller.  */
 +      gcc_assert(PARAM_VALUE (PARAM_MIN_ITER_UNROLL_WITH_BRANCHES) != -1);

Should never even happen. You have set the minimum acceptable value to
0. If you managed to test this code with
PARAM_MIN_ITER_UNROLL_WITH_BRANCHES==-1, I'd like to know how (if you
can do it from the command line, there is a bug in the handling of
acceptable PARAM values :-)


 +      /* If the outer loop has enough iterations to be considered hot, then
 +         we can stop our upwards loop tree traversal and examine the current
 +         outer loop.  */
 +      if (outer_niters = PARAM_VALUE (PARAM_MIN_ITER_UNROLL_WITH_BRANCHES))
 +        {
 +          /* Assume that any call will cause the branch budget to be 
 exceeded,
 +             and that we can't unroll the current loop without increasing
 +             mispredicts.  */
 +   

Re: [PATCH] Take branch misprediction effects into account when RTL loop unrolling (issue6099055)

2012-04-24 Thread Andi Kleen
tejohn...@google.com (Teresa Johnson) writes:

 This patch adds heuristics to limit unrolling in loops with branches that may 
 increase
 branch mispredictions. It affects loops that are not frequently iterated, and 
 that are
 nested within a hot region of code that already contains many branch 
 instructions.

 Performance tested with both internal benchmarks and with SPEC 2000/2006 on a 
 variety
 of Intel systems (Core2, Corei7, SandyBridge) and a couple of different AMD 
 Opteron systems.
 This improves performance of an internal search indexing benchmark by close 
 to 2% on
 all the tested Intel platforms.  It also consistently improves 445.gobmk 
 (with FDO feedback
 where unrolling kicks in) by close to 1% on AMD Opteron. Other performance 
 effects are
 neutral.

 Bootstrapped and tested on x86_64-unknown-linux-gnu.  Is this ok for trunk?

One problem with any unrolling heuristics is currently that gcc has both
the tree level and the rtl level unroller. The tree one is even on at
-O3.  So if you tweak anything for one you have to affect both, otherwise the
other may still do the wrong thing(tm).

For some other tweaks I looked into a shared cost model some time ago.
May be still needed.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only


Re: [PATCH] Take branch misprediction effects into account when RTL loop unrolling (issue6099055)

2012-04-24 Thread Xinliang David Li
On Tue, Apr 24, 2012 at 6:13 PM, Andi Kleen a...@firstfloor.org wrote:
 tejohn...@google.com (Teresa Johnson) writes:

 This patch adds heuristics to limit unrolling in loops with branches that 
 may increase
 branch mispredictions. It affects loops that are not frequently iterated, 
 and that are
 nested within a hot region of code that already contains many branch 
 instructions.

 Performance tested with both internal benchmarks and with SPEC 2000/2006 on 
 a variety
 of Intel systems (Core2, Corei7, SandyBridge) and a couple of different AMD 
 Opteron systems.
 This improves performance of an internal search indexing benchmark by close 
 to 2% on
 all the tested Intel platforms.  It also consistently improves 445.gobmk 
 (with FDO feedback
 where unrolling kicks in) by close to 1% on AMD Opteron. Other performance 
 effects are
 neutral.

 Bootstrapped and tested on x86_64-unknown-linux-gnu.  Is this ok for trunk?

 One problem with any unrolling heuristics is currently that gcc has both
 the tree level and the rtl level unroller. The tree one is even on at
 -O3.  So if you tweak anything for one you have to affect both, otherwise the
 other may still do the wrong thing(tm).

Tree level unrollers (cunrolli and cunroll) do complete unroll. At O2,
both of them are turned on, but gcc does not allow any code growth --
which makes them pretty useless at O2 (very few loops qualify). The
default max complete peel iteration is also too low compared with both
icc and llvm.  This needs to be tuned.

David


 For some other tweaks I looked into a shared cost model some time ago.
 May be still needed.

 -Andi

 --
 a...@linux.intel.com -- Speaking for myself only