Re: [PATCH 2/2] Aarch64: Add branch diluter pass

2020-07-28 Thread Segher Boessenkool
Hi!

On Tue, Jul 28, 2020 at 07:55:58PM +0100, Andrea Corallo wrote:
> > [ Btw, the mailing list archive will not show your attachments (just lets
> > me download them); naming the files *.txt probably works, but you can
> > also use a sane mime type (like, text/plain) ].
> 
> [ Sure can do it np, I'm just less sure if text/x-diff is such and
> insane mime type for a mailing list software :) ]

Well, first and foremost, it does not *work* with our current mailing
list setup.  Also, anything x- only makes sense if you know the software
the receiver runs (and of course you don't, this is public email, with
hundreds or thousands of readers).

> > Can you share a geomean improvement as well?  Also something like 0.4%
> > is sounds like, or is it more?
> 
> After my first measure I was suggestted by a colleague a less noisy
> system to benchmark on and a more reproducable methodology.  I repeated
> the tests on N1 with the following results:
> 
> | Benchmark  | Est. Peak Rate ration |
> ||diluted / baseline |
> |+---|
> | 400.perlbench  |1.018x |
> | 401.bzip2  |1.004x |
> | 403.gcc|0.987x |
> | 429.mcf|1.000x |
> | 445.gobmk  |0.998x |
> | 456.hmmer  |1.000x |
> | 458.sjeng  |1.008x |
> | 462.libquantum |1.014x |
> | 464.h264ref|1.004x |
> | 471.omnetpp|1.017x |
> | 473.astar  |1.007x |
> | 483.xalancbmk  |0.998x |

Cool.  Well, gcc has a pretty big drop, what's up with that?  Everything
else looks just great :-)

> I was explained xalanc tend to be very noisy being memory bound so this
> explains the difference, not sure why sjeng looks less good.

Sjeng is very jumpy code, so of course your patch will influence it a
lot.  No idea why it is less positive this run.

> The
> overall ratio comparing spec rates is +~0.44%.

Yup, a nice healthy improvement :-)


Segher


Re: [PATCH 2/2] Aarch64: Add branch diluter pass

2020-07-28 Thread Andrea Corallo
Segher Boessenkool  writes:

> Hi!

Hi Segher,

> [ Btw, the mailing list archive will not show your attachments (just lets
> me download them); naming the files *.txt probably works, but you can
> also use a sane mime type (like, text/plain) ].

[ Sure can do it np, I'm just less sure if text/x-diff is such and
insane mime type for a mailing list software :) ]

> On Wed, Jul 22, 2020 at 12:09:08PM +0200, Andrea Corallo wrote:
>> this second patch implements the AArch64 specific back-end pass
>> 'branch-dilution' controllable by the followings command line options:
>> 
>> -mbranch-dilution
>> 
>> --param=aarch64-branch-dilution-granularity={num}
>> 
>> --param=aarch64-branch-dilution-max-branches={num}
>
> That sounds like something that would be useful generically, even?
> Targets that do not want to use it can just skip it (that probably should
> be the default then), via setting the granularity to 0 for example.
> 
>> Observed performance improvements on Neoverse N1 SPEC CPU 2006 where
>> up to ~+3% (xalancbmk) and ~+1.5% (sjeng).  Average code size increase
>> for all the testsuite proved to be ~0.4%.
>
> Can you share a geomean improvement as well?  Also something like 0.4%
> is sounds like, or is it more?

After my first measure I was suggestted by a colleague a less noisy
system to benchmark on and a more reproducable methodology.  I repeated
the tests on N1 with the following results:

| Benchmark  | Est. Peak Rate ration |
||diluted / baseline |
|+---|
| 400.perlbench  |1.018x |
| 401.bzip2  |1.004x |
| 403.gcc|0.987x |
| 429.mcf|1.000x |
| 445.gobmk  |0.998x |
| 456.hmmer  |1.000x |
| 458.sjeng  |1.008x |
| 462.libquantum |1.014x |
| 464.h264ref|1.004x |
| 471.omnetpp|1.017x |
| 473.astar  |1.007x |
| 483.xalancbmk  |0.998x |

I was explained xalanc tend to be very noisy being memory bound so this
explains the difference, not sure why sjeng looks less good.  The
overall ratio comparing spec rates is +~0.44%.

>> - Unconditional branches must be the end of a basic block, and nops
>>   cannot be outside of a basic block.  Thus the need for FILLER_INSN,
>>   which allows placement outside of a basic block
>
> But the new rtx class is recognised in just one location, so it could
> recognise it on any other characteristic easily.

Make sense, unless in the future this changes, but probably is not worth
taking this in account now.

>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/branch-dilution-off.c
>> @@ -0,0 +1,57 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O1 -mcpu=cortex-a72 --param case-values-threshold=50" } */
>> +/* { dg-final { scan-assembler-not  "\\s*b.*\n\\snop\n" } } */
>
> You can write this simpler as
>
> /* { dg-final { scan-assembler-not {\s*b.*\n\snop\n} } } */
>
> which shows a problem in this more clearly: . will match newlines as
> well.  Also, starting a RE with (anything)* does nothing, "anything" is
> allowed to match 0 times after all.  You probably meant the "b" should
> start a mnemonic?
>
> /* { dg-final { scan-assembler-not {(?n)\mb.*\n\snop\n} } } */
>
> (\m is a zero-width start-of-word match, like \< in grep; (?n) means .
> does not match newlines (if you know Perl, it turns /m on and /s off --
> the opposite of the defaults for Tcl).
>
> (or you could do [^\n]* or even just \S* , no (?n) needed then).
>
>
> Segher

Thanks for the feedback!

  Andrea


Re: [PATCH 2/2] Aarch64: Add branch diluter pass

2020-07-24 Thread Segher Boessenkool
Hi!

[ Btw, the mailing list archive will not show your attachments (just lets
me download them); naming the files *.txt probably works, but you can
also use a sane mime type (like, text/plain) ].

On Wed, Jul 22, 2020 at 12:09:08PM +0200, Andrea Corallo wrote:
> this second patch implements the AArch64 specific back-end pass
> 'branch-dilution' controllable by the followings command line options:
> 
> -mbranch-dilution
> 
> --param=aarch64-branch-dilution-granularity={num}
> 
> --param=aarch64-branch-dilution-max-branches={num}

That sounds like something that would be useful generically, even?
Targets that do not want to use it can just skip it (that probably should
be the default then), via setting the granularity to 0 for example.

> Observed performance improvements on Neoverse N1 SPEC CPU 2006 where
> up to ~+3% (xalancbmk) and ~+1.5% (sjeng).  Average code size increase
> for all the testsuite proved to be ~0.4%.

Can you share a geomean improvement as well?  Also something like 0.4%
is sounds like, or is it more?

> - Unconditional branches must be the end of a basic block, and nops
>   cannot be outside of a basic block.  Thus the need for FILLER_INSN,
>   which allows placement outside of a basic block

But the new rtx class is recognised in just one location, so it could
recognise it on any other characteristic easily.

> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/branch-dilution-off.c
> @@ -0,0 +1,57 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O1 -mcpu=cortex-a72 --param case-values-threshold=50" } */
> +/* { dg-final { scan-assembler-not  "\\s*b.*\n\\snop\n" } } */

You can write this simpler as

/* { dg-final { scan-assembler-not {\s*b.*\n\snop\n} } } */

which shows a problem in this more clearly: . will match newlines as
well.  Also, starting a RE with (anything)* does nothing, "anything" is
allowed to match 0 times after all.  You probably meant the "b" should
start a mnemonic?

/* { dg-final { scan-assembler-not {(?n)\mb.*\n\snop\n} } } */

(\m is a zero-width start-of-word match, like \< in grep; (?n) means .
does not match newlines (if you know Perl, it turns /m on and /s off --
the opposite of the defaults for Tcl).

(or you could do [^\n]* or even just \S* , no (?n) needed then).


Segher


Re: [PATCH 2/2] Aarch64: Add branch diluter pass

2020-07-24 Thread Andrea Corallo
Segher Boessenkool  writes:

> Hi!
>
> On Fri, Jul 24, 2020 at 09:01:33AM +0200, Andrea Corallo wrote:
>> Segher Boessenkool  writes:
>> >> Correct, it's a sliding window only because the real load address is not
>> >> known to the compiler and the algorithm is conservative.  I believe we
>> >> could use ASM_OUTPUT_ALIGN_WITH_NOP if we align each function to (al
>> >> least) the granule size, then we should be able to insert 'nop aligned
>> >> labels' precisely.
>> >
>> > Yeah, we have similar issues on Power...  Our "granule" (fetch group
>> > size, in our terminology) is 32 typically, but we align functions to
>> > just 16.  This is causing some problems, but aligning to bigger
>> > boundaries isn't a very happy alternative either.  WIP...
>> 
>> Interesting, I was expecting other CPUs to have a similar mechanism.
>
> On old cpus (like the 970) there were at most two branch predictions per
> cycle.  Nowadays, all branches are predicted; not sure when this changed,
> it is pretty long ago already.
>
>> > (We don't have this exact same problem, because our non-ancient cores
>> > can just predict *all* branches in the same cycle).
>> >
>> >> My main fear is that given new cores tend to have big granules code size
>> >> would blow.  One advantage of the implemented algorithm is that even if
>> >> slightly conservative it's impacting code size only where an high branch
>> >> density shows up.
>> >
>> > What is "big granules" for you?
>> 
>> N1 is 8 instructions so 32 bytes as well, I guess this may grow further
>> (my speculation).
>
> It has to sooner rather than later, yeah.  Or the mechanism has to change
> more radically.  Interesting times ahead, I guess :-)

Indeed :)

> About your patch itself.  The basic idea seems fine (I didn't look too
> closely), but do you really need a new RTX class for this?  That is not
> very appetising...

Agree, OTOH I'm not sure about the other options on the table and their
impact, the advantage of this is that the impact is relatively
contained.

  Andrea


Re: [PATCH 2/2] Aarch64: Add branch diluter pass

2020-07-24 Thread Segher Boessenkool
Hi!

On Fri, Jul 24, 2020 at 09:01:33AM +0200, Andrea Corallo wrote:
> Segher Boessenkool  writes:
> >> Correct, it's a sliding window only because the real load address is not
> >> known to the compiler and the algorithm is conservative.  I believe we
> >> could use ASM_OUTPUT_ALIGN_WITH_NOP if we align each function to (al
> >> least) the granule size, then we should be able to insert 'nop aligned
> >> labels' precisely.
> >
> > Yeah, we have similar issues on Power...  Our "granule" (fetch group
> > size, in our terminology) is 32 typically, but we align functions to
> > just 16.  This is causing some problems, but aligning to bigger
> > boundaries isn't a very happy alternative either.  WIP...
> 
> Interesting, I was expecting other CPUs to have a similar mechanism.

On old cpus (like the 970) there were at most two branch predictions per
cycle.  Nowadays, all branches are predicted; not sure when this changed,
it is pretty long ago already.

> > (We don't have this exact same problem, because our non-ancient cores
> > can just predict *all* branches in the same cycle).
> >
> >> My main fear is that given new cores tend to have big granules code size
> >> would blow.  One advantage of the implemented algorithm is that even if
> >> slightly conservative it's impacting code size only where an high branch
> >> density shows up.
> >
> > What is "big granules" for you?
> 
> N1 is 8 instructions so 32 bytes as well, I guess this may grow further
> (my speculation).

It has to sooner rather than later, yeah.  Or the mechanism has to change
more radically.  Interesting times ahead, I guess :-)


About your patch itself.  The basic idea seems fine (I didn't look too
closely), but do you really need a new RTX class for this?  That is not
very appetising...


Segher


Re: [PATCH 2/2] Aarch64: Add branch diluter pass

2020-07-24 Thread Andrea Corallo
Segher Boessenkool  writes:

> On Wed, Jul 22, 2020 at 09:45:08PM +0200, Andrea Corallo wrote:
>> > Should that actually be a sliding window, or should there actually just
>> > not be more than N branches per aligned block of machine code?  Like,
>> > per fetch group.
>> >
>> > Can you not use ASM_OUTPUT_ALIGN_WITH_NOP (or ASM_OUTPUT_MAX_SKIP_ALIGN
>> > even) then?  GCC has infrastructure for that, already.
>> 
>> Correct, it's a sliding window only because the real load address is not
>> known to the compiler and the algorithm is conservative.  I believe we
>> could use ASM_OUTPUT_ALIGN_WITH_NOP if we align each function to (al
>> least) the granule size, then we should be able to insert 'nop aligned
>> labels' precisely.
>
> Yeah, we have similar issues on Power...  Our "granule" (fetch group
> size, in our terminology) is 32 typically, but we align functions to
> just 16.  This is causing some problems, but aligning to bigger
> boundaries isn't a very happy alternative either.  WIP...

Interesting, I was expecting other CPUs to have a similar mechanism.

> (We don't have this exact same problem, because our non-ancient cores
> can just predict *all* branches in the same cycle).
>
>> My main fear is that given new cores tend to have big granules code size
>> would blow.  One advantage of the implemented algorithm is that even if
>> slightly conservative it's impacting code size only where an high branch
>> density shows up.
>
> What is "big granules" for you?

N1 is 8 instructions so 32 bytes as well, I guess this may grow further
(my speculation).

  Andrea


Re: [PATCH 2/2] Aarch64: Add branch diluter pass

2020-07-23 Thread Segher Boessenkool
On Wed, Jul 22, 2020 at 09:45:08PM +0200, Andrea Corallo wrote:
> > Should that actually be a sliding window, or should there actually just
> > not be more than N branches per aligned block of machine code?  Like,
> > per fetch group.
> >
> > Can you not use ASM_OUTPUT_ALIGN_WITH_NOP (or ASM_OUTPUT_MAX_SKIP_ALIGN
> > even) then?  GCC has infrastructure for that, already.
> 
> Correct, it's a sliding window only because the real load address is not
> known to the compiler and the algorithm is conservative.  I believe we
> could use ASM_OUTPUT_ALIGN_WITH_NOP if we align each function to (al
> least) the granule size, then we should be able to insert 'nop aligned
> labels' precisely.

Yeah, we have similar issues on Power...  Our "granule" (fetch group
size, in our terminology) is 32 typically, but we align functions to
just 16.  This is causing some problems, but aligning to bigger
boundaries isn't a very happy alternative either.  WIP...

(We don't have this exact same problem, because our non-ancient cores
can just predict *all* branches in the same cycle).

> My main fear is that given new cores tend to have big granules code size
> would blow.  One advantage of the implemented algorithm is that even if
> slightly conservative it's impacting code size only where an high branch
> density shows up.

What is "big granules" for you?


Segher


Re: [PATCH 2/2] Aarch64: Add branch diluter pass

2020-07-22 Thread Andrea Corallo
Segher Boessenkool  writes:

> Hi!
>
> On Wed, Jul 22, 2020 at 03:53:34PM +0200, Andrea Corallo wrote:
>> Andrew Pinski  writes:
>> > Can you give a simple example of what this patch does?
>>
>> Sure, this pass simply moves a sliding window over the insns trying to
>> make sure that we never have more then 'max_branch' branches for every
>> 'granule_size' insns.
>>
>> If too many branches are detected nops are added where considered less
>> armful to correct that.
>
> Should that actually be a sliding window, or should there actually just
> not be more than N branches per aligned block of machine code?  Like,
> per fetch group.
>
> Can you not use ASM_OUTPUT_ALIGN_WITH_NOP (or ASM_OUTPUT_MAX_SKIP_ALIGN
> even) then?  GCC has infrastructure for that, already.

Correct, it's a sliding window only because the real load address is not
known to the compiler and the algorithm is conservative.  I believe we
could use ASM_OUTPUT_ALIGN_WITH_NOP if we align each function to (al
least) the granule size, then we should be able to insert 'nop aligned
labels' precisely.

My main fear is that given new cores tend to have big granules code size
would blow.  One advantage of the implemented algorithm is that even if
slightly conservative it's impacting code size only where an high branch
density shows up.

  Andrea


Re: [PATCH 2/2] Aarch64: Add branch diluter pass

2020-07-22 Thread Segher Boessenkool
Hi!

On Wed, Jul 22, 2020 at 03:53:34PM +0200, Andrea Corallo wrote:
> Andrew Pinski  writes:
> > Can you give a simple example of what this patch does?
> 
> Sure, this pass simply moves a sliding window over the insns trying to
> make sure that we never have more then 'max_branch' branches for every
> 'granule_size' insns.
> 
> If too many branches are detected nops are added where considered less
> armful to correct that.

Should that actually be a sliding window, or should there actually just
not be more than N branches per aligned block of machine code?  Like,
per fetch group.

Can you not use ASM_OUTPUT_ALIGN_WITH_NOP (or ASM_OUTPUT_MAX_SKIP_ALIGN
even) then?  GCC has infrastructure for that, already.


Segher


Re: [PATCH 2/2] Aarch64: Add branch diluter pass

2020-07-22 Thread Andrea Corallo
Hi Andrew,

thanks for reviewing I'll work on your comments.  Just replying to the
high level questions.

Andrew Pinski  writes:

> On Wed, Jul 22, 2020 at 3:10 AM Andrea Corallo  wrote:
>>
>> Hi all,
>>
>> this second patch implements the AArch64 specific back-end pass
>> 'branch-dilution' controllable by the followings command line options:
>>
>> -mbranch-dilution
>>
>> --param=aarch64-branch-dilution-granularity={num}
>>
>> --param=aarch64-branch-dilution-max-branches={num}
>>
>> Some cores known to be able to benefit from this pass have been given
>> default tuning values for their granularity and max-branches.  Each
>> affected core has a very specific granule size and associated max-branch
>> limit.  This is a microarchitecture specific optimization.  Typical
>> usage should be -mbranch-dilution with a specified -mcpu.  Cores with a
>> granularity tuned to 0 will be ignored. Options are provided for
>> experimentation.
>
> Can you give a simple example of what this patch does?

Sure, this pass simply moves a sliding window over the insns trying to
make sure that we never have more then 'max_branch' branches for every
'granule_size' insns.

If too many branches are detected nops are added where considered less
armful to correct that.

There are obviously many scenarios where the compiler can generate a
branch dense pieces of code but say we have the equivalent of:


.L389:
bl  foo
b   .L43
.L388:
bl  foo
b   .L42
.L387:
bl  foo
b   .L41
.L386:
bl  foo
b   .L40


Assuming granule size 4 and max branches 2 this will be transformed in
the equivalent of:


.L389:
bl  foo
b   .L43
nop
nop
.L388:
bl  foo
b   .L42
nop
nop
.L387:
bl  foo
b   .L41
nop
nop
.L386:
bl  foo
b   .L40
nop
nop


> Also your testcases seem too sensitive to other optimizations which
> could happen.  E.g. the call to "branch (i)" could be pulled out of
> the switch statement.  Or even the "*i += N;" could be moved to one
> Basic block and the switch becomes just one if statement.
>
>> Observed performance improvements on Neoverse N1 SPEC CPU 2006 where
>> up to ~+3% (xalancbmk) and ~+1.5% (sjeng).  Average code size increase
>> for all the testsuite proved to be ~0.4%.
>
> Also does this improve any non-SPEC benchmarks or has it only been
> benchmarked with SPEC?

So far I tried it only on SPEC 2006.  The transformation is not
benchmark specific tho, other code may benefit from it.

Thanks

  Andrea


Re: [PATCH 2/2] Aarch64: Add branch diluter pass

2020-07-22 Thread Andrew Pinski via Gcc-patches
On Wed, Jul 22, 2020 at 3:10 AM Andrea Corallo  wrote:
>
> Hi all,
>
> this second patch implements the AArch64 specific back-end pass
> 'branch-dilution' controllable by the followings command line options:
>
> -mbranch-dilution
>
> --param=aarch64-branch-dilution-granularity={num}
>
> --param=aarch64-branch-dilution-max-branches={num}
>
> Some cores known to be able to benefit from this pass have been given
> default tuning values for their granularity and max-branches.  Each
> affected core has a very specific granule size and associated max-branch
> limit.  This is a microarchitecture specific optimization.  Typical
> usage should be -mbranch-dilution with a specified -mcpu.  Cores with a
> granularity tuned to 0 will be ignored. Options are provided for
> experimentation.

Can you give a simple example of what this patch does?
Also your testcases seem too sensitive to other optimizations which
could happen.  E.g. the call to "branch (i)" could be pulled out of
the switch statement.  Or even the "*i += N;" could be moved to one
Basic block and the switch becomes just one if statement.

> Observed performance improvements on Neoverse N1 SPEC CPU 2006 where
> up to ~+3% (xalancbmk) and ~+1.5% (sjeng).  Average code size increase
> for all the testsuite proved to be ~0.4%.

Also does this improve any non-SPEC benchmarks or has it only been
benchmarked with SPEC?

A few comments about the patch itself:
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -321,7 +321,7 @@ aarch64*-*-*)
>  c_target_objs="aarch64-c.o"
>  cxx_target_objs="aarch64-c.o"
>  d_target_objs="aarch64-d.o"
> - extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o 
> aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o 
> aarch64-sve-builtins-sve2.o cortex-a57-fma-steering.o aarch64-speculation.o 
> falkor-tag-collision-avoidance.o aarch64-bti-insert.o"
> + extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o 
> aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o 
> aarch64-sve-builtins-sve2.o cortex-a57-fma-steering.o aarch64-speculation.o 
> falkor-tag-collision-avoidance.o aarch64-bti-insert.o 
> aarch64-branch-dilution.o"
>  target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.c 
> \$(srcdir)/config/aarch64/aarch64-sve-builtins.h 
> \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
>  target_has_targetm_common=yes
>  ;;

I think it is time to change how extra_objs is done and split it
across a few lines; this should be done in a different patch, it will
simplify future additions later on.

+unsigned max_branch = 0;
+unsigned granule_size = 0;

These two really should be part of the insn_granule class.  Having
global variables is not a good idea with respect to threading of GCC.

+  dump_printf (MSG_NOTE,
+"%d. %s%s%s  > NEXT = (%d), PREV = (%d) -- UID: %d\n",
+insn->index, GET_RTX_NAME (GET_CODE (insn->rtx)),
+any_uncondjump_p (insn->rtx) ? " (ubranch)" : "",
+insn->is_nop ? " (nop)" : "",
+insn->next ? insn->next->index : -1,
+insn->prev ? insn->prev->index : -1, INSN_UID (insn->rtx));

This part should really be of a method of insn_info instead.

is_branch (insn->rtx)

Why not:
insn->is_branch ()

This simplifies the code and really shows what is being done.

+  if (is_branch (prev_real_nondebug_insn (insn->rtx))
+  && is_branch (next_real_nondebug_insn (insn->rtx)))

Maybe:
+  if (is_branch (insn->prev_real_nondebug_insn ())
+  && is_branch (insn->next_real_nondebug_insn ()))

+  while (current_insn && !current_insn->is_branch)
+{
+  current_insn = current_insn->next;
+}

Why not:
current_insn = current_insn->next_branch_insn ();

There are many more examples of where you can improve like the above;
that is the way you define insn_info can be improved and push some of
the implementation back into the insn_info definition.

Thanks,
Andrew

>
> * Algorithm and Heuristic
>
> The pass takes a very simple 'sliding window' approach to the problem.
> We crawl through each instruction (starting at the first branch) and
> keep track of the number of branches within the current "granule" (or
> window).  When this exceeds the max-branch value, the pass will dilute
> the current granule, inserting nops to push out some of the branches.
> The heuristic will favor unconditonal branches (for performance
> reasons), or branches that are between two other branches (in order to
> decrease the likelihood of another dilution call being needed).
>
> Each branch type required a different method for nop insertion due to
> RTL/basic_block restrictions:
>
> - Returning calls do not end a basic block so can be handled by
>   emitting a generic nop.
>
> - Unconditional branches must be the end of a basic block, and nops
>   cannot be outside of a basic block.  Thus the need for FILLER_INSN,
>   which allows placement outside of a basic block
>
> - and translates to a nop.
>
> - For most conditional branches we've taken a simple approach and only
>   handle the 

[PATCH 2/2] Aarch64: Add branch diluter pass

2020-07-22 Thread Andrea Corallo
Hi all,

this second patch implements the AArch64 specific back-end pass
'branch-dilution' controllable by the followings command line options:

-mbranch-dilution

--param=aarch64-branch-dilution-granularity={num}

--param=aarch64-branch-dilution-max-branches={num}

Some cores known to be able to benefit from this pass have been given
default tuning values for their granularity and max-branches.  Each
affected core has a very specific granule size and associated max-branch
limit.  This is a microarchitecture specific optimization.  Typical
usage should be -mbranch-dilution with a specified -mcpu.  Cores with a
granularity tuned to 0 will be ignored. Options are provided for
experimentation.

Observed performance improvements on Neoverse N1 SPEC CPU 2006 where
up to ~+3% (xalancbmk) and ~+1.5% (sjeng).  Average code size increase
for all the testsuite proved to be ~0.4%.

* Algorithm and Heuristic

The pass takes a very simple 'sliding window' approach to the problem.
We crawl through each instruction (starting at the first branch) and
keep track of the number of branches within the current "granule" (or
window).  When this exceeds the max-branch value, the pass will dilute
the current granule, inserting nops to push out some of the branches.
The heuristic will favor unconditonal branches (for performance
reasons), or branches that are between two other branches (in order to
decrease the likelihood of another dilution call being needed).

Each branch type required a different method for nop insertion due to
RTL/basic_block restrictions:

- Returning calls do not end a basic block so can be handled by
  emitting a generic nop.

- Unconditional branches must be the end of a basic block, and nops
  cannot be outside of a basic block.  Thus the need for FILLER_INSN,
  which allows placement outside of a basic block

- and translates to a nop.

- For most conditional branches we've taken a simple approach and only
  handle the fallthru edge for simplicity, which we do by inserting a
  "nop block" of nops on the fallthru edge, mapping that back to the
  original destination block.

- asm gotos and pcsets are going to be tricky to analyze from a
  dilution perspective so are ignored at present.

* Testing

The two patches has been tested together on top of current master on
aarch64-unknown-linux-gnu as follow:

- Successful compilation of 3 stage bootstrap with the
  pass forced on (for stage 2, 3)

- No additional compilation failures (SPEC CPU 2006 and SPEC CPU 2017)

- No 'make check' regressions


Regards

  Andrea

gcc/ChangeLog

2020-07-17  Andrea Corallo  
Carey Williams  

* config.gcc (extra_objs): Add aarch64-branch-dilution.o.
* config/aarch64/aarch64-branch-dilution.c: New file.
* config/aarch64/aarch64-passes.def (branch-dilution): Register
pass.
* config/aarch64/aarch64-protos.h (struct tune_params): Declare
tuning parameters bdilution_gsize and bdilution_maxb.
(make_pass_branch_dilution): New declaration.
* config/aarch64/aarch64.c (generic_tunings, cortexa35_tunings)
(cortexa53_tunings, cortexa57_tunings, cortexa72_tunings)
(cortexa73_tunings, exynosm1_tunings, thunderxt88_tunings)
(thunderx_tunings, tsv110_tunings, xgene1_tunings)
(qdf24xx_tunings, saphira_tunings, thunderx2t99_tunings)
(neoversen1_tunings): Provide default tunings for bdilution_gsize
and bdilution_maxb.
* config/aarch64/aarch64.md (filler_insn): Define new insn.
* config/aarch64/aarch64.opt (-mbranch-dilution)
(--param=aarch64-branch-dilution-granularity)
(--param=aarch64-branch-dilution-max-branches): Add new options.
* config/aarch64/t-aarch64 (aarch64-branch-dilution.c): New rule
for aarch64-branch-dilution.c.
* doc/invoke.texi (-mbranch-dilution)
(--param=aarch64-branch-dilution-granularity)
(--param=aarch64-branch-dilution-max-branches): Document branch
dilution options.

gcc/testsuite/ChangeLog

2020-07-17  Andrea Corallo  
Carey Williams  

* gcc.target/aarch64/branch-dilution-off.c: New file.
* gcc.target/aarch64/branch-dilution-on.c: New file.
>From 386b3a3131d5f03a4c9fb8ee47b321009f17fab5 Mon Sep 17 00:00:00 2001
From: Andrea Corallo 
Date: Thu, 16 Jul 2020 09:24:33 +0100
Subject: [PATCH 2/2] Aarch64: Add branch diluter pass

gcc/ChangeLog

2020-07-17  Andrea Corallo  
	Carey Williams  

	* config.gcc (extra_objs): Add aarch64-branch-dilution.o.
	* config/aarch64/aarch64-branch-dilution.c: New file.
	* config/aarch64/aarch64-passes.def (branch-dilution): Register
	pass.
* config/aarch64/aarch64-protos.h (struct tune_params): Declare
	tuning parameters bdilution_gsize and bdilution_maxb.
(make_pass_branch_dilution): New declaration.
* config/aarch64/aarch64.c (generic_tunings, cortexa35_tunings)
(cortexa53_tunings, cortexa57_tun