Re: [PATCH 2/2] Aarch64: Add branch diluter pass
Hi! On Tue, Jul 28, 2020 at 07:55:58PM +0100, Andrea Corallo wrote: > > [ Btw, the mailing list archive will not show your attachments (just lets > > me download them); naming the files *.txt probably works, but you can > > also use a sane mime type (like, text/plain) ]. > > [ Sure can do it np, I'm just less sure if text/x-diff is such and > insane mime type for a mailing list software :) ] Well, first and foremost, it does not *work* with our current mailing list setup. Also, anything x- only makes sense if you know the software the receiver runs (and of course you don't, this is public email, with hundreds or thousands of readers). > > Can you share a geomean improvement as well? Also something like 0.4% > > is sounds like, or is it more? > > After my first measure I was suggestted by a colleague a less noisy > system to benchmark on and a more reproducable methodology. I repeated > the tests on N1 with the following results: > > | Benchmark | Est. Peak Rate ration | > ||diluted / baseline | > |+---| > | 400.perlbench |1.018x | > | 401.bzip2 |1.004x | > | 403.gcc|0.987x | > | 429.mcf|1.000x | > | 445.gobmk |0.998x | > | 456.hmmer |1.000x | > | 458.sjeng |1.008x | > | 462.libquantum |1.014x | > | 464.h264ref|1.004x | > | 471.omnetpp|1.017x | > | 473.astar |1.007x | > | 483.xalancbmk |0.998x | Cool. Well, gcc has a pretty big drop, what's up with that? Everything else looks just great :-) > I was explained xalanc tend to be very noisy being memory bound so this > explains the difference, not sure why sjeng looks less good. Sjeng is very jumpy code, so of course your patch will influence it a lot. No idea why it is less positive this run. > The > overall ratio comparing spec rates is +~0.44%. Yup, a nice healthy improvement :-) Segher
Re: [PATCH 2/2] Aarch64: Add branch diluter pass
Segher Boessenkool writes: > Hi! Hi Segher, > [ Btw, the mailing list archive will not show your attachments (just lets > me download them); naming the files *.txt probably works, but you can > also use a sane mime type (like, text/plain) ]. [ Sure can do it np, I'm just less sure if text/x-diff is such and insane mime type for a mailing list software :) ] > On Wed, Jul 22, 2020 at 12:09:08PM +0200, Andrea Corallo wrote: >> this second patch implements the AArch64 specific back-end pass >> 'branch-dilution' controllable by the followings command line options: >> >> -mbranch-dilution >> >> --param=aarch64-branch-dilution-granularity={num} >> >> --param=aarch64-branch-dilution-max-branches={num} > > That sounds like something that would be useful generically, even? > Targets that do not want to use it can just skip it (that probably should > be the default then), via setting the granularity to 0 for example. > >> Observed performance improvements on Neoverse N1 SPEC CPU 2006 where >> up to ~+3% (xalancbmk) and ~+1.5% (sjeng). Average code size increase >> for all the testsuite proved to be ~0.4%. > > Can you share a geomean improvement as well? Also something like 0.4% > is sounds like, or is it more? After my first measure I was suggestted by a colleague a less noisy system to benchmark on and a more reproducable methodology. I repeated the tests on N1 with the following results: | Benchmark | Est. Peak Rate ration | ||diluted / baseline | |+---| | 400.perlbench |1.018x | | 401.bzip2 |1.004x | | 403.gcc|0.987x | | 429.mcf|1.000x | | 445.gobmk |0.998x | | 456.hmmer |1.000x | | 458.sjeng |1.008x | | 462.libquantum |1.014x | | 464.h264ref|1.004x | | 471.omnetpp|1.017x | | 473.astar |1.007x | | 483.xalancbmk |0.998x | I was explained xalanc tend to be very noisy being memory bound so this explains the difference, not sure why sjeng looks less good. The overall ratio comparing spec rates is +~0.44%. >> - Unconditional branches must be the end of a basic block, and nops >> cannot be outside of a basic block. Thus the need for FILLER_INSN, >> which allows placement outside of a basic block > > But the new rtx class is recognised in just one location, so it could > recognise it on any other characteristic easily. Make sense, unless in the future this changes, but probably is not worth taking this in account now. >> --- /dev/null >> +++ b/gcc/testsuite/gcc.target/aarch64/branch-dilution-off.c >> @@ -0,0 +1,57 @@ >> +/* { dg-do compile } */ >> +/* { dg-options "-O1 -mcpu=cortex-a72 --param case-values-threshold=50" } */ >> +/* { dg-final { scan-assembler-not "\\s*b.*\n\\snop\n" } } */ > > You can write this simpler as > > /* { dg-final { scan-assembler-not {\s*b.*\n\snop\n} } } */ > > which shows a problem in this more clearly: . will match newlines as > well. Also, starting a RE with (anything)* does nothing, "anything" is > allowed to match 0 times after all. You probably meant the "b" should > start a mnemonic? > > /* { dg-final { scan-assembler-not {(?n)\mb.*\n\snop\n} } } */ > > (\m is a zero-width start-of-word match, like \< in grep; (?n) means . > does not match newlines (if you know Perl, it turns /m on and /s off -- > the opposite of the defaults for Tcl). > > (or you could do [^\n]* or even just \S* , no (?n) needed then). > > > Segher Thanks for the feedback! Andrea
Re: [PATCH 2/2] Aarch64: Add branch diluter pass
Hi! [ Btw, the mailing list archive will not show your attachments (just lets me download them); naming the files *.txt probably works, but you can also use a sane mime type (like, text/plain) ]. On Wed, Jul 22, 2020 at 12:09:08PM +0200, Andrea Corallo wrote: > this second patch implements the AArch64 specific back-end pass > 'branch-dilution' controllable by the followings command line options: > > -mbranch-dilution > > --param=aarch64-branch-dilution-granularity={num} > > --param=aarch64-branch-dilution-max-branches={num} That sounds like something that would be useful generically, even? Targets that do not want to use it can just skip it (that probably should be the default then), via setting the granularity to 0 for example. > Observed performance improvements on Neoverse N1 SPEC CPU 2006 where > up to ~+3% (xalancbmk) and ~+1.5% (sjeng). Average code size increase > for all the testsuite proved to be ~0.4%. Can you share a geomean improvement as well? Also something like 0.4% is sounds like, or is it more? > - Unconditional branches must be the end of a basic block, and nops > cannot be outside of a basic block. Thus the need for FILLER_INSN, > which allows placement outside of a basic block But the new rtx class is recognised in just one location, so it could recognise it on any other characteristic easily. > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/branch-dilution-off.c > @@ -0,0 +1,57 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O1 -mcpu=cortex-a72 --param case-values-threshold=50" } */ > +/* { dg-final { scan-assembler-not "\\s*b.*\n\\snop\n" } } */ You can write this simpler as /* { dg-final { scan-assembler-not {\s*b.*\n\snop\n} } } */ which shows a problem in this more clearly: . will match newlines as well. Also, starting a RE with (anything)* does nothing, "anything" is allowed to match 0 times after all. You probably meant the "b" should start a mnemonic? /* { dg-final { scan-assembler-not {(?n)\mb.*\n\snop\n} } } */ (\m is a zero-width start-of-word match, like \< in grep; (?n) means . does not match newlines (if you know Perl, it turns /m on and /s off -- the opposite of the defaults for Tcl). (or you could do [^\n]* or even just \S* , no (?n) needed then). Segher
Re: [PATCH 2/2] Aarch64: Add branch diluter pass
Segher Boessenkool writes: > Hi! > > On Fri, Jul 24, 2020 at 09:01:33AM +0200, Andrea Corallo wrote: >> Segher Boessenkool writes: >> >> Correct, it's a sliding window only because the real load address is not >> >> known to the compiler and the algorithm is conservative. I believe we >> >> could use ASM_OUTPUT_ALIGN_WITH_NOP if we align each function to (al >> >> least) the granule size, then we should be able to insert 'nop aligned >> >> labels' precisely. >> > >> > Yeah, we have similar issues on Power... Our "granule" (fetch group >> > size, in our terminology) is 32 typically, but we align functions to >> > just 16. This is causing some problems, but aligning to bigger >> > boundaries isn't a very happy alternative either. WIP... >> >> Interesting, I was expecting other CPUs to have a similar mechanism. > > On old cpus (like the 970) there were at most two branch predictions per > cycle. Nowadays, all branches are predicted; not sure when this changed, > it is pretty long ago already. > >> > (We don't have this exact same problem, because our non-ancient cores >> > can just predict *all* branches in the same cycle). >> > >> >> My main fear is that given new cores tend to have big granules code size >> >> would blow. One advantage of the implemented algorithm is that even if >> >> slightly conservative it's impacting code size only where an high branch >> >> density shows up. >> > >> > What is "big granules" for you? >> >> N1 is 8 instructions so 32 bytes as well, I guess this may grow further >> (my speculation). > > It has to sooner rather than later, yeah. Or the mechanism has to change > more radically. Interesting times ahead, I guess :-) Indeed :) > About your patch itself. The basic idea seems fine (I didn't look too > closely), but do you really need a new RTX class for this? That is not > very appetising... Agree, OTOH I'm not sure about the other options on the table and their impact, the advantage of this is that the impact is relatively contained. Andrea
Re: [PATCH 2/2] Aarch64: Add branch diluter pass
Hi! On Fri, Jul 24, 2020 at 09:01:33AM +0200, Andrea Corallo wrote: > Segher Boessenkool writes: > >> Correct, it's a sliding window only because the real load address is not > >> known to the compiler and the algorithm is conservative. I believe we > >> could use ASM_OUTPUT_ALIGN_WITH_NOP if we align each function to (al > >> least) the granule size, then we should be able to insert 'nop aligned > >> labels' precisely. > > > > Yeah, we have similar issues on Power... Our "granule" (fetch group > > size, in our terminology) is 32 typically, but we align functions to > > just 16. This is causing some problems, but aligning to bigger > > boundaries isn't a very happy alternative either. WIP... > > Interesting, I was expecting other CPUs to have a similar mechanism. On old cpus (like the 970) there were at most two branch predictions per cycle. Nowadays, all branches are predicted; not sure when this changed, it is pretty long ago already. > > (We don't have this exact same problem, because our non-ancient cores > > can just predict *all* branches in the same cycle). > > > >> My main fear is that given new cores tend to have big granules code size > >> would blow. One advantage of the implemented algorithm is that even if > >> slightly conservative it's impacting code size only where an high branch > >> density shows up. > > > > What is "big granules" for you? > > N1 is 8 instructions so 32 bytes as well, I guess this may grow further > (my speculation). It has to sooner rather than later, yeah. Or the mechanism has to change more radically. Interesting times ahead, I guess :-) About your patch itself. The basic idea seems fine (I didn't look too closely), but do you really need a new RTX class for this? That is not very appetising... Segher
Re: [PATCH 2/2] Aarch64: Add branch diluter pass
Segher Boessenkool writes: > On Wed, Jul 22, 2020 at 09:45:08PM +0200, Andrea Corallo wrote: >> > Should that actually be a sliding window, or should there actually just >> > not be more than N branches per aligned block of machine code? Like, >> > per fetch group. >> > >> > Can you not use ASM_OUTPUT_ALIGN_WITH_NOP (or ASM_OUTPUT_MAX_SKIP_ALIGN >> > even) then? GCC has infrastructure for that, already. >> >> Correct, it's a sliding window only because the real load address is not >> known to the compiler and the algorithm is conservative. I believe we >> could use ASM_OUTPUT_ALIGN_WITH_NOP if we align each function to (al >> least) the granule size, then we should be able to insert 'nop aligned >> labels' precisely. > > Yeah, we have similar issues on Power... Our "granule" (fetch group > size, in our terminology) is 32 typically, but we align functions to > just 16. This is causing some problems, but aligning to bigger > boundaries isn't a very happy alternative either. WIP... Interesting, I was expecting other CPUs to have a similar mechanism. > (We don't have this exact same problem, because our non-ancient cores > can just predict *all* branches in the same cycle). > >> My main fear is that given new cores tend to have big granules code size >> would blow. One advantage of the implemented algorithm is that even if >> slightly conservative it's impacting code size only where an high branch >> density shows up. > > What is "big granules" for you? N1 is 8 instructions so 32 bytes as well, I guess this may grow further (my speculation). Andrea
Re: [PATCH 2/2] Aarch64: Add branch diluter pass
On Wed, Jul 22, 2020 at 09:45:08PM +0200, Andrea Corallo wrote: > > Should that actually be a sliding window, or should there actually just > > not be more than N branches per aligned block of machine code? Like, > > per fetch group. > > > > Can you not use ASM_OUTPUT_ALIGN_WITH_NOP (or ASM_OUTPUT_MAX_SKIP_ALIGN > > even) then? GCC has infrastructure for that, already. > > Correct, it's a sliding window only because the real load address is not > known to the compiler and the algorithm is conservative. I believe we > could use ASM_OUTPUT_ALIGN_WITH_NOP if we align each function to (al > least) the granule size, then we should be able to insert 'nop aligned > labels' precisely. Yeah, we have similar issues on Power... Our "granule" (fetch group size, in our terminology) is 32 typically, but we align functions to just 16. This is causing some problems, but aligning to bigger boundaries isn't a very happy alternative either. WIP... (We don't have this exact same problem, because our non-ancient cores can just predict *all* branches in the same cycle). > My main fear is that given new cores tend to have big granules code size > would blow. One advantage of the implemented algorithm is that even if > slightly conservative it's impacting code size only where an high branch > density shows up. What is "big granules" for you? Segher
Re: [PATCH 2/2] Aarch64: Add branch diluter pass
Segher Boessenkool writes: > Hi! > > On Wed, Jul 22, 2020 at 03:53:34PM +0200, Andrea Corallo wrote: >> Andrew Pinski writes: >> > Can you give a simple example of what this patch does? >> >> Sure, this pass simply moves a sliding window over the insns trying to >> make sure that we never have more then 'max_branch' branches for every >> 'granule_size' insns. >> >> If too many branches are detected nops are added where considered less >> armful to correct that. > > Should that actually be a sliding window, or should there actually just > not be more than N branches per aligned block of machine code? Like, > per fetch group. > > Can you not use ASM_OUTPUT_ALIGN_WITH_NOP (or ASM_OUTPUT_MAX_SKIP_ALIGN > even) then? GCC has infrastructure for that, already. Correct, it's a sliding window only because the real load address is not known to the compiler and the algorithm is conservative. I believe we could use ASM_OUTPUT_ALIGN_WITH_NOP if we align each function to (al least) the granule size, then we should be able to insert 'nop aligned labels' precisely. My main fear is that given new cores tend to have big granules code size would blow. One advantage of the implemented algorithm is that even if slightly conservative it's impacting code size only where an high branch density shows up. Andrea
Re: [PATCH 2/2] Aarch64: Add branch diluter pass
Hi! On Wed, Jul 22, 2020 at 03:53:34PM +0200, Andrea Corallo wrote: > Andrew Pinski writes: > > Can you give a simple example of what this patch does? > > Sure, this pass simply moves a sliding window over the insns trying to > make sure that we never have more then 'max_branch' branches for every > 'granule_size' insns. > > If too many branches are detected nops are added where considered less > armful to correct that. Should that actually be a sliding window, or should there actually just not be more than N branches per aligned block of machine code? Like, per fetch group. Can you not use ASM_OUTPUT_ALIGN_WITH_NOP (or ASM_OUTPUT_MAX_SKIP_ALIGN even) then? GCC has infrastructure for that, already. Segher
Re: [PATCH 2/2] Aarch64: Add branch diluter pass
Hi Andrew, thanks for reviewing I'll work on your comments. Just replying to the high level questions. Andrew Pinski writes: > On Wed, Jul 22, 2020 at 3:10 AM Andrea Corallo wrote: >> >> Hi all, >> >> this second patch implements the AArch64 specific back-end pass >> 'branch-dilution' controllable by the followings command line options: >> >> -mbranch-dilution >> >> --param=aarch64-branch-dilution-granularity={num} >> >> --param=aarch64-branch-dilution-max-branches={num} >> >> Some cores known to be able to benefit from this pass have been given >> default tuning values for their granularity and max-branches. Each >> affected core has a very specific granule size and associated max-branch >> limit. This is a microarchitecture specific optimization. Typical >> usage should be -mbranch-dilution with a specified -mcpu. Cores with a >> granularity tuned to 0 will be ignored. Options are provided for >> experimentation. > > Can you give a simple example of what this patch does? Sure, this pass simply moves a sliding window over the insns trying to make sure that we never have more then 'max_branch' branches for every 'granule_size' insns. If too many branches are detected nops are added where considered less armful to correct that. There are obviously many scenarios where the compiler can generate a branch dense pieces of code but say we have the equivalent of: .L389: bl foo b .L43 .L388: bl foo b .L42 .L387: bl foo b .L41 .L386: bl foo b .L40 Assuming granule size 4 and max branches 2 this will be transformed in the equivalent of: .L389: bl foo b .L43 nop nop .L388: bl foo b .L42 nop nop .L387: bl foo b .L41 nop nop .L386: bl foo b .L40 nop nop > Also your testcases seem too sensitive to other optimizations which > could happen. E.g. the call to "branch (i)" could be pulled out of > the switch statement. Or even the "*i += N;" could be moved to one > Basic block and the switch becomes just one if statement. > >> Observed performance improvements on Neoverse N1 SPEC CPU 2006 where >> up to ~+3% (xalancbmk) and ~+1.5% (sjeng). Average code size increase >> for all the testsuite proved to be ~0.4%. > > Also does this improve any non-SPEC benchmarks or has it only been > benchmarked with SPEC? So far I tried it only on SPEC 2006. The transformation is not benchmark specific tho, other code may benefit from it. Thanks Andrea
Re: [PATCH 2/2] Aarch64: Add branch diluter pass
On Wed, Jul 22, 2020 at 3:10 AM Andrea Corallo wrote: > > Hi all, > > this second patch implements the AArch64 specific back-end pass > 'branch-dilution' controllable by the followings command line options: > > -mbranch-dilution > > --param=aarch64-branch-dilution-granularity={num} > > --param=aarch64-branch-dilution-max-branches={num} > > Some cores known to be able to benefit from this pass have been given > default tuning values for their granularity and max-branches. Each > affected core has a very specific granule size and associated max-branch > limit. This is a microarchitecture specific optimization. Typical > usage should be -mbranch-dilution with a specified -mcpu. Cores with a > granularity tuned to 0 will be ignored. Options are provided for > experimentation. Can you give a simple example of what this patch does? Also your testcases seem too sensitive to other optimizations which could happen. E.g. the call to "branch (i)" could be pulled out of the switch statement. Or even the "*i += N;" could be moved to one Basic block and the switch becomes just one if statement. > Observed performance improvements on Neoverse N1 SPEC CPU 2006 where > up to ~+3% (xalancbmk) and ~+1.5% (sjeng). Average code size increase > for all the testsuite proved to be ~0.4%. Also does this improve any non-SPEC benchmarks or has it only been benchmarked with SPEC? A few comments about the patch itself: > --- a/gcc/config.gcc > +++ b/gcc/config.gcc > @@ -321,7 +321,7 @@ aarch64*-*-*) > c_target_objs="aarch64-c.o" > cxx_target_objs="aarch64-c.o" > d_target_objs="aarch64-d.o" > - extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o > aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o > aarch64-sve-builtins-sve2.o cortex-a57-fma-steering.o aarch64-speculation.o > falkor-tag-collision-avoidance.o aarch64-bti-insert.o" > + extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o > aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o > aarch64-sve-builtins-sve2.o cortex-a57-fma-steering.o aarch64-speculation.o > falkor-tag-collision-avoidance.o aarch64-bti-insert.o > aarch64-branch-dilution.o" > target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.c > \$(srcdir)/config/aarch64/aarch64-sve-builtins.h > \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc" > target_has_targetm_common=yes > ;; I think it is time to change how extra_objs is done and split it across a few lines; this should be done in a different patch, it will simplify future additions later on. +unsigned max_branch = 0; +unsigned granule_size = 0; These two really should be part of the insn_granule class. Having global variables is not a good idea with respect to threading of GCC. + dump_printf (MSG_NOTE, +"%d. %s%s%s > NEXT = (%d), PREV = (%d) -- UID: %d\n", +insn->index, GET_RTX_NAME (GET_CODE (insn->rtx)), +any_uncondjump_p (insn->rtx) ? " (ubranch)" : "", +insn->is_nop ? " (nop)" : "", +insn->next ? insn->next->index : -1, +insn->prev ? insn->prev->index : -1, INSN_UID (insn->rtx)); This part should really be of a method of insn_info instead. is_branch (insn->rtx) Why not: insn->is_branch () This simplifies the code and really shows what is being done. + if (is_branch (prev_real_nondebug_insn (insn->rtx)) + && is_branch (next_real_nondebug_insn (insn->rtx))) Maybe: + if (is_branch (insn->prev_real_nondebug_insn ()) + && is_branch (insn->next_real_nondebug_insn ())) + while (current_insn && !current_insn->is_branch) +{ + current_insn = current_insn->next; +} Why not: current_insn = current_insn->next_branch_insn (); There are many more examples of where you can improve like the above; that is the way you define insn_info can be improved and push some of the implementation back into the insn_info definition. Thanks, Andrew > > * Algorithm and Heuristic > > The pass takes a very simple 'sliding window' approach to the problem. > We crawl through each instruction (starting at the first branch) and > keep track of the number of branches within the current "granule" (or > window). When this exceeds the max-branch value, the pass will dilute > the current granule, inserting nops to push out some of the branches. > The heuristic will favor unconditonal branches (for performance > reasons), or branches that are between two other branches (in order to > decrease the likelihood of another dilution call being needed). > > Each branch type required a different method for nop insertion due to > RTL/basic_block restrictions: > > - Returning calls do not end a basic block so can be handled by > emitting a generic nop. > > - Unconditional branches must be the end of a basic block, and nops > cannot be outside of a basic block. Thus the need for FILLER_INSN, > which allows placement outside of a basic block > > - and translates to a nop. > > - For most conditional branches we've taken a simple approach and only > handle the
[PATCH 2/2] Aarch64: Add branch diluter pass
Hi all, this second patch implements the AArch64 specific back-end pass 'branch-dilution' controllable by the followings command line options: -mbranch-dilution --param=aarch64-branch-dilution-granularity={num} --param=aarch64-branch-dilution-max-branches={num} Some cores known to be able to benefit from this pass have been given default tuning values for their granularity and max-branches. Each affected core has a very specific granule size and associated max-branch limit. This is a microarchitecture specific optimization. Typical usage should be -mbranch-dilution with a specified -mcpu. Cores with a granularity tuned to 0 will be ignored. Options are provided for experimentation. Observed performance improvements on Neoverse N1 SPEC CPU 2006 where up to ~+3% (xalancbmk) and ~+1.5% (sjeng). Average code size increase for all the testsuite proved to be ~0.4%. * Algorithm and Heuristic The pass takes a very simple 'sliding window' approach to the problem. We crawl through each instruction (starting at the first branch) and keep track of the number of branches within the current "granule" (or window). When this exceeds the max-branch value, the pass will dilute the current granule, inserting nops to push out some of the branches. The heuristic will favor unconditonal branches (for performance reasons), or branches that are between two other branches (in order to decrease the likelihood of another dilution call being needed). Each branch type required a different method for nop insertion due to RTL/basic_block restrictions: - Returning calls do not end a basic block so can be handled by emitting a generic nop. - Unconditional branches must be the end of a basic block, and nops cannot be outside of a basic block. Thus the need for FILLER_INSN, which allows placement outside of a basic block - and translates to a nop. - For most conditional branches we've taken a simple approach and only handle the fallthru edge for simplicity, which we do by inserting a "nop block" of nops on the fallthru edge, mapping that back to the original destination block. - asm gotos and pcsets are going to be tricky to analyze from a dilution perspective so are ignored at present. * Testing The two patches has been tested together on top of current master on aarch64-unknown-linux-gnu as follow: - Successful compilation of 3 stage bootstrap with the pass forced on (for stage 2, 3) - No additional compilation failures (SPEC CPU 2006 and SPEC CPU 2017) - No 'make check' regressions Regards Andrea gcc/ChangeLog 2020-07-17 Andrea Corallo Carey Williams * config.gcc (extra_objs): Add aarch64-branch-dilution.o. * config/aarch64/aarch64-branch-dilution.c: New file. * config/aarch64/aarch64-passes.def (branch-dilution): Register pass. * config/aarch64/aarch64-protos.h (struct tune_params): Declare tuning parameters bdilution_gsize and bdilution_maxb. (make_pass_branch_dilution): New declaration. * config/aarch64/aarch64.c (generic_tunings, cortexa35_tunings) (cortexa53_tunings, cortexa57_tunings, cortexa72_tunings) (cortexa73_tunings, exynosm1_tunings, thunderxt88_tunings) (thunderx_tunings, tsv110_tunings, xgene1_tunings) (qdf24xx_tunings, saphira_tunings, thunderx2t99_tunings) (neoversen1_tunings): Provide default tunings for bdilution_gsize and bdilution_maxb. * config/aarch64/aarch64.md (filler_insn): Define new insn. * config/aarch64/aarch64.opt (-mbranch-dilution) (--param=aarch64-branch-dilution-granularity) (--param=aarch64-branch-dilution-max-branches): Add new options. * config/aarch64/t-aarch64 (aarch64-branch-dilution.c): New rule for aarch64-branch-dilution.c. * doc/invoke.texi (-mbranch-dilution) (--param=aarch64-branch-dilution-granularity) (--param=aarch64-branch-dilution-max-branches): Document branch dilution options. gcc/testsuite/ChangeLog 2020-07-17 Andrea Corallo Carey Williams * gcc.target/aarch64/branch-dilution-off.c: New file. * gcc.target/aarch64/branch-dilution-on.c: New file. >From 386b3a3131d5f03a4c9fb8ee47b321009f17fab5 Mon Sep 17 00:00:00 2001 From: Andrea Corallo Date: Thu, 16 Jul 2020 09:24:33 +0100 Subject: [PATCH 2/2] Aarch64: Add branch diluter pass gcc/ChangeLog 2020-07-17 Andrea Corallo Carey Williams * config.gcc (extra_objs): Add aarch64-branch-dilution.o. * config/aarch64/aarch64-branch-dilution.c: New file. * config/aarch64/aarch64-passes.def (branch-dilution): Register pass. * config/aarch64/aarch64-protos.h (struct tune_params): Declare tuning parameters bdilution_gsize and bdilution_maxb. (make_pass_branch_dilution): New declaration. * config/aarch64/aarch64.c (generic_tunings, cortexa35_tunings) (cortexa53_tunings, cortexa57_tun