[Bug ipa/114531] Feature proposal for an `-finline-functions-aggressive` compiler option

2024-04-08 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114531

Rama Malladi  changed:

   What|Removed |Added

 CC||rvmallad at amazon dot com

--- Comment #8 from Rama Malladi  ---
Created attachment 57898
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57898=edit
Updated patch for `-finline-functions-aggressive` GCC option.

This is an updated patch to include a new GCC option:
`-finline-functions-aggressive`. It has the `-O3` inlining heuristics replaced
with an entry that implies `OPT_finline_functions_aggressive` is enabled. It
also has an entry in `invoke.texi` for documentation stating that this option
selects the same inlining heuristics as `-O3`.

[Bug ipa/114531] Feature proposal for an `-finline-functions-aggressive` compiler option

2024-04-01 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114531

--- Comment #7 from Rama Malladi  ---
(In reply to Rama Malladi from comment #5)
> (In reply to Andrew Pinski from comment #3)
> > Also do you have numbers with lto enabled? Or is these without lto?
> > 
> > Does LTO improve the situation for Envoy too?
> 
> These numbers are without lto. I haven't tried it but I can try and post an
> update.

I checked and found the Envoy run was w/o LTO but SPEC cpu2017 intrate was w
LTO.

I tried a build of Envoy w LTO and it failed. I need to debug that issue
further.

Below are perf results w/o LTO. gcc version 11.4.0 (Ubuntu
11.4.0-1ubuntu1~22.04).

copies=8-O2 -Ofast  Gain w  -O2 + inlining  Gain w
noLTO   noLTO   Ofast   noLTO   inlining
500.perlbench_r 33.733.398.8%   33.298.5%
502.gcc_r   45.246.9103.8%  46.3102.4%
505.mcf_r   44.744.399.1%   44.699.8%
520.omnetpp_r   21.424.4114.0%  21.399.5%
523.xalancbmk_r 41.645.5109.4%  44  105.8%
525.x264_r  44.289  201.4%  43.999.3%
531.deepsjeng_r 32.832.8100.0%  33.1100.9%
541.leela_r 28.630.5106.6%  30.3105.9%
548.exchange2_r 64.164.6100.8%  64.1100.0%
557.xz_r20.320.4100.5%  20.3100.0%
SPECrate..base  35.639.4110.7%  36  101.1%

[Bug ipa/114531] Feature proposal for an `-finline-functions-aggressive` compiler option

2024-03-29 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114531

--- Comment #5 from Rama Malladi  ---
(In reply to Andrew Pinski from comment #3)
> Also do you have numbers with lto enabled? Or is these without lto?
> 
> Does LTO improve the situation for Envoy too?

These numbers are without lto. I haven't tried it but I can try and post an
update.

[Bug ipa/114531] Feature proposal for an `-finline-functions-aggressive` compiler option

2024-03-29 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114531

--- Comment #4 from Rama Malladi  ---
(In reply to Andrew Pinski from comment #1)
> Maybe we should figure out why the increase of the limits help and add extra
> code to get better heuristics rather than just tweaking the limits.
> 
> I know that there was some improvements for gcc 14 already for the
> heuristics for c++ code.

interesting... thank you.

[Bug driver/114531] New: Feature proposal for an `-finline-functions-aggressive` compiler option

2024-03-29 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114531

Bug ID: 114531
   Summary: Feature proposal for an
`-finline-functions-aggressive` compiler option
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: driver
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rvmallad at amazon dot com
CC: rsandifo at gcc dot gnu.org
  Target Milestone: ---

Created attachment 57837
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57837=edit
patch to implement -finline-functions-aggressive option in GCC

This is a proposal for a user-visible GCC compiler option for aggressive
inlining that is currently only available at -O3 as internal inline parameters
(--param=early-inlining-insns=14 --param=inline-heuristics-hint-percent=600
--param=inline-min-speedup=15 --param=max-inline-insns-auto=30
--param=max-inline-insns-single=200).

I got some perf data for Envoy (https://github.com/envoyproxy/envoy) and SPEC
CPU2017 intrate benchmarks on C7g.2xlarge w Ubuntu22 + gcc-11.4.0. We see perf
gains (2% - 5%) using these aggressive inline parameters (at -O2). Attached is
a patch for this change.

We do not want to add these inline limits at ‘-O2’ itself, as we see from one
of the SPEC CPU tests that got slower. Also, more inline tuning at -O2 would
make some of the symbols not to be available for probe/ debug (that are
available when not using these aggressive inline params).

---
Envoy load_balancer_benchmark – using only 1 CPU – Iterations, higher better
$ bazel run -c opt //test/common/upstream:load_balancer_benchmark

bazel-envoy/external/local_config_cc/BUILD can be changed for adding inline
parameters/ options.


Benchmark Iterations   Baseline O2+ Inline Params   Gain

benchmarkRoundRobinLoad  1518   1596   1.05x
BalancerBuild/500/50/50

benchmarkLeastRequestLoad1465   1514   1.03x
BalancerChooseHost/100/3/1000   

benchmarkRingHashLoadBalancer  33 34   1.03x
ChooseHost/100/65536/10   

benchmarkMaglevLoadBalancer69 72   1.04x
Weighted/500/95/75/25/1


copies=8"-O2"   "-Ofast" Gain  "-O2 +   Gain w
 w Ofastinlining"   inlining
500.perlbench_r 36.534.3 94.0%  34.494.2%
502.gcc_r   45.447.6 104.8% 47.5104.6%
505.mcf_r   44.648.2 108.1% 44.399.3%
520.omnetpp_r   22.124.9 112.7% 21.999.1%
523.xalancbmk_r 43.846.3 105.7% 45.4103.7%
525.x264_r  44.389   200.9% 43.898.9%
531.deepsjeng_r 36  37.3 103.6% 37.5104.2%
541.leela_r 33.533.9 101.2% 34.2102.1%
548.exchange2_r 65.476.6 117.1% 65.399.8%
557.xz_r19.819.9 100.5% 19.9100.5%
SPECrate..base  37.141.6 112.1% 37.3100.5%
---

[Bug sanitizer/97696] ICE since ASAN_MARK does not handle poly_int sized varibales

2024-03-05 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97696

--- Comment #5 from Rama Malladi  ---
Thank you Richard for this patch/ fix.

[Bug sanitizer/97696] ICE since ASAN_MARK does not handle poly_int sized varibales

2024-01-29 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97696

Rama Malladi  changed:

   What|Removed |Added

 CC||rvmallad at amazon dot com

--- Comment #2 from Rama Malladi  ---
Hi,
Can this be actioned/ fixed? We had a related issue and would like this fixed.
https://github.com/numpy/numpy/issues/25556

Thank you.
Rama

[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba

2023-03-29 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409

--- Comment #23 from Rama Malladi  ---
(In reply to Rama Malladi from comment #22)
> I will close this issue as we were unable to reproduce the perf drop going
> from gcc-7 to gcc-8 on a Graviton2 based instance. The performance of
> 519.lbm_r built with gcc-7.4 was same as that with gcc-8.5.

Can someone from the GCC dev/ regression team close this issue as I am unable
to find an option for the same? Thanks

[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba

2023-03-29 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409

--- Comment #22 from Rama Malladi  ---
I will close this issue as we were unable to reproduce the perf drop going from
gcc-7 to gcc-8 on a Graviton2 based instance. The performance of 519.lbm_r
built with gcc-7.4 was same as that with gcc-8.5.

[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba

2023-02-24 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409

--- Comment #21 from Rama Malladi  ---
I did another triage for perf loss on Graviton 2 processor (neoverse-n1) based
instance and found this commit: `a9a4edf0e71bbac9f1b5dcecdcf9250111d16889` to
be the reason. As I had indicated in my earlier reply, I was doing a triage of
perf loss going from gcc-7 to gcc-10.

The perf of 519.libm_r 1-copy run improved 1.08x with the revert of commit:
`a9a4edf0e71bbac9f1b5dcecdcf9250111d16889` on gcc-mainline (
`2f1691be517fcdcabae9cd671ab511eb0e08b1d5`).

I am guessing that we don't see it on LNT/ Altra CPUs.

So, please look into this issue fix. Let me know if you have any queries.
Thanks.

[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba

2023-02-20 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409

--- Comment #20 from Rama Malladi  ---
@Martin J and @Sebastian P, Let me walk you through the perf data and my
triage.

First, my triage has been on Graviton 3 (neoverse-v1) processor based
instances. Next, I was looking for perf delta going from gcc-7 to gcc-10. I
found 2 issues: One was reported in 107413
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413) and fixed (the perf delta
between gcc-7 and gcc-8 -- 215s vs. 266s); Another one is the issue reported in
here.

I did another triage and landed at the same commit that I reported earlier.

# first bad commit: [a9a4edf0e71bbac9f1b5dcecdcf9250111d16889] Update
max_bb_count in execute_fixup_cfg

Please let me know any further info/ studies you would like to see on this
report.

Thank you.

[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba

2023-02-02 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409

--- Comment #19 from Rama Malladi  ---
Thanks @Sebastian and @Martin J. I will get another bisect between GCC 7-and-8.

[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba

2023-01-08 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409

--- Comment #15 from Rama Malladi  ---
Hi, Can we review this issue and suggest next steps/ action please? Thanks.

[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba

2022-12-12 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409

--- Comment #14 from Rama Malladi  ---
(In reply to Martin Liška from comment #13)
> Note the mentioned revision is a fix and yes, sometimes these revisions can
> end up with a regression as profile estimation is a complex guess.

Yes, possibly. So, from my understanding, the update_max_bb_count() tracks the
max basic block count and takes a decision to inline or not in this case/ run.
That is likely why we see a larger instruction count w this function executed/
enabled.

[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba

2022-12-09 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409

--- Comment #12 from Rama Malladi  ---
I found difference in dumps at various stages of the compilation for the
mainline GCC and with update_max_bb_count() commented. Here are the details:

Mainline: Commit ID: 63a42ffc0833553fbcb84b50cf0fd2d867b8a92f

There was difference in the dumps for these 2 stages:
"einline" and "earlydebug"

Since we use LTO for this build of 519.lbm_r build, I found these differences
in these stages of the link-time optimizer:
"vect", "slp1", "ivopts", "earlydebug", "debug"

Also, this perf drop of 5%-6% with update_max_bb_count() code was observed only
on ARM64 instances (Graviton3) and not on x86_64 instances (Intel Xeon).

I ran the other SPEC cpu2017_fprate benchmarks on ARM64 with this code
commented on GCC mainline and I haven't observed any perf regression. So, maybe
worth a fix.

Thank you.

[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba

2022-12-08 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409

--- Comment #11 from Rama Malladi  ---
(In reply to Martin Liška from comment #10)
> @Honza ?

Just checking if this can be fixed/ implemented. Thanks.

[Bug tree-optimization/107413] Perf loss ~14% on 519.lbm_r SPEC cpu2017 benchmark with r8-7132-gb5b33e113434be

2022-12-01 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413

--- Comment #19 from Rama Malladi  ---
(In reply to Wilco from comment #17)
> (In reply to Rama Malladi from comment #16)
> > (In reply to Wilco from comment #15)
> > > (In reply to Rama Malladi from comment #14)
> > > > This fix also improved performance of 538.imagick_r by 15%. Did you 
> > > > have a
> > > > similar observation? Thank you.
> > > 
> > > No, but I was using -mcpu=neoverse-n1 as my baseline. It's possible
> > > -mcpu=neoverse-v1 shows larger speedups, what gain do you get on the 
> > > overall
> > > FP score?
> > 
> > I was using -mcpu=native and run on a Neoverse V1 arch (Graviton3). Here are
> > the scores I got (relative gains of latest mainline vs. an earlier 
> > mainline).
> > 
> > Latest mainline: 0976b012d89e3d819d83cdaf0dab05925b3eb3a0
> > Earlier mainline: f896c13489d22b30d01257bc8316ab97b3359d1c
> 
> Right that's about 3 weeks of changes, I think
> 1b9a5cc9ec08e9f239dd2096edcc447b7a72f64a has improved imagick_r.
> 
> > geomean 1.03
> 
> That's a nice gain in 3 weeks!

Hi Wilco, Could you backport the change to active release branches? Thanks.

[Bug tree-optimization/107413] Perf loss ~14% on 519.lbm_r SPEC cpu2017 benchmark with r8-7132-gb5b33e113434be

2022-12-01 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413

--- Comment #18 from Rama Malladi  ---
(In reply to Wilco from comment #17)
> (In reply to Rama Malladi from comment #16)
> > (In reply to Wilco from comment #15)
> > > (In reply to Rama Malladi from comment #14)
> > > > This fix also improved performance of 538.imagick_r by 15%. Did you 
> > > > have a
> > > > similar observation? Thank you.
> > > 
> > > No, but I was using -mcpu=neoverse-n1 as my baseline. It's possible
> > > -mcpu=neoverse-v1 shows larger speedups, what gain do you get on the 
> > > overall
> > > FP score?
> > 
> > I was using -mcpu=native and run on a Neoverse V1 arch (Graviton3). Here are
> > the scores I got (relative gains of latest mainline vs. an earlier 
> > mainline).
> > 
> > Latest mainline: 0976b012d89e3d819d83cdaf0dab05925b3eb3a0
> > Earlier mainline: f896c13489d22b30d01257bc8316ab97b3359d1c
> 
> Right that's about 3 weeks of changes, I think
> 1b9a5cc9ec08e9f239dd2096edcc447b7a72f64a has improved imagick_r.
> 
> > geomean 1.03
> 
> That's a nice gain in 3 weeks!

Yes, indeed :-) ... Thank you.

[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba

2022-11-30 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409

--- Comment #9 from Rama Malladi  ---
(In reply to Martin Liška from comment #3)
> Can you please share perf-profile before and after the revision?
> 
> Note I can't see it for Altra aarch64 CPU:
> https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=633.477.0.
> 1=683.477.0=664.477.0=648.477.0=618.477.0=605.
> 477.0=759.477.0=584.477.0&
> 
> However, there are huge changes in between GCC 6/7 and a newer releases.
> Note the benchmark is pretty small and very sensitive to instruction caches.

Hi, I got IPC data for baseline version of compiler and with this patch
reverted.

This is on Graviton3 processor machine, executing 1-copy rate run of 519.lbm_r.

Baseline: Compiler commit ID: f896c13489d22b30d01257bc8316ab97b3359d1c
Cycles:148,489,372,938
Instructions:  382,748,379,257
IPC:   2.58

Baseline with code change in a9a4edf0e71bbac9f1b5dcecdcf9250111d16889 reverted.

$ git diff gcc/tree-cfg.cc
diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc
index d982988048f..736432713fe 100644
--- a/gcc/tree-cfg.cc
+++ b/gcc/tree-cfg.cc
@@ -9984,7 +9984,7 @@ execute_fixup_cfg (void)
 }
   if (scale)
 {
-  update_max_bb_count ();
+//  update_max_bb_count ();
   compute_function_frequency ();
 }

Cycles:140,937,228,769
Instructions:  368,881,714,982
IPC:   2.62

>From the above, I do see the instructions executed are higher for the baseline
compiler code-gen vs. the one with patch reverted. Can you please look into the
code-gen and let me know if you find some perf opportunity with this patch
revert? Thank you.

[Bug tree-optimization/107413] Perf loss ~14% on 519.lbm_r SPEC cpu2017 benchmark with r8-7132-gb5b33e113434be

2022-11-29 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413

--- Comment #16 from Rama Malladi  ---
(In reply to Wilco from comment #15)
> (In reply to Rama Malladi from comment #14)
> > This fix also improved performance of 538.imagick_r by 15%. Did you have a
> > similar observation? Thank you.
> 
> No, but I was using -mcpu=neoverse-n1 as my baseline. It's possible
> -mcpu=neoverse-v1 shows larger speedups, what gain do you get on the overall
> FP score?

I was using -mcpu=native and run on a Neoverse V1 arch (Graviton3). Here are
the scores I got (relative gains of latest mainline vs. an earlier mainline).

Latest mainline: 0976b012d89e3d819d83cdaf0dab05925b3eb3a0
Earlier mainline: f896c13489d22b30d01257bc8316ab97b3359d1c

fp 1-copy rate  Ratio
503.bwaves_r0.98
507.cactuBSSN_r 1.00
508.namd_r  0.97
510.parest_rNA
511.povray_rNA
519.lbm_r   1.16
521.wrf_r   1.00
526.blender_r   0.99
527.cam4_r  NA
538.imagick_r   1.17
544.nab_r   1.01
549.fotonik3d_r NA
554.roms_r  1.00
geomean 1.03

[Bug tree-optimization/107413] Perf loss ~14% on 519.lbm_r SPEC cpu2017 benchmark with r8-7132-gb5b33e113434be

2022-11-29 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413

--- Comment #14 from Rama Malladi  ---
This fix also improved performance of 538.imagick_r by 15%. Did you have a
similar observation? Thank you.

[Bug tree-optimization/107413] Perf loss ~14% on 519.lbm_r SPEC cpu2017 benchmark with r8-7132-gb5b33e113434be

2022-11-28 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413

--- Comment #13 from Rama Malladi  ---
(In reply to CVS Commits from comment #12)
> The master branch has been updated by Wilco Dijkstra :
> 
> https://gcc.gnu.org/g:0c1b0a23f1fe7db6a2e391b7cb78cff90032
> 
> commit r13-4291-g0c1b0a23f1fe7db6a2e391b7cb78cff90032
> Author: Wilco Dijkstra 
> Date:   Wed Nov 23 17:27:19 2022 +
> 
> AArch64: Add fma_reassoc_width [PR107413]
> 
> Add a reassocation width for FMA in per-CPU tuning structures. Keep
> the existing setting of 1 for cores with 2 FMA pipes (this disables
> reassociation), and use 4 for cores with 4 FMA pipes.  This improves
> SPECFP2017 on Neoverse V1 by ~1.5%.
> 
> gcc/
> PR tree-optimization/107413
> * config/aarch64/aarch64.cc (struct tune_params): Add
> fma_reassoc_width to all CPU tuning structures.
> (aarch64_reassociation_width): Use fma_reassoc_width.
> * config/aarch64/aarch64-protos.h (struct tune_params): Add
> fma_reassoc_width.

Thank you for this code change/ fix. I will attempt a run with this change.

[Bug tree-optimization/107413] Perf loss ~14% on 519.lbm_r SPEC cpu2017 benchmark with r8-7132-gb5b33e113434be

2022-11-06 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413

--- Comment #11 from Rama Malladi  ---
(In reply to Wilco from comment #10)
> I'm seeing about 1.5% gain on Neoverse V1 and 0.5% loss on Neoverse N1. I'll
> post a patch that allows per-CPU settings for FMA reassociation, so you'll
> get good performance with -mcpu=native. However reassociation really needs
> to be taught about the existence of FMAs.

Thank you very much Wilco.

[Bug tree-optimization/107413] Perf loss ~14% on 519.lbm_r SPEC cpu2017 benchmark with r8-7132-gb5b33e113434be

2022-11-02 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413

--- Comment #9 from Rama Malladi  ---
(In reply to Rama Malladi from comment #8)
> (In reply to Wilco from comment #7)
> > The revert results in about 0.5% loss on Neoverse N1, so it looks like the
> > reassociation pass is still splitting FMAs into separate MUL and ADD (which
> > is bad for narrow cores).
> 
> Thank you for checking on N1. Did you happen to check on V1 too to reproduce
> the perf results I had? Any other experiments/ tests I can do to help on
> this filing? Thanks again for the debug/ fix.

I ran SPEC cpu2017 fprate 1-copy benchmark built with the patch reverted and
using option 'neoverse-n1' on the Graviton 3 processor (which has support for
SVE). The performance was up by 0.4%, primary contributor being 519.lbm_r which
was up 13%.

[Bug tree-optimization/107413] Perf loss ~14% on 519.lbm_r SPEC cpu2017 benchmark with r8-7132-gb5b33e113434be

2022-11-01 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413

--- Comment #8 from Rama Malladi  ---
(In reply to Wilco from comment #7)
> The revert results in about 0.5% loss on Neoverse N1, so it looks like the
> reassociation pass is still splitting FMAs into separate MUL and ADD (which
> is bad for narrow cores).

Thank you for checking on N1. Did you happen to check on V1 too to reproduce
the perf results I had? Any other experiments/ tests I can do to help on this
filing? Thanks again for the debug/ fix.

[Bug tree-optimization/107413] Perf loss ~14% on 519.lbm_r SPEC cpu2017 benchmark

2022-10-28 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413

--- Comment #6 from Rama Malladi  ---
The compilation options were: -Ofast -mcpu=native -flto

[Bug tree-optimization/107413] Perf loss ~14% on 519.lbm_r SPEC cpu2017 benchmark

2022-10-28 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413

--- Comment #5 from Rama Malladi  ---
(In reply to Wilco from comment #2)
> That's interesting - if the reassociation pass has become a bit smarter in
> the last 5 years, we might no longer need this workaround. What is the
> effect on the overall SPECFP score? Did you try other values like
> fp_reassoc_width = 2 or 3?

Here is SPEC cpu2017 fprate perf data for 1-copy rate run. The runs were run on
a c7g.16xlarge AWS cloud instance.

Benchmark   w fix
--
503.bwaves_r0.98
507.cactuBSSN_r NA
508.namd_r  0.97
510.parest_rNA
511.povray_r1.01
519.lbm_r   1.16
521.wrf_r   1.00
526.blender_r   NA
527.cam4_r  1.00
538.imagick_r   0.99
544.nab_r   1.00
549.fotonik3d_r NA
554.roms_r  1.00
geomean 1.01

The baseline was gcc version 12.2.0 (GCC) compiler. Fix was revert of code
change in commit: b5b33e113434be909e8a6d7b93824196fb6925c0.

So, looks like we aren't impacted much with this commit revert.

I haven't yet tried fp_reassoc_width. Will try shortly.

[Bug c++/107433] 510.parest_r, call of overloaded 'back_interpolate' is ambiguous

2022-10-27 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107433

--- Comment #2 from Rama Malladi  ---
(In reply to Martin Liška from comment #1)
> As mentioned slightly here:
> https://www.spec.org/cpu2017/Docs/benchmarks/510.parest_r.html
> please use -std=c++98 or something < c++17.

Thank you. I had it for C compiler. Will add it to C++ compiler command-line
too.

[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark

2022-10-27 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409

--- Comment #8 from Rama Malladi  ---
(In reply to Mark Wielaard from comment #7)
> The content of attachment 53773 [details] has been deleted for the following
> reason:
> 
> https://sourceware.org/pipermail/overseers/2022q4/019048.html

Thank you.

[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark

2022-10-27 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409

--- Comment #6 from Rama Malladi  ---
(In reply to Martin Liška from comment #5)
> Please try writing here: overse...@sourceware.org

I have asked for deletion. Thanks

[Bug c/107433] New: 510.parest_r, call of overloaded 'back_interpolate' is ambiguous

2022-10-27 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107433

Bug ID: 107433
   Summary: 510.parest_r, call of overloaded 'back_interpolate' is
ambiguous
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rvmallad at amazon dot com
  Target Milestone: ---

$ g++ -mabi=lp64 -c -o source/fe/fe_tools.o -DSPEC -DNDEBUG -Iinclude -I.
-DSPEC_AUTO_SUPPRESS_OPENMP -g -O3 -mcpu=native -fpermissive  -DSPEC_LP64
source/fe/fe_tools.cc

source/fe/fe_tools.cc:1301:21: error: call of overloaded
'back_interpolate(const dealii::DoFHandler<3, 3>&, const
dealii::BlockVector&, const dealii::FiniteElement<3, 3>&,
dealii::BlockVector&)' is ambiguous
 1301 | back_interpolate(dof1, u1, dof2.get_fe(), u1_interpolated);
  | ^~

$ /home/ubuntu/gccmainline/bin/g++  -v
Using built-in specs.
COLLECT_GCC=/home/ubuntu/gccmainline/bin/g++
COLLECT_LTO_WRAPPER=/home/ubuntu/gccmainline/libexec/gcc/aarch64-unknown-linux-gnu/13.0.0/lto-wrapper
Target: aarch64-unknown-linux-gnu
Configured with: ../configure --prefix=/home/ubuntu/gccmainline
--enable-languages=c,fortran
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 13.0.0 20221026 (experimental) (GCC)

[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark

2022-10-27 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409

--- Comment #4 from Rama Malladi  ---
Hi Martin,
Thanks for the guidance. Can we delete the attachment from this bug report?

Regards,
Rama

[Bug tree-optimization/107413] Perf loss ~14% on 519.lbm_r SPEC cpu2017 benchmark

2022-10-26 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413

--- Comment #3 from Rama Malladi  ---
I will get the effect of this revert for the overall SPEC FP score. I haven't
tried experimenting with fp_reassoc_width values. Will try it and update.

[Bug tree-optimization/107413] Perf loss ~14% on 519.lbm_r SPEC cpu2017 benchmark

2022-10-26 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413

--- Comment #1 from Rama Malladi  ---
$ /home/ubuntu/gccfixissue2/bin/gcc  -v
Using built-in specs.
COLLECT_GCC=/home/ubuntu/gccfixissue2/bin/gcc
COLLECT_LTO_WRAPPER=/home/ubuntu/gccfixissue2/libexec/gcc/aarch64-unknown-linux-gnu/13.0.0/lto-wrapper
Target: aarch64-unknown-linux-gnu
Configured with: ../configure --prefix=/home/ubuntu/gccfixissue2
--enable-languages=c,fortran
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 13.0.0 20221021 (experimental) (GCC)

[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark

2022-10-26 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409

--- Comment #1 from Rama Malladi  ---
$ /home/ubuntu/gccfixissue1/bin/gcc  -v
Using built-in specs.
COLLECT_GCC=/home/ubuntu/gccfixissue1/bin/gcc
COLLECT_LTO_WRAPPER=/home/ubuntu/gccfixissue1/libexec/gcc/aarch64-unknown-linux-gnu/13.0.0/lto-wrapper
Target: aarch64-unknown-linux-gnu
Configured with: ../configure --prefix=/home/ubuntu/gccfixissue1
--enable-languages=c,fortran
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 13.0.0 20221021 (experimental) (GCC)

[Bug tree-optimization/107413] New: Perf loss ~14% on 519.lbm_r SPEC cpu2017 benchmark

2022-10-26 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413

Bug ID: 107413
   Summary: Perf loss ~14% on 519.lbm_r SPEC cpu2017 benchmark
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rvmallad at amazon dot com
  Target Milestone: ---

Created attachment 53775
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53775=edit
Input and source files.

Below is some perf data executing the 519.lbm_r benchmark on aarch64
architecture (Graviton 3 processor). I have comparison of the baseline perf
(mainline commit ID: f56d48b2471c388401174029324e1f4c4b84fcdb) vs. a fix for
the same (revert the code change in commit ID:
b5b33e113434be909e8a6d7b93824196fb6925c0).

Steps to compile:
$ gcc -std=c99 -mabi=lp64 -g -Ofast -mcpu=native lbm.i main.i -lm -flto -o
519_lbm_r_base

$ time ./519_lbm_r_base 3000 reference.dat 0 0 100_100_130_ldc.of
real2m50.946s

Reverting the code changes in commit ID:
b5b33e113434be909e8a6d7b93824196fb6925c0
$ time ./519_lbm_r_fix 3000 reference.dat 0 0 100_100_130_ldc.of
real2m27.157s

The code change reverted was:
[AArch64] PR84114: Avoid reassociating FMA

Author: Wilco Dijkstra 
Date:   Mon Mar 5 14:40:55 2018 +

Please find attached the files to reproduce this issue and the fix.

[Bug tree-optimization/107409] New: Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark

2022-10-26 Thread rvmallad at amazon dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409

Bug ID: 107409
   Summary: Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rvmallad at amazon dot com
  Target Milestone: ---

Created attachment 53773
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53773=edit
Input and source files.

Below is some perf data executing the 519.lbm_r benchmark on aarch64
architecture (Graviton 3 processor). I have comparison of the baseline perf
(mainline commit ID: f56d48b2471c388401174029324e1f4c4b84fcdb) vs. a fix for
the same (revert the code change in commit ID:
a9a4edf0e71bbac9f1b5dcecdcf9250111d16889).

Steps to compile:
$ gcc -std=c99 -mabi=lp64 -g -Ofast -mcpu=native lbm.i main.i -lm -flto -o
519_lbm_r_base

$ time ./519_lbm_r_base 3000 reference.dat 0 0 100_100_130_ldc.of
real2m50.946s

Reverting the code changes in commit ID:
a9a4edf0e71bbac9f1b5dcecdcf9250111d16889
$ time ./519_lbm_r_fix 3000 reference.dat 0 0 100_100_130_ldc.of
real2m42.091s

The code change reverted was in the following file:
* tree-cfg.c (execute_fixup_cfg): Update also max_bb_count when scaling happen.

Author: Jan Hubicka 
Date:   Sat Nov 30 22:25:24 2019 +0100

Please find attached the files to reproduce this issue and the fix.