[Bug target/113847] [14 Regression] 10% slowdown of 462.libquantum on AMD Ryzen 7700X and Ryzen 7900X

2024-03-07 Thread law at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113847

Jeffrey A. Law  changed:

   What|Removed |Added

 CC||law at gcc dot gnu.org
   Priority|P3  |P2

[Bug target/113847] [14 Regression] 10% slowdown of 462.libquantum on AMD Ryzen 7700X and Ryzen 7900X

2024-02-12 Thread jamborm at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113847

--- Comment #6 from Martin Jambor  ---
(In reply to Richard Biener from comment #5)
> CCing also Martin who should know how/why IPA SRA doesn't reconstruct the
> component ref chain here 

I have not had a look at this specific case (yet), but IPA-SRA just
doesn't (unlike intraprocedural SRA) and always creates MEM_REFs (in
callers).  I guess we could stream field offsets and/or array_ref
indices and attempt to reconstruct it for simple (non-union,
non-otherwise-overlapping) types, even if it would make the
ipa_adjusted_param type (and thus ipa_param_adjustments) slightly
bigger and add another vector.

> or why it choses the dynamic type as it does
> (possibly local SRA when fully scalarizing an aggregate copy does the same).

That is unlikely.  Total scalarization in intraprocedural SRA just
follows the type of the decl whereas IPA-SRA (and intra-SRA too when
not totally scalarizing) takes all types from existing memory
accesses.

[Bug target/113847] [14 Regression] 10% slowdown of 462.libquantum on AMD Ryzen 7700X and Ryzen 7900X

2024-02-12 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113847

Richard Biener  changed:

   What|Removed |Added

 CC||jamborm at gcc dot gnu.org

--- Comment #5 from Richard Biener  ---
CCing also Martin who should know how/why IPA SRA doesn't reconstruct the
component ref chain here or why it choses the dynamic type as it does
(possibly local SRA when fully scalarizing an aggregate copy does the same).

[Bug target/113847] [14 Regression] 10% slowdown of 462.libquantum on AMD Ryzen 7700X and Ryzen 7900X

2024-02-12 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113847

Richard Biener  changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org

--- Comment #4 from Richard Biener  ---
Hmm, the important one is actually MEM[ptr + CST] vs MEM[ptr].component.  But
those are not semantically equivalent, even when the same TBAA type is in
effect.

  _31 = MEM  [(struct quantum_reg *)reg_3(D)];
  _33 = MEM  [(struct quantum_reg *)reg_3(D) + 8B];
  _34 = MEM  [(struct quantum_reg *)reg_3(D) + 16B];
  _35 = MEM  [(struct quantum_reg *)reg_3(D) + 24B];
  out = quantum_state_collapse.isra (pos_1(D), result_22, _31, _32, _33, _34,
_35); [return slot optimization]

this is from inlined quantum_state_collapse where IPA SRA is eventually
applied producing the above.

That we do produce those might hint at that we can't really assume the
dynamic type quantum_reg is at offset 8 but that was the original intent.
What we are left with is the special-case where typeof (MEM[ptr + CST])
== typeof (alias-pointed-to-type) (with CST == 0).  For any other case
what we know is only that the access MEM[ptr + CST] is to somewhere
inside an object of dynamic type quantum_reg?

I'm not sure that's not less than we make use of in the alias-oracle,
esp. aliasing_component_refs_walk and friends?  We might be fine in
practice for "bare" MEM_REFs like the above, but if we ever fold only
part of the access path into the constant offset funny things may happen?

So I think IPA SRA does wrong here (and maybe GCC in other places as well),
possibly only pessimizing and possibly creating latent wrong-code.
Note quantum_state_collapse has

  reg$size_62 = reg.size;
  reg$node_75 = reg.node;
...

pre-IPA.

Honza, any opinion?

[Bug target/113847] [14 Regression] 10% slowdown of 462.libquantum on AMD Ryzen 7700X and Ryzen 7900X

2024-02-12 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113847

--- Comment #3 from Richard Biener  ---
I can't confirm a regression (testing r14-8925-g1e3f78dbb328a2 with the
offending rev reverted vs bare).

462.libquantum  20720   61.9335 S   20720   62.6331 *
462.libquantum  20720   62.2333 *   20720   61.9335 S
462.libquantum  20720   62.4332 S   20720   62.7330 S

so the "best" run with the change is faster than the best run with it reverted
while the worst runs are the same.

There's only code-gen changes in quantum_bmeasure.part.0 and we can see
it's likely

{component_ref,mem_ref<0B>,reg_3(D)}@.MEM_166 (0030)

vs

{component_ref,mem_ref<0B>,reg_3(D)}@.MEM_9 (0022)

where once the size is 256 and once 64.  The types are

  constant 256>
unit-size  constant 32>

vs.

 
unit-size 

the former is subsetted by a COMPONENT_REF to eventually

 >
unsigned DI

so we have basically MEM vs. MEM.member-with-off.

That's indeed a case where we maybe like to avoid applying this fix, but
maybe only when strict-aliasing is in effect.

[Bug target/113847] [14 Regression] 10% slowdown of 462.libquantum on AMD Ryzen 7700X and Ryzen 7900X

2024-02-12 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113847

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
   Last reconfirmed||2024-02-12
 Ever confirmed|0   |1

--- Comment #2 from Richard Biener  ---
I will try to investigate.  Note this was a correctness fix, it could be
relaxed a tiny bit but behavior will then depend on the order of processing of
blocks not ordered by RPO.

[Bug target/113847] [14 Regression] 10% slowdown of 462.libquantum on AMD Ryzen 7700X and Ryzen 7900X

2024-02-10 Thread fkastl at suse dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113847

Filip Kastl  changed:

   What|Removed |Added

   Keywords|needs-bisection |
 CC||rguenth at gcc dot gnu.org

--- Comment #1 from Filip Kastl  ---
Bisected to g:724b64304ff5c8ac08a913509afd6fde38d7b767 (I did the bisection on
Ryzen 7900X)

[Bug target/113847] [14 Regression] 10% slowdown of 462.libquantum on AMD Ryzen 7700X and Ryzen 7900X

2024-02-09 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113847

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |14.0