[Bug target/113847] [14 Regression] 10% slowdown of 462.libquantum on AMD Ryzen 7700X and Ryzen 7900X
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113847 Jeffrey A. Law changed: What|Removed |Added CC||law at gcc dot gnu.org Priority|P3 |P2
[Bug target/113847] [14 Regression] 10% slowdown of 462.libquantum on AMD Ryzen 7700X and Ryzen 7900X
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113847 --- Comment #6 from Martin Jambor --- (In reply to Richard Biener from comment #5) > CCing also Martin who should know how/why IPA SRA doesn't reconstruct the > component ref chain here I have not had a look at this specific case (yet), but IPA-SRA just doesn't (unlike intraprocedural SRA) and always creates MEM_REFs (in callers). I guess we could stream field offsets and/or array_ref indices and attempt to reconstruct it for simple (non-union, non-otherwise-overlapping) types, even if it would make the ipa_adjusted_param type (and thus ipa_param_adjustments) slightly bigger and add another vector. > or why it choses the dynamic type as it does > (possibly local SRA when fully scalarizing an aggregate copy does the same). That is unlikely. Total scalarization in intraprocedural SRA just follows the type of the decl whereas IPA-SRA (and intra-SRA too when not totally scalarizing) takes all types from existing memory accesses.
[Bug target/113847] [14 Regression] 10% slowdown of 462.libquantum on AMD Ryzen 7700X and Ryzen 7900X
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113847 Richard Biener changed: What|Removed |Added CC||jamborm at gcc dot gnu.org --- Comment #5 from Richard Biener --- CCing also Martin who should know how/why IPA SRA doesn't reconstruct the component ref chain here or why it choses the dynamic type as it does (possibly local SRA when fully scalarizing an aggregate copy does the same).
[Bug target/113847] [14 Regression] 10% slowdown of 462.libquantum on AMD Ryzen 7700X and Ryzen 7900X
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113847 Richard Biener changed: What|Removed |Added CC||hubicka at gcc dot gnu.org --- Comment #4 from Richard Biener --- Hmm, the important one is actually MEM[ptr + CST] vs MEM[ptr].component. But those are not semantically equivalent, even when the same TBAA type is in effect. _31 = MEM [(struct quantum_reg *)reg_3(D)]; _33 = MEM [(struct quantum_reg *)reg_3(D) + 8B]; _34 = MEM [(struct quantum_reg *)reg_3(D) + 16B]; _35 = MEM [(struct quantum_reg *)reg_3(D) + 24B]; out = quantum_state_collapse.isra (pos_1(D), result_22, _31, _32, _33, _34, _35); [return slot optimization] this is from inlined quantum_state_collapse where IPA SRA is eventually applied producing the above. That we do produce those might hint at that we can't really assume the dynamic type quantum_reg is at offset 8 but that was the original intent. What we are left with is the special-case where typeof (MEM[ptr + CST]) == typeof (alias-pointed-to-type) (with CST == 0). For any other case what we know is only that the access MEM[ptr + CST] is to somewhere inside an object of dynamic type quantum_reg? I'm not sure that's not less than we make use of in the alias-oracle, esp. aliasing_component_refs_walk and friends? We might be fine in practice for "bare" MEM_REFs like the above, but if we ever fold only part of the access path into the constant offset funny things may happen? So I think IPA SRA does wrong here (and maybe GCC in other places as well), possibly only pessimizing and possibly creating latent wrong-code. Note quantum_state_collapse has reg$size_62 = reg.size; reg$node_75 = reg.node; ... pre-IPA. Honza, any opinion?
[Bug target/113847] [14 Regression] 10% slowdown of 462.libquantum on AMD Ryzen 7700X and Ryzen 7900X
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113847 --- Comment #3 from Richard Biener --- I can't confirm a regression (testing r14-8925-g1e3f78dbb328a2 with the offending rev reverted vs bare). 462.libquantum 20720 61.9335 S 20720 62.6331 * 462.libquantum 20720 62.2333 * 20720 61.9335 S 462.libquantum 20720 62.4332 S 20720 62.7330 S so the "best" run with the change is faster than the best run with it reverted while the worst runs are the same. There's only code-gen changes in quantum_bmeasure.part.0 and we can see it's likely {component_ref,mem_ref<0B>,reg_3(D)}@.MEM_166 (0030) vs {component_ref,mem_ref<0B>,reg_3(D)}@.MEM_9 (0022) where once the size is 256 and once 64. The types are constant 256> unit-size constant 32> vs. unit-size the former is subsetted by a COMPONENT_REF to eventually > unsigned DI so we have basically MEM vs. MEM.member-with-off. That's indeed a case where we maybe like to avoid applying this fix, but maybe only when strict-aliasing is in effect.
[Bug target/113847] [14 Regression] 10% slowdown of 462.libquantum on AMD Ryzen 7700X and Ryzen 7900X
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113847 Richard Biener changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Last reconfirmed||2024-02-12 Ever confirmed|0 |1 --- Comment #2 from Richard Biener --- I will try to investigate. Note this was a correctness fix, it could be relaxed a tiny bit but behavior will then depend on the order of processing of blocks not ordered by RPO.
[Bug target/113847] [14 Regression] 10% slowdown of 462.libquantum on AMD Ryzen 7700X and Ryzen 7900X
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113847 Filip Kastl changed: What|Removed |Added Keywords|needs-bisection | CC||rguenth at gcc dot gnu.org --- Comment #1 from Filip Kastl --- Bisected to g:724b64304ff5c8ac08a913509afd6fde38d7b767 (I did the bisection on Ryzen 7900X)
[Bug target/113847] [14 Regression] 10% slowdown of 462.libquantum on AMD Ryzen 7700X and Ryzen 7900X
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113847 Richard Biener changed: What|Removed |Added Target Milestone|--- |14.0