[Bug tree-optimization/116463] [15 Regression] complex multiply vectorizer detection failures after r15-3087-gb07f8a301158e5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 --- Comment #35 from Tamar Christina --- (In reply to Richard Biener from comment #34) > Possibly first computing a lattice val for each SSA name whether its origin > is a "real" or a "imag" component of a complex load could get us meta but > even then the individual sorting which determines the initial association to > SLP nodes would be only possible to adjust "cross-lane" (and to what? I > guess combine real+imag parts? Hopefully of the same entity). Into vect we > get with > > _19 = REALPART_EXPR <*_3>; > _18 = IMAGPART_EXPR <*_3>; > _5 = a_14(D) + _2; > _23 = REALPART_EXPR <*_5>; // real > _24 = IMAGPART_EXPR <*_5>; // imag > _26 = b$real_11 * _23; // real? > _27 = _24 * _53; // imag? > _28 = _23 * _53; // mixed? but fed into imag > _29 = b$real_11 * _24; // mixed? > _7 = _18 - _28; // mixed? or imag? > _22 = _27 - _26; // mixed? > _32 = _19 + _22; // mixed? or real? > _33 = _7 - _29; // mixed? but fed into real? > REALPART_EXPR <*_3> = _32; > IMAGPART_EXPR <*_3> = _33; > > so not sure if that will help. That we'd like to have full load groups > is unfortunately only visible a node deeper. We could also fill a lattice > with group IDs but we'd need to know parts to identify lane duplicates vs. > full groups. It's also a lot of hassle for not much gain and very special > cases? That should help, because all it's after is that the final loads be permuted. The reason I'm keep to fix this is because it's not that niche. complex-add due to the operation being just +/- with a permute is by far the most common one. Not recognizing this is e.g. 10% on fotonik in SPECCPU2017 FP, which is also a regression I'm trying to fix. I can try to reduce a testcase for that to see if maybe that specific one is easier to fix. I'm just wondering if we can't do better in the future, e.g. LLVM recognizes both fms180snd cases for instance. If it's easier, I could see if we can just have another pattern to discover fmul + fcadd? Could maybe work and fix the SPEC regression... need to make a testcase
[Bug tree-optimization/116463] [15 Regression] complex multiply vectorizer detection failures after r15-3087-gb07f8a301158e5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 --- Comment #34 from Richard Biener --- Possibly first computing a lattice val for each SSA name whether its origin is a "real" or a "imag" component of a complex load could get us meta but even then the individual sorting which determines the initial association to SLP nodes would be only possible to adjust "cross-lane" (and to what? I guess combine real+imag parts? Hopefully of the same entity). Into vect we get with _19 = REALPART_EXPR <*_3>; _18 = IMAGPART_EXPR <*_3>; _5 = a_14(D) + _2; _23 = REALPART_EXPR <*_5>; // real _24 = IMAGPART_EXPR <*_5>; // imag _26 = b$real_11 * _23; // real? _27 = _24 * _53; // imag? _28 = _23 * _53; // mixed? but fed into imag _29 = b$real_11 * _24; // mixed? _7 = _18 - _28; // mixed? or imag? _22 = _27 - _26; // mixed? _32 = _19 + _22; // mixed? or real? _33 = _7 - _29; // mixed? but fed into real? REALPART_EXPR <*_3> = _32; IMAGPART_EXPR <*_3> = _33; so not sure if that will help. That we'd like to have full load groups is unfortunately only visible a node deeper. We could also fill a lattice with group IDs but we'd need to know parts to identify lane duplicates vs. full groups. It's also a lot of hassle for not much gain and very special cases?
[Bug tree-optimization/116463] [15 Regression] complex multiply vectorizer detection failures after r15-3087-gb07f8a301158e5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 --- Comment #33 from Richard Biener --- I see. Note it's SLP discoveries association code that figures out a SLP graph, disabling this ends up with single-lane (store-lanes) from the start. The association that "succeeds" first wins, and it's an unfortunate one (for SLP pattern detection). The thing is that the re-association greedily figures the best operand order as well. We start with t.c:3:21: note: pre-sorted chains of plus_expr plus_expr _19 plus_expr _27 minus_expr _26 plus_expr _18 minus_expr _29 minus_expr _28 and if we'd start with plus_expr _19 plus_expr _27 minus_expr _26 plus_expr _18 minus_expr _28 minus_expr _29 instead we get the desired SLP pattern match but still store-lanes is prefered it seems (not sure how we got away with no store-lanes in GCC 13). We could simply refuse to override the SLP graph with laod/store-lanes when patterns were found: diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index 3892e1be3f2..4fb57a76f85 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -5064,7 +5065,7 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size, to cancel SLP when this applied to all instances in a loop but now we decide this per SLP instance. It's important to do this only after SLP pattern recognition. */ - if (is_a (vinfo)) + if (!pattern_found && is_a (vinfo)) FOR_EACH_VEC_ELT (LOOP_VINFO_SLP_INSTANCES (vinfo), i, instance) if (SLP_INSTANCE_KIND (instance) == slp_inst_kind_store && !SLP_INSTANCE_TREE (instance)->ldst_lanes) when starting with the swapped ops above we then get the desired code again. I've hacked that in with diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index 3892e1be3f2..4fb57a76f85 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -2275,6 +2275,7 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node, /* 1. pre-sort according to def_type and operation. */ for (unsigned lane = 0; lane < group_size; ++lane) chains[lane].stablesort (dt_sort_cmp, vinfo); + std::swap (chains[1][2], chains[1][1]); if (dump_enabled_p ()) { dump_printf_loc (MSG_NOTE, vect_location, it happens that in this specific case the optimal operand order matches stmt order so the following produces that - but I'm not positively sure that's always good (though the 'stablesort' also tries to not disturb order - but in this case it's the DFS order collecting the scalar ops). In reality there's not enough info on the op or its definition to locally decide a better order for future pattern matching. diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index 3892e1be3f2..f21e8b909ff 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -1684,7 +1684,12 @@ dt_sort_cmp (const void *op1_, const void *op2_, void *) auto *op2 = (const chain_op_t *) op2_; if (op1->dt != op2->dt) return (int)op1->dt - (int)op2->dt; - return (int)op1->code - (int)op2->code; + if ((int)op1->code != (int)op2->code) +return (int)op1->code - (int)op2->code; + if (TREE_CODE (op1->op) == SSA_NAME && TREE_CODE (op2->op) == SSA_NAME) +return (gimple_uid (SSA_NAME_DEF_STMT (op1->op)) + - gimple_uid (SSA_NAME_DEF_STMT (op2->op))); + return 0; } /* Linearize the associatable expression chain at START with the That said, I don't have a good idea on how to make this work better, not even after re-doing SLP discovery. Maybe SLP patterns need to work on the initial single-lane SLP graph? But then we'd have to find lane-matches on two unconnected SLP sub-graphs which complicates the pattern matching part. We basically form SLP nodes from two sets of (two lanes) plus/minus ops (three each) but we of course try to avoid SLP build of all 3! permutations possible and stop at the first one that succeeds.
[Bug tree-optimization/116463] [15 Regression] complex multiply vectorizer detection failures after r15-3087-gb07f8a301158e5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 --- Comment #32 from Tamar Christina --- (In reply to Richard Biener from comment #31) > (In reply to Tamar Christina from comment #29) > > (In reply to Tamar Christina from comment #27) > > > > > > > > > > We DO already impose any order on them, but the other operand is > > > > > oddodd, so > > > > > the overall order ends up being oddodd because any known permute > > > > > overrides > > > > > unknown ones. > > > > > > > > So what's the desired outcome? I guess PERM_UNKNOWN? I guess it's > > > > the "other operand" of an add? What's the (bad) effect of classifying > > > > it as ODDODD (optimistically)? > > > > > > > > > So the question is, can we not follow externals in a constructor to > > > > > figure > > > > > out if how they are used they all read from the same base and in > > > > > which order? > > > > > > > > I don't see how it makes sense to do this. For the above example, > > > > what's > > > > the testcase exhibiting this (and on which arch)? > > > > > > I've been working on a fix from a different angle for this, which also > > > covers another GCC 14 regression that went unnoticed. I'll post after > > > regressions finish. > > > > So I've formalized the handling of TOP a bit better. Which gets it to > > recognize it again, however, it will be dropped as it's not profitable. > > > > The reason it's not profitable is the canonicalization issue mentioned > > above. This has split the imaginary and real nodes into different > > computations. > > > > So no matter what you do in the SLP tree, the attached digraph won't make > > the loads of _5 linear. Are you ok with me trying that Richi? > > I can't make sense of that graph - the node feeding the store seems to have > wrong scalar stmts? > > What's the testcase for this (and on what arch?). > void fms_elemconjsnd(_Complex TYPE a[restrict N], _Complex TYPE b, _Complex TYPE c[restrict N]) { for (int i = 0; i < N; i++) c[i] -= a[i] * ~b; } compiled with -Ofast -march=armv8.3-a #define TYPE double #define I 1.0i #define N 200 void fms180snd (_Complex TYPE a[restrict N], _Complex TYPE b[restrict N], _Complex TYPE c[restrict N]) { for (int i=0; i < N; i++) c[i] -= a[i] * (b[i] * I * I); } void fms180snd_1 (_Complex TYPE a[restrict N], _Complex TYPE b[restrict N], _Complex TYPE c[restrict N]) { _Complex TYPE t = I; for (int i=0; i < N; i++) c[i] -= a[i] * (b[i] * t * t); } is another one, where they are the same things, but 1st one is matched and second one doesn't. > But yes, the loads of *5 won't get linear here, but at least the > permute node feeding the complex-add-rot270 can be elided, eventually > even making the external _53, b$real_11 match the other with different > order (though we don't model that, cost-wise). But without the loads getting linearize the match will never work as multi-lane SLP will be immediately cancelled because it assumed load-lanes is cheaper (it's not, but load lanes doesn't get costed) and that's why there's a load permute optimization step after complex pattern matching. The point is however, that no permute is needed. *not even for the loads*. GCC 13 generated: fms_elemconjsnd: fnegd1, d1 mov x2, 0 dup v4.2d, v0.d[0] dup v3.2d, v1.d[0] .L2: ldr q1, [x0, x2] ldr q0, [x1, x2] fmulv2.2d, v3.2d, v1.2d fcadd v0.2d, v0.2d, v2.2d, #270 fmlsv0.2d, v4.2d, v1.2d str q0, [x1, x2] add x2, x2, 16 cmp x2, 3200 bne .L2 ret which was the optimal sequence.
[Bug tree-optimization/116463] [15 Regression] complex multiply vectorizer detection failures after r15-3087-gb07f8a301158e5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 --- Comment #31 from Richard Biener --- (In reply to Tamar Christina from comment #29) > (In reply to Tamar Christina from comment #27) > > > > > > > > We DO already impose any order on them, but the other operand is > > > > oddodd, so > > > > the overall order ends up being oddodd because any known permute > > > > overrides > > > > unknown ones. > > > > > > So what's the desired outcome? I guess PERM_UNKNOWN? I guess it's > > > the "other operand" of an add? What's the (bad) effect of classifying > > > it as ODDODD (optimistically)? > > > > > > > So the question is, can we not follow externals in a constructor to > > > > figure > > > > out if how they are used they all read from the same base and in which > > > > order? > > > > > > I don't see how it makes sense to do this. For the above example, what's > > > the testcase exhibiting this (and on which arch)? > > > > I've been working on a fix from a different angle for this, which also > > covers another GCC 14 regression that went unnoticed. I'll post after > > regressions finish. > > So I've formalized the handling of TOP a bit better. Which gets it to > recognize it again, however, it will be dropped as it's not profitable. > > The reason it's not profitable is the canonicalization issue mentioned > above. This has split the imaginary and real nodes into different > computations. > > So no matter what you do in the SLP tree, the attached digraph won't make > the loads of _5 linear. Are you ok with me trying that Richi? I can't make sense of that graph - the node feeding the store seems to have wrong scalar stmts? What's the testcase for this (and on what arch?). But yes, the loads of *5 won't get linear here, but at least the permute node feeding the complex-add-rot270 can be elided, eventually even making the external _53, b$real_11 match the other with different order (though we don't model that, cost-wise).
[Bug tree-optimization/116463] [15 Regression] complex multiply vectorizer detection failures after r15-3087-gb07f8a301158e5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 --- Comment #30 from Tamar Christina --- Created attachment 59779 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59779&action=edit pattern.dot digraph of resulting pattern
[Bug tree-optimization/116463] [15 Regression] complex multiply vectorizer detection failures after r15-3087-gb07f8a301158e5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 --- Comment #29 from Tamar Christina --- (In reply to Tamar Christina from comment #27) > > > > > > We DO already impose any order on them, but the other operand is oddodd, > > > so > > > the overall order ends up being oddodd because any known permute overrides > > > unknown ones. > > > > So what's the desired outcome? I guess PERM_UNKNOWN? I guess it's > > the "other operand" of an add? What's the (bad) effect of classifying > > it as ODDODD (optimistically)? > > > > > So the question is, can we not follow externals in a constructor to figure > > > out if how they are used they all read from the same base and in which > > > order? > > > > I don't see how it makes sense to do this. For the above example, what's > > the testcase exhibiting this (and on which arch)? > > I've been working on a fix from a different angle for this, which also > covers another GCC 14 regression that went unnoticed. I'll post after > regressions finish. So I've formalized the handling of TOP a bit better. Which gets it to recognize it again, however, it will be dropped as it's not profitable. The reason it's not profitable is the canonicalization issue mentioned above. This has split the imaginary and real nodes into different computations. So no matter what you do in the SLP tree, the attached digraph won't make the loads of _5 linear. Are you ok with me trying that Richi?
[Bug tree-optimization/116463] [15 Regression] complex multiply vectorizer detection failures after r15-3087-gb07f8a301158e5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 --- Comment #28 from Tamar Christina --- (In reply to Tamar Christina from comment #27) > > > > > > We DO already impose any order on them, but the other operand is oddodd, > > > so > > > the overall order ends up being oddodd because any known permute overrides > > > unknown ones. > > > > So what's the desired outcome? I guess PERM_UNKNOWN? I guess it's > > the "other operand" of an add? What's the (bad) effect of classifying > > it as ODDODD (optimistically)? > > > > > So the question is, can we not follow externals in a constructor to figure > > > out if how they are used they all read from the same base and in which > > > order? > > > > I don't see how it makes sense to do this. For the above example, what's > > the testcase exhibiting this (and on which arch)? > > I've been working on a fix from a different angle for this, which also > covers another GCC 14 regression that went unnoticed. I'll post after > regressions finish. So I've formalized the handling of TOP a bit better. Which gets it to recognize it again, however, it will be dropped as it's not profitable. The reason it's not profitable is the canonicalization issue mentioned above. This has split the imaginary and real nodes into different computations. So no matter what you do in the SLP tree, the attached digraph won't make the loads of _5 linear. Are you ok with me trying that Richi?
[Bug tree-optimization/116463] [15 Regression] complex multiply vectorizer detection failures after r15-3087-gb07f8a301158e5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 --- Comment #27 from Tamar Christina --- > > > > We DO already impose any order on them, but the other operand is oddodd, so > > the overall order ends up being oddodd because any known permute overrides > > unknown ones. > > So what's the desired outcome? I guess PERM_UNKNOWN? I guess it's > the "other operand" of an add? What's the (bad) effect of classifying > it as ODDODD (optimistically)? > > > So the question is, can we not follow externals in a constructor to figure > > out if how they are used they all read from the same base and in which > > order? > > I don't see how it makes sense to do this. For the above example, what's > the testcase exhibiting this (and on which arch)? I've been working on a fix from a different angle for this, which also covers another GCC 14 regression that went unnoticed. I'll post after regressions finish.
[Bug tree-optimization/116463] [15 Regression] complex multiply vectorizer detection failures after r15-3087-gb07f8a301158e5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 --- Comment #26 from GCC Commits --- The releases/gcc-14 branch has been updated by Tamar Christina : https://gcc.gnu.org/g:f01f01f0ebf8f5207096cb9650354210d890fe0d commit r14-11053-gf01f01f0ebf8f5207096cb9650354210d890fe0d Author: Tamar Christina Date: Thu Nov 21 15:10:24 2024 + middle-end:For multiplication try swapping operands when matching complex multiply [PR116463] This commit fixes the failures of complex.exp=fast-math-complex-mls-*.c on the GCC 14 branch and some of the ones on the master. The current matching just looks for one order for multiplication and was relying on canonicalization to always give the right order because of the TWO_OPERANDS. However when it comes to the multiplication trying only one order is a bit fragile as they can be flipped. The failing tests on the branch are: void fms180snd(_Complex TYPE a[restrict N], _Complex TYPE b[restrict N], _Complex TYPE c[restrict N]) { for (int i = 0; i < N; i++) c[i] -= a[i] * (b[i] * I * I); } void fms180fst(_Complex TYPE a[restrict N], _Complex TYPE b[restrict N], _Complex TYPE c[restrict N]) { for (int i = 0; i < N; i++) c[i] -= (a[i] * I * I) * b[i]; } The issue is just a small difference in commutative operations. we look for {R,R} * {R,I} but found {R,I} * {R,R}. Since the DF analysis is cached, we should be able to swap operands and retry for multiply cheaply. There is a constraint being checked by vect_validate_multiplication for the data flow of the operands feeding the multiplications. So e.g. between the nodes: note: node 0x4d1d210 (max_nunits=2, refcnt=3) vector(2) double note: op template: _27 = _10 * _25; note: stmt 0 _27 = _10 * _25; note: stmt 1 _29 = _11 * _25; note: node 0x4d1d060 (max_nunits=2, refcnt=2) vector(2) double note: op template: _26 = _11 * _24; note: stmt 0 _26 = _11 * _24; note: stmt 1 _28 = _10 * _24; we require the lanes to come from the same source which vect_validate_multiplication checks. As such it doesn't make sense to flip them individually because that would invalidate the earlier linear_loads_p checks which have validated that the arguments all come from the same datarefs. This patch thus flips the operands in unison to still maintain this invariant, but also honor the commutative nature of multiplication. gcc/ChangeLog: PR tree-optimization/116463 * tree-vect-slp-patterns.cc (complex_mul_pattern::matches, complex_fms_pattern::matches): Try swapping operands on multiply. (cherry picked from commit a9473f9c6f2d755d2eb79dbd30877e64b4bc6fc8)
[Bug tree-optimization/116463] [15 Regression] complex multiply vectorizer detection failures after r15-3087-gb07f8a301158e5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 --- Comment #25 from Richard Biener --- (In reply to Tamar Christina from comment #24) > (In reply to rguent...@suse.de from comment #23) > > > Am 23.11.2024 um 13:20 schrieb tnfchris at gcc dot gnu.org > > > : > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 > > > > > > --- Comment #22 from Tamar Christina --- > > > Ok, so the problem with the ones on trunk isn't necessarily the > > > canonicalization itself but that our externals handling is a bit shallow. > > > > > > On externals we determine that we have no information on the DF and > > > return TOP. > > > This is because DR analysis doesn't try to handle externals since they're > > > not > > > part of the loop. > > > > > > However all we need to know for complex numbers is whether the externals > > > are > > > loaded from the same place and the order of them. > > > > > > concretely the loop pre-header is: > > > > > > [local count: 10737416]: > > > b$real_11 = REALPART_EXPR ; > > > b$imag_10 = IMAGPART_EXPR ; > > > _53 = -b$imag_10; > > > > > > and the loop body: > > > > > > [local count: 1063004408]: > > > ... > > > > > > _23 = REALPART_EXPR <*_5>; > > > _24 = IMAGPART_EXPR <*_5>; > > > _27 = _24 * _53; > > > _28 = _23 * _53; > > > > > > codegen before after: > > > > > > {_24, _23} * { _53, _53 } > > > > > > and after > > > > > > { _24, _24 } * { _53, b$real_11 } > > > > > > Before we were able to easily tell that the order for the multiply would > > > be > > > IMAG, REAL. > > > In the after (GCC 15) case that information is there, but requires us to > > > follow > > > the externals. > > > > > > Richi what do you think about extending externals handling in > > > linear_loads_p to > > > follow all external ops and if they load from the same memref to figure > > > out the > > > "virtual lane permute"? > > > > Externs do not have a permute as we build them from scalars. So any permute > > can be trivially imposed on them - rather than TOP they should be BOTTOM. > > Of course there’s also no advantage of imposing a permute on them. > > > > But the scalars can access memory that we can tell what they are. > > My point with the above was that it doesn't make sense to me that we know > that {a[0],a[1]} reads a linearly but that with > > a1 = a[0] > a2 = a[1] > > {a1,a2} we say "sorry we know nothing about you". > > Yes they're externals but they have a defined order of use in the SLP tree. > This isn't about imposing a permute. I said virtual permute since > linear_load_p uses the lane permutes on loads to determine the memory access > order. > > We DO already impose any order on them, but the other operand is oddodd, so > the overall order ends up being oddodd because any known permute overrides > unknown ones. So what's the desired outcome? I guess PERM_UNKNOWN? I guess it's the "other operand" of an add? What's the (bad) effect of classifying it as ODDODD (optimistically)? > So the question is, can we not follow externals in a constructor to figure > out if how they are used they all read from the same base and in which order? I don't see how it makes sense to do this. For the above example, what's the testcase exhibiting this (and on which arch)?
[Bug tree-optimization/116463] [15 Regression] complex multiply vectorizer detection failures after r15-3087-gb07f8a301158e5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 --- Comment #24 from Tamar Christina --- (In reply to rguent...@suse.de from comment #23) > > Am 23.11.2024 um 13:20 schrieb tnfchris at gcc dot gnu.org > > : > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 > > > > --- Comment #22 from Tamar Christina --- > > Ok, so the problem with the ones on trunk isn't necessarily the > > canonicalization itself but that our externals handling is a bit shallow. > > > > On externals we determine that we have no information on the DF and return > > TOP. > > This is because DR analysis doesn't try to handle externals since they're > > not > > part of the loop. > > > > However all we need to know for complex numbers is whether the externals are > > loaded from the same place and the order of them. > > > > concretely the loop pre-header is: > > > > [local count: 10737416]: > > b$real_11 = REALPART_EXPR ; > > b$imag_10 = IMAGPART_EXPR ; > > _53 = -b$imag_10; > > > > and the loop body: > > > > [local count: 1063004408]: > > ... > > > > _23 = REALPART_EXPR <*_5>; > > _24 = IMAGPART_EXPR <*_5>; > > _27 = _24 * _53; > > _28 = _23 * _53; > > > > codegen before after: > > > > {_24, _23} * { _53, _53 } > > > > and after > > > > { _24, _24 } * { _53, b$real_11 } > > > > Before we were able to easily tell that the order for the multiply would be > > IMAG, REAL. > > In the after (GCC 15) case that information is there, but requires us to > > follow > > the externals. > > > > Richi what do you think about extending externals handling in > > linear_loads_p to > > follow all external ops and if they load from the same memref to figure out > > the > > "virtual lane permute"? > > Externs do not have a permute as we build them from scalars. So any permute > can be trivially imposed on them - rather than TOP they should be BOTTOM. > Of course there’s also no advantage of imposing a permute on them. > But the scalars can access memory that we can tell what they are. My point with the above was that it doesn't make sense to me that we know that {a[0],a[1]} reads a linearly but that with a1 = a[0] a2 = a[1] {a1,a2} we say "sorry we know nothing about you". Yes they're externals but they have a defined order of use in the SLP tree. This isn't about imposing a permute. I said virtual permute since linear_load_p uses the lane permutes on loads to determine the memory access order. We DO already impose any order on them, but the other operand is oddodd, so the overall order ends up being oddodd because any known permute overrides unknown ones. So the question is, can we not follow externals in a constructor to figure out if how they are used they all read from the same base and in which order?
[Bug tree-optimization/116463] [15 Regression] complex multiply vectorizer detection failures after r15-3087-gb07f8a301158e5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 --- Comment #23 from rguenther at suse dot de --- > Am 23.11.2024 um 13:20 schrieb tnfchris at gcc dot gnu.org > : > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 > > --- Comment #22 from Tamar Christina --- > Ok, so the problem with the ones on trunk isn't necessarily the > canonicalization itself but that our externals handling is a bit shallow. > > On externals we determine that we have no information on the DF and return > TOP. > This is because DR analysis doesn't try to handle externals since they're not > part of the loop. > > However all we need to know for complex numbers is whether the externals are > loaded from the same place and the order of them. > > concretely the loop pre-header is: > > [local count: 10737416]: > b$real_11 = REALPART_EXPR ; > b$imag_10 = IMAGPART_EXPR ; > _53 = -b$imag_10; > > and the loop body: > > [local count: 1063004408]: > ... > > _23 = REALPART_EXPR <*_5>; > _24 = IMAGPART_EXPR <*_5>; > _27 = _24 * _53; > _28 = _23 * _53; > > codegen before after: > > {_24, _23} * { _53, _53 } > > and after > > { _24, _24 } * { _53, b$real_11 } > > Before we were able to easily tell that the order for the multiply would be > IMAG, REAL. > In the after (GCC 15) case that information is there, but requires us to > follow > the externals. > > Richi what do you think about extending externals handling in linear_loads_p > to > follow all external ops and if they load from the same memref to figure out > the > "virtual lane permute"? Externs do not have a permute as we build them from scalars. So any permute can be trivially imposed on them - rather than TOP they should be BOTTOM. Of course there’s also no advantage of imposing a permute on them. > We can store the info in a new externals cache (to avoid re-walking externals > we already walked, as perm_cache stores slp nodes) and the permute for the > node in the perm_cache as we do for any cached lookups today? > > This would also fix the other tests Andrew added in > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463#c4 > > -- > You are receiving this mail because: > You are on the CC list for the bug.
[Bug tree-optimization/116463] [15 Regression] complex multiply vectorizer detection failures after r15-3087-gb07f8a301158e5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 --- Comment #22 from Tamar Christina --- Ok, so the problem with the ones on trunk isn't necessarily the canonicalization itself but that our externals handling is a bit shallow. On externals we determine that we have no information on the DF and return TOP. This is because DR analysis doesn't try to handle externals since they're not part of the loop. However all we need to know for complex numbers is whether the externals are loaded from the same place and the order of them. concretely the loop pre-header is: [local count: 10737416]: b$real_11 = REALPART_EXPR ; b$imag_10 = IMAGPART_EXPR ; _53 = -b$imag_10; and the loop body: [local count: 1063004408]: ... _23 = REALPART_EXPR <*_5>; _24 = IMAGPART_EXPR <*_5>; _27 = _24 * _53; _28 = _23 * _53; codegen before after: {_24, _23} * { _53, _53 } and after { _24, _24 } * { _53, b$real_11 } Before we were able to easily tell that the order for the multiply would be IMAG, REAL. In the after (GCC 15) case that information is there, but requires us to follow the externals. Richi what do you think about extending externals handling in linear_loads_p to follow all external ops and if they load from the same memref to figure out the "virtual lane permute"? We can store the info in a new externals cache (to avoid re-walking externals we already walked, as perm_cache stores slp nodes) and the permute for the node in the perm_cache as we do for any cached lookups today? This would also fix the other tests Andrew added in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463#c4
[Bug tree-optimization/116463] [15 Regression] complex multiply vectorizer detection failures after r15-3087-gb07f8a301158e5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 Tamar Christina changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |tnfchris at gcc dot gnu.org --- Comment #21 from Tamar Christina --- will wait a week or so for backporting to GCC 14, looking at the complex add failures on trunk now. Have an idea but need to work out on paper if sound.
[Bug tree-optimization/116463] [15 Regression] complex multiply vectorizer detection failures after r15-3087-gb07f8a301158e5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 --- Comment #20 from GCC Commits --- The master branch has been updated by Tamar Christina : https://gcc.gnu.org/g:a9473f9c6f2d755d2eb79dbd30877e64b4bc6fc8 commit r15-5585-ga9473f9c6f2d755d2eb79dbd30877e64b4bc6fc8 Author: Tamar Christina Date: Thu Nov 21 15:10:24 2024 + middle-end:For multiplication try swapping operands when matching complex multiply [PR116463] This commit fixes the failures of complex.exp=fast-math-complex-mls-*.c on the GCC 14 branch and some of the ones on the master. The current matching just looks for one order for multiplication and was relying on canonicalization to always give the right order because of the TWO_OPERANDS. However when it comes to the multiplication trying only one order is a bit fragile as they can be flipped. The failing tests on the branch are: void fms180snd(_Complex TYPE a[restrict N], _Complex TYPE b[restrict N], _Complex TYPE c[restrict N]) { for (int i = 0; i < N; i++) c[i] -= a[i] * (b[i] * I * I); } void fms180fst(_Complex TYPE a[restrict N], _Complex TYPE b[restrict N], _Complex TYPE c[restrict N]) { for (int i = 0; i < N; i++) c[i] -= (a[i] * I * I) * b[i]; } The issue is just a small difference in commutative operations. we look for {R,R} * {R,I} but found {R,I} * {R,R}. Since the DF analysis is cached, we should be able to swap operands and retry for multiply cheaply. There is a constraint being checked by vect_validate_multiplication for the data flow of the operands feeding the multiplications. So e.g. between the nodes: note: node 0x4d1d210 (max_nunits=2, refcnt=3) vector(2) double note: op template: _27 = _10 * _25; note: stmt 0 _27 = _10 * _25; note: stmt 1 _29 = _11 * _25; note: node 0x4d1d060 (max_nunits=2, refcnt=2) vector(2) double note: op template: _26 = _11 * _24; note: stmt 0 _26 = _11 * _24; note: stmt 1 _28 = _10 * _24; we require the lanes to come from the same source which vect_validate_multiplication checks. As such it doesn't make sense to flip them individually because that would invalidate the earlier linear_loads_p checks which have validated that the arguments all come from the same datarefs. This patch thus flips the operands in unison to still maintain this invariant, but also honor the commutative nature of multiplication. gcc/ChangeLog: PR tree-optimization/116463 * tree-vect-slp-patterns.cc (complex_mul_pattern::matches, complex_fms_pattern::matches): Try swapping operands on multiply.
[Bug tree-optimization/116463] [15 Regression] complex multiply vectorizer detection failures after r15-3087-gb07f8a301158e5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 --- Comment #19 from Tamar Christina --- The failing tests on the branch are: #include #define TYPE double #define N 200 void fms180snd(_Complex TYPE a[restrict N], _Complex TYPE b[restrict N], _Complex TYPE c[restrict N]) { for (int i = 0; i < N; i++) c[i] -= a[i] * (b[i] * I * I); } void fms180fst(_Complex TYPE a[restrict N], _Complex TYPE b[restrict N], _Complex TYPE c[restrict N]) { for (int i = 0; i < N; i++) c[i] -= (a[i] * I * I) * b[i]; } The issue is just a small difference in commutative operations. we look for {R,R} * {R,I} but found {R,I} * {R,R} which is a bit fragile. since the DF analysis is cached, we should be able to swap operands and retry for multiply. for the addition we can't as it's part of the TWO_OPERANDs and we require them to be in a particular order to have the right values. Testing a patch that fixes the tests on the gcc-14 branch. Will work on the one for master next.
[Bug tree-optimization/116463] [15 Regression] complex multiply vectorizer detection failures after r15-3087-gb07f8a301158e5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 --- Comment #18 from Tamar Christina --- I can provide a GCC 14 fix during Stage 3 while I fix the GCC 15 one if that's easier.
[Bug tree-optimization/116463] [15 Regression] complex multiply vectorizer detection failures after r15-3087-gb07f8a301158e5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 --- Comment #17 from Jakub Jelinek --- It doesn't. With r14-10682 there are 2 Found COMPLEX_FMS pattern in SLP tree and 2 Found COMPLEX_ADD_ROT270 pattern in SLP tree messages in fast-math-complex-mls-double.c and 4 Found COMPLEX_FMA pattern in SLP tree and 4 Found COMPLEX_FMS_CONJ pattern in SLP tree With r14-10683 the 4 Found COMPLEX_FMA pattern in SLP tree are gone, the rest remains the same. With r14+10683 + backported r15-3128 the 4 Found COMPLEX_FMA pattern in SLP tree are back, but the 2 COMPLEX_FMS and 2 COMPLEX_ADD_ROT270 messages are gone.
[Bug tree-optimization/116463] [15 Regression] complex multiply vectorizer detection failures after r15-3087-gb07f8a301158e5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #15 from Jakub Jelinek --- Those tests FAIL also on the 14 branch, starting with r14-10683-g12c00048d9f3598e57b98ec7723f7356bd255d04 Do we want to backport r15-3128 or just xfail the parts of the tests?
[Bug tree-optimization/116463] [15 Regression] complex multiply vectorizer detection failures after r15-3087-gb07f8a301158e5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 --- Comment #16 from Richard Biener --- (In reply to Jakub Jelinek from comment #15) > Those tests FAIL also on the 14 branch, starting with > r14-10683-g12c00048d9f3598e57b98ec7723f7356bd255d04 > Do we want to backport r15-3128 or just xfail the parts of the tests? I don't think r15-3128 on its own fixes the tests? If it does, sure, we can backport it.