[Bug target/51509] Inefficient neon intrinsic code sequence
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509 Christophe Lyon changed: What|Removed |Added Assignee|cbaylis at gcc dot gnu.org |unassigned at gcc dot gnu.org --- Comment #8 from Christophe Lyon --- (In reply to Eric Gallager from comment #7) > (In reply to Maxim Kuvyrkov from comment #5) > > Oh, sorry, I missed the fact that PR65375 is for aarch64 and this one is for > > armv7. Charles, would you please look at this? > > Should Charles still remain the assignee for this? I'm afraid not: Charles no longer works with us.
[Bug target/51509] Inefficient neon intrinsic code sequence
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509 Eric Gallager changed: What|Removed |Added CC||egallager at gcc dot gnu.org See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=65375 --- Comment #7 from Eric Gallager --- (In reply to Maxim Kuvyrkov from comment #5) > Oh, sorry, I missed the fact that PR65375 is for aarch64 and this one is for > armv7. Charles, would you please look at this? Should Charles still remain the assignee for this?
[Bug target/51509] Inefficient neon intrinsic code sequence
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509 Allan Jensen changed: What|Removed |Added CC||linux at carewolf dot com --- Comment #6 from Allan Jensen --- I have run into a similar problem with vld3 and vst4. uint8x16x3_t tmp = vld3q_u8(src); vst4q_u8((uint8_t *)dst, {tmp.val[2], tmp.val[1], tmp.val[0], fullVector}); produces: 70: 4cdf4061ld3 {v1.16b-v3.16b}, [x3], #48 74: 4e083c04mov x4, v0.d[0] 78: 4e183c05mov x5, v0.d[1] 7c: 6f000400mvniv0.4s, #0x0 80: 4e083c4amov x10, v2.d[0] 84: 4e183c4bmov x11, v2.d[1] 88: aa0403e2mov x2, x4 8c: aa0503e1mov x1, x5 90: 4e083c24mov x4, v1.d[0] 94: 4e183c25mov x5, v1.d[1] 98: a90007e2stp x2, x1, [sp] 9c: 3d800fe0str q0, [sp,#48] a0: a9012feastp x10, x11, [sp,#16] a4: aa0403e6mov x6, x4 a8: a90217e6stp x6, x5, [sp,#32] ac: 4c4023e0ld1 {v0.16b-v3.16b}, [sp] b0: 4c9fst4 {v0.16b-v3.16b}, [x0], #64 But if I add -fno-split-wide-types it compiles to: 68: 4cdf4064ld3 {v4.16b-v6.16b}, [x3], #48 6c: 4f000400moviv0.4s, #0x0 70: 6f000403mvniv3.4s, #0x0 74: 4ea51ca1mov v1.16b, v5.16b 78: 4ea41c82mov v2.16b, v4.16b 7c: 4c9fst4 {v0.16b-v3.16b}, [x0], #64 This happens with both 4.9 and 5.1 that I have tried.
[Bug target/51509] Inefficient neon intrinsic code sequence
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509 Maxim Kuvyrkov mkuvyrkov at gcc dot gnu.org changed: What|Removed |Added CC||clyon at gcc dot gnu.org, ||mkuvyrkov at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |kugan at gcc dot gnu.org --- Comment #4 from Maxim Kuvyrkov mkuvyrkov at gcc dot gnu.org --- Kugan, Would you please check if your patch for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65375 also affects this one?
[Bug target/51509] Inefficient neon intrinsic code sequence
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509 Maxim Kuvyrkov mkuvyrkov at gcc dot gnu.org changed: What|Removed |Added Assignee|kugan at gcc dot gnu.org |cbaylis at gcc dot gnu.org --- Comment #5 from Maxim Kuvyrkov mkuvyrkov at gcc dot gnu.org --- Oh, sorry, I missed the fact that PR65375 is for aarch64 and this one is for armv7. Charles, would you please look at this?
[Bug target/51509] Inefficient neon intrinsic code sequence
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509 --- Comment #3 from Ramana Radhakrishnan ramana at gcc dot gnu.org 2012-06-15 00:51:26 UTC --- With -fno-split-wide-types I can end up getting identical output to what is expected in this case with FSF trunk. I suspect this might be another of those costs with lower-subreg issues. Ramana
[Bug target/51509] Inefficient neon intrinsic code sequence
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509 --- Comment #1 from rsandifo at gcc dot gnu.org rsandifo at gcc dot gnu.org 2011-12-13 09:07:38 UTC --- At least part of the problem here is the uninitialised variable in the vld4 call. GCC tries to create a zero initialisation of x before the vld4, so that the other lanes have defined values. Obviously we could be doing that much better than we are, and perhaps we should have some kind of special case so that uninitialised NEON vectors are never zero-initialised (e.g. use a plain clobber instead). But uninitialised variables aren't really ideal either way. Something like: x = vld4_dup_u8(src); y.val[0][0] = x.val[1][0]; y.val[1][0] = x.val[2][0]; vst2_lane_u8(dst, y, 0); would be better in principle. Unfortunately, we don't generate good code for that either. Part of the problem is introduced by lower-subreg, but it's not good even with -fno-split-wide-types.
[Bug target/51509] Inefficient neon intrinsic code sequence
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509 --- Comment #2 from rsandifo at gcc dot gnu.org rsandifo at gcc dot gnu.org 2011-12-13 09:20:54 UTC --- FWIW, uint8x8x4_t x; uint8x8x2_t y; x = vld4_dup_u8(src); y.val[0] = x.val[1]; y.val[1] = x.val[2]; vst2_lane_u8(dst, y, 0); does give the expected output. I.e. the remaining inefficiency from comment #1 is in the uninitialised parts of y. Richard
[Bug target/51509] Inefficient neon intrinsic code sequence
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509 Ramana Radhakrishnan ramana at gcc dot gnu.org changed: What|Removed |Added Target|arm-linux-androideabi |arm-linux-androideabi, ||arm-linux-gnueabi Status|UNCONFIRMED |NEW Keywords||missed-optimization Last reconfirmed||2011-12-12 CC||ramana at gcc dot gnu.org, ||rsandifo at gcc dot gnu.org Blocks||47562 Ever Confirmed|0 |1