[Bug target/51509] Inefficient neon intrinsic code sequence

2018-12-12 Thread clyon at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509

Christophe Lyon  changed:

   What|Removed |Added

   Assignee|cbaylis at gcc dot gnu.org |unassigned at gcc dot 
gnu.org

--- Comment #8 from Christophe Lyon  ---
(In reply to Eric Gallager from comment #7)
> (In reply to Maxim Kuvyrkov from comment #5)
> > Oh, sorry, I missed the fact that PR65375 is for aarch64 and this one is for
> > armv7.  Charles, would you please look at this?
> 
> Should Charles still remain the assignee for this?

I'm afraid not: Charles no longer works with us.

[Bug target/51509] Inefficient neon intrinsic code sequence

2018-12-11 Thread egallager at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509

Eric Gallager  changed:

   What|Removed |Added

 CC||egallager at gcc dot gnu.org
   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=65375

--- Comment #7 from Eric Gallager  ---
(In reply to Maxim Kuvyrkov from comment #5)
> Oh, sorry, I missed the fact that PR65375 is for aarch64 and this one is for
> armv7.  Charles, would you please look at this?

Should Charles still remain the assignee for this?

[Bug target/51509] Inefficient neon intrinsic code sequence

2015-11-26 Thread linux at carewolf dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509

Allan Jensen  changed:

   What|Removed |Added

 CC||linux at carewolf dot com

--- Comment #6 from Allan Jensen  ---
I have run into a similar problem with vld3 and vst4.

uint8x16x3_t tmp = vld3q_u8(src);
vst4q_u8((uint8_t *)dst, {tmp.val[2], tmp.val[1], tmp.val[0], fullVector});

produces:
  70:   4cdf4061ld3 {v1.16b-v3.16b}, [x3], #48
  74:   4e083c04mov x4, v0.d[0]
  78:   4e183c05mov x5, v0.d[1]
  7c:   6f000400mvniv0.4s, #0x0
  80:   4e083c4amov x10, v2.d[0]
  84:   4e183c4bmov x11, v2.d[1]
  88:   aa0403e2mov x2, x4
  8c:   aa0503e1mov x1, x5
  90:   4e083c24mov x4, v1.d[0]
  94:   4e183c25mov x5, v1.d[1]
  98:   a90007e2stp x2, x1, [sp]
  9c:   3d800fe0str q0, [sp,#48]
  a0:   a9012feastp x10, x11, [sp,#16]
  a4:   aa0403e6mov x6, x4
  a8:   a90217e6stp x6, x5, [sp,#32]
  ac:   4c4023e0ld1 {v0.16b-v3.16b}, [sp]
  b0:   4c9fst4 {v0.16b-v3.16b}, [x0], #64


But if I add -fno-split-wide-types it compiles to:
  68:   4cdf4064ld3 {v4.16b-v6.16b}, [x3], #48
  6c:   4f000400moviv0.4s, #0x0
  70:   6f000403mvniv3.4s, #0x0
  74:   4ea51ca1mov v1.16b, v5.16b
  78:   4ea41c82mov v2.16b, v4.16b
  7c:   4c9fst4 {v0.16b-v3.16b}, [x0], #64

This happens with both 4.9 and 5.1 that I have tried.

[Bug target/51509] Inefficient neon intrinsic code sequence

2015-04-13 Thread mkuvyrkov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509

Maxim Kuvyrkov mkuvyrkov at gcc dot gnu.org changed:

   What|Removed |Added

 CC||clyon at gcc dot gnu.org,
   ||mkuvyrkov at gcc dot gnu.org
   Assignee|unassigned at gcc dot gnu.org  |kugan at gcc dot gnu.org

--- Comment #4 from Maxim Kuvyrkov mkuvyrkov at gcc dot gnu.org ---
Kugan,

Would you please check if your patch for
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65375 also affects this one?


[Bug target/51509] Inefficient neon intrinsic code sequence

2015-04-13 Thread mkuvyrkov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509

Maxim Kuvyrkov mkuvyrkov at gcc dot gnu.org changed:

   What|Removed |Added

   Assignee|kugan at gcc dot gnu.org   |cbaylis at gcc dot 
gnu.org

--- Comment #5 from Maxim Kuvyrkov mkuvyrkov at gcc dot gnu.org ---
Oh, sorry, I missed the fact that PR65375 is for aarch64 and this one is for
armv7.  Charles, would you please look at this?


[Bug target/51509] Inefficient neon intrinsic code sequence

2012-06-14 Thread ramana at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509

--- Comment #3 from Ramana Radhakrishnan ramana at gcc dot gnu.org 2012-06-15 
00:51:26 UTC ---
With -fno-split-wide-types I can end up getting identical output to what is
expected in this case with FSF trunk. I suspect this might be another of those
costs with lower-subreg issues. 


Ramana


[Bug target/51509] Inefficient neon intrinsic code sequence

2011-12-13 Thread rsandifo at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509

--- Comment #1 from rsandifo at gcc dot gnu.org rsandifo at gcc dot gnu.org 
2011-12-13 09:07:38 UTC ---
At least part of the problem here is the uninitialised
variable in the vld4 call.  GCC tries to create a zero
initialisation of x before the vld4, so that the other
lanes have defined values.  Obviously we could be doing
that much better than we are, and perhaps we should have
some kind of special case so that uninitialised NEON vectors
are never zero-initialised (e.g. use a plain clobber instead).
But uninitialised variables aren't really ideal either way.
Something like:

  x = vld4_dup_u8(src);

  y.val[0][0] = x.val[1][0];
  y.val[1][0] = x.val[2][0];

  vst2_lane_u8(dst, y, 0);

would be better in principle.  Unfortunately, we don't
generate good code for that either.  Part of the problem
is introduced by lower-subreg, but it's not good even
with -fno-split-wide-types.


[Bug target/51509] Inefficient neon intrinsic code sequence

2011-12-13 Thread rsandifo at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509

--- Comment #2 from rsandifo at gcc dot gnu.org rsandifo at gcc dot gnu.org 
2011-12-13 09:20:54 UTC ---
FWIW,

  uint8x8x4_t x;
  uint8x8x2_t y;

  x = vld4_dup_u8(src);

  y.val[0] = x.val[1];
  y.val[1] = x.val[2];

  vst2_lane_u8(dst, y, 0);

does give the expected output.  I.e. the remaining inefficiency
from comment #1 is in the uninitialised parts of y.

Richard


[Bug target/51509] Inefficient neon intrinsic code sequence

2011-12-12 Thread ramana at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509

Ramana Radhakrishnan ramana at gcc dot gnu.org changed:

   What|Removed |Added

 Target|arm-linux-androideabi   |arm-linux-androideabi,
   ||arm-linux-gnueabi
 Status|UNCONFIRMED |NEW
   Keywords||missed-optimization
   Last reconfirmed||2011-12-12
 CC||ramana at gcc dot gnu.org,
   ||rsandifo at gcc dot gnu.org
 Blocks||47562
 Ever Confirmed|0   |1