https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90424

            Bug ID: 90424
           Summary: memcpy into vector builtin not optimized
           Product: gcc
           Version: 9.1.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: kretz at kde dot org
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

Testcase (cf. https://godbolt.org/z/LsKcii):

template <class T>
using V [[gnu::vector_size(16)]] = T;

template <class T, unsigned M = sizeof(V<T>)>
V<T> load(const void *p) {
  using W = V<T>;
  W r;
  __builtin_memcpy(&r, p, M);
  return r;
}

// movq or movsd
template V<char> load<char, 8>(const void *);     // bad
template V<short> load<short, 8>(const void *);   // bad
template V<int> load<int, 8>(const void *);       // bad
template V<long> load<long, 8>(const void *);     // good
template V<float> load<float, 8>(const void *);   // bad
template V<double> load<double, 8>(const void *); // good (movsd?)

// movd or movss
template V<char> load<char, 4>(const void *);   // bad
template V<short> load<short, 4>(const void *); // bad
template V<int> load<int, 4>(const void *);     // good
template V<float> load<float, 4>(const void *); // good

All of these partial loads should be translated to a single mov[qd] or movs[sd]
instruction. But most of them are not.

Reply via email to