[Bug tree-optimization/92130] New: Missed vectorization for iteration dependent loads and simple multiplicative accumulators

witold.baryluk+gcc at gmail dot com Wed, 16 Oct 2019 11:51:43 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130


            Bug ID: 92130
           Summary: Missed vectorization for iteration dependent loads and
                    simple multiplicative accumulators
           Product: gcc
           Version: 9.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: witold.baryluk+gcc at gmail dot com
  Target Milestone: ---

Created attachment 47051
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47051&action=edit
Perlin2D noise mesh generation

So,

I do have pretty complex multi level loop spread across many functions, but it
can be all vectorized, but under certain scenarios gcc does not vectorize it
with gcc 9.2.1

I am attaching somehow simplified code with few defines inside to play with it.

The one exposed by default present the biggest challenge to gcc, despite me
able to vectorize it manually.

I tested this on SSE2, AVX2 (cascadelake and znver2), AVX512 (-march=knm and
-march=skylake-avx512) and ARM SVE, with all same effects. I am using
associative math and other flags mentioned in the sourcefile at the top.

The high level overview is like this:

input: A, F, W, maxO, sufficiently aligned d.

foreach y:
  foreach x:
    float v = 0.0
    float a = 1.0
    float f = 1.0
    foreach o in [0, maxO):
      v += a * g(f * x, f * y, o, h(o, p))
      a *= A
      f *= F
    d[y*W + x] = v

where both g and h are pure functions (relatively complex tho) with no control
flow or data dependent flow.

In some situations if a and f are replaced by a precomputed table of
coefficient for every o, and then used as v += a[o] * g(f[o] * x, f[o] * y,
h(o, p)), it does vectorize, but not always. h(o, p) could also be precomputed,
but I didn't bother as it appears to not have any bad effect on vectorizer.

Vectorizater should vectorize along the 'foreach x', and compute multiple x-s
per-lane completely independently. It is true that when updating a and f, each
lane need to be duplicated, but that can be done by computing it scalarly, and
then broadcasting, or by repeating same constants updates in each lane.

[Bug tree-optimization/92130] New: Missed vectorization for iteration dependent loads and simple multiplicative accumulators

Reply via email to