https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130

            Bug ID: 92130
           Summary: Missed vectorization for iteration dependent loads and
                    simple multiplicative accumulators
           Product: gcc
           Version: 9.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: witold.baryluk+gcc at gmail dot com
  Target Milestone: ---

Created attachment 47051
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47051&action=edit
Perlin2D noise mesh generation

So,

I do have pretty complex multi level loop spread across many functions, but it
can be all vectorized, but under certain scenarios gcc does not vectorize it
with gcc 9.2.1

I am attaching somehow simplified code with few defines inside to play with it.

The one exposed by default present the biggest challenge to gcc, despite me
able to vectorize it manually.

I tested this on SSE2, AVX2 (cascadelake and znver2), AVX512 (-march=knm and
-march=skylake-avx512) and ARM SVE, with all same effects. I am using
associative math and other flags mentioned in the sourcefile at the top.

The high level overview is like this:

input: A, F, W, maxO, sufficiently aligned d.

foreach y:
  foreach x:
    float v = 0.0
    float a = 1.0
    float f = 1.0
    foreach o in [0, maxO):
      v += a * g(f * x, f * y, o, h(o, p))
      a *= A
      f *= F
    d[y*W + x] = v

where both g and h are pure functions (relatively complex tho) with no control
flow or data dependent flow.

In some situations if a and f are replaced by a precomputed table of
coefficient for every o, and then used as v += a[o] * g(f[o] * x, f[o] * y,
h(o, p)), it does vectorize, but not always. h(o, p) could also be precomputed,
but I didn't bother as it appears to not have any bad effect on vectorizer.

Vectorizater should vectorize along the 'foreach x', and compute multiple x-s
per-lane completely independently. It is true that when updating a and f, each
lane need to be duplicated, but that can be done by computing it scalarly, and
then broadcasting, or by repeating same constants updates in each lane.

Reply via email to