https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130
Bug ID: 92130 Summary: Missed vectorization for iteration dependent loads and simple multiplicative accumulators Product: gcc Version: 9.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: witold.baryluk+gcc at gmail dot com Target Milestone: --- Created attachment 47051 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47051&action=edit Perlin2D noise mesh generation So, I do have pretty complex multi level loop spread across many functions, but it can be all vectorized, but under certain scenarios gcc does not vectorize it with gcc 9.2.1 I am attaching somehow simplified code with few defines inside to play with it. The one exposed by default present the biggest challenge to gcc, despite me able to vectorize it manually. I tested this on SSE2, AVX2 (cascadelake and znver2), AVX512 (-march=knm and -march=skylake-avx512) and ARM SVE, with all same effects. I am using associative math and other flags mentioned in the sourcefile at the top. The high level overview is like this: input: A, F, W, maxO, sufficiently aligned d. foreach y: foreach x: float v = 0.0 float a = 1.0 float f = 1.0 foreach o in [0, maxO): v += a * g(f * x, f * y, o, h(o, p)) a *= A f *= F d[y*W + x] = v where both g and h are pure functions (relatively complex tho) with no control flow or data dependent flow. In some situations if a and f are replaced by a precomputed table of coefficient for every o, and then used as v += a[o] * g(f[o] * x, f[o] * y, h(o, p)), it does vectorize, but not always. h(o, p) could also be precomputed, but I didn't bother as it appears to not have any bad effect on vectorizer. Vectorizater should vectorize along the 'foreach x', and compute multiple x-s per-lane completely independently. It is true that when updating a and f, each lane need to be duplicated, but that can be done by computing it scalarly, and then broadcasting, or by repeating same constants updates in each lane.