[Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal

2024-05-14 Thread mjr19 at cam dot ac.uk via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767

--- Comment #7 from mjr19 at cam dot ac.uk ---
Another manifestation of this issue in GCC 13.1 and 14.1 is that the loop

  do i=1,n
 c(i)=a(i)*c(i)*(0d0,1d0)
  enddo

takes about twice as long to run as

  do i=1,n
 c(i)=a(i)*(0d0,1d0)*c(i)
  enddo

when compiled -Ofast -mavx2. In the second case the compiler manages to merge
its unnecessary desire to form separate vectors of real and imaginary
components to perform the sign flips on multiplying by i, with its much more
reasonable desire to form such vectors for the general complex-complex
multiplication.

One might also argue that, as the above expressions are mathematically
identical, at -Ofast the compiler ought to chose the faster anyway.

[Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal

2024-04-19 Thread mjr19 at cam dot ac.uk via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767

--- Comment #6 from mjr19 at cam dot ac.uk ---
I was starting to wonder whether this issue might be related to that in bug
114324, which is a slightly more complicated example in which multiplication by
a purely imaginary number destroys vectorisation.

In 114324 the problem seems to arise from a refusal to alternate no-ops and
negs along a vector, which is pretty much the issue here too.

[Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal

2024-04-18 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767

--- Comment #5 from Roger Sayle  ---
Another interesting (simpler) case of -ffast-math pessimization is:
void foo(_Complex double *c)
{
for (int i=0; i<16; i++)
  c[i] += __builtin_complex(1.0,0.0);
}

Again without -ffast-math we vectorize consecutive additions, but with
-ffast-math we (not so) cleverly avoid every second addition by producing
significantly larger code that shuffles the real/imaginary parts around.

This even suggests a missed-optimization for:
void bar(_Complex double *c, double x)
{
for (int i=0; i<16; i++)
  c[i] += x;
}

which may be more efficiently implemented (when safe) by:
void bar(_Complex double *c, double x)
{
for (int i=0; i<16; i++)
  c[i] += __builtin_complex(x,0.0);
}

i.e. insert/interleave a no-op zero addition, to simplify the vectorization.

The existence of a suitable identity operation (+0, *1.0, &~0, |0, ^0) can be
used to avoid shuffling/permuting values/lanes out of vectors, when its
possible for the vector operation to leave the other values unchanged.

[Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal

2024-04-18 Thread mjr19 at cam dot ac.uk via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767

--- Comment #4 from mjr19 at cam dot ac.uk ---
An issue which I suspect is related is shown by

subroutine zradd(c,n)
  integer :: i,n
  complex(kind(1d0)) :: c(*)

  do i=1,n
 c(i)=c(i)+1d0
  enddo
end subroutine

If compiled with gfortran-14 and -O3 -mavx2 it all looks very sensible.

If one adds -ffast-math, it looks a lot less sensible, and takes over 70%
longer to run. I think it has changed from promoting 1d0 to (1d0,0d0) and then
adding that (which one might argue that a strict interpretation of the Fortran
standard requires, but I am not certain that it does), to collecting all the
real parts in a vector, adding 1d0 to them, and avoiding adding 0d0 to the
imaginary parts. Unsurprisingly the gain in halving the number of additions is
more than offset by the required vperms and vshufs.

Ideally -ffast-math would have noticed that adding 0d0 to the imaginary part is
not necessary, but then concluded that doing so was faster than any alternative
method, and so done so anyway.

[Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal

2024-04-18 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767

Roger Sayle  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com

--- Comment #3 from Roger Sayle  ---
Richard has already changed this from "gfortran" to "tree-optimization", but
for the record, the C equivalent of this test case (with the same issue) is:

void scale_i(_Complex double *c, int n)
{
for (int i=0; i

[Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal

2024-04-18 Thread mjr19 at cam dot ac.uk via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767

--- Comment #2 from mjr19 at cam dot ac.uk ---
Ah, I see. An inability to alternate negation with noop also means that
conjugation is treated suboptimally.

  do i=1,n
 c(i)=conjg(c(i))
  enddo

Here gfortran-13 and -14 are differently suboptimal, and again it appears to be
because they don't wish to alternate noops and negations along a single vector. 

In this case -14 fails to vectorise, and loads just the imaginary values,
negating them and storing them, moving with a stride of two. Not a bad answer,
but one which would not generalise well should the loop contain other
operations on c(i).

In practice almost every vector operation can be trivially alternated with a
noop, as +, -, *, xor, or, and, all have special values which reduce the
operation to a noop and which could be used to pad things. Is there no scope
for making the SLP build more flexible here, as otherwise shuffling and
permuting can cost both time and registers?

I also note that it means that -ffast-math (or simply -fno-signed-zeros) slows
down these examples. This is unfortunate, as there are cases in which
-fno-signed-zeros gives a useful performance increase, but this can be reduced
or reversed when there are cases, perhaps in the same source file, when it
reduces performance.

[Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal

2024-04-18 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization
   Last reconfirmed||2024-04-18
 Blocks||53947
  Component|fortran |tree-optimization
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1

--- Comment #1 from Richard Biener  ---
When you avoid -ffast-math you'll get

.L4:
vmovupd (%rax), %ymm0
addq$32, %rax
vmulpd  %ymm2, %ymm0, %ymm1
vpermilpd   $5, %ymm0, %ymm0
vaddsubpd   %ymm0, %ymm1, %ymm1
vmovupd %ymm1, -32(%rax)
cmpq%rax, %rdx
jne .L4

this consumes the negation with the vaddsubpd but has a multiplication
by zero for the sake of preserving signed zeros.

The main issue in the way of optimal code is that we lower the operations
early, arriving at

  _6 = REALPART_EXPR <(*c_10(D))[_1]>;
  _5 = IMAGPART_EXPR <(*c_10(D))[_1]>;
  _14 = -_5;
  REALPART_EXPR <(*c_10(D))[_1]> = _14;
  IMAGPART_EXPR <(*c_10(D))[_1]> = _6;

and vectorizing that fails to use SLP which introduces an interleaving chain.
Negation isn't supported as "two-operators" op as there's no corresponding
"no negation" operation and we lack a way to insert a noop during SLP build.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations