[Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767 --- Comment #7 from mjr19 at cam dot ac.uk --- Another manifestation of this issue in GCC 13.1 and 14.1 is that the loop do i=1,n c(i)=a(i)*c(i)*(0d0,1d0) enddo takes about twice as long to run as do i=1,n c(i)=a(i)*(0d0,1d0)*c(i) enddo when compiled -Ofast -mavx2. In the second case the compiler manages to merge its unnecessary desire to form separate vectors of real and imaginary components to perform the sign flips on multiplying by i, with its much more reasonable desire to form such vectors for the general complex-complex multiplication. One might also argue that, as the above expressions are mathematically identical, at -Ofast the compiler ought to chose the faster anyway.
[Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767 --- Comment #6 from mjr19 at cam dot ac.uk --- I was starting to wonder whether this issue might be related to that in bug 114324, which is a slightly more complicated example in which multiplication by a purely imaginary number destroys vectorisation. In 114324 the problem seems to arise from a refusal to alternate no-ops and negs along a vector, which is pretty much the issue here too.
[Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767 --- Comment #5 from Roger Sayle --- Another interesting (simpler) case of -ffast-math pessimization is: void foo(_Complex double *c) { for (int i=0; i<16; i++) c[i] += __builtin_complex(1.0,0.0); } Again without -ffast-math we vectorize consecutive additions, but with -ffast-math we (not so) cleverly avoid every second addition by producing significantly larger code that shuffles the real/imaginary parts around. This even suggests a missed-optimization for: void bar(_Complex double *c, double x) { for (int i=0; i<16; i++) c[i] += x; } which may be more efficiently implemented (when safe) by: void bar(_Complex double *c, double x) { for (int i=0; i<16; i++) c[i] += __builtin_complex(x,0.0); } i.e. insert/interleave a no-op zero addition, to simplify the vectorization. The existence of a suitable identity operation (+0, *1.0, &~0, |0, ^0) can be used to avoid shuffling/permuting values/lanes out of vectors, when its possible for the vector operation to leave the other values unchanged.
[Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767 --- Comment #4 from mjr19 at cam dot ac.uk --- An issue which I suspect is related is shown by subroutine zradd(c,n) integer :: i,n complex(kind(1d0)) :: c(*) do i=1,n c(i)=c(i)+1d0 enddo end subroutine If compiled with gfortran-14 and -O3 -mavx2 it all looks very sensible. If one adds -ffast-math, it looks a lot less sensible, and takes over 70% longer to run. I think it has changed from promoting 1d0 to (1d0,0d0) and then adding that (which one might argue that a strict interpretation of the Fortran standard requires, but I am not certain that it does), to collecting all the real parts in a vector, adding 1d0 to them, and avoiding adding 0d0 to the imaginary parts. Unsurprisingly the gain in halving the number of additions is more than offset by the required vperms and vshufs. Ideally -ffast-math would have noticed that adding 0d0 to the imaginary part is not necessary, but then concluded that doing so was faster than any alternative method, and so done so anyway.
[Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com --- Comment #3 from Roger Sayle --- Richard has already changed this from "gfortran" to "tree-optimization", but for the record, the C equivalent of this test case (with the same issue) is: void scale_i(_Complex double *c, int n) { for (int i=0; i
[Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767 --- Comment #2 from mjr19 at cam dot ac.uk --- Ah, I see. An inability to alternate negation with noop also means that conjugation is treated suboptimally. do i=1,n c(i)=conjg(c(i)) enddo Here gfortran-13 and -14 are differently suboptimal, and again it appears to be because they don't wish to alternate noops and negations along a single vector. In this case -14 fails to vectorise, and loads just the imaginary values, negating them and storing them, moving with a stride of two. Not a bad answer, but one which would not generalise well should the loop contain other operations on c(i). In practice almost every vector operation can be trivially alternated with a noop, as +, -, *, xor, or, and, all have special values which reduce the operation to a noop and which could be used to pad things. Is there no scope for making the SLP build more flexible here, as otherwise shuffling and permuting can cost both time and registers? I also note that it means that -ffast-math (or simply -fno-signed-zeros) slows down these examples. This is unfortunate, as there are cases in which -fno-signed-zeros gives a useful performance increase, but this can be reduced or reversed when there are cases, perhaps in the same source file, when it reduces performance.
[Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767 Richard Biener changed: What|Removed |Added Keywords||missed-optimization Last reconfirmed||2024-04-18 Blocks||53947 Component|fortran |tree-optimization Status|UNCONFIRMED |NEW Ever confirmed|0 |1 --- Comment #1 from Richard Biener --- When you avoid -ffast-math you'll get .L4: vmovupd (%rax), %ymm0 addq$32, %rax vmulpd %ymm2, %ymm0, %ymm1 vpermilpd $5, %ymm0, %ymm0 vaddsubpd %ymm0, %ymm1, %ymm1 vmovupd %ymm1, -32(%rax) cmpq%rax, %rdx jne .L4 this consumes the negation with the vaddsubpd but has a multiplication by zero for the sake of preserving signed zeros. The main issue in the way of optimal code is that we lower the operations early, arriving at _6 = REALPART_EXPR <(*c_10(D))[_1]>; _5 = IMAGPART_EXPR <(*c_10(D))[_1]>; _14 = -_5; REALPART_EXPR <(*c_10(D))[_1]> = _14; IMAGPART_EXPR <(*c_10(D))[_1]> = _6; and vectorizing that fails to use SLP which introduces an interleaving chain. Negation isn't supported as "two-operators" op as there's no corresponding "no negation" operation and we lack a way to insert a noop during SLP build. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations