https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81127
Bug ID: 81127 Summary: Complex division misses vectorisation opportunity Product: gcc Version: 7.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: drraph at gmail dot com Target Milestone: --- This report has two parts. The first is about complex float division and the second about complex double division. --- Part 1 --- Consider: #include <complex.h> complex float f(complex float x, complex float y) { return x/y; } In gcc 7.1 with -O3 -march=core-avx2 -ffast-math you get: f: vmovq QWORD PTR [rsp-16], xmm1 vmovss xmm5, DWORD PTR [rsp-12] vmovss xmm4, DWORD PTR [rsp-16] vmovq QWORD PTR [rsp-8], xmm0 vmovss xmm0, DWORD PTR [rsp-4] vmovss xmm3, DWORD PTR [rsp-8] vmulss xmm2, xmm5, xmm5 vmulss xmm1, xmm0, xmm5 vfmadd231ss xmm2, xmm4, xmm4 vfmadd231ss xmm1, xmm3, xmm4 vmulss xmm3, xmm3, xmm5 vdivss xmm1, xmm1, xmm2 vfmsub132ss xmm0, xmm3, xmm4 vdivss xmm0, xmm0, xmm2 vmovss DWORD PTR [rsp-24], xmm1 vmovss DWORD PTR [rsp-20], xmm0 vmovq xmm0, QWORD PTR [rsp-24] ret Note three calls to vmulss and two calls to vdivss ICC on the other hand gives: f: vcvtps2pd xmm2, xmm1 #3.12 vcvtps2pd xmm4, xmm0 #3.12 vmulpd xmm8, xmm2, xmm2 #3.12 vunpckhpd xmm3, xmm2, xmm2 #3.12 vmulpd xmm6, xmm3, xmm4 #3.12 vmovddup xmm7, xmm2 #3.12 vshufpd xmm5, xmm4, xmm4, 1 #3.12 vshufpd xmm9, xmm8, xmm8, 1 #3.12 vfmaddsub213pd xmm7, xmm5, xmm6 #3.12 vaddpd xmm11, xmm8, xmm9 #3.12 vshufpd xmm10, xmm7, xmm7, 1 #3.12 vdivpd xmm12, xmm10, xmm11 #3.12 vcvtpd2ps xmm0, xmm12 #3.12 ret Note two calls to vmulpd and one call to vdivpd. Just for interest,if you increase the optimisation level (using -fp-model fast=2) ICC also offers this alternative: f: vmovlhps xmm2, xmm1, xmm1 #3.12 vmulps xmm8, xmm2, xmm2 #3.12 vshufps xmm9, xmm8, xmm8, 177 #3.12 vmovlhps xmm4, xmm0, xmm0 #3.12 vaddps xmm10, xmm8, xmm9 #3.12 vrcpps xmm11, xmm10 #3.12 vmovshdup xmm3, xmm2 #3.12 vaddps xmm12, xmm11, xmm11 #3.12 vmulps xmm6, xmm4, xmm3 #3.12 vmulps xmm14, xmm11, xmm10 #3.12 vmovsldup xmm7, xmm2 #3.12 vshufps xmm5, xmm4, xmm4, 177 #3.12 vfmaddsub213ps xmm7, xmm5, xmm6 #3.12 vfnmadd213ps xmm14, xmm11, xmm12 #3.12 vshufps xmm13, xmm7, xmm7, 177 #3.12 vmulps xmm0, xmm13, xmm14 #3.12 ret Note one call to vrcpps and four calls to vmulps and zero calls to vdivpd. --- Part 2 --- Consider: #include <complex.h> complex double f(complex double x, complex double y) { return x/y; } In gcc 7.1 with -O3 -march=core-avx2 -ffast-math you get: f: vmulsd xmm4, xmm1, xmm3 vmovapd xmm6, xmm0 vmulsd xmm5, xmm3, xmm3 vmulsd xmm6, xmm6, xmm3 vfmadd231sd xmm4, xmm0, xmm2 vfmadd231sd xmm5, xmm2, xmm2 vfmsub132sd xmm1, xmm6, xmm2 vdivsd xmm0, xmm4, xmm5 vdivsd xmm1, xmm1, xmm5 ret In ICC you get with -fp-model fast=2: f: vunpcklpd xmm4, xmm2, xmm3 #2.54 vunpcklpd xmm6, xmm0, xmm1 #2.54 vunpckhpd xmm5, xmm4, xmm4 #3.12 vmulpd xmm10, xmm4, xmm4 #3.12 vmulpd xmm8, xmm5, xmm6 #3.12 vmovddup xmm9, xmm4 #3.12 vshufpd xmm7, xmm6, xmm6, 1 #3.12 vshufpd xmm11, xmm10, xmm10, 1 #3.12 vfmaddsub213pd xmm9, xmm7, xmm8 #3.12 vaddpd xmm13, xmm10, xmm11 #3.12 vshufpd xmm12, xmm9, xmm9, 1 #3.12 vdivpd xmm0, xmm12, xmm13 #3.12 vunpckhpd xmm1, xmm0, xmm0 #3.12 ret This reduces the number of multiplications to two and the number of divisions to one again. It would be great to have benchmarks for all of this but I don't have a copy of ICC to test.