https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81127

            Bug ID: 81127
           Summary: Complex division misses vectorisation opportunity
           Product: gcc
           Version: 7.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: drraph at gmail dot com
  Target Milestone: ---

This report has two parts. The first is about complex float division and the
second about complex double division.

--- Part 1 ---

Consider:

#include <complex.h>
complex float f(complex float x, complex float y) {
  return x/y;
}

In gcc 7.1 with -O3 -march=core-avx2 -ffast-math you get:

f:
        vmovq   QWORD PTR [rsp-16], xmm1
        vmovss  xmm5, DWORD PTR [rsp-12]
        vmovss  xmm4, DWORD PTR [rsp-16]
        vmovq   QWORD PTR [rsp-8], xmm0
        vmovss  xmm0, DWORD PTR [rsp-4]
        vmovss  xmm3, DWORD PTR [rsp-8]
        vmulss  xmm2, xmm5, xmm5
        vmulss  xmm1, xmm0, xmm5
        vfmadd231ss     xmm2, xmm4, xmm4
        vfmadd231ss     xmm1, xmm3, xmm4
        vmulss  xmm3, xmm3, xmm5
        vdivss  xmm1, xmm1, xmm2
        vfmsub132ss     xmm0, xmm3, xmm4
        vdivss  xmm0, xmm0, xmm2
        vmovss  DWORD PTR [rsp-24], xmm1
        vmovss  DWORD PTR [rsp-20], xmm0
        vmovq   xmm0, QWORD PTR [rsp-24]
        ret

Note three calls to vmulss and two calls to vdivss

ICC on the other hand gives:

f:
        vcvtps2pd xmm2, xmm1                                    #3.12
        vcvtps2pd xmm4, xmm0                                    #3.12
        vmulpd    xmm8, xmm2, xmm2                              #3.12
        vunpckhpd xmm3, xmm2, xmm2                              #3.12
        vmulpd    xmm6, xmm3, xmm4                              #3.12
        vmovddup  xmm7, xmm2                                    #3.12
        vshufpd   xmm5, xmm4, xmm4, 1                           #3.12
        vshufpd   xmm9, xmm8, xmm8, 1                           #3.12
        vfmaddsub213pd xmm7, xmm5, xmm6                         #3.12
        vaddpd    xmm11, xmm8, xmm9                             #3.12
        vshufpd   xmm10, xmm7, xmm7, 1                          #3.12
        vdivpd    xmm12, xmm10, xmm11                           #3.12
        vcvtpd2ps xmm0, xmm12                                   #3.12
        ret   

Note two calls to vmulpd and one call to vdivpd.

Just for interest,if you increase the optimisation level (using -fp-model
fast=2) ICC also offers this alternative:

f:
        vmovlhps  xmm2, xmm1, xmm1                              #3.12
        vmulps    xmm8, xmm2, xmm2                              #3.12
        vshufps   xmm9, xmm8, xmm8, 177                         #3.12
        vmovlhps  xmm4, xmm0, xmm0                              #3.12
        vaddps    xmm10, xmm8, xmm9                             #3.12
        vrcpps    xmm11, xmm10                                  #3.12
        vmovshdup xmm3, xmm2                                    #3.12
        vaddps    xmm12, xmm11, xmm11                           #3.12
        vmulps    xmm6, xmm4, xmm3                              #3.12
        vmulps    xmm14, xmm11, xmm10                           #3.12
        vmovsldup xmm7, xmm2                                    #3.12
        vshufps   xmm5, xmm4, xmm4, 177                         #3.12
        vfmaddsub213ps xmm7, xmm5, xmm6                         #3.12
        vfnmadd213ps xmm14, xmm11, xmm12                        #3.12
        vshufps   xmm13, xmm7, xmm7, 177                        #3.12
        vmulps    xmm0, xmm13, xmm14                            #3.12
        ret   

Note one call to vrcpps and four calls to vmulps and zero calls to vdivpd. 

--- Part 2 ---

Consider:

#include <complex.h>
complex double f(complex double x, complex double y) {
  return x/y;
}

In gcc 7.1 with -O3 -march=core-avx2 -ffast-math you get:

f:
        vmulsd  xmm4, xmm1, xmm3
        vmovapd xmm6, xmm0
        vmulsd  xmm5, xmm3, xmm3
        vmulsd  xmm6, xmm6, xmm3
        vfmadd231sd     xmm4, xmm0, xmm2
        vfmadd231sd     xmm5, xmm2, xmm2
        vfmsub132sd     xmm1, xmm6, xmm2
        vdivsd  xmm0, xmm4, xmm5
        vdivsd  xmm1, xmm1, xmm5
        ret

In ICC you get with -fp-model fast=2:

f:
        vunpcklpd xmm4, xmm2, xmm3                              #2.54
        vunpcklpd xmm6, xmm0, xmm1                              #2.54
        vunpckhpd xmm5, xmm4, xmm4                              #3.12
        vmulpd    xmm10, xmm4, xmm4                             #3.12
        vmulpd    xmm8, xmm5, xmm6                              #3.12
        vmovddup  xmm9, xmm4                                    #3.12
        vshufpd   xmm7, xmm6, xmm6, 1                           #3.12
        vshufpd   xmm11, xmm10, xmm10, 1                        #3.12
        vfmaddsub213pd xmm9, xmm7, xmm8                         #3.12
        vaddpd    xmm13, xmm10, xmm11                           #3.12
        vshufpd   xmm12, xmm9, xmm9, 1                          #3.12
        vdivpd    xmm0, xmm12, xmm13                            #3.12
        vunpckhpd xmm1, xmm0, xmm0                              #3.12
        ret  

This reduces the number of multiplications to two and the number of divisions
to one again.  

It would be great to have benchmarks for all of this but I don't have a copy of
ICC to test.

Reply via email to