Re: Rather Bizarre slow downs using Complex!float with avx (ldc).

2021-10-02 Thread Guillaume Piolat via Digitalmars-d-learn

On Friday, 1 October 2021 at 08:32:14 UTC, james.p.leblanc wrote:


Does anyone have insight to what is happening?

Thanks,
James


Maybe something related to: 
https://gist.github.com/rygorous/32bc3ea8301dba09358fd2c64e02d774 
?


AVX is not always a clear win in terms of performance.
Processing 8x float at once may not do anything if you are 
memory-bound, etc.


Re: Rather Bizarre slow downs using Complex!float with avx (ldc).

2021-10-01 Thread james.p.leblanc via Digitalmars-d-learn

On Thursday, 30 September 2021 at 16:52:57 UTC, Johan wrote:

On Thursday, 30 September 2021 at 16:40:03 UTC, james.p.leblanc


Generally, for performance issues like this you need to study 
assembly output (`--output-s`) or LLVM IR (`--output-ll`).

First thing I would look out for is function inlining yes/no.

cheers,
  Johan


Johan,

Thanks kindly for your reply.  As suggested, I have looked at the 
assembly output.


Strangely the fused multiplay add are indeed there in the avx 
version, but example

still runs slower for **Complex!float** data type.

I have stripped the code down to a minimum, which demonstrates 
the weird result:




```d

import ldc.attributes;  // with or without this line makes no 
difference

import std.stdio;
import std.datetime.stopwatch;
import std.complex;

alias T = Complex!float;
auto typestr = "COMPLEX FLOAT";
/* alias T = Complex!double; */
/* auto typestr = "COMPLEX DOUBLE"; */

auto alpha = cast(T) complex(0.1, -0.2);  // dummy values to fill 
arrays

auto beta = cast(T) complex(-0.7, 0.6);

auto dotprod( T[] x, T[] y)
{
   auto sum = cast(T) 0;
  foreach( size_t i ; 0 .. x.length)
 sum += x[i] * conj(y[i]);
   return sum;
}

void main()
{
   int nEle = 1000;
   int nIter = 2000;

   auto startTime = MonoTime.currTime;
   auto dur = cast(double) 
(MonoTime.currTime-startTime).total!"usecs";


   T[] x, y;
   x.length = nEle;
   y.length = nEle;
   T z;
   x[] = alpha;
   y[] = beta;

   startTime = MonoTime.currTime;
   foreach( i ; 0 .. nIter){
  foreach( j ; 0 .. nIter){
z = dotprod(x,y);
  }
   }
   auto etime = cast(double) 
(MonoTime.currTime-startTime).total!"msecs" / 1.0e3;
   writef(" result:  % 5.2f%+5.2fi  comp time:  %5.2f \n", z.re, 
z.im, etime);

}
```

For convenience I include bash script used compile/run/generate 
assembly code / and grep:


```bash
echo
echo "With AVX:"
ldc2 -O3 -release question.d --ffast-math -mcpu=haswell
question
ldc2 -output-s -O3 -release question.d --ffast-math -mcpu=haswell
mv question.s question_with_avx.s

echo
echo "Without AVX"
ldc2 -O3 -release question.d
question
ldc2 -output-s -O3 -release question.d
mv question.s question_without_avx.s

echo
echo "fused multiply adds are found in avx code (as desired)"
grep vfmadd *.s /dev/null
```

Here is output when run on my machine:

```console
With AVX:
 result:  -190.00+80.00i  comp time:   6.45

Without AVX
 result:  -190.00+80.00i  comp time:   5.74

fused multiply adds are found in avx code (as desired)
question_with_avx.s:vfmadd231ss %xmm2, %xmm5, %xmm3
question_with_avx.s:vfmadd231ss %xmm0, %xmm2, %xmm3
question_with_avx.s:vfmadd231ss %xmm2, %xmm4, %xmm1
question_with_avx.s:vfmadd231ss %xmm3, %xmm5, %xmm1
question_with_avx.s:vfmadd231ss %xmm3, %xmm1, %xmm0

```

Repeating the experiment after changing to datatype of 
Complex!double
shows AVX code to be twice as fast (perhaps more aligned with 
expectations).


**I admit my confusion as to why the Complex!float is 
misbehaving.**


Does anyone have insight to what is happening?

Thanks,
James




Re: Rather Bizarre slow downs using Complex!float with avx (ldc).

2021-09-30 Thread Johan via Digitalmars-d-learn
On Thursday, 30 September 2021 at 16:40:03 UTC, james.p.leblanc 
wrote:

D-Ers,

I have been getting counterintuitive results on avx/no-avx 
timing

experiments.


This could be an template instantiation culling problem. If the 
compiler is able to determine that `Complex!float` is already 
instantiated (codegen) inside Phobos, then it may decide not to 
codegen it again when you are compiling your code with 
AVX+fastmath enabled. This could explain why you don't see 
improvement for `Complex!float`, but do see improvement with 
`Complex!double`. This does not explain the worse performance 
with AVX+fastmath vs without it.


Generally, for performance issues like this you need to study 
assembly output (`--output-s`) or LLVM IR (`--output-ll`).

First thing I would look out for is function inlining yes/no.

cheers,
  Johan



Rather Bizarre slow downs using Complex!float with avx (ldc).

2021-09-30 Thread james.p.leblanc via Digitalmars-d-learn

D-Ers,

I have been getting counterintuitive results on avx/no-avx timing
experiments.  Storyline to date (notes at end):

**Experiment #1)** Real float data type (i.e. non-complex 
numbers),

speed comparison.
  a)  moving from non-avx --> avx shows non-realistic speed up of 
15-25 X.

  b)  this is weird, but story continues ...

**Experiment #2)** Real double data type (non-complex numbers),
  a)  moving from non-avx --> avx again shows amazing gains, but 
the
  gains are about half of those seen in Experiment #1, so 
maybe

  this looks plausible?

**Experiment #3)**  Complex!float datatypes:
  a)  now **going from non-avx to avx shows a serious performance 
LOSS**

  of 40% to breaking even at best.  What is happening here?

**Experiment #4)**  Complex!double:
  a)  non-avx --> avx shows performancegains again about 2X (so 
the

  gains appear to be reasonable).


The main question I have is:

**"What is going on with the Complex!float performance?"**  One 
might expect

floats to have a better perfomance than doubles as we saw with the
real-value data (becuase of vector packaging, memory bandwidth, 
etc).


But, **Complex!float shows MUCH WORSE avx performance than 
Complex!Double

(by a factor of almost 4).**

```d
//Table of Computation Times
//
//   self math  std math
// explicit  no-explicit   explicit  no-explicit
//   align  alignalign  align
//   0.12   0.21  0.15  0.21 ;  # Float with AVX
//   3.23   3.24  3.30  3.22 ;  # Float without 
AVX

//   0.31   0.42  0.31  0.42 ;  # Double with AVX
//   3.25   3.24  3.24  3.27 ;  # Double without 
AVX
//   6.42   6.62  6.61  6.59 ;  # Complex!float 
with AVX
//   4.04   4.17  6.68  5.82 ;  # Complex!float 
without AVX
//   1.67   1.69  1.73  1.71 ;  # Complex!double 
with AVX
//   3.34   3.42  3.28  3.31# Complex!double 
without AVX

```

Notes:

1) Based on forum hints from ldc experts, I got good guidance
   on enabling avx ( i.e. compiling modules on command line, using
   --fast-math and -mcpu=haswell on command line).

2) From Mir-glas experts I received hints to try to implement own 
version
   of the complex math.  (this is what the "self-math" column 
refers to).


I understand that detail of the computations are not included 
here, (I
can do that if there is interest, and if I figure out an 
effective way to present

it in a forum.)

But, I thought I might begin with a simple question, **"Is there 
some well-known
issue that I am missing here".  Have others been done this road 
as well?**


Thanks for any and all input.
Best Regards,
James

PS  Sorry for the inelegant table ... I do not believe there is a 
way
to include the beautiful bars charts on this forum.  Please 
correct me

if there is a way...)