Re: Rather Bizarre slow downs using Complex!float with avx (ldc).
On Friday, 1 October 2021 at 08:32:14 UTC, james.p.leblanc wrote: Does anyone have insight to what is happening? Thanks, James Maybe something related to: https://gist.github.com/rygorous/32bc3ea8301dba09358fd2c64e02d774 ? AVX is not always a clear win in terms of performance. Processing 8x float at once may not do anything if you are memory-bound, etc.
Re: Rather Bizarre slow downs using Complex!float with avx (ldc).
On Thursday, 30 September 2021 at 16:52:57 UTC, Johan wrote: On Thursday, 30 September 2021 at 16:40:03 UTC, james.p.leblanc Generally, for performance issues like this you need to study assembly output (`--output-s`) or LLVM IR (`--output-ll`). First thing I would look out for is function inlining yes/no. cheers, Johan Johan, Thanks kindly for your reply. As suggested, I have looked at the assembly output. Strangely the fused multiplay add are indeed there in the avx version, but example still runs slower for **Complex!float** data type. I have stripped the code down to a minimum, which demonstrates the weird result: ```d import ldc.attributes; // with or without this line makes no difference import std.stdio; import std.datetime.stopwatch; import std.complex; alias T = Complex!float; auto typestr = "COMPLEX FLOAT"; /* alias T = Complex!double; */ /* auto typestr = "COMPLEX DOUBLE"; */ auto alpha = cast(T) complex(0.1, -0.2); // dummy values to fill arrays auto beta = cast(T) complex(-0.7, 0.6); auto dotprod( T[] x, T[] y) { auto sum = cast(T) 0; foreach( size_t i ; 0 .. x.length) sum += x[i] * conj(y[i]); return sum; } void main() { int nEle = 1000; int nIter = 2000; auto startTime = MonoTime.currTime; auto dur = cast(double) (MonoTime.currTime-startTime).total!"usecs"; T[] x, y; x.length = nEle; y.length = nEle; T z; x[] = alpha; y[] = beta; startTime = MonoTime.currTime; foreach( i ; 0 .. nIter){ foreach( j ; 0 .. nIter){ z = dotprod(x,y); } } auto etime = cast(double) (MonoTime.currTime-startTime).total!"msecs" / 1.0e3; writef(" result: % 5.2f%+5.2fi comp time: %5.2f \n", z.re, z.im, etime); } ``` For convenience I include bash script used compile/run/generate assembly code / and grep: ```bash echo echo "With AVX:" ldc2 -O3 -release question.d --ffast-math -mcpu=haswell question ldc2 -output-s -O3 -release question.d --ffast-math -mcpu=haswell mv question.s question_with_avx.s echo echo "Without AVX" ldc2 -O3 -release question.d question ldc2 -output-s -O3 -release question.d mv question.s question_without_avx.s echo echo "fused multiply adds are found in avx code (as desired)" grep vfmadd *.s /dev/null ``` Here is output when run on my machine: ```console With AVX: result: -190.00+80.00i comp time: 6.45 Without AVX result: -190.00+80.00i comp time: 5.74 fused multiply adds are found in avx code (as desired) question_with_avx.s:vfmadd231ss %xmm2, %xmm5, %xmm3 question_with_avx.s:vfmadd231ss %xmm0, %xmm2, %xmm3 question_with_avx.s:vfmadd231ss %xmm2, %xmm4, %xmm1 question_with_avx.s:vfmadd231ss %xmm3, %xmm5, %xmm1 question_with_avx.s:vfmadd231ss %xmm3, %xmm1, %xmm0 ``` Repeating the experiment after changing to datatype of Complex!double shows AVX code to be twice as fast (perhaps more aligned with expectations). **I admit my confusion as to why the Complex!float is misbehaving.** Does anyone have insight to what is happening? Thanks, James
Re: Rather Bizarre slow downs using Complex!float with avx (ldc).
On Thursday, 30 September 2021 at 16:40:03 UTC, james.p.leblanc wrote: D-Ers, I have been getting counterintuitive results on avx/no-avx timing experiments. This could be an template instantiation culling problem. If the compiler is able to determine that `Complex!float` is already instantiated (codegen) inside Phobos, then it may decide not to codegen it again when you are compiling your code with AVX+fastmath enabled. This could explain why you don't see improvement for `Complex!float`, but do see improvement with `Complex!double`. This does not explain the worse performance with AVX+fastmath vs without it. Generally, for performance issues like this you need to study assembly output (`--output-s`) or LLVM IR (`--output-ll`). First thing I would look out for is function inlining yes/no. cheers, Johan
Rather Bizarre slow downs using Complex!float with avx (ldc).
D-Ers, I have been getting counterintuitive results on avx/no-avx timing experiments. Storyline to date (notes at end): **Experiment #1)** Real float data type (i.e. non-complex numbers), speed comparison. a) moving from non-avx --> avx shows non-realistic speed up of 15-25 X. b) this is weird, but story continues ... **Experiment #2)** Real double data type (non-complex numbers), a) moving from non-avx --> avx again shows amazing gains, but the gains are about half of those seen in Experiment #1, so maybe this looks plausible? **Experiment #3)** Complex!float datatypes: a) now **going from non-avx to avx shows a serious performance LOSS** of 40% to breaking even at best. What is happening here? **Experiment #4)** Complex!double: a) non-avx --> avx shows performancegains again about 2X (so the gains appear to be reasonable). The main question I have is: **"What is going on with the Complex!float performance?"** One might expect floats to have a better perfomance than doubles as we saw with the real-value data (becuase of vector packaging, memory bandwidth, etc). But, **Complex!float shows MUCH WORSE avx performance than Complex!Double (by a factor of almost 4).** ```d //Table of Computation Times // // self math std math // explicit no-explicit explicit no-explicit // align alignalign align // 0.12 0.21 0.15 0.21 ; # Float with AVX // 3.23 3.24 3.30 3.22 ; # Float without AVX // 0.31 0.42 0.31 0.42 ; # Double with AVX // 3.25 3.24 3.24 3.27 ; # Double without AVX // 6.42 6.62 6.61 6.59 ; # Complex!float with AVX // 4.04 4.17 6.68 5.82 ; # Complex!float without AVX // 1.67 1.69 1.73 1.71 ; # Complex!double with AVX // 3.34 3.42 3.28 3.31# Complex!double without AVX ``` Notes: 1) Based on forum hints from ldc experts, I got good guidance on enabling avx ( i.e. compiling modules on command line, using --fast-math and -mcpu=haswell on command line). 2) From Mir-glas experts I received hints to try to implement own version of the complex math. (this is what the "self-math" column refers to). I understand that detail of the computations are not included here, (I can do that if there is interest, and if I figure out an effective way to present it in a forum.) But, I thought I might begin with a simple question, **"Is there some well-known issue that I am missing here". Have others been done this road as well?** Thanks for any and all input. Best Regards, James PS Sorry for the inelegant table ... I do not believe there is a way to include the beautiful bars charts on this forum. Please correct me if there is a way...)