Re: Optimizing for SIMD: best practices?(i.e. what features are allowed?)

2021-03-07 Thread tsbockman via Digitalmars-d-learn

On Sunday, 7 March 2021 at 22:54:32 UTC, tsbockman wrote:

...
result = diffSq[0];
static foreach(i; 0 .. 3)
result += diffSq[i];
...


Oops, that's supposed to say `i; 1 .. 3`. Fixed:

import std.meta : Repeat;
void euclideanDistanceFixedSizeArray(V)(ref Repeat!(3, const(V)) 
a, ref Repeat!(3, const(V)) b, ref V result)

if(is(V : __vector(float[length]), size_t length))
{
Repeat!(3, V) diffSq = a;
static foreach(i; 0 .. 3) {
diffSq[i] -= b[i];
diffSq[i] *= diffSq[i];
}

result = diffSq[0];
static foreach(i; 1 .. 3)
result += diffSq[i];

version(LDC) { version(X86_64) {
enum isSupportedPlatform = true;
import ldc.llvmasm : __asm;
result = __asm!V(`vsqrtps $1, $0`, `=x, x`, result);
} }
static assert(isSupportedPlatform);
}

Fixed asm:

pure nothrow @nogc void 
app.euclideanDistanceFixedSizeArray!(__vector(float[16])).euclideanDistanceFixedSizeArray(ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref __vector(float[16])):

mov rax, qword ptr [rsp + 8]
vmovaps zmm0, zmmword ptr [rax]
vmovaps zmm1, zmmword ptr [r9]
vmovaps zmm2, zmmword ptr [r8]
vsubps  zmm0, zmm0, zmmword ptr [rcx]
vsubps  zmm1, zmm1, zmmword ptr [rdx]
vmulps  zmm1, zmm1, zmm1
vsubps  zmm2, zmm2, zmmword ptr [rsi]
vfmadd231ps zmm1, zmm0, zmm0
vfmadd231ps zmm1, zmm2, zmm2
vmovaps zmmword ptr [rdi], zmm1
vsqrtps zmm0, zmm1
vmovaps zmmword ptr [rdi], zmm0
vzeroupper
ret

(I really wish I could just edit my posts here...)


Re: Optimizing for SIMD: best practices?(i.e. what features are allowed?)

2021-03-07 Thread tsbockman via Digitalmars-d-learn

On Sunday, 7 March 2021 at 22:54:32 UTC, tsbockman wrote:

import std.meta : Repeat;
void euclideanDistanceFixedSizeArray(V)(ref Repeat!(3, 
const(V)) a, ref Repeat!(3, const(V)) b, out V result)

if(is(V : __vector(float[length]), size_t length))
...

Resulting asm with is(V == __vector(float[16])):

.LCPI1_0:
.long   0x7fc0
pure nothrow @nogc void 
app.euclideanDistanceFixedSizeArray!(__vector(float[16])).euclideanDistanceFixedSizeArray(ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), out __vector(float[16])):

mov rax, qword ptr [rsp + 8]
vbroadcastsszmm0, dword ptr [rip + .LCPI1_0]
...


Apparently the optimizer is too stupid to skip the redundant 
float.nan broadcast when result is an `out` parameter, so just 
make it `ref V result` instead for better code gen:


pure nothrow @nogc void 
app.euclideanDistanceFixedSizeArray!(__vector(float[16])).euclideanDistanceFixedSizeArray(ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref __vector(float[16])):

mov rax, qword ptr [rsp + 8]
vmovaps zmm0, zmmword ptr [rax]
vmovaps zmm1, zmmword ptr [r9]
vmovaps zmm2, zmmword ptr [r8]
vsubps  zmm0, zmm0, zmmword ptr [rcx]
vmulps  zmm0, zmm0, zmm0
vsubps  zmm1, zmm1, zmmword ptr [rdx]
vsubps  zmm2, zmm2, zmmword ptr [rsi]
vaddps  zmm0, zmm0, zmm0
vfmadd231ps zmm0, zmm1, zmm1
vfmadd231ps zmm0, zmm2, zmm2
vmovaps zmmword ptr [rdi], zmm0
vsqrtps zmm0, zmm0
vmovaps zmmword ptr [rdi], zmm0
vzeroupper
ret



Re: Optimizing for SIMD: best practices?(i.e. what features are allowed?)

2021-03-07 Thread tsbockman via Digitalmars-d-learn

On Sunday, 7 March 2021 at 18:00:57 UTC, z wrote:

On Friday, 26 February 2021 at 03:57:12 UTC, tsbockman wrote:

  static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
  distance += a[i].abs;//abs required by the caller


(a * a) above is always positive for real numbers. You don't 
need the call to abs unless you're trying to guarantee that 
even nan values will have a clear sign bit.



I do not know why but the caller's performance nosedives


My way is definitely (slightly) better; something is going wrong 
in either the caller, or the optimizer. Show me the code for the 
caller and maybe I can figure it out.


whenever there is no .abs at this particular line.(there's a 3x 
difference, no joke.)


Perhaps the compiler is performing a value range propagation 
(VRP) based optimization in the caller, but isn't smart enough to 
figure out that the value is already always positive without the 
`abs` call? I've run into that specific problem before.


Alternatively, sometimes trivial changes to the code that 
*shouldn't* matter make the difference between hitting a smart 
path in the optimizer, and a dumb one. Automatic SIMD 
optimization is quite sensitive and temperamental.


Either way, the problem can be fixed by figuring out what 
optimization the compiler is doing when it knows that distance is 
positive, and just doing it manually.


Same for assignment instead of addition, but with a 2x 
difference instead.


Did you fix the nan bug I pointed out earlier? More generally, 
are you actually verifying the correctness of the results in any 
way for each alternative implementation? Because you can get big 
speedups sometimes from buggy code when the compiler realizes 
that some later logic error makes earlier code irrelevant, but 
that doesn't mean the buggy code is better...


Re: Optimizing for SIMD: best practices?(i.e. what features are allowed?)

2021-03-07 Thread tsbockman via Digitalmars-d-learn

On Sunday, 7 March 2021 at 13:26:37 UTC, z wrote:

On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
However, AVX512 support seems limited to being able to use the 
16 other YMM registers, rather than using the same code 
template but changed to use ZMM registers and double the 
offsets to take advantage of the new size.
Compiled with «-g -enable-unsafe-fp-math 
-enable-no-infs-fp-math -ffast-math -O -release -mcpu=skylake» :


You're not compiling with AVX512 enabled. You would need to use 
-mcpu=skylake-avx512.


However, LLVM's code generation for AVX512 seems to be pretty 
terrible still, so you'll need to either use some inline ASM, or 
stick with AVX2. Here's a structure of arrays style example:


import std.meta : Repeat;
void euclideanDistanceFixedSizeArray(V)(ref Repeat!(3, const(V)) 
a, ref Repeat!(3, const(V)) b, out V result)

if(is(V : __vector(float[length]), size_t length))
{
Repeat!(3, V) diffSq = a;
static foreach(i; 0 .. 3) {
diffSq[i] -= b[i];
diffSq[i] *= diffSq[i];
}

result = diffSq[0];
static foreach(i; 0 .. 3)
result += diffSq[i];

version(LDC) { version(X86_64) {
enum isSupportedPlatform = true;
import ldc.llvmasm : __asm;
result = __asm!V(`vsqrtps $1, $0`, `=x, x`, result);
} }
static assert(isSupportedPlatform);
}

Resulting asm with is(V == __vector(float[16])):

.LCPI1_0:
.long   0x7fc0
pure nothrow @nogc void 
app.euclideanDistanceFixedSizeArray!(__vector(float[16])).euclideanDistanceFixedSizeArray(ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), out __vector(float[16])):

mov rax, qword ptr [rsp + 8]
vbroadcastsszmm0, dword ptr [rip + .LCPI1_0]
vmovaps zmmword ptr [rdi], zmm0
vmovaps zmm0, zmmword ptr [rax]
vmovaps zmm1, zmmword ptr [r9]
vmovaps zmm2, zmmword ptr [r8]
vsubps  zmm0, zmm0, zmmword ptr [rcx]
vmulps  zmm0, zmm0, zmm0
vsubps  zmm1, zmm1, zmmword ptr [rdx]
vsubps  zmm2, zmm2, zmmword ptr [rsi]
vaddps  zmm0, zmm0, zmm0
vfmadd231ps zmm0, zmm1, zmm1
vfmadd231ps zmm0, zmm2, zmm2
vmovaps zmmword ptr [rdi], zmm0
vsqrtps zmm0, zmm0
vmovaps zmmword ptr [rdi], zmm0
vzeroupper
ret



Re: Optimizing for SIMD: best practices?(i.e. what features are allowed?)

2021-03-07 Thread Bruce Carneal via Digitalmars-d-learn

On Sunday, 7 March 2021 at 14:15:58 UTC, z wrote:
On Thursday, 25 February 2021 at 14:28:40 UTC, Guillaume Piolat 
wrote:

On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
How does one optimize code to make full use of the CPU's SIMD 
capabilities?
Is there any way to guarantee that "packed" versions of SIMD 
instructions will be used?(e.g. vmulps, vsqrtps, etc...)


https://code.dlang.org/packages/intel-intrinsics


I'd try to use it but the platform i'm building on requires AVX 
to get the most performance.


The code below might be worth a try on your AVX512 machine.

Unless you're looking for a combined result, you might need to 
separate out the memory access overhead by running multiple 
passes over a "known optimal for L2" data set.


Also note that I compiled with -preview=in.  I don't know if that 
matters.




import std.math : sqrt;
enum SIMDBits = 512; // 256 was tested, 512 was not
alias A = float[SIMDBits / (float.sizeof * 8)];
pragma(inline, true)
void soaEuclidean(ref A a0, in A a1, in A a2, in A a3, in A 
b1, in A b2, in A b3)

{
alias V = __vector(A);
static V vsqrt(V v)
{
A a = cast(A) v;
static foreach (i; 0 .. A.length)
a[i] = sqrt(a[i]);
return cast(V)a;
}

static V sd(in A a, in A b)
{
V v = cast(V) b - cast(V) a;
return v * v;
}

auto v = sd(a1, b1) + sd(a2, b2) + sd(a3, b3);
a0[] = vsqrt(v)[];
}




Re: Optimizing for SIMD: best practices?(i.e. what features are allowed?)

2021-03-07 Thread z via Digitalmars-d-learn

On Friday, 26 February 2021 at 03:57:12 UTC, tsbockman wrote:

  static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
  distance += a[i].abs;//abs required by the caller


(a * a) above is always positive for real numbers. You don't 
need the call to abs unless you're trying to guarantee that 
even nan values will have a clear sign bit.


I do not know why but the caller's performance nosedives whenever 
there is no .abs at this particular line.(there's a 3x 
difference, no joke.)
Same for assignment instead of addition, but with a 2x difference 
instead.


Re: Optimizing for SIMD: best practices?(i.e. what features are allowed?)

2021-03-07 Thread z via Digitalmars-d-learn
On Thursday, 25 February 2021 at 14:28:40 UTC, Guillaume Piolat 
wrote:

On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
How does one optimize code to make full use of the CPU's SIMD 
capabilities?
Is there any way to guarantee that "packed" versions of SIMD 
instructions will be used?(e.g. vmulps, vsqrtps, etc...)


https://code.dlang.org/packages/intel-intrinsics


I'd try to use it but the platform i'm building on requires AVX 
to get the most performance.





Re: Optimizing for SIMD: best practices?(i.e. what features are allowed?)

2021-03-07 Thread z via Digitalmars-d-learn

On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:

...
It seems that using static foreach with pointer parameters 
exclusively is the best way to "guide" LDC into optimizing 
code.(using arr1[] += arr2[] syntax resulted in worse performance 
for me.)
However, AVX512 support seems limited to being able to use the 16 
other YMM registers, rather than using the same code template but 
changed to use ZMM registers and double the offsets to take 
advantage of the new size.
Compiled with «-g -enable-unsafe-fp-math -enable-no-infs-fp-math 
-ffast-math -O -release -mcpu=skylake» :

__gshared simdf init = [0f,0f,0f,0f,0f,0f,0f,0f];
alias simdf = float[8]
extern(C)//with extern(D)(the default), the assembly output uses 
one register for two pointers.
void vEUCLIDpsptr_void(simdf* a0, simdf* a1, simdf* a2, simdf* 
a3, simdf* b1, simdf* b2, simdf* b3) {

simdf amm0 = init;//returned simdf
simdf amm1 = *a1;
simdf amm2 = *a2;
simdf amm3 = *a3;
static foreach(size_t i; 0..simdlength) {
		//Needs to be interleaved like this, otherwise LDC generates 
worse code.

amm1[i] -= (*b1)[i];
amm1[i] *= amm1[i];
amm2[i] -= (*b2)[i];
amm2[i] *= amm2[i];
amm3[i] -= (*b3)[i];
amm3[i] *= amm3[i];
amm0[i] += amm1[i];
amm0[i] += amm2[i];
amm0[i] += amm3[i];
amm0[i] = sqrt(amm0[i]);
}
*a0 = amm0;
return;
}



mov r10,qword ptr ss:[rsp+38]
mov r11,qword ptr ss:[rsp+30]
mov rax,qword ptr ss:[rsp+28]
vmovups ymm0,yword ptr ds:[rdx]
vmovups ymm1,yword ptr ds:[r8]
vsubps ymm0,ymm0,yword ptr ds:[rax]
vmovups ymm2,yword ptr ds:[r9]
vfmadd213ps ymm0,ymm0,yword ptr ds:[<_D12euclideandst4initG8f>]
vsubps ymm1,ymm1,yword ptr ds:[r11]
vfmadd213ps ymm1,ymm1,ymm0
vsubps ymm0,ymm2,yword ptr ds:[r10]
vfmadd213ps ymm0,ymm0,ymm1
vsqrtps ymm0,ymm0
vmovups yword ptr ds:[rcx],ymm0
vzeroupper ret


The speed difference is near 400% for the same amount of 
distances compared with the single distance function example.
However, the assembly generated isn't the fastest, for example 
removing vzeroupper and using the unused and known-zeroed YMM15 
register as a zero-filled register operand for the first 
vfmadd213ps instruction improves performance by 10%(70 vs 78ms 
for 256 million distances...)
The function can then be improved further to use pointer offsets 
and more registers, this is more efficient and results in a 500%~ 
improvement :

extern(C)
void vEUCLIDpsptr_void_40(simdf* a0, simdf* a1, simdf* a2, 
simdf* a3, simdf* b1, simdf* b2, simdf* b3) {

simdf amm0 = init;
simdf amm1 = *a1;
simdf amm2 = *a2;
simdf amm3 = *a3;
simdf emm0 = init;
simdf emm1 = amm1;
simdf emm2 = amm2;//mirror AMM for positions
simdf emm3 = amm3;
simdf imm0 = init;
simdf imm1 = emm1;
simdf imm2 = emm2;
simdf imm3 = emm3;
simdf omm0 = init;
simdf omm1 = emm1;
simdf omm2 = emm2;
simdf omm3 = emm3;
simdf umm0 = init;
simdf umm1 = omm1;
simdf umm2 = omm2;
simdf umm3 = omm3;
	//cascading assignment may not be the fastest way, especially 
compared to just loading from the pointer!


static foreach(size_t i; 0..simdlength) {
amm1[i] -= (b1[0])[i];
amm1[i] *= amm1[i];
amm0[i] += amm1[i];

amm2[i] -= (b2[0])[i];
amm2[i] *= amm2[i];
amm0[i] += amm2[i];

amm3[i] -= (b3[0])[i];
amm3[i] *= amm3[i];
amm0[i] += amm3[i];

amm0[i] = sqrt(amm0[i]);
//template
emm1[i] -= (b1[1])[i];
emm1[i] *= emm1[i];
emm0[i] += emm1[i];

emm2[i] -= (b2[1])[i];
emm2[i] *= emm2[i];
emm0[i] += emm2[i];

emm3[i] -= (b3[1])[i];
emm3[i] *= emm3[i];
emm0[i] += emm3[i];

emm0[i] = sqrt(emm0[i]);
//
imm1[i] -= (b1[2])[i];
imm1[i] *= imm1[i];
imm0[i] += imm1[i];

imm2[i] -= (b2[2])[i];
imm2[i] *= imm2[i];
imm0[i] += imm2[i];

imm3[i] -= (b3[2])[i];
imm3[i] *= imm3[i];
imm0[i] += imm3[i];

imm0[i] = sqrt(imm0[i]);
//
omm1[i] -= (b1[3])[i];
omm1[i] *= omm1[i];
omm0[i] += omm1[i];

omm2[i] -= (b2[3])[i];
omm2[i] *= omm2[i];
omm0[i] += omm2[i];

omm3[i] -= (b3[3])[i];
omm3[i] *= omm3[i];
omm0[i] += omm3[i];

omm0[i] = sqrt(omm0[i]);
//
 

Re: Optimizing for SIMD: best practices?(i.e. what features are allowed?)

2021-02-26 Thread Guillaume Piolat via Digitalmars-d-learn
On Thursday, 25 February 2021 at 14:28:40 UTC, Guillaume Piolat 
wrote:

On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
How does one optimize code to make full use of the CPU's SIMD 
capabilities?
Is there any way to guarantee that "packed" versions of SIMD 
instructions will be used?(e.g. vmulps, vsqrtps, etc...)


https://code.dlang.org/packages/intel-intrinsics


A bit of elaboration on why you might want to prefer 
intel-intrinsics:

- it supports all D compilers, including DMD 32-bit target
- targets arm32 and arm64 with same code (LDC only)
- core.simd just give you the basic operators, but not say, 
pmaddwd or any of the complex instructions. Some instructions 
need very specific work to get them.
- at least with LLVM, optimizers works reliably over subsequent 
versions of the compiler.




Re: Optimizing for SIMD: best practices?(i.e. what features are allowed?)

2021-02-25 Thread Bruce Carneal via Digitalmars-d-learn

On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
How does one optimize code to make full use of the CPU's SIMD 
capabilities?
Is there any way to guarantee that "packed" versions of SIMD 
instructions will be used?(e.g. vmulps, vsqrtps, etc...)
To give some context, this is a sample of one of the functions 
that could benefit from better SIMD usage :

float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {
  float distance;
  a[] -= b[];
  a[] *= a[];
  static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
  distance += a[i].abs;//abs required by the caller
  }
  return sqrt(distance);
}
vmovsd xmm0,qword ptr ds:[rdx]
vmovss xmm1,dword ptr ds:[rdx+8]
vmovsd xmm2,qword ptr ds:[rcx+4]
vsubps xmm0,xmm0,xmm2
vsubss xmm1,xmm1,dword ptr ds:[rcx+C]
vmulps xmm0,xmm0,xmm0
vmulss xmm1,xmm1,xmm1
vbroadcastss xmm2,dword ptr ds:[<__real@7fff>]
vandps xmm0,xmm0,xmm2
vpermilps xmm3,xmm0,F5
vaddss xmm0,xmm0,xmm3
vandps xmm1,xmm1,xmm2
vaddss xmm0,xmm0,xmm1
vsqrtss xmm0,xmm0,xmm0
vmovaps xmm6,xmmword ptr ss:[rsp+20]
add rsp,38
ret


I've tried to experiment with dynamic arrays of float[3] but 
the output assembly seemed to be worse.[1](in short, it's 
calling internal D functions which use "vxxxss" instructions 
while performing many moves.)


Big thanks
[1] https://run.dlang.io/is/F3Xye3


If you are developing for deployment to a platform that has a 
GPU, you might consider going SIMT (dcompute) rather than SIMD.  
SIMT is a lot easier on the eyes.  More importantly, if you're 
targetting an SoC or console, or have relatively chunky 
computations that allow you to work around the PCIe transit 
costs, the path is open to very large performance improvements.  
I've only been using dcompute for a week or so but so far it's 
been great.


If your algorithims are very branchy, or you decide to stick with 
multi-core/SIMD for any of a number of other good reasons, here 
are a few things I learned before decamping to dcompute land that 
may help:


  1)  LDC is pretty good at auto vectorization as you have 
probably observed.  Definitely worth a few iterations to try and 
get the vectorizer engaged.


  2)  LDC auto vectorization was good but explicit __vector 
programming is more predictable and was, at least for my tasks, 
much faster. I couldn't persuade the auto vectorizer to "do the 
right thing" throughout the hot path but perhaps you'll have 
better luck.


  3)  LDC does a good job of going between T[N] <==> 
__vector(T[N]) so using the static array types as your 
input/output types and the __vector types as your compute types 
works out well whenever you have to interface with an unaligned 
world. LDC issues unaligned vector loads/stores for casts or full 
array assigns: v = cast(VT)sa[];  or v[] = sa[];  These are quite 
good on modern CPUs.  To calibrate note that Ethan recently 
talked about a 10% gain he experienced using full alignment, 
IIRC, so there's that.


  4) LDC also does a good job of discovering SIMD equivalents 
given static foreach unrolled loops with explicit complie-time 
indexing of vector element operands.  You can use those along 
with pragma(inline, true) to develop your own "intrinsics" that 
supplement other libs.


  5) If you adopt the __vector approach you'll have to handle the 
partials manually. (array length % vec length != 0 indicates a 
partial or tail fragment).  If the classic (copying/padding) 
approaches to such fragmentation don't work for you I'd suggest 
using nested static functions that take ref T[N] inputs and 
outputs.  The main loops become very simple and the tail handling 
reduces to loading stack allocated T[N] variables explicitly, 
calling the static function, and unloading.


Good luck.




Re: Optimizing for SIMD: best practices?(i.e. what features are allowed?)

2021-02-25 Thread tsbockman via Digitalmars-d-learn

On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:

float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {


Use __vector(float[4]), not float[3].


  float distance;


The default value for float is float.nan. You need to explicitly 
initialize it to 0.0f or something if you want this function to 
actually do anything useful.



  a[] -= b[];
  a[] *= a[];


With __vector types, this can be simplified (not optimized) to 
just:

a -= b;
a *= a;


  static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
  distance += a[i].abs;//abs required by the caller


(a * a) above is always positive for real numbers. You don't need 
the call to abs unless you're trying to guarantee that even nan 
values will have a clear sign bit.


Also, there is no point to adding the first component to zero, 
and copying element [0] from a SIMD register into a scalar is 
free, so this can become:


float distance = a[0];
static foreach(size_t i; 1 .. 3)
distance += a[i];


  }
  return sqrt(distance);
}


Final assembly output (ldc 1.24.0 with -release -O3 
-preview=intpromote -preview=dip1000 -m64 -mcpu=haswell 
-fp-contract=fast -enable-cross-module-inlining):


vsubps  xmm0, xmm1, xmm0
vmulps  xmm0, xmm0, xmm0
vmovshdup   xmm1, xmm0
vaddss  xmm1, xmm0, xmm1
vpermilpd   xmm0, xmm0, 1
vaddss  xmm0, xmm0, xmm1
vsqrtss xmm0, xmm0, xmm0
ret


Re: Optimizing for SIMD: best practices?(i.e. what features are allowed?)

2021-02-25 Thread tsbockman via Digitalmars-d-learn

On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
Is there any way to guarantee that "packed" versions of SIMD 
instructions will be used?(e.g. vmulps, vsqrtps, etc...)
To give some context, this is a sample of one of the functions 
that could benefit from better SIMD usage :

float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {


You need to use __vector(float[4]) instead of float[3] to tell 
the compiler to pack multiple elements per SIMD register. Right 
now your data lacks proper alignment for SIMD load/stores.


Beyond that, SIMD code is rather difficult to optimize. Code 
written in ignorance or in a rush is unlikely to be meaningfully 
faster than ordinary scalar code, unless the data flow is very 
simple. You will probably get a bigger speedup for less effort 
and pain by first minimizing heap allocations, maximizing 
locality of reference, minimizing indirections, and minimizing 
memory use. (And, of course, it should go without saying that 
choosing an asymptotically efficient high-level algorithm is more 
important than any micro-optimization for large data sets.) 
Nevertheless, if you are up to the challenge, SIMD can sometimes 
provide a final 2-3x speed boost.


Your algorithms will need to be designed to minimize mixing of 
data between SIMD channels, as this forces the generation of lots 
of extra instructions to swizzle the data, or worse to unpack and 
repack it. Something like a Cartesian dot product or cross 
product will benefit much less from SIMD than vector addition, 
for example. Sometimes the amount of swizzling can be greatly 
reduced with a little algebra, other times you might need to 
refactor an array of structures into a structure of arrays.


Per-element conditional branches are very bad, and often 
completely defeat the benefits of SIMD. For very short segments 
of code (like conditional assignment), replace them with a SIMD 
conditional move (vcmp and vblend). Bit-twiddling is your friend.


Finally, do not trust the compiler or the optimizer. People love 
to make the claim that "The Compiler" is always better than 
humans at micro-optimizations, but this is not at all the case 
for SIMD code with current systems. I have found even LLVM to 
produce quite bad SIMD code for complex algorithms, unless I 
carefully structure my code to make it as easy as possible for 
the optimizer to get to the final assembly I want. A sprinkling 
of manual assembly code (directly, or via a library) is also 
necessary to fill in certain instructions that the compiler 
doesn't know when to use at all.


Resources I have found very helpful:

Matt Godbolt's Compiler Explorer online visual disassembler 
(supports D):

https://godbolt.org/

Felix Cloutier's x86 and amd64 instruction reference:
https://www.felixcloutier.com/x86/

Agner Fog's optimization guide (especially the instruction 
tables):

https://agner.org/optimize/


Re: Optimizing for SIMD: best practices?(i.e. what features are allowed?)

2021-02-25 Thread Guillaume Piolat via Digitalmars-d-learn

On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
How does one optimize code to make full use of the CPU's SIMD 
capabilities?
Is there any way to guarantee that "packed" versions of SIMD 
instructions will be used?(e.g. vmulps, vsqrtps, etc...)


https://code.dlang.org/packages/intel-intrinsics




Optimizing for SIMD: best practices?(i.e. what features are allowed?)

2021-02-25 Thread z via Digitalmars-d-learn
How does one optimize code to make full use of the CPU's SIMD 
capabilities?
Is there any way to guarantee that "packed" versions of SIMD 
instructions will be used?(e.g. vmulps, vsqrtps, etc...)
To give some context, this is a sample of one of the functions 
that could benefit from better SIMD usage :

float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {
  float distance;
  a[] -= b[];
  a[] *= a[];
  static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
  distance += a[i].abs;//abs required by the caller
  }
  return sqrt(distance);
}
vmovsd xmm0,qword ptr ds:[rdx]
vmovss xmm1,dword ptr ds:[rdx+8]
vmovsd xmm2,qword ptr ds:[rcx+4]
vsubps xmm0,xmm0,xmm2
vsubss xmm1,xmm1,dword ptr ds:[rcx+C]
vmulps xmm0,xmm0,xmm0
vmulss xmm1,xmm1,xmm1
vbroadcastss xmm2,dword ptr ds:[<__real@7fff>]
vandps xmm0,xmm0,xmm2
vpermilps xmm3,xmm0,F5
vaddss xmm0,xmm0,xmm3
vandps xmm1,xmm1,xmm2
vaddss xmm0,xmm0,xmm1
vsqrtss xmm0,xmm0,xmm0
vmovaps xmm6,xmmword ptr ss:[rsp+20]
add rsp,38
ret


I've tried to experiment with dynamic arrays of float[3] but the 
output assembly seemed to be worse.[1](in short, it's calling 
internal D functions which use "vxxxss" instructions while 
performing many moves.)


Big thanks
[1] https://run.dlang.io/is/F3Xye3