Re: Godbolt.org: mir-algorithm was added

2017-09-21 Thread rikki cattermole via Digitalmars-d-announce

On 22/09/2017 5:36 AM, Arun Chandrasekaran wrote:

On Friday, 22 September 2017 at 03:51:36 UTC, Ilya Yaroshenko wrote:
Mir Algorithm and Mir GLAS (glas is experimental) was added to 
https://d.godbolt.org

by Johan Engelen. Thanks you, Johan!

[...]


Honestly, how do you guys understand these assembly instructions that's 
further optimized by the complier? Am I alone here?


By reading the damn manual!

(It can take a while to understand x86, it is a right royal mess with 
layers on top of layers in it).


If you do intend on reading the manual, I recommend AMD's. They created 
x86_64, not Intel. They you can actually learn from.


http://developer.amd.com/resources/developer-guides-manuals/

You want: http://support.amd.com/TechDocs/24594.pdf


Re: Godbolt.org: mir-algorithm was added

2017-09-21 Thread Arun Chandrasekaran via Digitalmars-d-announce
On Friday, 22 September 2017 at 03:51:36 UTC, Ilya Yaroshenko 
wrote:
Mir Algorithm and Mir GLAS (glas is experimental) was added to 
https://d.godbolt.org

by Johan Engelen. Thanks you, Johan!

[...]


Honestly, how do you guys understand these assembly instructions 
that's further optimized by the complier? Am I alone here?


Re: Godbolt.org: mir-algorithm was added

2017-09-21 Thread Ilya Yaroshenko via Digitalmars-d-announce
On Friday, 22 September 2017 at 03:51:36 UTC, Ilya Yaroshenko 
wrote:
Mir Algorithm and Mir GLAS (glas is experimental) was added to 
https://d.godbolt.org


 Proper link is https://d.godbolt.org/beta/



Godbolt.org: mir-algorithm was added

2017-09-21 Thread Ilya Yaroshenko via Digitalmars-d-announce
Mir Algorithm and Mir GLAS (glas is experimental) was added to 
https://d.godbolt.org

by Johan Engelen. Thanks you, Johan!

Try it:
1. Select mir-algorithm 0.6.13 from the libraries list (after 
Intel button)

2. Select LDC 1.4.0
3. Add compiler flags: -O -release -mcpu=cannonlake 
-linkonce-templates -betterC

4. Add o code
5. Enjoy AVX512 instructions with fused math :)
---
// Euclidian norm
import mir.ndslice;
import mir.math.common;

@fastmath double norm2(ContiguousVector!double x) {
return 0.0.reduce!"a + b * b"(x).sqrt;
}


Output:

double example.norm2(mir.ndslice.slice.Slice!(2, [1], 
double*).Slice):

  mov rax, qword ptr [rsp + 8]
  test rax, rax
  je .LBB0_1
  lea rcx, [rsp + 8]
  mov rcx, qword ptr [rcx + 8]
  vxorpd xmm0, xmm0, xmm0
  cmp rax, 32
  jb .LBB0_12
  mov r8, rax
  and r8, -32
  mov rsi, rax
  and rsi, -32
  je .LBB0_12
  lea rdi, [rsi - 32]
  mov rdx, rdi
  shr rdx, 5
  bt edi, 5
  jb .LBB0_5
  vmovupd zmm0, zmmword ptr [rcx]
  vmovupd zmm1, zmmword ptr [rcx + 64]
  vmovupd zmm2, zmmword ptr [rcx + 128]
  vmovupd zmm3, zmmword ptr [rcx + 192]
  vmulpd zmm0, zmm0, zmm0
  vmulpd zmm1, zmm1, zmm1
  vmulpd zmm2, zmm2, zmm2
  vmulpd zmm3, zmm3, zmm3
  mov r9d, 32
  test rdx, rdx
  jne .LBB0_8
  jmp .LBB0_10
.LBB0_1:
  vxorps xmm0, xmm0, xmm0
  vsqrtsd xmm0, xmm0, xmm0
  ret
.LBB0_5:
  vxorpd zmm0, zmm0, zmm0
  xor r9d, r9d
  vxorpd zmm1, zmm1, zmm1
  vxorpd zmm2, zmm2, zmm2
  vxorpd zmm3, zmm3, zmm3
  test rdx, rdx
  je .LBB0_10
.LBB0_8:
  mov rdi, rsi
  sub rdi, r9
  lea rdx, [rcx + 8*r9 + 448]
.LBB0_9:
  vmovupd zmm4, zmmword ptr [rdx - 448]
  vmovupd zmm5, zmmword ptr [rdx - 384]
  vmovupd zmm6, zmmword ptr [rdx - 320]
  vmovupd zmm7, zmmword ptr [rdx - 256]
  vfmadd213pd zmm4, zmm4, zmm0
  vfmadd213pd zmm5, zmm5, zmm1
  vfmadd213pd zmm6, zmm6, zmm2
  vfmadd213pd zmm7, zmm7, zmm3
  vmovupd zmm0, zmmword ptr [rdx - 192]
  vmovupd zmm1, zmmword ptr [rdx - 128]
  vmovupd zmm2, zmmword ptr [rdx - 64]
  vmovupd zmm3, zmmword ptr [rdx]
  vfmadd213pd zmm0, zmm0, zmm4
  vfmadd213pd zmm1, zmm1, zmm5
  vfmadd213pd zmm2, zmm2, zmm6
  vfmadd213pd zmm3, zmm3, zmm7
  add rdx, 512
  add rdi, -64
  jne .LBB0_9
.LBB0_10:
  vaddpd zmm0, zmm0, zmm2
  vaddpd zmm1, zmm1, zmm3
  vaddpd zmm0, zmm0, zmm1
  vshuff64x2 zmm1, zmm0, zmm0, 14
  vaddpd zmm0, zmm0, zmm1
  vpermpd zmm1, zmm0, 238
  vaddpd zmm0, zmm0, zmm1
  vpermilpd zmm1, zmm0, 1
  vaddpd zmm0, zmm0, zmm1
  cmp rax, rsi
  je .LBB0_13
  sub rax, r8
  lea rcx, [rcx + 8*rsi]
.LBB0_12:
  vmovsd xmm1, qword ptr [rcx]
  vfmadd231sd xmm0, xmm1, xmm1
  add rcx, 8
  add rax, -1
  jne .LBB0_12
.LBB0_13:
  vsqrtsd xmm0, xmm0, xmm0
  ret

Bet regards,
Ilya