Re: [fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!

2019-10-27 Thread J. Gareth Moreton

The following passes everything through XMM0:

#include
#include

doubleMod(__m128dz)
{
returnsqrt((z[0]*z[0])+(z[1]*z[1]));
}

intmain()
{
__m128dz;
z[0] = 0; z[1] = 1;
doubled = Mod(z);
}

I will admit that it's very fiddly to get right.  All of my attempts to 
map an anonymous struct to __m128d via a union (so you could call z.re 
and z.im rather than access the array elements) were unsuccessful.  C++ 
is not very friendly with vector types and you have to go out of your 
way to get the compiler to be efficient with them, but the System V ABI 
does support utilising the full vector registers.


It took me a while to work out how passing a record type with two 
single-precision elements into just XMM0 is correct, but this is because 
the record type as a whole has a size of eight bytes, and gets passed as 
a single argument of class SSE.  If the function parameters are instead 
two separate arguments, then they get passed individually through XMM0 
and XMM1.  It seems you have to interpret this document very literally 
to get it right: https://www.uclibc.org/docs/psABI-x86_64.pdf


Gareth aka. Kit

On 27/10/2019 08:13, Florian Klämpfl wrote:

Am 23.10.19 um 22:36 schrieb J. Gareth Moreton:
So I did a bit of reading after finding the "mpx-linux64-abi.pdf" 
document.  As I suspected, the System V ABI is like vectorcall when 
it comes to using the XMM registers... only the types __m128, 
__float128 and __Decimal128 use the "SSEUP" class and hence use the 
entire register.  The types are opaque, but both their size and 
alignment are 16 bytes, so I think anything that abides by those 
rules can be considered equivalent.


If the complex type is unaligned, the two fields get their own XMM 
register.  If aligned, they both go into %xmm0.  At least that is 
what I gathered from reading the document - it's a little unclear 
sometimes.


I briefly tested with god bolt (https://godbolt.org/): records of two 
double are passed in two xmm registers regardless of the alignment, 
two floats (so single) are passed in one xmm register.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel




--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!

2019-10-27 Thread Florian Klämpfl

Am 23.10.19 um 22:36 schrieb J. Gareth Moreton:
So I did a bit of reading after finding the "mpx-linux64-abi.pdf" 
document.  As I suspected, the System V ABI is like vectorcall when it 
comes to using the XMM registers... only the types __m128, __float128 
and __Decimal128 use the "SSEUP" class and hence use the entire 
register.  The types are opaque, but both their size and alignment are 
16 bytes, so I think anything that abides by those rules can be 
considered equivalent.


If the complex type is unaligned, the two fields get their own XMM 
register.  If aligned, they both go into %xmm0.  At least that is what I 
gathered from reading the document - it's a little unclear sometimes.


I briefly tested with god bolt (https://godbolt.org/): records of two 
double are passed in two xmm registers regardless of the alignment, two 
floats (so single) are passed in one xmm register.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!

2019-10-23 Thread J. Gareth Moreton
In the meantime, if everything seems present and correct, 
https://bugs.freepascal.org/view.php?id=36202 contains the alignment and 
vectorcall modifiers for uComplex.  It shouldn't affect anything outside 
of x86_64 but should still keep the unit very lightweight, which I 
believe was the original intent.


Gareth aka. Kit


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!

2019-10-23 Thread J. Gareth Moreton
Hmmm, that is unfortunate if the horizontal operations are inefficient.  
I had a look at them at 
https://www.agner.org/optimize/instruction_tables.pdf - you are right in 
that HADDPS has a surprisingly high latency (approximately how many 
cycles it takes to execute), although HADDPD isn't as bad, probably 
because it's only dealing with 2 Doubles instead of 4 Singles, and it 
seems mostly equivalent in speed to the multiplication instructions.


Using just SSE2:

mulpd %xmm0,%xmm0
shufpd %xmm0,%xmm1,1
addsd %xmm1,%xmm0
sqrtsd %xmm0,%xmm0

Ultimately it's not much better than what you have:

shufpd %xmm0,%xmm1,1 { Only needed if both fields are in %xmm0 }
mulsd %xmm0,%xmm0
mulsd %xmm1,%xmm1
addsd %xmm1,%xmm0
sqrtsd %xmm0,%xmm0

If you measure the dependencies between the instructions (shufpd and the 
first mulsd can run simultaneously, or equivalently, the two mulsd 
instructions), it still amounts to 4 cycles, assuming each instruction 
takes an equal amount of time to execute (which they don't, but it's a 
reasonable approximation).  The subroutines are also probably too small 
to get accurate timing metrics on them.  It might be something to 
experiment on though - I would hope at the very least that the 
horizontal operations have improved in later years.


I know though that vectorising instructions is, by and large, a net 
gain.  For example, let's go to a simpler example of adding two complex 
numbers together:


  operator + (z1, z2 : complex) z : complex; vectorcall;
  {$ifdef TEST_INLINE}
  inline;
  {$endif TEST_INLINE}
    { addition : z := z1 + z2 }
    begin
   z.re := z1.re + z2.re;
   z.im := z1.im + z2.im;
    end;

No horizonal adds here, just a simple packed addition and storing the 
result into %xmm0 as opposed to two scalar additions and then combining 
the result in whatever way is demanded (if aligned, it's all in %xmm0, 
if unaligned, I think %xmm0 and %xmm1 are supposed to be used).  Mind 
you, in this case the function is inlined, so the parameter passing 
doesn't always apply.


Once again though, I was surprised at how inefficient HADDPS is once you 
pointed it out.  The double-precision versions aren't nearly as bad 
though, so maybe they can still be used.


Gareth aka. Kit

P.S. As far as 128-bit aligned vector types are concerned, vectorcall 
and the System V ABI can be considered equivalent. Vectorcall can use 
more MM registers for return values and more complex aggregates as 
parameters, but in our examples, we don't have to worry about that yet.



On 23/10/2019 21:20, Florian Klämpfl wrote:

Am 22.10.19 um 05:01 schrieb J. Gareth Moreton:


mulpd    %xmm0, %xmm0 { Calculates "re * re" and "im * im" 
simultaneously }
haddpd    %xmm0, %xmm0 { Adds the above multiplications together 
(horizontal add) }


Unfortunatly, those horizontal operations are normally not very 
efficient IIRC.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!

2019-10-23 Thread J. Gareth Moreton
So I did a bit of reading after finding the "mpx-linux64-abi.pdf" 
document.  As I suspected, the System V ABI is like vectorcall when it 
comes to using the XMM registers... only the types __m128, __float128 
and __Decimal128 use the "SSEUP" class and hence use the entire 
register.  The types are opaque, but both their size and alignment are 
16 bytes, so I think anything that abides by those rules can be 
considered equivalent.


If the complex type is unaligned, the two fields get their own XMM 
register.  If aligned, they both go into %xmm0.  At least that is what I 
gathered from reading the document - it's a little unclear sometimes.


Gareth aka. Kit

On 23/10/2019 06:59, Florian Klämpfl wrote:

Am 23. Oktober 2019 01:14:03 schrieb "J. Gareth Moreton" 
:


That's definitely a marked improvement.  Under the System V ABI and
vectorcall, both fields of a complex type would be passed through xmm0.
Splitting it up into two separate registers would require something like:


shufpd%xmm0,%xmm1,3 { Copy the high-order Double into the low-order
position - an immediate operand of "1" will also work, since we're not
concerned with the upper 64 bits of %xmm1 }


After which your complied code will work correctly (since it looks like
%xmm1 was undefined before):

The code is correct, on x86_64-linux vectorcall is ignored. Supporting 
vectorcall with my approach would be more difficult.



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!

2019-10-23 Thread Florian Klämpfl

Am 22.10.19 um 05:01 schrieb J. Gareth Moreton:


mulpd    %xmm0, %xmm0 { Calculates "re * re" and "im * im" simultaneously }
haddpd    %xmm0, %xmm0 { Adds the above multiplications together 
(horizontal add) }


Unfortunatly, those horizontal operations are normally not very 
efficient IIRC.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!

2019-10-23 Thread Florian Klämpfl
Am 23. Oktober 2019 01:14:03 schrieb "J. Gareth Moreton" 
:

> That's definitely a marked improvement.  Under the System V ABI and
> vectorcall, both fields of a complex type would be passed through xmm0.
> Splitting it up into two separate registers would require something like:
>
>
> shufpd%xmm0,%xmm1,3 { Copy the high-order Double into the low-order
> position - an immediate operand of "1" will also work, since we're not
> concerned with the upper 64 bits of %xmm1 }
>
>
> After which your complied code will work correctly (since it looks like
> %xmm1 was undefined before):

The code is correct, on x86_64-linux vectorcall is ignored. Supporting 
vectorcall with my approach would be more difficult.



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!

2019-10-22 Thread J. Gareth Moreton
That's definitely a marked improvement.  Under the System V ABI and 
vectorcall, both fields of a complex type would be passed through xmm0.  
Splitting it up into two separate registers would require something like:


shufpd    %xmm0,%xmm1,3 { Copy the high-order Double into the low-order 
position - an immediate operand of "1" will also work, since we're not 
concerned with the upper 64 bits of %xmm1 }


After which your complied code will work correctly (since it looks like 
%xmm1 was undefined before):


mulsd    %xmm0,%xmm0
mulsd    %xmm1,%xmm1
addsd    %xmm0,%xmm1 { In terms of register usage, the most optimal 
combination of instructions here would be "addsd %xmm1,%xmm0" then 
"sqrtsd %xmm0,%xmm0", since %xmm1 is released for other purposes one 
instruction sooner }

sqrtsd    %xmm1,%xmm0
ret

Otherwise you'd have to load in the data from reference (%rcx under 
win64, and %rdi under other x86_64 platforms) - for example:


movsd    (%rcx),%xmm0
movsd    8(%rcx),%xmm1

I would be interested to see the the patch when it's ready.

Under SSE2 (no horizontal add), I think the most optimal set of 
instructions (assuming the entirety of the parameter is passed through 
%xmm0) is:


mulpd    %xmm0,%xmm0
shufpd    %xmm0,%xmm1,3
addsd    %xmm1,%xmm0
sqrtsd    %xmm0,%xmm0
ret

The main motivation in my eyes is the fact that it removes one of the 
multiplication instructions - mind you, on a modern processor, a pair of 
"mulsd" instructions working on independent data will be executed 
simultaneously, in which case the only time a cycle-counting improvement 
becomes visible is if the core is hyperthreaded and another thread is 
using the ALUs.  Of course, a sufficiently-skilled assembler programmer 
will be able to beat the compiler in many cases, but it's still a target 
to strive for.


Gareth aka. Kit

On 22/10/2019 22:03, Florian Klämpfl wrote:

Am 22.10.19 um 05:01 schrieb J. Gareth Moreton:



Bigger challenges would be optimising the modulus of a complex number:

   function cmod (z : complex): real; vectorcall;
 { module : r = |z| }
 begin
    with z do
  cmod := sqrt((re * re) + (im * im));
 end;

A perfect compiler with permission to use SSE3 (for haddpd) should 
generate the following (note that no stack frame is required):


mulpd    %xmm0, %xmm0 { Calculates "re * re" and "im * im" 
simultaneously }
haddpd    %xmm0, %xmm0 { Adds the above multiplications together 
(horizontal add) }

sqrtsd    %xmm0
ret

Currently, with vectorcall, the routine compiles into this:

leaq    -24(%rsp),%rsp
movdqa    %xmm0,(%rsp)
movq    %rsp,%rax
movsd    (%rax),%xmm1
mulsd    %xmm1,%xmm1
movsd    8(%rax),%xmm0
mulsd    %xmm0,%xmm0
addsd    %xmm1,%xmm0
sqrtsd    %xmm0,%xmm0
leaq    24(%rsp),%rsp
ret

And without vectorcall (or an unaligned record type):

leaq    -24(%rsp),%rsp
movq    %rcx,%rax
movq    (%rax),%rdx
movq    %rdx,(%rsp)
movq    8(%rax),%rax
movq    %rax,8(%rsp)
movq    %rsp,%rax
movsd    (%rax),%xmm1
mulsd    %xmm1,%xmm1
movsd    8(%rax),%xmm0
mulsd    %xmm0,%xmm0
addsd    %xmm1,%xmm0
sqrtsd    %xmm0,%xmm0
leaq    24(%rsp),%rsp
ret



With a few additions (the git patch is less than 500 lines) in the 
compiler I get (it is not ready for committing yet):


.section .text.n_p$program_$$_cmod$complex$$real,"ax"
.balign 16,0x90
.globl    P$PROGRAM_$$_CMOD$COMPLEX$$REAL
.type    P$PROGRAM_$$_CMOD$COMPLEX$$REAL,@function
P$PROGRAM_$$_CMOD$COMPLEX$$REAL:
.Lc2:
# Var $result located in register xmm0
# Var z located in register xmm0
# [test.pp]
# [20] begin
# [22] cmod := sqrt((re * re) + (im * im));
mulsd    %xmm0,%xmm0
mulsd    %xmm1,%xmm1
addsd    %xmm0,%xmm1
sqrtsd    %xmm1,%xmm0
# Var $result located in register xmm0
.Lc3:
# [23] end;
ret
.Lc1:
.Le0:
.size    P$PROGRAM_$$_CMOD$COMPLEX$$REAL, .Le0 - 
P$PROGRAM_$$_CMOD$COMPLEX$$REAL


It mainly keeps records in mm registers. I am not sure about the right 
approach yet. But to allocate one register to each field of suitable 
records seems to be a reasonable approach.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!

2019-10-22 Thread Florian Klämpfl

Am 22.10.19 um 05:01 schrieb J. Gareth Moreton:



Bigger challenges would be optimising the modulus of a complex number:

   function cmod (z : complex): real; vectorcall;
     { module : r = |z| }
     begin
    with z do
  cmod := sqrt((re * re) + (im * im));
     end;

A perfect compiler with permission to use SSE3 (for haddpd) should 
generate the following (note that no stack frame is required):


mulpd    %xmm0, %xmm0 { Calculates "re * re" and "im * im" simultaneously }
haddpd    %xmm0, %xmm0 { Adds the above multiplications together 
(horizontal add) }

sqrtsd    %xmm0
ret

Currently, with vectorcall, the routine compiles into this:

leaq    -24(%rsp),%rsp
movdqa    %xmm0,(%rsp)
movq    %rsp,%rax
movsd    (%rax),%xmm1
mulsd    %xmm1,%xmm1
movsd    8(%rax),%xmm0
mulsd    %xmm0,%xmm0
addsd    %xmm1,%xmm0
sqrtsd    %xmm0,%xmm0
leaq    24(%rsp),%rsp
ret

And without vectorcall (or an unaligned record type):

leaq    -24(%rsp),%rsp
movq    %rcx,%rax
movq    (%rax),%rdx
movq    %rdx,(%rsp)
movq    8(%rax),%rax
movq    %rax,8(%rsp)
movq    %rsp,%rax
movsd    (%rax),%xmm1
mulsd    %xmm1,%xmm1
movsd    8(%rax),%xmm0
mulsd    %xmm0,%xmm0
addsd    %xmm1,%xmm0
sqrtsd    %xmm0,%xmm0
leaq    24(%rsp),%rsp
ret



With a few additions (the git patch is less than 500 lines) in the 
compiler I get (it is not ready for committing yet):


.section .text.n_p$program_$$_cmod$complex$$real,"ax"
.balign 16,0x90
.globl  P$PROGRAM_$$_CMOD$COMPLEX$$REAL
.type   P$PROGRAM_$$_CMOD$COMPLEX$$REAL,@function
P$PROGRAM_$$_CMOD$COMPLEX$$REAL:
.Lc2:
# Var $result located in register xmm0
# Var z located in register xmm0
# [test.pp]
# [20] begin
# [22] cmod := sqrt((re * re) + (im * im));
mulsd   %xmm0,%xmm0
mulsd   %xmm1,%xmm1
addsd   %xmm0,%xmm1
sqrtsd  %xmm1,%xmm0
# Var $result located in register xmm0
.Lc3:
# [23] end;
ret
.Lc1:
.Le0:
	.size	P$PROGRAM_$$_CMOD$COMPLEX$$REAL, .Le0 - 
P$PROGRAM_$$_CMOD$COMPLEX$$REAL


It mainly keeps records in mm registers. I am not sure about the right 
approach yet. But to allocate one register to each field of suitable 
records seems to be a reasonable approach.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!

2019-10-21 Thread J. Gareth Moreton

This is a long read, so strap in!

Well, I finally got it to work - the required type defintion was as follows:

{$push}
{$codealign RECORDMIN=16}
{$PACKRECORDS C}
  { This record forces "complex" to be aligned to a 16-byte boundary }
  type align_dummy = record
    filler: array[0..1] of real;
  end;
{$pop}

  type complex = record
   case Byte of
   0: (
    alignment: align_dummy;
  );
   1: (
    re : real;
    im : real;
  );
 end;

It is so, so easy to get wrong because if align_dummy's field is 1, 2, 4 
or 8 bytes in size, it is classed as an integer under Windows, and that 
overrides the Double-type in the union, causing the entire record to 
still be passed by reference.  Additionally, the dummy field has to be 
of type Single or Double (or Real); if it is an integral type (e.g. 
"array[0..15] of Byte"), it is once again classified as an integer and 
overrides the Double type as per the rules of System V ABI parameter 
classification (in other words, the entire thing would get passed by 
reference under both x86_64-win64 and x86_64-linux etc.).  Long story 
short, this is an absolute minefield!!


I still seriously think that having an alignment attribute or some such 
will make life so much easier for third-party developers who may not 
know the exact quirks of how x86_64 classifies its parameters.  To me, 
this trick feels incredibly hacky and very hard to get right.


Compiled code isnt perfect though - for example, when moving parameters 
to and from the relevant xmm registers, the "movdqa" instruction is used 
instead of "movapd", which causes a performance penalty because the 
internal CPU state has to switch between double-precision and integer 
(this is why, for example, there are separate VINSERTF128 and 
VINSERTI128 instructions, even though they superficially do the same 
thing).  Additionally, inlined vectorcall routines still seem to fall 
back onto using movq to transfer 8 bytes at a time between a function 
result and wherever it is to be stored, but this is because everything 
is decomposed at the node level and the compiler currently lacks any 
decent vectorisation algorithms.


Nevertheless, I think I'm ready to prepare a patch for uComplex for 
evaluation, and it's given me some things to play with to see if the 
compiler can be made to work with packed data better.  I figure the 
uComplex unit is a good place to start because it's an array of 2 
Doubles internally and a lot of the operations like addition are 
component-wise.


Bigger challenges would be optimising the modulus of a complex number:

  function cmod (z : complex): real; vectorcall;
    { module : r = |z| }
    begin
   with z do
 cmod := sqrt((re * re) + (im * im));
    end;

A perfect compiler with permission to use SSE3 (for haddpd) should 
generate the following (note that no stack frame is required):


mulpd    %xmm0, %xmm0 { Calculates "re * re" and "im * im" simultaneously }
haddpd    %xmm0, %xmm0 { Adds the above multiplications together 
(horizontal add) }

sqrtsd    %xmm0
ret

Currently, with vectorcall, the routine compiles into this:

leaq    -24(%rsp),%rsp
movdqa    %xmm0,(%rsp)
movq    %rsp,%rax
movsd    (%rax),%xmm1
mulsd    %xmm1,%xmm1
movsd    8(%rax),%xmm0
mulsd    %xmm0,%xmm0
addsd    %xmm1,%xmm0
sqrtsd    %xmm0,%xmm0
leaq    24(%rsp),%rsp
ret

And without vectorcall (or an unaligned record type):

leaq    -24(%rsp),%rsp
movq    %rcx,%rax
movq    (%rax),%rdx
movq    %rdx,(%rsp)
movq    8(%rax),%rax
movq    %rax,8(%rsp)
movq    %rsp,%rax
movsd    (%rax),%xmm1
mulsd    %xmm1,%xmm1
movsd    8(%rax),%xmm0
mulsd    %xmm0,%xmm0
addsd    %xmm1,%xmm0
sqrtsd    %xmm0,%xmm0
leaq    24(%rsp),%rsp
ret

Maybe I'm in the minority here, and definitely getting ahead of myself, 
but seeing ways of improving the compiled assembly language excites me!  
Even without vectorcall, I want to see if I can get my deep optimiser in 
a workable form, because things like "movq %rsp,%rax" and then merely 
reading from %rax is completely unnecessary.  Also, things like this:


...
movdqa    %xmm0,(%rsp)
movq    %rsp,%rax
movsd    (%rax),%xmm1
...

Just... why?!  Just do "movsd %xmm0,%xmm1"!! The peephole optimiser may 
struggle to spot this anyway because of the inefficient mixing of 
integer and floating-point XMM instructions - of course, it might be the 
case that the original contents of %xmm0 is needed later - this is where 
my deep optimiser or some other form of data-flow analysis would come 
into play.  Playing the logical flow in my head, I can see it optimising 
the triplet as follows:


1. Notice that %rax = %rsp and change the movsd instruction to minimise 
a pipeline stall (the later "movsd 8(%rax),%xmm0" instruction would get 
changed too):


...
movdqa    %xmm0,(%rsp)
movq    %rsp,%rax
movsd    (%rsp),%xmm1
...

2. Notice that %rax is