Re: [fpc-devel] vmul commutative optimization?

2019-11-17 Thread Florian Klämpfl

Am 17.11.19 um 16:46 schrieb Marco van de Voort:


Op 2019-11-17 om 15:49 schreef Florian Klämpfl:


This was an easy one :) Fixed in r43509


Thanks, here is an harder one:

https://bugs.freepascal.org/view.php?id=36324

(  :-) )


Yes, FPC cannot keep record elements in registers. I have a partial 
patch for it, though not finished.


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] vmul commutative optimization?

2019-11-17 Thread Marco van de Voort


Op 2019-11-17 om 15:49 schreef Florian Klämpfl:


This was an easy one :) Fixed in r43509


Thanks, here is an harder one:

https://bugs.freepascal.org/view.php?id=36324

(  :-) )

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] vmul commutative optimization?

2019-11-17 Thread Florian Klämpfl

Am 12.11.19 um 13:22 schrieb Marco van de Voort:

I compiled some bits with avx, and noticed that when you do

asingle:=someconstant*othersingle;

then that generates something like

     vmovss    TC_$FFTS_$$_C31(%rip),%xmm2
     vmulss    %xmm0,%xmm2,%xmm0

while if you do

asingle:=othersingle*someconstant;

it generates

     vmulss    TC_$FFTS_$$_C32(%rip),%xmm2,%xmm2


I assume the reason is that only the first param can be an address, and 
the second a register. But the compiler isn't smart enough to exchange 
them.


This was an easy one :) Fixed in r43509
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] vmul commutative optimization?

2019-11-15 Thread J. Gareth Moreton
I'll double check that one with _m64 and a 
Homogeneous Float Aggregate. That sounds 
like a bug. I'll do some reading up on the 
documentation too.

Gareth aka. Kit
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] vmul commutative optimization?

2019-11-15 Thread Marco van de Voort


Op 14/11/2019 om 01:14 schreef J. Gareth Moreton:



I guess that means testing with VS?


Testing with Visual Studio or even GCC under Windows is a good idea if 
you want to be sure how particular record types are transferred.  The 
example given in that article has two fields of type __m128, even 
though it looks like only one of the four vector elements are used 
initially.  Regardless, under the default Microsoft calling 
convention, that would be passed by reference, just like a record of 
two Doubles.  A (packed) record of two Singles would be passed by 
value in an integer register, just to cause trouble with conversions!


To be clear: I meant if  2 single 64-bit vectors are registered in XMM 
instead of integer fields with vectorcall


It was more meant as a research point, I don't need it anymore. After 
realizing that I either need autovectorizing or intrinsics I simply 
started doing a simple translation to assembler, a naive 1:1 translation 
(but then with complex as two singles in an XMM). Bit of fiddling to 
define multiplying with j in xmm assembler (Doing NOT on one of both 
singles), but otherwise simple.


I got the first stage (the radix funtions for the radices that I use, 
4,5,10) and got things working, and both speed and instruction count 
divided by 3.  (not entirely 100% logical, since the asm version has 
relatively more complex instructions).


Under vectorcall, a record of two Singles would be treated as a 
Homogeneous Float Aggregate and pass the two fields in XMM0 and XMM1


Afaik FPC doesn't do that yet. It passed in an int  register. Pity. as 
_m64 register it would have been nice for complex-with-singles.


, and the same thing happens with an unaligned record of two Doubles.  
If a record of two Doubles is aligned to a 16-byte boundary though, or 
is otherwise a union with a __m128 type (with the two Doubles aliased 
to the lower and upper 64 bits respectively), then it can be passed in 
its entirety through XMM0.


Some things are a little bit messy and opaque with __m128 though, and 
just making an aligned array of 4 Singles or 2 Doubles doesn't always 
work - it needs to be typecast through __m128 in some way - but I 
think that's mostly because C++ wasn't really designed with alignment 
in mind.  In Free Pascal, you have to make a bit of a messy union to 
ensure everything works; for example:


I already use that union copied from your patch, but then changed to 
singles. But doesn't do much.


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] vmul commutative optimization?

2019-11-13 Thread J. Gareth Moreton


On 13/11/2019 16:03, Marco van de Voort wrote:


Op 2019-11-12 om 20:46 schreef J. Gareth Moreton:


The Microsoft ABI is a bit restrictive when it comes to record types; 
as described here 
, 
"Structs and unions of size 8, 16, 32, or 64 bits, and __m64 types, 
are passed as if they were integers of the same size." So 
unfortunately, a single-precision complex number is treated as a 
64-bit structure and passed as an integer.  The System V ABI, on the 
other hand, would pass the two entries through the lower 64 bits of 
XMM0.  Vectorcall, theoretically, should put the two components into 
XMM0 and XMM1, because the complex type would be considered a 
"homogeneous vector aggregate" (with floats as 1-dimensional vectors).


I've found refs like 
https://devblogs.microsoft.com/cppblog/introducing-vector-calling-convention/#comments 
so the question is if partial vectors (and specially 2 single 8-byte, 
since there are various special SSE opcodes to deal with them) are one 
register or not. The references I found usually talk about "vector 
types like _m128 and _m256), but don't really specify an exhaustive list.



I guess that means testing with VS?


Testing with Visual Studio or even GCC under Windows is a good idea if 
you want to be sure how particular record types are transferred.  The 
example given in that article has two fields of type __m128, even though 
it looks like only one of the four vector elements are used initially.  
Regardless, under the default Microsoft calling convention, that would 
be passed by reference, just like a record of two Doubles.  A (packed) 
record of two Singles would be passed by value in an integer register, 
just to cause trouble with conversions!


But to give a clear answer, under fastcall (the default Microsoft ABI), 
a record of two Singles will be passed by value through an integer 
register (RCX if it's the first parameter), and a record of two Doubles 
will be passed by reference (pointer in RCX if it's the first 
parameter).  Under vectorcall, a record of two Singles would be treated 
as a Homogeneous Float Aggregate and pass the two fields in XMM0 and 
XMM1, and the same thing happens with an unaligned record of two 
Doubles.  If a record of two Doubles is aligned to a 16-byte boundary 
though, or is otherwise a union with a __m128 type (with the two Doubles 
aliased to the lower and upper 64 bits respectively), then it can be 
passed in its entirety through XMM0.


Some things are a little bit messy and opaque with __m128 though, and 
just making an aligned array of 4 Singles or 2 Doubles doesn't always 
work - it needs to be typecast through __m128 in some way - but I think 
that's mostly because C++ wasn't really designed with alignment in 
mind.  In Free Pascal, you have to make a bit of a messy union to ensure 
everything works; for example:


{$push}
{$codealign RECORDMIN=16}
{$PACKRECORDS C}
  type align_dummy = record
 filler: array[0..1] of Double;
   end;
{$pop}

 type complex = record
  case Byte of
  0: (
   alignment: align_dummy;
 );
  1: (
   re : real;
   im : real;
 );
    end;

Trying to apply RECORDMIN=16 to complex directly just puts each field on 
a 16-byte boundary.  It's why I proposed allowing "align 16" between 
"end" and the semicolon so it can be done with relative ease, since it's 
so easy to get wrong and it's not obvious that it's wrong until you 
measure performance benchmarks and look at the disassembly.


Gareth aka. Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] vmul commutative optimization?

2019-11-13 Thread Marco van de Voort


Op 2019-11-12 om 20:46 schreef J. Gareth Moreton:


The Microsoft ABI is a bit restrictive when it comes to record types; 
as described here 
, 
"Structs and unions of size 8, 16, 32, or 64 bits, and __m64 types, 
are passed as if they were integers of the same size." So 
unfortunately, a single-precision complex number is treated as a 
64-bit structure and passed as an integer.  The System V ABI, on the 
other hand, would pass the two entries through the lower 64 bits of 
XMM0.  Vectorcall, theoretically, should put the two components into 
XMM0 and XMM1, because the complex type would be considered a 
"homogeneous vector aggregate" (with floats as 1-dimensional vectors).


I've found refs like 
https://devblogs.microsoft.com/cppblog/introducing-vector-calling-convention/#comments 
so the question is if partial vectors (and specially 2 single 8-byte, 
since there are various special SSE opcodes to deal with them) are one 
register or not. The references I found usually talk about "vector types 
like _m128 and _m256), but don't really specify an exhaustive list.



I guess that means testing with VS?


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] vmul commutative optimization?

2019-11-12 Thread J. Gareth Moreton
The Microsoft ABI is a bit restrictive when it comes to record types; as 
described here 
, 
"Structs and unions of size 8, 16, 32, or 64 bits, and __m64 types, are 
passed as if they were integers of the same size." So unfortunately, a 
single-precision complex number is treated as a 64-bit structure and 
passed as an integer.  The System V ABI, on the other hand, would pass 
the two entries through the lower 64 bits of XMM0.  Vectorcall, 
theoretically, should put the two components into XMM0 and XMM1, because 
the complex type would be considered a "homogeneous vector aggregate" 
(with floats as 1-dimensional vectors).


I think the overhead that comes with issues such as this is the reason 
why vectorcall was developed in the first place.


Gareth aka. Kit

On 12/11/2019 16:05, Marco van de Voort wrote:


Op 12/11/2019 om 16:08 schreef J. Gareth Moreton:
It's true.  With VMULSS, only the first parameter (third parameter 
under Intel notation) can be an address (source: Intel(R) 64 and 
IA-32 Architectures Software Development Manual, Volume 2B, Page 4-154).


I'll see if I can work in that optimisation for the commutative 
operations (+ and *) at some point from the node side.


Thanks.

Another tidbit I noticed while playing with  (elements of) the complex 
patch is that if I set the elementsize to double (re:double;im:double) 
that with vectorcall loads all data into registers.


However if I make it single, (iow the tcomplex is 8-byte), the records 
are loaded into integer registers, and the compiler stores them to the 
stack and then reloads them.


This matters less for me since it won't vectorize anyway (see inline 
and philosophy thread) I'll change this routine to assembler I think, 
accepting a pointer and load and store from that thread.


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] vmul commutative optimization?

2019-11-12 Thread Marco van de Voort


Op 12/11/2019 om 16:08 schreef J. Gareth Moreton:
It's true.  With VMULSS, only the first parameter (third parameter 
under Intel notation) can be an address (source: Intel(R) 64 and IA-32 
Architectures Software Development Manual, Volume 2B, Page 4-154).


I'll see if I can work in that optimisation for the commutative 
operations (+ and *) at some point from the node side.


Thanks.

Another tidbit I noticed while playing with  (elements of) the complex 
patch is that if I set the elementsize to double (re:double;im:double) 
that with vectorcall loads all data into registers.


However if I make it single, (iow the tcomplex is 8-byte), the records 
are loaded into integer registers, and the compiler stores them to the 
stack and then reloads them.


This matters less for me since it won't vectorize anyway (see inline and 
philosophy thread) I'll change this routine to assembler I think, 
accepting a pointer and load and store from that thread.


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] vmul commutative optimization?

2019-11-12 Thread J. Gareth Moreton
It's true.  With VMULSS, only the first parameter (third parameter under 
Intel notation) can be an address (source: Intel(R) 64 and IA-32 
Architectures Software Development Manual, Volume 2B, Page 4-154).


I'll see if I can work in that optimisation for the commutative 
operations (+ and *) at some point from the node side.


Gareth aka. Kit


On 12/11/2019 12:22, Marco van de Voort wrote:

I compiled some bits with avx, and noticed that when you do

asingle:=someconstant*othersingle;

then that generates something like

    vmovss    TC_$FFTS_$$_C31(%rip),%xmm2
    vmulss    %xmm0,%xmm2,%xmm0

while if you do

asingle:=othersingle*someconstant;

it generates

    vmulss    TC_$FFTS_$$_C32(%rip),%xmm2,%xmm2


I assume the reason is that only the first param can be an address, 
and the second a register. But the compiler isn't smart enough to 
exchange them.


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


[fpc-devel] vmul commutative optimization?

2019-11-12 Thread Marco van de Voort

I compiled some bits with avx, and noticed that when you do

asingle:=someconstant*othersingle;

then that generates something like

    vmovss    TC_$FFTS_$$_C31(%rip),%xmm2
    vmulss    %xmm0,%xmm2,%xmm0

while if you do

asingle:=othersingle*someconstant;

it generates

    vmulss    TC_$FFTS_$$_C32(%rip),%xmm2,%xmm2


I assume the reason is that only the first param can be an address, and 
the second a register. But the compiler isn't smart enough to exchange them.


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel