Re: [fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!
The following passes everything through XMM0: #include #include doubleMod(__m128dz) { returnsqrt((z[0]*z[0])+(z[1]*z[1])); } intmain() { __m128dz; z[0] = 0; z[1] = 1; doubled = Mod(z); } I will admit that it's very fiddly to get right. All of my attempts to map an anonymous struct to __m128d via a union (so you could call z.re and z.im rather than access the array elements) were unsuccessful. C++ is not very friendly with vector types and you have to go out of your way to get the compiler to be efficient with them, but the System V ABI does support utilising the full vector registers. It took me a while to work out how passing a record type with two single-precision elements into just XMM0 is correct, but this is because the record type as a whole has a size of eight bytes, and gets passed as a single argument of class SSE. If the function parameters are instead two separate arguments, then they get passed individually through XMM0 and XMM1. It seems you have to interpret this document very literally to get it right: https://www.uclibc.org/docs/psABI-x86_64.pdf Gareth aka. Kit On 27/10/2019 08:13, Florian Klämpfl wrote: Am 23.10.19 um 22:36 schrieb J. Gareth Moreton: So I did a bit of reading after finding the "mpx-linux64-abi.pdf" document. As I suspected, the System V ABI is like vectorcall when it comes to using the XMM registers... only the types __m128, __float128 and __Decimal128 use the "SSEUP" class and hence use the entire register. The types are opaque, but both their size and alignment are 16 bytes, so I think anything that abides by those rules can be considered equivalent. If the complex type is unaligned, the two fields get their own XMM register. If aligned, they both go into %xmm0. At least that is what I gathered from reading the document - it's a little unclear sometimes. I briefly tested with god bolt (https://godbolt.org/): records of two double are passed in two xmm registers regardless of the alignment, two floats (so single) are passed in one xmm register. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel -- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!
Am 23.10.19 um 22:36 schrieb J. Gareth Moreton: So I did a bit of reading after finding the "mpx-linux64-abi.pdf" document. As I suspected, the System V ABI is like vectorcall when it comes to using the XMM registers... only the types __m128, __float128 and __Decimal128 use the "SSEUP" class and hence use the entire register. The types are opaque, but both their size and alignment are 16 bytes, so I think anything that abides by those rules can be considered equivalent. If the complex type is unaligned, the two fields get their own XMM register. If aligned, they both go into %xmm0. At least that is what I gathered from reading the document - it's a little unclear sometimes. I briefly tested with god bolt (https://godbolt.org/): records of two double are passed in two xmm registers regardless of the alignment, two floats (so single) are passed in one xmm register. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!
In the meantime, if everything seems present and correct, https://bugs.freepascal.org/view.php?id=36202 contains the alignment and vectorcall modifiers for uComplex. It shouldn't affect anything outside of x86_64 but should still keep the unit very lightweight, which I believe was the original intent. Gareth aka. Kit -- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!
Hmmm, that is unfortunate if the horizontal operations are inefficient. I had a look at them at https://www.agner.org/optimize/instruction_tables.pdf - you are right in that HADDPS has a surprisingly high latency (approximately how many cycles it takes to execute), although HADDPD isn't as bad, probably because it's only dealing with 2 Doubles instead of 4 Singles, and it seems mostly equivalent in speed to the multiplication instructions. Using just SSE2: mulpd %xmm0,%xmm0 shufpd %xmm0,%xmm1,1 addsd %xmm1,%xmm0 sqrtsd %xmm0,%xmm0 Ultimately it's not much better than what you have: shufpd %xmm0,%xmm1,1 { Only needed if both fields are in %xmm0 } mulsd %xmm0,%xmm0 mulsd %xmm1,%xmm1 addsd %xmm1,%xmm0 sqrtsd %xmm0,%xmm0 If you measure the dependencies between the instructions (shufpd and the first mulsd can run simultaneously, or equivalently, the two mulsd instructions), it still amounts to 4 cycles, assuming each instruction takes an equal amount of time to execute (which they don't, but it's a reasonable approximation). The subroutines are also probably too small to get accurate timing metrics on them. It might be something to experiment on though - I would hope at the very least that the horizontal operations have improved in later years. I know though that vectorising instructions is, by and large, a net gain. For example, let's go to a simpler example of adding two complex numbers together: operator + (z1, z2 : complex) z : complex; vectorcall; {$ifdef TEST_INLINE} inline; {$endif TEST_INLINE} { addition : z := z1 + z2 } begin z.re := z1.re + z2.re; z.im := z1.im + z2.im; end; No horizonal adds here, just a simple packed addition and storing the result into %xmm0 as opposed to two scalar additions and then combining the result in whatever way is demanded (if aligned, it's all in %xmm0, if unaligned, I think %xmm0 and %xmm1 are supposed to be used). Mind you, in this case the function is inlined, so the parameter passing doesn't always apply. Once again though, I was surprised at how inefficient HADDPS is once you pointed it out. The double-precision versions aren't nearly as bad though, so maybe they can still be used. Gareth aka. Kit P.S. As far as 128-bit aligned vector types are concerned, vectorcall and the System V ABI can be considered equivalent. Vectorcall can use more MM registers for return values and more complex aggregates as parameters, but in our examples, we don't have to worry about that yet. On 23/10/2019 21:20, Florian Klämpfl wrote: Am 22.10.19 um 05:01 schrieb J. Gareth Moreton: mulpd %xmm0, %xmm0 { Calculates "re * re" and "im * im" simultaneously } haddpd %xmm0, %xmm0 { Adds the above multiplications together (horizontal add) } Unfortunatly, those horizontal operations are normally not very efficient IIRC. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel -- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!
So I did a bit of reading after finding the "mpx-linux64-abi.pdf" document. As I suspected, the System V ABI is like vectorcall when it comes to using the XMM registers... only the types __m128, __float128 and __Decimal128 use the "SSEUP" class and hence use the entire register. The types are opaque, but both their size and alignment are 16 bytes, so I think anything that abides by those rules can be considered equivalent. If the complex type is unaligned, the two fields get their own XMM register. If aligned, they both go into %xmm0. At least that is what I gathered from reading the document - it's a little unclear sometimes. Gareth aka. Kit On 23/10/2019 06:59, Florian Klämpfl wrote: Am 23. Oktober 2019 01:14:03 schrieb "J. Gareth Moreton" : That's definitely a marked improvement. Under the System V ABI and vectorcall, both fields of a complex type would be passed through xmm0. Splitting it up into two separate registers would require something like: shufpd%xmm0,%xmm1,3 { Copy the high-order Double into the low-order position - an immediate operand of "1" will also work, since we're not concerned with the upper 64 bits of %xmm1 } After which your complied code will work correctly (since it looks like %xmm1 was undefined before): The code is correct, on x86_64-linux vectorcall is ignored. Supporting vectorcall with my approach would be more difficult. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel -- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!
Am 22.10.19 um 05:01 schrieb J. Gareth Moreton: mulpd %xmm0, %xmm0 { Calculates "re * re" and "im * im" simultaneously } haddpd %xmm0, %xmm0 { Adds the above multiplications together (horizontal add) } Unfortunatly, those horizontal operations are normally not very efficient IIRC. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!
Am 23. Oktober 2019 01:14:03 schrieb "J. Gareth Moreton" : > That's definitely a marked improvement. Under the System V ABI and > vectorcall, both fields of a complex type would be passed through xmm0. > Splitting it up into two separate registers would require something like: > > > shufpd%xmm0,%xmm1,3 { Copy the high-order Double into the low-order > position - an immediate operand of "1" will also work, since we're not > concerned with the upper 64 bits of %xmm1 } > > > After which your complied code will work correctly (since it looks like > %xmm1 was undefined before): The code is correct, on x86_64-linux vectorcall is ignored. Supporting vectorcall with my approach would be more difficult. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!
That's definitely a marked improvement. Under the System V ABI and vectorcall, both fields of a complex type would be passed through xmm0. Splitting it up into two separate registers would require something like: shufpd %xmm0,%xmm1,3 { Copy the high-order Double into the low-order position - an immediate operand of "1" will also work, since we're not concerned with the upper 64 bits of %xmm1 } After which your complied code will work correctly (since it looks like %xmm1 was undefined before): mulsd %xmm0,%xmm0 mulsd %xmm1,%xmm1 addsd %xmm0,%xmm1 { In terms of register usage, the most optimal combination of instructions here would be "addsd %xmm1,%xmm0" then "sqrtsd %xmm0,%xmm0", since %xmm1 is released for other purposes one instruction sooner } sqrtsd %xmm1,%xmm0 ret Otherwise you'd have to load in the data from reference (%rcx under win64, and %rdi under other x86_64 platforms) - for example: movsd (%rcx),%xmm0 movsd 8(%rcx),%xmm1 I would be interested to see the the patch when it's ready. Under SSE2 (no horizontal add), I think the most optimal set of instructions (assuming the entirety of the parameter is passed through %xmm0) is: mulpd %xmm0,%xmm0 shufpd %xmm0,%xmm1,3 addsd %xmm1,%xmm0 sqrtsd %xmm0,%xmm0 ret The main motivation in my eyes is the fact that it removes one of the multiplication instructions - mind you, on a modern processor, a pair of "mulsd" instructions working on independent data will be executed simultaneously, in which case the only time a cycle-counting improvement becomes visible is if the core is hyperthreaded and another thread is using the ALUs. Of course, a sufficiently-skilled assembler programmer will be able to beat the compiler in many cases, but it's still a target to strive for. Gareth aka. Kit On 22/10/2019 22:03, Florian Klämpfl wrote: Am 22.10.19 um 05:01 schrieb J. Gareth Moreton: Bigger challenges would be optimising the modulus of a complex number: function cmod (z : complex): real; vectorcall; { module : r = |z| } begin with z do cmod := sqrt((re * re) + (im * im)); end; A perfect compiler with permission to use SSE3 (for haddpd) should generate the following (note that no stack frame is required): mulpd %xmm0, %xmm0 { Calculates "re * re" and "im * im" simultaneously } haddpd %xmm0, %xmm0 { Adds the above multiplications together (horizontal add) } sqrtsd %xmm0 ret Currently, with vectorcall, the routine compiles into this: leaq -24(%rsp),%rsp movdqa %xmm0,(%rsp) movq %rsp,%rax movsd (%rax),%xmm1 mulsd %xmm1,%xmm1 movsd 8(%rax),%xmm0 mulsd %xmm0,%xmm0 addsd %xmm1,%xmm0 sqrtsd %xmm0,%xmm0 leaq 24(%rsp),%rsp ret And without vectorcall (or an unaligned record type): leaq -24(%rsp),%rsp movq %rcx,%rax movq (%rax),%rdx movq %rdx,(%rsp) movq 8(%rax),%rax movq %rax,8(%rsp) movq %rsp,%rax movsd (%rax),%xmm1 mulsd %xmm1,%xmm1 movsd 8(%rax),%xmm0 mulsd %xmm0,%xmm0 addsd %xmm1,%xmm0 sqrtsd %xmm0,%xmm0 leaq 24(%rsp),%rsp ret With a few additions (the git patch is less than 500 lines) in the compiler I get (it is not ready for committing yet): .section .text.n_p$program_$$_cmod$complex$$real,"ax" .balign 16,0x90 .globl P$PROGRAM_$$_CMOD$COMPLEX$$REAL .type P$PROGRAM_$$_CMOD$COMPLEX$$REAL,@function P$PROGRAM_$$_CMOD$COMPLEX$$REAL: .Lc2: # Var $result located in register xmm0 # Var z located in register xmm0 # [test.pp] # [20] begin # [22] cmod := sqrt((re * re) + (im * im)); mulsd %xmm0,%xmm0 mulsd %xmm1,%xmm1 addsd %xmm0,%xmm1 sqrtsd %xmm1,%xmm0 # Var $result located in register xmm0 .Lc3: # [23] end; ret .Lc1: .Le0: .size P$PROGRAM_$$_CMOD$COMPLEX$$REAL, .Le0 - P$PROGRAM_$$_CMOD$COMPLEX$$REAL It mainly keeps records in mm registers. I am not sure about the right approach yet. But to allocate one register to each field of suitable records seems to be a reasonable approach. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel -- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!
Am 22.10.19 um 05:01 schrieb J. Gareth Moreton: Bigger challenges would be optimising the modulus of a complex number: function cmod (z : complex): real; vectorcall; { module : r = |z| } begin with z do cmod := sqrt((re * re) + (im * im)); end; A perfect compiler with permission to use SSE3 (for haddpd) should generate the following (note that no stack frame is required): mulpd %xmm0, %xmm0 { Calculates "re * re" and "im * im" simultaneously } haddpd %xmm0, %xmm0 { Adds the above multiplications together (horizontal add) } sqrtsd %xmm0 ret Currently, with vectorcall, the routine compiles into this: leaq -24(%rsp),%rsp movdqa %xmm0,(%rsp) movq %rsp,%rax movsd (%rax),%xmm1 mulsd %xmm1,%xmm1 movsd 8(%rax),%xmm0 mulsd %xmm0,%xmm0 addsd %xmm1,%xmm0 sqrtsd %xmm0,%xmm0 leaq 24(%rsp),%rsp ret And without vectorcall (or an unaligned record type): leaq -24(%rsp),%rsp movq %rcx,%rax movq (%rax),%rdx movq %rdx,(%rsp) movq 8(%rax),%rax movq %rax,8(%rsp) movq %rsp,%rax movsd (%rax),%xmm1 mulsd %xmm1,%xmm1 movsd 8(%rax),%xmm0 mulsd %xmm0,%xmm0 addsd %xmm1,%xmm0 sqrtsd %xmm0,%xmm0 leaq 24(%rsp),%rsp ret With a few additions (the git patch is less than 500 lines) in the compiler I get (it is not ready for committing yet): .section .text.n_p$program_$$_cmod$complex$$real,"ax" .balign 16,0x90 .globl P$PROGRAM_$$_CMOD$COMPLEX$$REAL .type P$PROGRAM_$$_CMOD$COMPLEX$$REAL,@function P$PROGRAM_$$_CMOD$COMPLEX$$REAL: .Lc2: # Var $result located in register xmm0 # Var z located in register xmm0 # [test.pp] # [20] begin # [22] cmod := sqrt((re * re) + (im * im)); mulsd %xmm0,%xmm0 mulsd %xmm1,%xmm1 addsd %xmm0,%xmm1 sqrtsd %xmm1,%xmm0 # Var $result located in register xmm0 .Lc3: # [23] end; ret .Lc1: .Le0: .size P$PROGRAM_$$_CMOD$COMPLEX$$REAL, .Le0 - P$PROGRAM_$$_CMOD$COMPLEX$$REAL It mainly keeps records in mm registers. I am not sure about the right approach yet. But to allocate one register to each field of suitable records seems to be a reasonable approach. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!
This is a long read, so strap in! Well, I finally got it to work - the required type defintion was as follows: {$push} {$codealign RECORDMIN=16} {$PACKRECORDS C} { This record forces "complex" to be aligned to a 16-byte boundary } type align_dummy = record filler: array[0..1] of real; end; {$pop} type complex = record case Byte of 0: ( alignment: align_dummy; ); 1: ( re : real; im : real; ); end; It is so, so easy to get wrong because if align_dummy's field is 1, 2, 4 or 8 bytes in size, it is classed as an integer under Windows, and that overrides the Double-type in the union, causing the entire record to still be passed by reference. Additionally, the dummy field has to be of type Single or Double (or Real); if it is an integral type (e.g. "array[0..15] of Byte"), it is once again classified as an integer and overrides the Double type as per the rules of System V ABI parameter classification (in other words, the entire thing would get passed by reference under both x86_64-win64 and x86_64-linux etc.). Long story short, this is an absolute minefield!! I still seriously think that having an alignment attribute or some such will make life so much easier for third-party developers who may not know the exact quirks of how x86_64 classifies its parameters. To me, this trick feels incredibly hacky and very hard to get right. Compiled code isnt perfect though - for example, when moving parameters to and from the relevant xmm registers, the "movdqa" instruction is used instead of "movapd", which causes a performance penalty because the internal CPU state has to switch between double-precision and integer (this is why, for example, there are separate VINSERTF128 and VINSERTI128 instructions, even though they superficially do the same thing). Additionally, inlined vectorcall routines still seem to fall back onto using movq to transfer 8 bytes at a time between a function result and wherever it is to be stored, but this is because everything is decomposed at the node level and the compiler currently lacks any decent vectorisation algorithms. Nevertheless, I think I'm ready to prepare a patch for uComplex for evaluation, and it's given me some things to play with to see if the compiler can be made to work with packed data better. I figure the uComplex unit is a good place to start because it's an array of 2 Doubles internally and a lot of the operations like addition are component-wise. Bigger challenges would be optimising the modulus of a complex number: function cmod (z : complex): real; vectorcall; { module : r = |z| } begin with z do cmod := sqrt((re * re) + (im * im)); end; A perfect compiler with permission to use SSE3 (for haddpd) should generate the following (note that no stack frame is required): mulpd %xmm0, %xmm0 { Calculates "re * re" and "im * im" simultaneously } haddpd %xmm0, %xmm0 { Adds the above multiplications together (horizontal add) } sqrtsd %xmm0 ret Currently, with vectorcall, the routine compiles into this: leaq -24(%rsp),%rsp movdqa %xmm0,(%rsp) movq %rsp,%rax movsd (%rax),%xmm1 mulsd %xmm1,%xmm1 movsd 8(%rax),%xmm0 mulsd %xmm0,%xmm0 addsd %xmm1,%xmm0 sqrtsd %xmm0,%xmm0 leaq 24(%rsp),%rsp ret And without vectorcall (or an unaligned record type): leaq -24(%rsp),%rsp movq %rcx,%rax movq (%rax),%rdx movq %rdx,(%rsp) movq 8(%rax),%rax movq %rax,8(%rsp) movq %rsp,%rax movsd (%rax),%xmm1 mulsd %xmm1,%xmm1 movsd 8(%rax),%xmm0 mulsd %xmm0,%xmm0 addsd %xmm1,%xmm0 sqrtsd %xmm0,%xmm0 leaq 24(%rsp),%rsp ret Maybe I'm in the minority here, and definitely getting ahead of myself, but seeing ways of improving the compiled assembly language excites me! Even without vectorcall, I want to see if I can get my deep optimiser in a workable form, because things like "movq %rsp,%rax" and then merely reading from %rax is completely unnecessary. Also, things like this: ... movdqa %xmm0,(%rsp) movq %rsp,%rax movsd (%rax),%xmm1 ... Just... why?! Just do "movsd %xmm0,%xmm1"!! The peephole optimiser may struggle to spot this anyway because of the inefficient mixing of integer and floating-point XMM instructions - of course, it might be the case that the original contents of %xmm0 is needed later - this is where my deep optimiser or some other form of data-flow analysis would come into play. Playing the logical flow in my head, I can see it optimising the triplet as follows: 1. Notice that %rax = %rsp and change the movsd instruction to minimise a pipeline stall (the later "movsd 8(%rax),%xmm0" instruction would get changed too): ... movdqa %xmm0,(%rsp) movq %rsp,%rax movsd (%rsp),%xmm1 ... 2. Notice that %rax is