Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt
It's why I like going for optimisations that try to reduce code size without sacrificing speed, because of reducing the number of 16-byte or 32-byte sections. Anyhow, back to work with optimising! Gareth aka. Kit On 04/01/2022 19:33, Martin Frb via fpc-devel wrote: On 04/01/2022 18:43, Jonas Maebe via fpc-devel wrote: On 03/01/2022 12:54, Martin Frb via fpc-devel wrote: not sure if this is of interest to you, but I see you do a lot on the optimizer It's very likely unrelated to anything the optimiser does or does not do, and more regarding random differences in code layout. Charlie posted the following video on core just yesterday, and it touches on exactly this subject: https://www.youtube.com/watch?v=r-TLSBdHe1A Choice quote: code layout and environment variables can produce up to 40% differences in performance, which is more than what even the best optimizing compilers can achieve do in most cases. Interesting... And yes, see my previous post. It seems to be which "sub-section" of a loop falls into a 32 byte aligned 32 byte block. It's not even the entire loop (not about the begin of the loop), but a certain code block within. This also goes along with one optimization that (even though still chance) in my test improved the timing (both worst and best time, though those are only *my" worst/best) => reducing the byte size of the loop code. That way there are less 32byte sections. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel -- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt
On 04/01/2022 18:43, Jonas Maebe via fpc-devel wrote: On 03/01/2022 12:54, Martin Frb via fpc-devel wrote: not sure if this is of interest to you, but I see you do a lot on the optimizer It's very likely unrelated to anything the optimiser does or does not do, and more regarding random differences in code layout. Charlie posted the following video on core just yesterday, and it touches on exactly this subject: https://www.youtube.com/watch?v=r-TLSBdHe1A Choice quote: code layout and environment variables can produce up to 40% differences in performance, which is more than what even the best optimizing compilers can achieve do in most cases. Interesting... And yes, see my previous post. It seems to be which "sub-section" of a loop falls into a 32 byte aligned 32 byte block. It's not even the entire loop (not about the begin of the loop), but a certain code block within. This also goes along with one optimization that (even though still chance) in my test improved the timing (both worst and best time, though those are only *my" worst/best) => reducing the byte size of the loop code. That way there are less 32byte sections. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt
On 03/01/2022 12:54, Martin Frb via fpc-devel wrote: not sure if this is of interest to you, but I see you do a lot on the optimizer It's very likely unrelated to anything the optimiser does or does not do, and more regarding random differences in code layout. Charlie posted the following video on core just yesterday, and it touches on exactly this subject: https://www.youtube.com/watch?v=r-TLSBdHe1A Choice quote: code layout and environment variables can produce up to 40% differences in performance, which is more than what even the best optimizing compilers can achieve do in most cases. Jonas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt
On 4-1-2022 17:15, J. Gareth Moreton via fpc-devel wrote: I neglected to include -Cpcoreavx, that was my bad. I'll try again. According to Intel® 64 and IA-32 Architectures Software Developer’s Manual, Vol 2B, Page 4-391. The zero flag is set if the source is zero, and cleared otherwise. Regarding an undefined result, I got confused with the BSF and BSR commands, sorry. I guess I was more tired than I thought! POPCNT returns zero for a zero input. Ok, that's what I thought. I played a bit by adding code alignments to loops in the SSE code, but it only seems to slow the core loop rather than accelerate it (align before the branch location and/or branch target) Did you have any thoughts about moving up the NOT instruction ? ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt
I neglected to include -Cpcoreavx, that was my bad. I'll try again. According to Intel® 64 and IA-32 Architectures Software Developer’s Manual, Vol 2B, Page 4-391. The zero flag is set if the source is zero, and cleared otherwise. Regarding an undefined result, I got confused with the BSF and BSR commands, sorry. I guess I was more tired than I thought! POPCNT returns zero for a zero input. Gareth aka. Kit On 04/01/2022 16:03, Marco van de Voort via fpc-devel wrote: On 4-1-2022 16:31, Martin Frb via fpc-devel wrote: Weird as mine is inlined with -Cpcoreavx -O4, with no special handling for 0. But that does put some things on shaky ground. Maybe zero the result before hand? Same here. I looked up popcnt and found nothing about not setting if zero. (E.g. https://www.felixcloutier.com/x86/popcnt ) I meanwhile also ran on my Ryzen 4800H laptop and updated the version on the web with the stats. The stats for the long string are about as fast as on my i7-3770 (laptop vs desktop memory bandwidth? The ryzen should be faster in any way?!?), but the short one (40 bytes) is significantly faster. What I don't get is why the assembler version seems systematically faster even for the short code. The generated asm is nearly the same. Also notable is that on this machine with popcnt (-Cpcoreavx), the popcnt version is as fast as the add function within error margins, so probably popcnt instruction is faster(lower latency) and thus less of a bottleneck on this machine. Note that the POP() function is half the size, so that makes it better for newer machines. - Note that I test on Windows, so it might be that the "two times load" is a difference somehow due to different codegeneration on windows About UTF8LengthFast() Well, before I get to this, I noted something weird. 2 runs, compiled with the same compiler ( 3.2.3 ), and the same settings, with the only difference: -gw3 or not -gw3 => And the speed differed. 600 (with dwarf) vs 700 (no dwarf) / reproducible. I also have seen this, while working on the code. And indeed mainly with the "fast" one. It also explains why the assembler is always consistent, it suffers less from detail code changes when I e.g. update FPC from git, and thus different alignment. (assuming that the section starts are always aligned) Alignment. 16 vs 32 bit. Can that make a difference? According to: https://stackoverflow.com/questions/61016077/32-byte-aligned-routine-does-not-fit-the-uops-cache Seems to be a problem of the Skylake and later archs, which I no longer have. The i7 is too old, and the others are AMD. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel -- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt
On 4-1-2022 16:31, Martin Frb via fpc-devel wrote: Weird as mine is inlined with -Cpcoreavx -O4, with no special handling for 0. But that does put some things on shaky ground. Maybe zero the result before hand? Same here. I looked up popcnt and found nothing about not setting if zero. (E.g. https://www.felixcloutier.com/x86/popcnt ) I meanwhile also ran on my Ryzen 4800H laptop and updated the version on the web with the stats. The stats for the long string are about as fast as on my i7-3770 (laptop vs desktop memory bandwidth? The ryzen should be faster in any way?!?), but the short one (40 bytes) is significantly faster. What I don't get is why the assembler version seems systematically faster even for the short code. The generated asm is nearly the same. Also notable is that on this machine with popcnt (-Cpcoreavx), the popcnt version is as fast as the add function within error margins, so probably popcnt instruction is faster(lower latency) and thus less of a bottleneck on this machine. Note that the POP() function is half the size, so that makes it better for newer machines. - Note that I test on Windows, so it might be that the "two times load" is a difference somehow due to different codegeneration on windows About UTF8LengthFast() Well, before I get to this, I noted something weird. 2 runs, compiled with the same compiler ( 3.2.3 ), and the same settings, with the only difference: -gw3 or not -gw3 => And the speed differed. 600 (with dwarf) vs 700 (no dwarf) / reproducible. I also have seen this, while working on the code. And indeed mainly with the "fast" one. It also explains why the assembler is always consistent, it suffers less from detail code changes when I e.g. update FPC from git, and thus different alignment. (assuming that the section starts are always aligned) Alignment. 16 vs 32 bit. Can that make a difference? According to: https://stackoverflow.com/questions/61016077/32-byte-aligned-routine-does-not-fit-the-uops-cache Seems to be a problem of the Skylake and later archs, which I no longer have. The i7 is too old, and the others are AMD. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt
@Marco: havent played with popcnt => it could benefit from the "const to var" too. So I played around a bit... Of course, all this is intel only 1) var Mask8, Mask1: qword; Mask8 := EIGHTYMASK; Mask1 := ONEMASK; And the constant no longer is assigned inside the loop. Also makes the loop shorter. => improves speed 2) //for i := 1 to (ByteCount-cnt) div sizeof(PtrInt) do //for i := (ByteCount-cnt) div sizeof(PtrInt) - 1 downto 0 do i := (ByteCount-cnt) div sizeof(PtrInt) ; repeat dec(i); until i = 0; The orig: for i := 1 to (ByteCount-cnt) div sizeof(PtrInt) do // r9 is reserved to hold the upper bound .Lj26: addq $1,%r10 ... cmpq %r10,%r9 jnle .Lj26 Since the counter var "i" is not accessed in the loop, its value does not matter. So for i := (ByteCount-cnt) div sizeof(PtrInt) - 1 downto 0 do // no longer needs to store an upper bound, but still has an extra "test" since the "subq" is at the start of the loop .Lj26: subq $1,%r10 ... testq %r10,%r10 jnle .Lj26 // "repeat " , and there no longer is a "test" And that again reduced the loop size. And apparently just below a critical point, as time get a little better again - WITH the constants moved to var: orig for : 547 downto:547 repeat: 516 ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt
On 04/01/2022 10:31, Marco van de Voort via fpc-devel wrote: Weird as mine is inlined with -Cpcoreavx -O4, with no special handling for 0. But that does put some things on shaky ground. Maybe zero the result before hand? Same here. About UTF8LengthFast() Well, before I get to this, I noted something weird. 2 runs, compiled with the same compiler ( 3.2.3 ), and the same settings, with the only difference: -gw3 or not -gw3 => And the speed differed. 600 (with dwarf) vs 700 (no dwarf) / reproducible. So then => I compiled the app, with the same 3.2.3 (and I compiled WITHOUT -a / though with -al I get the same results) => I compiled once with -gw3 , and once without dwarf info => I used objdump to dis-assemble the exe. => I diffed (and searched for 0x101010 (only used in the ...Fast code) --> _*The assembler is identical.*_ Yet one is faster. (the one WITH dwarf) The calling code in the main body is the same too (as far as I could see), except that the address of the callee is different (but it is just 20 calls per measurement) I did those runs OUTSIDE the IDE. So no debugger in the background. Win64 / 64 bit Core I7 8600K Using 3.3.1 the speed is equal. Never mind if dwarf is generated or not. (I did not compare the asm for that...) - Clinging to straws, there is one (maybe) diff in the 3.2.3 with/without dwarf assembler. *** I am totally out of my depth here Alignment. 16 vs 32 bit. Can that make a difference? According to: https://stackoverflow.com/questions/61016077/32-byte-aligned-routine-does-not-fit-the-uops-cache The Decoded ICache consists of 32 sets. Each set contains eight Ways. Each Way can hold up to six micro-ops. All micro-ops in a Way represent instructions which are statically contiguous in the code and have their EIPs within _*the same aligned 32-byte region*_. So the alignment of the 2 procedures differs by 16 bytes The proc entry is at With dwarf 11870 without 11860 // actually this is 32byte aligned (but slower) Yet, maybe it matters which statements in the big loop happen to fall into the same 32byte block??? The loop starting with for i := 1 to (ByteCount-cnt) div sizeof(PtrInt) do With DWARF (faster): 118f0: 49 83 c2 01 add $0x1,%r10 118f4: 4c 8b 19 mov (%rcx),%r11 118f7: 4d 89 d8 mov %r11,%r8 118fa: 48 bf 80 80 80 80 80 movabs $0x8080808080808080,%rdi 11901: 80 80 80 11904: 49 21 f8 and %rdi,%r8 11907: 49 c1 e8 07 shr $0x7,%r8 1190b: 49 f7 d3 not %r11 1190e: 49 c1 eb 06 shr $0x6,%r11 11912: 4d 21 d8 and %r11,%r8 11915: 4c 89 c3 mov %r8,%rbx 11918: 49 bb 01 01 01 01 01 movabs $0x101010101010101,%r11 1191f: 01 01 01 11922: 4d 0f af c3 imul %r11,%r8 11926: 49 c1 e8 38 shr $0x38,%r8 1192a: 4c 01 c6 add %r8,%rsi 1192d: 48 83 c1 08 add $0x8,%rcx 11931: 4d 39 d1 cmp %r10,%r9 11934: 7f ba jg 118f0 WITHOUT: 118e0: 49 83 c2 01 add $0x1,%r10 118e4: 4c 8b 19 mov (%rcx),%r11 118e7: 4d 89 d8 mov %r11,%r8 118ea: 48 bf 80 80 80 80 80 movabs $0x8080808080808080,%rdi 118f1: 80 80 80 118f4: 49 21 f8 and %rdi,%r8 118f7: 49 c1 e8 07 shr $0x7,%r8 118fb: 49 f7 d3 not %r11 118fe: 49 c1 eb 06 shr $0x6,%r11 11902: 4d 21 d8 and %r11,%r8 11905: 4c 89 c3 mov %r8,%rbx 11908: 49 bb 01 01 01 01 01 movabs $0x101010101010101,%r11 1190f: 01 01 01 11912: 4d 0f af c3 imul %r11,%r8 11916: 49 c1 e8 38 shr $0x38,%r8 1191a: 4c 01 c6 add %r8,%rsi 1191d: 48 83 c1 08 add $0x8,%rcx 11921: 4d 39 d1 cmp %r10,%r9 11924: 7f ba jg 0x118e0 *** written before I got into the above.. --- About UTF8LengthFast() I notice 3 differences. Though I only compare the O4 result, and have no idea what is pre-peephole and post-peephole. 1) As you say: different registers (but the same statements, in the same order). No idea if that affects the CPU. 2) One extra statement in 3.3.1 movq %r10,%r8 <<< not in 3.2.3 movq $72340172838076673,%r10 imulq %r10,%r8 3) I do have "dwarf" enabled.
Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt
On 4-1-2022 01:06, J. Gareth Moreton via fpc-devel wrote: Prepare for a lot of technical rambling! This is just an analysis of the compilation of utf8lentest.lpr, not any of the System units. Notably, POPCNT isn't called directly, but instead goes through the System unit via "call fpc_popcnt_qword" on both 3.2.x and 3.3.1. A future study of "fpc_popcnt_qword" might yield some interesting information. Did you pass -Cpcoreavx ? # [381] Result += PopCnt(qword(nx)); popcntq %r10,%r8 addq %r8,%rbx In this situation, one can replace %r11 with %r13 in the 3rd and 4th instructions so the massive constant isn't assigned twice: The large constant(s) should be hoisted out of the loop. This is done in the assembler code: .Lj11: movq (%rcx),%r8 movq %r8,%r10 andq %rdx,%r10 notq %r8 shrq $7,%r10 shrq $6,%r8 andq %r8,%r10 subq %r8,%rax // subtract from rax=length(s). Iow algorithm is slightly different. don't sum // correction but directly subtract from length. Saves bytecount from being a live value till the end popcntq %r10,%r8 addq $8,%rcx decq %r11 jne .Lj11 As far as I can see the assembler version is about 25-33% faster for low counts (40 bytes) . For larger strings, the SSE2 code kicks in and that is a solid factor 2. (270 vs 538 for the fastest "add", see table in the comments of the source) Note that I use pop/fast because they are simpler and I only use for low counts. The SSE2 version is more like the ADD variant. For actual reduction of pipeline stalls, I'm not quite sure how smart the Intel CPU firmware is and if rearranging instructions would actually improve throughput or not. For example, on 3.2.x, there are these four instructions that configure parameters and preserve %rax: I tried for above code on godbolt, attached is the analysis. (both for the asm loop, the popcnt loop from my code and the popcnt with the movabs removed to test hoisting) Note my 3.3.1 (from 29 dec) doesn't seem to generate two loads. With my limited assembler abilities I'd say it is not the hoisting that makes the manual asm loop faster as I first thought, but moving the NOT up to where it is. Also, POPCNT is handled in the System unit and is not inlined, mainly because of how an input of zero is handled (the function has to return 255 in the result, whereas Intel's POPCNT instruction sets the zero flag and leaves the result undefined). Weird as mine is inlined with -Cpcoreavx -O4, with no special handling for 0. But that does put some things on shaky ground. Maybe zero the result before hand? manual asm : -- .Lj47: add rsi,1 mov r8,qword [rcx] mov r10,r8 mov rdi,-9187201950435737472 and r10,rdi shr r10,7 not r8 shr r8,6 and r10,r8 mov r11,r10 popcnt r8,r10 add rbx,r8 add rcx,8 cmp rsi,r9 jnge.Lj47 llvm-mca: -mcpu=ivybridge Iterations:100 Instructions: 1200 Total Cycles: 378 Total uOps:1200 Dispatch Width:4 uOps Per Cycle:3.17 IPC: 3.17 Block RThroughput: 3.0 Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1][2][3][4][5][6]Instructions: 1 5 0.50* mov r8, qword ptr [rcx] 1 1 0.33mov r10, r8 1 1 0.33and r10, rdx 1 1 0.33not r8 1 1 0.50shr r10, 7 1 1 0.50shr r8, 6 1 1 0.33and r10, r8 1 1 0.33sub rax, r8 1 3 1.00popcntr8, r10 1 1 0.33add rcx, 8 1 1 0.33dec r11 1 1 1.00jne .Lrestqwords Resources: [0] - SBDivider [1] - SBFPDivider [2] - SBPort0 [3] - SBPort1 [4] - SBPort4 [5] - SBPort5 [6.0] - SBPort23 [6.1] - SBPort23 Resource pressure per iteration: [0][1][2][3][4][5][6.0] [6.1] - - 3.66 3.67- 3.67 0.50 0.50 Resource pressure by instruction: [0][1][2][3][4][5][6.0] [6.1] Instructions: - - - - - - 0.50 0.50 mov r8, qword ptr [rcx] - - 0.21 0.63- 0.16- - mov r10, r8 - - 0.35 0.61- 0.04- - and r10, rdx - -