Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread J. Gareth Moreton via fpc-devel
It's why I like going for optimisations that try to reduce code size without sacrificing speed, because of reducing the number of 16-byte or 32-byte sections.  Anyhow, back to work with optimising! Gareth aka. Kit On 04/01/2022 19:33, Martin Frb via fpc-devel wrote: On 04/01/2022 18:43, Jonas

Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread Martin Frb via fpc-devel
On 04/01/2022 18:43, Jonas Maebe via fpc-devel wrote: On 03/01/2022 12:54, Martin Frb via fpc-devel wrote: not sure if this is of interest to you, but I see you do a lot on the optimizer It's very likely unrelated to anything the optimiser does or does not do, and more regarding random di

Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread Jonas Maebe via fpc-devel
On 03/01/2022 12:54, Martin Frb via fpc-devel wrote: not sure if this is of interest to you, but I see you do a lot on the optimizer It's very likely unrelated to anything the optimiser does or does not do, and more regarding random differences in code layout. Charlie posted the following

Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread Marco van de Voort via fpc-devel
On 4-1-2022 17:15, J. Gareth Moreton via fpc-devel wrote: I neglected to include -Cpcoreavx, that was my bad.  I'll try again. According to Intel® 64 and IA-32 Architectures Software Developer’s Manual, Vol 2B, Page 4-391.  The zero flag is set if the source is zero, and cleared otherwise.  R

Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread J. Gareth Moreton via fpc-devel
I neglected to include -Cpcoreavx, that was my bad.  I'll try again. According to Intel® 64 and IA-32 Architectures Software Developer’s Manual, Vol 2B, Page 4-391.  The zero flag is set if the source is zero, and cleared otherwise.  Regarding an undefined result, I got confused with the BSF a

Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread Marco van de Voort via fpc-devel
On 4-1-2022 16:31, Martin Frb via fpc-devel wrote: Weird as mine is inlined with -Cpcoreavx -O4, with no special handling for 0. But that does put some things on shaky ground. Maybe zero the result before hand? Same here. I looked up popcnt and found nothing about not setting if zero. (E.g

Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread Martin Frb via fpc-devel
@Marco: havent played with popcnt => it could benefit from the "const to var" too. So I played around a bit... Of course, all this is intel only 1) var   Mask8, Mask1: qword;   Mask8 := EIGHTYMASK;   Mask1 := ONEMASK; And the constant no longer is assigned inside the loop. Also makes

Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread Martin Frb via fpc-devel
On 04/01/2022 10:31, Marco van de Voort via fpc-devel wrote: Weird as mine is inlined with -Cpcoreavx -O4, with no special handling for 0. But that does put some things on shaky ground. Maybe zero the result before hand? Same here. About UTF8LengthF

Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread Marco van de Voort via fpc-devel
On 4-1-2022 01:06, J. Gareth Moreton via fpc-devel wrote: Prepare for a lot of technical rambling! This is just an analysis of the compilation of utf8lentest.lpr, not any of the System units.  Notably, POPCNT isn't called directly, but instead goes through the System unit via "call fpc_popcnt

Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-03 Thread J. Gareth Moreton via fpc-devel
Prepare for a lot of technical rambling! This is just an analysis of the compilation of utf8lentest.lpr, not any of the System units.  Notably, POPCNT isn't called directly, but instead goes through the System unit via "call fpc_popcnt_qword" on both 3.2.x and 3.3.1.  A future study of "fpc_po

Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-03 Thread J. Gareth Moreton via fpc-devel
Interesting - thank you.  Will be interesting to study the assembler output to see what's going on. I'm honoured that I've become the go-to person when optimisation is concerned! Gareth aka. Kit On 03/01/2022 11:54, Martin Frb via fpc-devel wrote: Hi Gareth, not sure if this is of interest

Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-03 Thread Marco van de Voort via fpc-devel
On 3-1-2022 12:54, Martin Frb via fpc-devel wrote: fpc 3.2.3 /   fpc 3.3.1 fst 594   fst 688 fst 578   fst 703 fst 578   fst 687 fst 562   fst 688 Fyi, the latest asm version (+fst/pop/add/naieve) is at http://www.stack.nl/~marcov/utf8lentest.lpr

[fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-03 Thread Martin Frb via fpc-devel
Hi Gareth, not sure if this is of interest to you, but I see you do a lot on the optimizer While testing the attached, I found that one of the functions was notable slower when compiled with 3.3.1 (compared to 3.2.3). So maybe something you are interested in looking at? The Code in "Utf