Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread J. Gareth Moreton via fpc-devel
It's why I like going for optimisations that try to reduce code size 
without sacrificing speed, because of reducing the number of 16-byte or 
32-byte sections.  Anyhow, back to work with optimising!


Gareth aka. Kit

On 04/01/2022 19:33, Martin Frb via fpc-devel wrote:

On 04/01/2022 18:43, Jonas Maebe via fpc-devel wrote:

On 03/01/2022 12:54, Martin Frb via fpc-devel wrote:
not sure if this is of interest to you, but I see you do a lot on 
the optimizer


It's very likely unrelated to anything the optimiser does or does not 
do, and more regarding random differences in code layout. Charlie 
posted the following video on core just yesterday, and it touches on 
exactly this subject: https://www.youtube.com/watch?v=r-TLSBdHe1A


Choice quote: code layout and environment variables can produce up to 
40% differences in performance, which is more than what even the best 
optimizing compilers can achieve do in most cases.


Interesting...

And yes, see my previous post. It seems to be which "sub-section" of a 
loop falls into a 32 byte aligned 32 byte block.
It's not even the entire loop (not about the begin of the loop), but a 
certain code block within.


This also goes along with one optimization that (even though still 
chance) in my test improved the timing (both worst and best time, 
though those are only *my" worst/best)

=> reducing the byte size of the loop code.
That way there are less 32byte sections.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread Martin Frb via fpc-devel

On 04/01/2022 18:43, Jonas Maebe via fpc-devel wrote:

On 03/01/2022 12:54, Martin Frb via fpc-devel wrote:
not sure if this is of interest to you, but I see you do a lot on the 
optimizer


It's very likely unrelated to anything the optimiser does or does not 
do, and more regarding random differences in code layout. Charlie 
posted the following video on core just yesterday, and it touches on 
exactly this subject: https://www.youtube.com/watch?v=r-TLSBdHe1A


Choice quote: code layout and environment variables can produce up to 
40% differences in performance, which is more than what even the best 
optimizing compilers can achieve do in most cases.


Interesting...

And yes, see my previous post. It seems to be which "sub-section" of a 
loop falls into a 32 byte aligned 32 byte block.
It's not even the entire loop (not about the begin of the loop), but a 
certain code block within.


This also goes along with one optimization that (even though still 
chance) in my test improved the timing (both worst and best time, though 
those are only *my" worst/best)

=> reducing the byte size of the loop code.
That way there are less 32byte sections.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread Jonas Maebe via fpc-devel

On 03/01/2022 12:54, Martin Frb via fpc-devel wrote:
not sure if this is of interest to you, but I see you do a lot on the 
optimizer


It's very likely unrelated to anything the optimiser does or does not 
do, and more regarding random differences in code layout. Charlie posted 
the following video on core just yesterday, and it touches on exactly 
this subject: https://www.youtube.com/watch?v=r-TLSBdHe1A


Choice quote: code layout and environment variables can produce up to 
40% differences in performance, which is more than what even the best 
optimizing compilers can achieve do in most cases.



Jonas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread Marco van de Voort via fpc-devel


On 4-1-2022 17:15, J. Gareth Moreton via fpc-devel wrote:

I neglected to include -Cpcoreavx, that was my bad.  I'll try again.

According to Intel® 64 and IA-32 Architectures Software Developer’s 
Manual, Vol 2B, Page 4-391.  The zero flag is set if the source is 
zero, and cleared otherwise.  Regarding an undefined result, I got 
confused with the BSF and BSR commands, sorry.  I guess I was more 
tired than I thought!  POPCNT returns zero for a zero input.


Ok, that's what I thought.

I played a bit by adding code alignments to loops in the SSE code, but 
it only seems to slow the core loop rather than accelerate it (align 
before the branch location and/or branch target)


Did you have any thoughts about moving up the NOT instruction ?
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread J. Gareth Moreton via fpc-devel

I neglected to include -Cpcoreavx, that was my bad.  I'll try again.

According to Intel® 64 and IA-32 Architectures Software Developer’s 
Manual, Vol 2B, Page 4-391.  The zero flag is set if the source is zero, 
and cleared otherwise.  Regarding an undefined result, I got confused 
with the BSF and BSR commands, sorry.  I guess I was more tired than I 
thought!  POPCNT returns zero for a zero input.


Gareth aka. Kit

On 04/01/2022 16:03, Marco van de Voort via fpc-devel wrote:

On 4-1-2022 16:31, Martin Frb via fpc-devel wrote:


Weird as mine is inlined with -Cpcoreavx -O4, with no special 
handling for 0. But that does put some things on shaky ground. Maybe 
zero the result before hand?


Same here.


I looked up popcnt and found nothing about not setting if zero. (E.g. 
https://www.felixcloutier.com/x86/popcnt )


I meanwhile also ran on my Ryzen 4800H laptop and updated the version 
on the web with the stats. The stats for the  long string are about as 
fast as on my i7-3770 (laptop vs desktop memory bandwidth? The ryzen 
should be faster in any way?!?), but the short one (40 bytes) is 
significantly faster. What I don't get is why the assembler version 
seems systematically faster even for the short code. The generated asm 
is nearly the same.


Also notable is that on this machine with popcnt (-Cpcoreavx), the 
popcnt version is as fast as the add function within error margins, so 
probably popcnt instruction is faster(lower latency) and thus less of 
a bottleneck on this machine.  Note that the POP() function is half 
the size, so that makes it better for newer machines.


-

Note that I test on Windows, so it might be that the "two times load" 
is a difference somehow due to different codegeneration on windows





About UTF8LengthFast()

Well, before I get to this, I noted something weird.

2 runs, compiled with the same compiler ( 3.2.3 ), and the same 
settings, with the only difference: -gw3 or not -gw3
=> And the speed differed.  600 (with dwarf)  vs 700 (no dwarf) / 
reproducible.


I also have seen this, while working on the code. And indeed mainly 
with the "fast" one. It also explains why the assembler is always 
consistent, it suffers less from detail code changes when I e.g. 
update FPC from git, and thus different alignment. (assuming that the 
section starts are always aligned)



Alignment. 16 vs 32 bit. Can that make a difference?
According to: 
https://stackoverflow.com/questions/61016077/32-byte-aligned-routine-does-not-fit-the-uops-cache


Seems to be a problem of the Skylake and later archs, which I no 
longer have. The i7 is too old, and the others are AMD.



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread Marco van de Voort via fpc-devel

On 4-1-2022 16:31, Martin Frb via fpc-devel wrote:


Weird as mine is inlined with -Cpcoreavx -O4, with no special 
handling for 0. But that does put some things on shaky ground. Maybe 
zero the result before hand?


Same here.


I looked up popcnt and found nothing about not setting if zero. (E.g. 
https://www.felixcloutier.com/x86/popcnt )


I meanwhile also ran on my Ryzen 4800H laptop and updated the version on 
the web with the stats. The stats for the  long string are about as fast 
as on my i7-3770 (laptop vs desktop memory bandwidth? The ryzen should 
be faster in any way?!?), but the short one (40 bytes) is significantly 
faster. What I don't get is why the assembler version seems 
systematically faster even for the short code. The generated asm is 
nearly the same.


Also notable is that on this machine with popcnt (-Cpcoreavx), the 
popcnt version is as fast as the add function within error margins, so 
probably popcnt instruction is faster(lower latency) and thus less of a 
bottleneck on this machine.  Note that the POP() function is half the 
size, so that makes it better for newer machines.


-

Note that I test on Windows, so it might be that the "two times load" is 
a difference somehow due to different codegeneration on windows





About UTF8LengthFast()

Well, before I get to this, I noted something weird.

2 runs, compiled with the same compiler ( 3.2.3 ), and the same 
settings, with the only difference: -gw3 or not -gw3
=> And the speed differed.  600 (with dwarf)  vs 700 (no dwarf) / 
reproducible.


I also have seen this, while working on the code. And indeed mainly with 
the "fast" one. It also explains why the assembler is always consistent, 
it suffers less from detail code changes when I e.g. update FPC from 
git, and thus different alignment. (assuming that the section starts are 
always aligned)



Alignment. 16 vs 32 bit. Can that make a difference?
According to: 
https://stackoverflow.com/questions/61016077/32-byte-aligned-routine-does-not-fit-the-uops-cache


Seems to be a problem of the Skylake and later archs, which I no longer 
have. The i7 is too old, and the others are AMD.



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread Martin Frb via fpc-devel
@Marco: havent played with popcnt => it could benefit from the "const to 
var" too.


So I played around a bit...

Of course, all this is intel only

1)
var
  Mask8, Mask1: qword;

  Mask8 := EIGHTYMASK;
  Mask1 := ONEMASK;

And the constant no longer is assigned inside the loop.
Also makes the loop  shorter.

=> improves speed

2)
  //for i := 1 to (ByteCount-cnt) div sizeof(PtrInt) do
  //for i := (ByteCount-cnt) div sizeof(PtrInt) - 1 downto 0 do
  i := (ByteCount-cnt) div sizeof(PtrInt) ;
  repeat
    
  dec(i);
  until i = 0;


The orig:
  for i := 1 to (ByteCount-cnt) div sizeof(PtrInt) do
// r9 is reserved to hold the upper bound
.Lj26:
    addq    $1,%r10
...
    cmpq    %r10,%r9
    jnle    .Lj26

Since the counter var "i" is not accessed in the loop, its value does 
not matter.

So
  for i := (ByteCount-cnt) div sizeof(PtrInt) - 1 downto 0 do
// no longer needs to store an upper bound, but still has an extra 
"test" since the "subq" is at the start of the loop

.Lj26:
    subq    $1,%r10
...
    testq    %r10,%r10
    jnle    .Lj26

// "repeat " , and there no longer is a "test"
And that again reduced the loop size.
And apparently just below a critical point, as time get a little better 
again


-
WITH the constants moved to var:
orig for : 547
downto:547
repeat: 516

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread Martin Frb via fpc-devel

On 04/01/2022 10:31, Marco van de Voort via fpc-devel wrote:



Weird as mine is inlined with -Cpcoreavx -O4, with no special handling 
for 0. But that does put some things on shaky ground. Maybe zero the 
result before hand?


Same here.


About UTF8LengthFast()

Well, before I get to this, I noted something weird.

2 runs, compiled with the same compiler ( 3.2.3 ), and the same 
settings, with the only difference: -gw3 or not -gw3
=> And the speed differed.  600 (with dwarf)  vs 700 (no dwarf) / 
reproducible.


So then
=> I compiled the app, with the same 3.2.3  (and I compiled WITHOUT -a  
/ though with -al I get the same results)

=> I compiled once with -gw3 , and once without dwarf info
=> I used objdump to dis-assemble the exe.
=> I diffed (and searched for 0x101010 (only used in the ...Fast code)
--> _*The assembler is identical.*_
Yet one is faster. (the one WITH dwarf)
The calling code in the main body is the same too (as far as I could 
see), except that the address of the callee is different (but it is just 
20 calls per measurement)


I did those runs OUTSIDE the IDE.
So no debugger in the background.

Win64 / 64 bit
Core I7 8600K

Using 3.3.1 the speed is equal. Never mind if dwarf is generated or not.
(I did not compare the asm for that...)

-
Clinging to straws, there is one (maybe) diff in the 3.2.3 with/without 
dwarf assembler.

*** I am totally out of my depth here 

Alignment. 16 vs 32 bit. Can that make a difference?
According to: 
https://stackoverflow.com/questions/61016077/32-byte-aligned-routine-does-not-fit-the-uops-cache
    The Decoded ICache consists of 32 sets. Each set contains eight 
Ways. Each Way can hold up to six micro-ops.


    All micro-ops in a Way represent instructions which are statically 
contiguous in the code and have their EIPs within _*the same aligned 
32-byte region*_.


So the alignment of the 2 procedures differs  by 16 bytes
The proc entry is at
With dwarf 11870
without   11860  // actually this is 32byte aligned (but slower)

Yet, maybe it matters which statements in the big loop happen to fall 
into the same 32byte block???


The loop starting with
   for i := 1 to (ByteCount-cnt) div sizeof(PtrInt) do

With DWARF (faster):
   118f0:    49 83 c2 01      add    $0x1,%r10
   118f4:    4c 8b 19     mov    (%rcx),%r11
   118f7:    4d 89 d8     mov    %r11,%r8
   118fa:    48 bf 80 80 80 80 80     movabs $0x8080808080808080,%rdi
   11901:    80 80 80
   11904:    49 21 f8     and    %rdi,%r8
   11907:    49 c1 e8 07      shr    $0x7,%r8
   1190b:    49 f7 d3     not    %r11
   1190e:    49 c1 eb 06      shr    $0x6,%r11
   11912:    4d 21 d8     and    %r11,%r8
   11915:    4c 89 c3     mov    %r8,%rbx
   11918:    49 bb 01 01 01 01 01     movabs $0x101010101010101,%r11
   1191f:    01 01 01
   11922:    4d 0f af c3      imul   %r11,%r8
   11926:    49 c1 e8 38      shr    $0x38,%r8
   1192a:    4c 01 c6     add    %r8,%rsi
   1192d:    48 83 c1 08      add    $0x8,%rcx
   11931:    4d 39 d1     cmp    %r10,%r9
   11934:    7f ba        jg 118f0 



WITHOUT:
   118e0:    49 83 c2 01      add    $0x1,%r10
   118e4:    4c 8b 19     mov    (%rcx),%r11
   118e7:    4d 89 d8     mov    %r11,%r8
   118ea:    48 bf 80 80 80 80 80     movabs $0x8080808080808080,%rdi
   118f1:    80 80 80
   118f4:    49 21 f8     and    %rdi,%r8
   118f7:    49 c1 e8 07      shr    $0x7,%r8
   118fb:    49 f7 d3     not    %r11
   118fe:    49 c1 eb 06      shr    $0x6,%r11
   11902:    4d 21 d8     and    %r11,%r8
   11905:    4c 89 c3     mov    %r8,%rbx
   11908:    49 bb 01 01 01 01 01     movabs $0x101010101010101,%r11
   1190f:    01 01 01
   11912:    4d 0f af c3      imul   %r11,%r8
   11916:    49 c1 e8 38      shr    $0x38,%r8
   1191a:    4c 01 c6     add    %r8,%rsi
   1191d:    48 83 c1 08      add    $0x8,%rcx
   11921:    4d 39 d1     cmp    %r10,%r9
   11924:    7f ba        jg 0x118e0



*** written before I got into the above..
---

About UTF8LengthFast()

I notice 3 differences.
Though I only compare the O4 result, and have no idea what is 
pre-peephole and post-peephole.


1) As you say: different registers (but the same statements, in the same 
order).

No idea if that affects the CPU.

2) One extra statement in 3.3.1

    movq    %r10,%r8  <<< not in 3.2.3
    movq    $72340172838076673,%r10
    imulq    %r10,%r8

3) I do have "dwarf" enabled. 

Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread Marco van de Voort via fpc-devel


On 4-1-2022 01:06, J. Gareth Moreton via fpc-devel wrote:

Prepare for a lot of technical rambling!

This is just an analysis of the compilation of utf8lentest.lpr, not 
any of the System units.  Notably, POPCNT isn't called directly, but 
instead goes through the System unit via "call fpc_popcnt_qword" on 
both 3.2.x and 3.3.1.  A future study of "fpc_popcnt_qword" might 
yield some interesting information.


Did you pass -Cpcoreavx    ?

# [381] Result += PopCnt(qword(nx));
    popcntq    %r10,%r8
    addq    %r8,%rbx



In this situation, one can replace %r11 with %r13 in the 3rd and 4th 
instructions so the massive constant isn't assigned twice:


The large constant(s) should be hoisted out of the loop. This is done in 
the assembler code:


.Lj11:
    movq    (%rcx),%r8
    movq    %r8,%r10
    andq    %rdx,%r10
    notq    %r8
    shrq    $7,%r10
    shrq    $6,%r8
    andq    %r8,%r10
    subq    %r8,%rax   // subtract from rax=length(s). Iow 
algorithm is slightly different. don't sum


    // correction but 
directly subtract from length. Saves bytecount from being a live value 
till the end


 popcntq    %r10,%r8
    addq    $8,%rcx
    decq    %r11
    jne    .Lj11

As far as I can see the assembler version is about 25-33% faster for low 
counts (40 bytes) . For larger strings, the SSE2 code kicks in and that 
is a solid factor 2.  (270 vs 538 for the fastest "add", see table in 
the comments of the source)


Note that I use pop/fast because they are simpler and I only use for low 
counts. The SSE2 version is more like the ADD variant.


For actual reduction of pipeline stalls, I'm not quite sure how smart 
the Intel CPU firmware is and if rearranging instructions would 
actually improve throughput or not.  For example, on 3.2.x, there are 
these four instructions that configure parameters and preserve %rax:


I tried for above code on godbolt, attached is the analysis. (both for 
the asm loop, the popcnt loop from my code and the popcnt with the 
movabs removed to test hoisting)


Note my 3.3.1 (from 29 dec) doesn't seem to generate two loads. With my 
limited assembler abilities I'd say it is not the hoisting that makes 
the manual asm loop faster as I first thought, but moving the NOT up to 
where it is.


Also, POPCNT is handled in the System unit and is not inlined, mainly 
because of how an input of zero is handled (the function has to return 
255 in the result, whereas Intel's POPCNT instruction sets the zero 
flag and leaves the result undefined).


Weird as mine is inlined with -Cpcoreavx -O4, with no special handling 
for 0. But that does put some things on shaky ground. Maybe zero the 
result before hand?


manual asm :
--

.Lj47:
add rsi,1
mov r8,qword [rcx]
mov r10,r8
mov rdi,-9187201950435737472
and r10,rdi
shr r10,7
not r8
shr r8,6
and r10,r8
mov r11,r10
popcnt  r8,r10
add rbx,r8
add rcx,8
cmp rsi,r9
jnge.Lj47

llvm-mca: -mcpu=ivybridge

Iterations:100
Instructions:  1200
Total Cycles:  378
Total uOps:1200

Dispatch Width:4
uOps Per Cycle:3.17
IPC:   3.17
Block RThroughput: 3.0

Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1][2][3][4][5][6]Instructions:
 1  5 0.50*   mov   r8, qword ptr [rcx]
 1  1 0.33mov   r10, r8
 1  1 0.33and   r10, rdx
 1  1 0.33not   r8
 1  1 0.50shr   r10, 7
 1  1 0.50shr   r8, 6
 1  1 0.33and   r10, r8
 1  1 0.33sub   rax, r8
 1  3 1.00popcntr8, r10
 1  1 0.33add   rcx, 8
 1  1 0.33dec   r11
 1  1 1.00jne   .Lrestqwords

Resources:
[0]   - SBDivider
[1]   - SBFPDivider
[2]   - SBPort0
[3]   - SBPort1
[4]   - SBPort4
[5]   - SBPort5
[6.0] - SBPort23
[6.1] - SBPort23

Resource pressure per iteration:
[0][1][2][3][4][5][6.0]  [6.1]  
 -  - 3.66   3.67- 3.67   0.50   0.50   

Resource pressure by instruction:
[0][1][2][3][4][5][6.0]  [6.1]  Instructions:
 -  -  -  -  -  - 0.50   0.50   mov r8, qword ptr 
[rcx]
 -  - 0.21   0.63- 0.16-  - mov r10, r8
 -  - 0.35   0.61- 0.04-  - and r10, rdx
 -  -