Re: [fpc-devel] Experimentation: "Branch stitching"

2022-11-28 Thread Martin Frb via fpc-devel

On 28/11/2022 16:37, Martin Frb via fpc-devel wrote:


"11.3μop cache"


Apart from the qop cache there is the normal loading into the cache.

I must admit I am not sure on the exact workings, but wasn't there 
something like loading entire cachelines?  If that is so (not sure), 
then of course moving other code in between means potentially pushing 
the current code out of the same cache line?


Then the decision needs information which code is executed more 
often/likely.
(Also in that case I don't know if the conditional branch could be 
negated (be vs bne / bg <> ble ...) without affecting the branch 
predictor? (leaving the order, but dropping the the jmp).


Of course there may also be the question of the distance, and which jump 
/ conditional jump takes what byte size for the distance (Again I 
don't know)___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Experimentation: "Branch stitching"

2022-11-28 Thread Martin Frb via fpc-devel

On 28/11/2022 16:19, J. Gareth Moreton via fpc-devel wrote:
I admit I can be disorganised sometimes and lose documents, so I 
apologise if you have sent them already and I mislaid them in my mess 
of a directory tree.  Believe me though, I want to swallow all of this 
up if it means squeezing out every cycle I can out of the generated 
machine code!


Curious to know... at which point did it become favourable to do a 
32-byte align rather than a 16-byte align on x86 processors? Should 
the compiler start favouring 32-byte aligns for loops, say?


https://www.agner.org/optimize/optimizing_assembly.pdf

"11.3μop cache"

I couldn't find the 32byte align in that doc though. I must have picked 
that up elsewhere. (I think).
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Experimentation: "Branch stitching"

2022-11-28 Thread J. Gareth Moreton via fpc-devel
I admit I can be disorganised sometimes and lose documents, so I 
apologise if you have sent them already and I mislaid them in my mess of 
a directory tree.  Believe me though, I want to swallow all of this up 
if it means squeezing out every cycle I can out of the generated machine 
code!


Curious to know... at which point did it become favourable to do a 
32-byte align rather than a 16-byte align on x86 processors? Should the 
compiler start favouring 32-byte aligns for loops, say?


Kit

On 28/11/2022 13:52, Martin Frb via fpc-devel wrote:

On 28/11/2022 14:32, J. Gareth Moreton via fpc-devel wrote:

On 28/11/2022 12:59, Martin Frb via fpc-devel wrote:

Well first of all, you didn't move the balign in front of .Lj732


I do move the alignment hints, but if the label becomes dead (due to 
the zero-distance jump being 'collapsed'), the alignment hint gets 
removed.  It's an experiment in progress.


Ah, yes right.
Anyway this may be more of a 32 byte thing, and the 16 byte align is 
at best a 50/50 game


I once had a better source on the topic (also it might be in the pdf I 
once sent) but for now:

https://superuser.com/questions/1368480/how-is-the-micro-op-cache-tagged

Each 32B window (from the instruction cache) is mapped into the uop 
cache
(in case of an outer loop) Due to the size of that cache depending 
what else is executed, uops may or may not be cached (also only 
matters if the moved block is (inside a loop) frequently entered).
But ultimately, the 16 bytes align are not meant for that. Though if a 
user used a directive to set a 32byte align => then that may matter.



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Experimentation: "Branch stitching"

2022-11-28 Thread Martin Frb via fpc-devel

On 28/11/2022 14:32, J. Gareth Moreton via fpc-devel wrote:

On 28/11/2022 12:59, Martin Frb via fpc-devel wrote:

Well first of all, you didn't move the balign in front of .Lj732


I do move the alignment hints, but if the label becomes dead (due to 
the zero-distance jump being 'collapsed'), the alignment hint gets 
removed.  It's an experiment in progress.


Ah, yes right.
Anyway this may be more of a 32 byte thing, and the 16 byte align is at 
best a 50/50 game


I once had a better source on the topic (also it might be in the pdf I 
once sent) but for now:

https://superuser.com/questions/1368480/how-is-the-micro-op-cache-tagged


Each 32B window (from the instruction cache) is mapped into the uop cache
(in case of an outer loop) Due to the size of that cache depending what 
else is executed, uops may or may not be cached (also only matters if 
the moved block is (inside a loop) frequently entered).
But ultimately, the 16 bytes align are not meant for that. Though if a 
user used a directive to set a 32byte align => then that may matter.



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Experimentation: "Branch stitching"

2022-11-28 Thread J. Gareth Moreton via fpc-devel

On 28/11/2022 12:59, Martin Frb via fpc-devel wrote:

On 28/11/2022 07:22, J. Gareth Moreton via fpc-devel wrote:

...
    testb   %al,%al
    je .Lj733
    subb    $1,%al
    je     .Lj734
    jmp    .Lj732
    .balign 16,0x90
.Lj733:
    ...
    jmp    .Lj718
    .balign 16,0x90
.Lj732:
    movl    $2019050530,%ecx
    call    VERBOSE_$$_INTERNALERROR$LONGINT
    jmp    .Lj718

The block with the internal error can be moved and 'stitched' to the 
"jmp .Lj732" instruction.


    ...
    testb    %al,%al
    je    .Lj733
    subb    $1,%al
    je    .Lj734
    movl    $2019050530,%ecx
    call    VERBOSE_$$_INTERNALERROR$LONGINT
    jmp    .Lj718
    .balign 16,0x90
.Lj733:
    ...

I'm still working a few things out, since it can move the function 
epilogue which makes things harder to read.  Currently I'm only 
moving blocks where the label only has a single reference, thereby 
causing a dead label when it's stitched alongside its corresponding 
jump.  This avoids problems where the label is referenced in a data 
block that's distinct from the assembly and where moving it may cause 
problems.


Well first of all, you didn't move the balign in front of .Lj732


I do move the alignment hints, but if the label becomes dead (due to the 
zero-distance jump being 'collapsed'), the alignment hint gets removed.  
It's an experiment in progress.


In the above example, that may be an improvement (most likely) because 
if the label really is referred once only (and thereby is also not a 
loop) then it may not be beneficial to align it (except maybe if the 
user specified a non default align?).
If the label is referred only once, but the whole think is inside a 
loop  it may still be relevant to have the align? (not sure, 
depends on how the cpu caches stuff)?


Another thing is, that moving the block can make the other part of the 
loop longer (needing more cache). If this branch-to-be-moved is rarely 
entered, it may want to be after the final "jmp-to-loop-start" of the 
normal branch?
Of course, if the loop is bigger than the block with the branches, and 
we did know that the branch is some sort of exception only, then we 
would want to move it even further away, to get it out of the loop..
It's a good point.  I'll have to work out which situations will be fine 
and which will increase the cache.  How is a procedure loaded into the 
CPU cache?  Is there some good doumentation on this because I always 
wondered if the whole thing, or at least as much as possible, was loaded 
sequentially, and the alignment hints are mostly to avoid partial reads.


--
Btw, .balign N, 0x90 => isn't there an align that uses multibyte nop 
(like) instructions? (I posted some pdf to you a while back, iirc it 
points that out)


There is - it's the .plalign directive.  I'm not sure why the compiler 
mixes and matches them though.


Kit
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Experimentation: "Branch stitching"

2022-11-28 Thread Martin Frb via fpc-devel

On 28/11/2022 07:22, J. Gareth Moreton via fpc-devel wrote:

...
    testb   %al,%al
    je .Lj733
    subb    $1,%al
    je     .Lj734
    jmp    .Lj732
    .balign 16,0x90
.Lj733:
    ...
    jmp    .Lj718
    .balign 16,0x90
.Lj732:
    movl    $2019050530,%ecx
    call    VERBOSE_$$_INTERNALERROR$LONGINT
    jmp    .Lj718

The block with the internal error can be moved and 'stitched' to the 
"jmp .Lj732" instruction.


    ...
    testb    %al,%al
    je    .Lj733
    subb    $1,%al
    je    .Lj734
    movl    $2019050530,%ecx
    call    VERBOSE_$$_INTERNALERROR$LONGINT
    jmp    .Lj718
    .balign 16,0x90
.Lj733:
    ...

I'm still working a few things out, since it can move the function 
epilogue which makes things harder to read.  Currently I'm only moving 
blocks where the label only has a single reference, thereby causing a 
dead label when it's stitched alongside its corresponding jump.  This 
avoids problems where the label is referenced in a data block that's 
distinct from the assembly and where moving it may cause problems.


Well first of all, you didn't move the balign in front of .Lj732

In the above example, that may be an improvement (most likely) because 
if the label really is referred once only (and thereby is also not a 
loop) then it may not be beneficial to align it (except maybe if the 
user specified a non default align?).
If the label is referred only once, but the whole think is inside a loop 
 it may still be relevant to have the align? (not sure, depends on 
how the cpu caches stuff)?


Another thing is, that moving the block can make the other part of the 
loop longer (needing more cache). If this branch-to-be-moved is rarely 
entered, it may want to be after the final "jmp-to-loop-start" of the 
normal branch?
Of course, if the loop is bigger than the block with the branches, and 
we did know that the branch is some sort of exception only, then we 
would want to move it even further away, to get it out of the loop..


--
Btw, .balign N, 0x90 => isn't there an align that uses multibyte nop 
(like) instructions? (I posted some pdf to you a while back, iirc it 
points that out)

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel