Kudos! In the end, did it make much of a difference in the compilation time?
Em sex., 25 de fev. de 2022 às 02:08, J. Gareth Moreton via fpc-devel <fpc-devel@lists.freepascal.org> escreveu: > > I did it! > > After a good week of work and getting things wrong, I finally found a > solution that works nicely and is extensible, at least for x86. A bit > of refactoring and it can be ported to other platforms. I'm just > running the test suites to see if I can break things now. Honestly the > hard part was making sure all the registers were tracked properly. > > https://gitlab.com/CuriousKit/optimisations/-/commits/sliding-window > > Please feel free to test it and break it! > > It will likely undergo adjustments over time to refactor bits and > pieces, add extra instructions (especially as it doesn't support SHRX > and SHLX yet, for example) and see if I can make the sliding window more > efficient - I increased it in size from 8 to 16 and then 32 entries, > since even at 16, some optimisations were missed in the RTL, but this > depends on a number of factors. > > It's gotten some pretty good optimisations. On x86_64-win64 under -O4, > the three lazarus binaries... before: > > lazarus.exe: > > 313,990,206 > 313,873,982 > > lazbuild.exe > > 59,704,391 > 59,725,895 > > startlazarus.exe > > 27,471,147 > 27,461,419 > > And the compiler itself, ppcx64.exe: > > 3,777,536 > 3,766,272 > > Remember that though I call the component a "sliding window", it's only > because it's very similar to how LZ77 finds start points for run-length > encoding, and the comments and variable names mention run-length > encoding as it scans sequences of identical instructions. However, at > no point is it actually performing data compression, and the end-result > is common subexpression elimination. All the space savings are from the > removal of redundant sequences of instructions, giving both a space and > a speed boost. > > Almost every source file in the compiler and the RTL shows some kind of > improvement. A lot of them are just redundant pointer deallocations, so > this will help with cache misses and the like - that aside though, here > are a couple of my favourites... one from dbgdwarf - > TDebugInfoDwarf.appendsym_fieldvar_with_name_offset: > > Before: > > ... > .Lj682: > leaq (,%r13,8),%rcx > movq 120(%rdi),%rax > cqto > idivq %rcx > imulq %r13,%rax > movq %rax,%r12 > addq 56(%rbp),%r12 > leaq (,%r13,8),%rcx > movq 120(%rdi),%rax > cqto > idivq %rcx > movq %rdx,%rsi > cmpb $0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip) > > After: > > ... > .Lj682: > leaq (,%r13,8),%rcx > movq 120(%rdi),%rax > cqto > idivq %rcx > imulq %r13,%rax > movq %rax,%r12 > addq 56(%rbp),%r12 > movq %rdx,%rsi > cmpb $0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip) > > This one has successfully removed an IDIV instruction because the > algorithm was able to detect that the subroutine wanted the remainder in > %edx, and it was still available from the first IDIV call because it > hadn't been overwritten, and neither %r13 nor %rdi had changed values so > the references are the same. > > This one is from SysUtils' IsLeapYear function and is one I have > personally wanted to optimise further ever since I first saw its > disassembly after I implemented the fast "x mod const = 0" algorithm. > > Before: > > ... > imulw $23593,%cx,%ax > rorw $2,%ax > cmpw $655,%ax > ja .Lj6979 > imulw $23593,%cx,%ax > rorw $4,%ax > cmpw $163,%ax > setbeb %al > ret > .Lj6979: > ... > > After: > > ... > imulw $23593,%cx,%ax > rorw $2,%ax > cmpw $655,%ax > ja .Lj6979 > rorw $2,%ax > cmpw $163,%ax > setbeb %al > ret > .Lj6979: > ... > > In this case, the RLE doesn't produce an exact match since the end > result of %ax is different, but the important point to note is that as > long as %cx doesn't change value (which it doesn't). the first sequence > can be transformed into the second sequence via a "rorw $2,%ax" > instruction, so the end result is that "imulw $23593,%cx,%ax; rorw > $4,%ax" is transmuted into "rorw $2,%ax" based on the previous value of > %ax, thus removing a multiplication. > > Because this has been a mammoth undertaking and is quite a major > addition to the compiler, I'm going to hold off making a merge request > until I write a document on it, like I've done in the past with some of > the other larger optimisations I've made. The branch is available at > the link at the top of this e-mail though if anyone wants to take a look > at it. > > One final note is that this optimisation is rather slow and specialist, > so much so that I've added a new optimizer switch named 'cs_opt_asmcse' > ("Assembly CSE", to differentiate it from "Node CSE"); this is enabled > by default with -O3, and requires the peephole optimizer to also be > enabled in order to function. > > Gareth aka. Kit > > > -- > This email has been checked for viruses by Avast antivirus software. > https://www.avast.com/antivirus > > _______________________________________________ > fpc-devel maillist - fpc-devel@lists.freepascal.org > https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel _______________________________________________ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel