Hi everyone,

I'm still exploring optimisations in generated x86 code, something which has become my speciality, and I found one new potential optimisation sequence that aims to reduce unnecessary calls to CMP and TEST when the result is already known.  However there are some situations where I'm not sure if the end result is better or not.  For example (taken from the cgiprotocol unit):

.Lj5:
    subl    $1,%esi
.Lj6:
    testl    %esi,%esi
    jng    .Lj9
    ...
    testl    %eax,%eax
    jne    .Lj5
.Lj9:
    movl    $-1,%eax
    testl    %esi,%esi
    cmovlel    %eax,%esi
    movl    %esi,%eax

In this instance, if the first TEST instruction results in "jng .Lj9" branching, it soon calls another TEST instruction with the same register, which will not have changed value, so CMOVLE will definitely set %esi to %eax (equal to -1) because "le" is equal to "ng" ("less than or equal" versus "not greater"), and then %esi is written back to %eax.  All in all, this is a dependency chain of length 3.  When my new optimisation takes place, the following is generated instead:

.Lj5:
    subl    $1,%esi
.Lj6:
    testl    %esi,%esi
    jng    .Lj9
    ...
    testl    %eax,%eax
    jne    .Lj5
    testl    %esi,%esi
    jnle    .Lj12
.Lj9:
    movl    $-1,%esi
.Lj12:
    movl    %esi,%eax

Here, a number of other optimisations don't take place (hence why the CMOV instruction no longer exists), but now if "jng .Lj9" branches, it sets %esi to -1 directly before writing the result to %eax... a dependency chain of just 2.  However, there is also an extra conditional jump, which can no longer be optimised into CMOV because of the position of the .Lj9 label.

The question is... is the new code better, about the same or worse?  On modern processors most, TEST/Jcc can be macro-fused, but jumps always cause some penalty.

There are cases where the optimisation gives clear benefits - for example, in the cclasses unit - before:

    movq    %rcx,%rdx
    testq    %rcx,%rcx
    je    .Lj382
    movq    -8(%rcx),%rdx
.Lj382:
    testq    %rcx,%rcx
    jne    .Lj383
    leaq    FPC_EMPTYCHAR(%rip),%rcx
.Lj383:
    call    CCLASSES_$$_FPHASH$PCHAR$LONGINT$$LONGWORD

After:

    movq    %rcx,%rdx
    testq    %rcx,%rcx
    je    .Lj382
    movq    -8(%rcx),%rdx
    jmp    .Lj383
.Lj382:
    leaq    FPC_EMPTYCHAR(%rip),%rcx
.Lj383:
    call    CCLASSES_$$_FPHASH$PCHAR$LONGINT$$LONGWORD

In this case, if "je .Lj382" branches, then %rcx is definitely zero and so control flow can safely skip straight to the LEA instruction, since the "jne .Lj383" instruction will definitely not branch.  On the other hand, if "je .Lj382" does not branch, then %rcx is definitely non-zero and so the second "testq %rcx,%rcx" is once again deterministic, meaning "jne .Lj383" will definitely branch, therefore the second TEST instruction can be removed completely and the conditional jump turned into an unconditional jump.  The end result is that the number of jumps hasn't changed (exactly one jump will be taken regardless of the value of %rcx), but the instruction count has been reduced (note that %rdx can't be optimised out because its value is used as an actual parameter for the CALL).

Kit

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Reply via email to