that's brillliant

2013/9/26 Jason Hood <[email protected]>

> Greetings.
>
> It's rather funny timing that a couple of topics have come up about
> optimization and exe size, as I've just spent the past couple of weeks
> improving the generated i386 code (most of which would also apply to
> x86-64, but I've not done that).  Not sure what the protocol regarding
> patches is, so for now you'll find it on pastebin, based on the 0.9.26
> release (as one big diff, I'm afraid).
>
> http://pastebin.com/vdQuhziY
>
> BTW, it looks like the original source was tab-free, but some tabs have
> snuck in, so you may want to (de)tabify the whole lot.  I've also made a
> couple of spelling corrections.
>
> First off, here's the results, building my tcc.exe (I'm on Windows, so
> I'll also be using Intel syntax) with:
>
> original tcc:                  225792 bytes
> my tcc, without optimizations: 218624 bytes (3% reduction)
> my tcc, with optimizations:    169472 bytes (25% reduction)
>
> Build times are basically the same (using gcc, it was about 0.01s slower
> to build with optimizations; using tcc, the optimized version actually
> built the optimized version about 0.01s quicker than the original).
>
> The non-optimized version is smaller, as I've made some changes
> independent of the optimizations:
>
> * 4- & 8-byte structs copy as int/long long (all targets);
> * passing structs <= 8 bytes will be treated as int/long long;
> * returning structs <= 8 bytes is done via (edx)eax (PE only);
> * added ebx to the register list (increasing prolog by one, to save it);
> * use xor r,r instead of mov r,0;
> * use the eax-specific form of instructions;
> * use movzx after setxx instead of mov r,0 before;
> * use movsx for char & short casts, instead of shl+sar;
> * use the byte form of sub esp (via enhanced gadd_sp() function);
> * gcall_or_jmp() uses symbols and locals directly (like call [ebp-4]);
> * use test r,r instead of cmp r,0;
> * use inc/dec r instead of add/sub r,-1;
> * use movzx r,br/bw instead of and r,0xff/0xffff;
> * or r,-1 (should it occur) replaces its mov r,whatever;
> * multiply by 0 (should it occur) becomes xor r,r (replacing its mov);
> * multiply by -1 becomes neg r;
> * make use of imul r,const;
> * simplify the float (not) equal test (remove cmp/xor, use jpo/jpe);
> * fix add in the assembler, to use the byte form when appropriate.
>
> To support the optimizations, o() must only be used to start an
> instruction.  I've added O<N> macros to combine <N> bytes into a single
> int and function og() to combine o() and g().
>
> Optimizations are enabled by using -O, but I neglected to add them to
> the help:
>
>     -Of - functions
>     -Oj - jumps
>     -Om - multiplications and pointer division
>     -Or - registers
>     -O -O2 -Ox - all optimizations
>     -O1 - all but -Oj (i.e. -Ofmr)
>     -Os - all but -Om (i.e. -Ofjr; also removes PE function alignment)
>     -O0 - no optimizations (default)
>
> -Of will minimize the prolog and epilog.  The full prolog is jumped over
> as usual, then when the function is finished, write only what is needed,
> move everything back (adjusting relocations to suit) and write the
> needed epilog.  As suggested above, I've also aligned PE functions to
> 16 bytes - this always happens, unless -Os is used (maybe it's not needed,
> but I'm so used to seeing it in disassembly listings, it just looks wrong
> without it :)).
>
> -Oj will optimize various usages of jump.  Jumps to jmp will be replaced
> with the destination of the jmp; resulting skipped jmps will be removed.
> Common code before a jmp and its destination (up to eight instructions,
> the reason for the o() restriction) will result in removal of the code
> before the jmp, changing the jmp destination.  Casting to boolean will
> use setxx/movzx or stc/sbb/inc when appropriate.  Conditional jumps
> over a jmp will invert the condition and change the destination,
> removing the jmp.  Jumps to the epilog will be replaced with the epilog
> itself (if it's only one or two bytes with -Os).  Appropriate near jumps
> will be converted to short.
>
> -Om will use lea (possibly followed by add, shl or another lea) to do
> appropriate constant multiplication.  Pointer division is done by
> reciprocal multiplication (which should probably also be used for normal
> division, don't know why I didn't).
>
> -Or improves register usage.  Previous values are remembered (this would
> ideally be done as part of tccgen).  Appropriate function arguments are
> pushed directly.  A load const/store pair stores the const directly.
> Suitable adds are turned into a displacement (greatly improving struct
> and long long access).
>
> A couple of things I didn't do was combine arithmetic operators (even
> though register displacement combines adds) or remove unused locals
> (remembering register values means writing to a temporary probably won't
> read from it).  And doing it all for x86-64 (in particular, returning
> small structs should be done, as that's expected by Windows).
>
> In addition, I've tweaked the Win32 build.  Build-tcc.bat will determine
> the target based on gcc itself (although it will need modification if
> you still want to support command.com).  Separated lib/chkstk.S into
> lib/seh.S (assuming only 32-bit) and lib/sjlj.S (assuming only 64-bit);
> however, I didn't update the configure process, only build-tcc.bat.
>
> --
> Jason.
>
> _______________________________________________
> Tinycc-devel mailing list
> [email protected]
> https://lists.nongnu.org/mailman/listinfo/tinycc-devel
>
_______________________________________________
Tinycc-devel mailing list
[email protected]
https://lists.nongnu.org/mailman/listinfo/tinycc-devel

Reply via email to