that's brillliant
2013/9/26 Jason Hood <[email protected]> > Greetings. > > It's rather funny timing that a couple of topics have come up about > optimization and exe size, as I've just spent the past couple of weeks > improving the generated i386 code (most of which would also apply to > x86-64, but I've not done that). Not sure what the protocol regarding > patches is, so for now you'll find it on pastebin, based on the 0.9.26 > release (as one big diff, I'm afraid). > > http://pastebin.com/vdQuhziY > > BTW, it looks like the original source was tab-free, but some tabs have > snuck in, so you may want to (de)tabify the whole lot. I've also made a > couple of spelling corrections. > > First off, here's the results, building my tcc.exe (I'm on Windows, so > I'll also be using Intel syntax) with: > > original tcc: 225792 bytes > my tcc, without optimizations: 218624 bytes (3% reduction) > my tcc, with optimizations: 169472 bytes (25% reduction) > > Build times are basically the same (using gcc, it was about 0.01s slower > to build with optimizations; using tcc, the optimized version actually > built the optimized version about 0.01s quicker than the original). > > The non-optimized version is smaller, as I've made some changes > independent of the optimizations: > > * 4- & 8-byte structs copy as int/long long (all targets); > * passing structs <= 8 bytes will be treated as int/long long; > * returning structs <= 8 bytes is done via (edx)eax (PE only); > * added ebx to the register list (increasing prolog by one, to save it); > * use xor r,r instead of mov r,0; > * use the eax-specific form of instructions; > * use movzx after setxx instead of mov r,0 before; > * use movsx for char & short casts, instead of shl+sar; > * use the byte form of sub esp (via enhanced gadd_sp() function); > * gcall_or_jmp() uses symbols and locals directly (like call [ebp-4]); > * use test r,r instead of cmp r,0; > * use inc/dec r instead of add/sub r,-1; > * use movzx r,br/bw instead of and r,0xff/0xffff; > * or r,-1 (should it occur) replaces its mov r,whatever; > * multiply by 0 (should it occur) becomes xor r,r (replacing its mov); > * multiply by -1 becomes neg r; > * make use of imul r,const; > * simplify the float (not) equal test (remove cmp/xor, use jpo/jpe); > * fix add in the assembler, to use the byte form when appropriate. > > To support the optimizations, o() must only be used to start an > instruction. I've added O<N> macros to combine <N> bytes into a single > int and function og() to combine o() and g(). > > Optimizations are enabled by using -O, but I neglected to add them to > the help: > > -Of - functions > -Oj - jumps > -Om - multiplications and pointer division > -Or - registers > -O -O2 -Ox - all optimizations > -O1 - all but -Oj (i.e. -Ofmr) > -Os - all but -Om (i.e. -Ofjr; also removes PE function alignment) > -O0 - no optimizations (default) > > -Of will minimize the prolog and epilog. The full prolog is jumped over > as usual, then when the function is finished, write only what is needed, > move everything back (adjusting relocations to suit) and write the > needed epilog. As suggested above, I've also aligned PE functions to > 16 bytes - this always happens, unless -Os is used (maybe it's not needed, > but I'm so used to seeing it in disassembly listings, it just looks wrong > without it :)). > > -Oj will optimize various usages of jump. Jumps to jmp will be replaced > with the destination of the jmp; resulting skipped jmps will be removed. > Common code before a jmp and its destination (up to eight instructions, > the reason for the o() restriction) will result in removal of the code > before the jmp, changing the jmp destination. Casting to boolean will > use setxx/movzx or stc/sbb/inc when appropriate. Conditional jumps > over a jmp will invert the condition and change the destination, > removing the jmp. Jumps to the epilog will be replaced with the epilog > itself (if it's only one or two bytes with -Os). Appropriate near jumps > will be converted to short. > > -Om will use lea (possibly followed by add, shl or another lea) to do > appropriate constant multiplication. Pointer division is done by > reciprocal multiplication (which should probably also be used for normal > division, don't know why I didn't). > > -Or improves register usage. Previous values are remembered (this would > ideally be done as part of tccgen). Appropriate function arguments are > pushed directly. A load const/store pair stores the const directly. > Suitable adds are turned into a displacement (greatly improving struct > and long long access). > > A couple of things I didn't do was combine arithmetic operators (even > though register displacement combines adds) or remove unused locals > (remembering register values means writing to a temporary probably won't > read from it). And doing it all for x86-64 (in particular, returning > small structs should be done, as that's expected by Windows). > > In addition, I've tweaked the Win32 build. Build-tcc.bat will determine > the target based on gcc itself (although it will need modification if > you still want to support command.com). Separated lib/chkstk.S into > lib/seh.S (assuming only 32-bit) and lib/sjlj.S (assuming only 64-bit); > however, I didn't update the configure process, only build-tcc.bat. > > -- > Jason. > > _______________________________________________ > Tinycc-devel mailing list > [email protected] > https://lists.nongnu.org/mailman/listinfo/tinycc-devel >
_______________________________________________ Tinycc-devel mailing list [email protected] https://lists.nongnu.org/mailman/listinfo/tinycc-devel
