Re: [Tinycc-devel] Generating better i386 code
On 24/10/2013 0:36, grischka wrote: I'm slightly skeptical about the register caching, i.e. how The register cache caused a lot of problems... $ .\tcc.exe -O ..\tcc.c -DONE_SOURCE -DTCC_TARGET_PE -o tcc1.exe -bench 26168 idents, 65111 lines, 2198309 bytes, 0.188 s, 346335 lines/s, 11.7 MB/s $ .\tcc1.exe -O ..\tcc.c -DONE_SOURCE -DTCC_TARGET_PE -o tcc2.exe -bench 26168 idents, 65111 lines, 2198309 bytes, 0.001 s, 65111000 lines/s, 2198.3 MB/s Note the 0.001 s part, something must be wrong there. ...which apparently I still haven't solved. It seems to work okay with my 0.9.26 release, so maybe it's something that's been added since. There is other stuff that is probably not worth it because the gain is minimal, such as with the split of chkstk.S. Sure, but library files should be separate, and unused code is still unused code... I'd like to encourage you to push this on a fork, as with the fork Actually, I was hoping someone else would run with it, as I wasn't really planning on looking at it again (indeed, I was just about to unsubscribe, but I'll wait another month now). -- Jason. ___ Tinycc-devel mailing list Tinycc-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/tinycc-devel
Re: [Tinycc-devel] Generating better i386 code
grischka wrote: Anyway I think this experiment is definitely worth to be kept around. I'd like to encourage you to push this on a fork, as with the fork link top of -- http://repo.or.cz/w/tinycc.git I just noticed that repo.or.cz now supports Personal mob branches: http://repo.or.cz/h/mob.html Looks like a good thing for stuff of various kinds such as not or not yet meant to go into mainline. --- gr ___ Tinycc-devel mailing list Tinycc-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/tinycc-devel
Re: [Tinycc-devel] Generating better i386 code
Jason Hood wrote: Greetings. It's rather funny timing that a couple of topics have come up about optimization and exe size, as I've just spent the past couple of weeks improving the generated i386 code (most of which would also apply to x86-64, but I've not done that). Not sure what the protocol regarding patches is, so for now you'll find it on pastebin, based on the 0.9.26 release (as one big diff, I'm afraid). http://pastebin.com/vdQuhziY I found some time to try this and I'm actually quite impressed how this produces much better code wrt. both size and speed with quite moderate effort. It's almost gcc -O0 level I guess. I really like the jump optimization part. If TCC had a say generic infrastructure to move around compiled code it could be beneficial also for the other targets I guess. I'm slightly skeptical about the register caching, i.e. how correct can it be under all circumstances, given the hackishness in the register handling that TCC alreay has. The RESET_CACHE_IND macro in various places is to me like a warning not to immediately trust this. ;) One symptom I happened to notice (on win32): $ .\tcc.exe -O ..\tcc.c -DONE_SOURCE -DTCC_TARGET_PE -o tcc1.exe -bench 26168 idents, 65111 lines, 2198309 bytes, 0.188 s, 346335 lines/s, 11.7 MB/s $ .\tcc1.exe -O ..\tcc.c -DONE_SOURCE -DTCC_TARGET_PE -o tcc2.exe -bench 26168 idents, 65111 lines, 2198309 bytes, 0.001 s, 65111000 lines/s, 2198.3 MB/s Note the 0.001 s part, something must be wrong there. There is other stuff that is probably not worth it because the gain is minimal, such as with the split of chkstk.S. Anyway I think this experiment is definitely worth to be kept around. I'd like to encourage you to push this on a fork, as with the fork link top of -- http://repo.or.cz/w/tinycc.git Ideally of course as a series of single patches for each feature. :) Thanks, --- grischka BTW, it looks like the original source was tab-free, but some tabs have snuck in, so you may want to (de)tabify the whole lot. I've also made a couple of spelling corrections. First off, here's the results, building my tcc.exe (I'm on Windows, so I'll also be using Intel syntax) with: original tcc: 225792 bytes my tcc, without optimizations: 218624 bytes (3% reduction) my tcc, with optimizations:169472 bytes (25% reduction) Build times are basically the same (using gcc, it was about 0.01s slower to build with optimizations; using tcc, the optimized version actually built the optimized version about 0.01s quicker than the original). The non-optimized version is smaller, as I've made some changes independent of the optimizations: * 4- 8-byte structs copy as int/long long (all targets); * passing structs = 8 bytes will be treated as int/long long; * returning structs = 8 bytes is done via (edx)eax (PE only); * added ebx to the register list (increasing prolog by one, to save it); * use xor r,r instead of mov r,0; * use the eax-specific form of instructions; * use movzx after setxx instead of mov r,0 before; * use movsx for char short casts, instead of shl+sar; * use the byte form of sub esp (via enhanced gadd_sp() function); * gcall_or_jmp() uses symbols and locals directly (like call [ebp-4]); * use test r,r instead of cmp r,0; * use inc/dec r instead of add/sub r,-1; * use movzx r,br/bw instead of and r,0xff/0x; * or r,-1 (should it occur) replaces its mov r,whatever; * multiply by 0 (should it occur) becomes xor r,r (replacing its mov); * multiply by -1 becomes neg r; * make use of imul r,const; * simplify the float (not) equal test (remove cmp/xor, use jpo/jpe); * fix add in the assembler, to use the byte form when appropriate. To support the optimizations, o() must only be used to start an instruction. I've added ON macros to combine N bytes into a single int and function og() to combine o() and g(). Optimizations are enabled by using -O, but I neglected to add them to the help: -Of - functions -Oj - jumps -Om - multiplications and pointer division -Or - registers -O -O2 -Ox - all optimizations -O1 - all but -Oj (i.e. -Ofmr) -Os - all but -Om (i.e. -Ofjr; also removes PE function alignment) -O0 - no optimizations (default) -Of will minimize the prolog and epilog. The full prolog is jumped over as usual, then when the function is finished, write only what is needed, move everything back (adjusting relocations to suit) and write the needed epilog. As suggested above, I've also aligned PE functions to 16 bytes - this always happens, unless -Os is used (maybe it's not needed, but I'm so used to seeing it in disassembly listings, it just looks wrong without it :)). -Oj will optimize various usages of jump. Jumps to jmp will be replaced with the destination of the jmp; resulting skipped jmps will be removed. Common code before a jmp and its destination (up to eight instructions, the reason for the o() restriction) will result in removal of the code before the jmp, changing the jmp
Re: [Tinycc-devel] Generating better i386 code
On Fri, Sep 27, 2013 at 11:21:19AM +1000, Jason Hood wrote: On 26/09/2013 16:30, Daniel Glöckner wrote: On Thu, Sep 26, 2013 at 03:39:45PM +1000, Jason Hood wrote: * 4- 8-byte structs copy as int/long long (all targets); did you check if the structure is aligned to a multiple of 4 bytes? Otherwise it will crash on ARM. No, as I thought structures of these sizes would already be aligned (as if they were int or long long). Is that not necessarily the case? No, struct { char x[4]; } has an alignment of 1 byte. Daniel ___ Tinycc-devel mailing list Tinycc-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/tinycc-devel
Re: [Tinycc-devel] Generating better i386 code
On 27/09/2013 16:54, Daniel Glöckner wrote: No, struct { char x[4]; } has an alignment of 1 byte. Right, well I think that's simple enough: --- tccgen~.c 2013-09-25 19:24:46 +1000 +++ tccgen.c2013-09-27 19:33:08 +1000 @@ -2405,12 +2405,18 @@ if (!nocode_wanted) { size = type_size(vtop-type, align); +#ifdef TCC_TARGET_ARM +if (!(align 3)) { +#endif if (size == 4) goto small_struct; if (size == 8) { ft = VT_LLONG; goto small_struct; } +#ifdef TCC_TARGET_ARM +} +#endif /* destination */ vswap(); -- Jason. ___ Tinycc-devel mailing list Tinycc-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/tinycc-devel
Re: [Tinycc-devel] Generating better i386 code
Looks like a nice patch. I have some issues though. * 4- 8-byte structs copy as int/long long (all targets); * passing structs = 8 bytes will be treated as int/long long; * returning structs = 8 bytes is done via (edx)eax (PE only); * added ebx to the register list (increasing prolog by one, to save it) These changes look like they might change the ABI of tcc. Did you check them for ABI compatibility? It would be great if it would still be possible to use code generated by tcc together with code generated by other, ABI conforming compilers. * simplify the float (not) equal test (remove cmp/xor, use jpo/jpe); Does this change correctly handle NaNs? The following code must print 0 when executed: #include stdio.h #include math.h int main() { double x = NAN; printf(%d\n,x == x); } ___ Tinycc-devel mailing list Tinycc-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/tinycc-devel
Re: [Tinycc-devel] Generating better i386 code
Hi Jason, On Thu, Sep 26, 2013 at 03:39:45PM +1000, Jason Hood wrote: * 4- 8-byte structs copy as int/long long (all targets); did you check if the structure is aligned to a multiple of 4 bytes? Otherwise it will crash on ARM. Daniel ___ Tinycc-devel mailing list Tinycc-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/tinycc-devel
Re: [Tinycc-devel] Generating better i386 code
On 26/09/2013 16:30, Daniel Glöckner wrote: On Thu, Sep 26, 2013 at 03:39:45PM +1000, Jason Hood wrote: * 4- 8-byte structs copy as int/long long (all targets); did you check if the structure is aligned to a multiple of 4 bytes? Otherwise it will crash on ARM. No, as I thought structures of these sizes would already be aligned (as if they were int or long long). Is that not necessarily the case? -- Jason. ___ Tinycc-devel mailing list Tinycc-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/tinycc-devel
Re: [Tinycc-devel] Generating better i386 code
that's brillliant 2013/9/26 Jason Hood jad...@yahoo.com.au Greetings. It's rather funny timing that a couple of topics have come up about optimization and exe size, as I've just spent the past couple of weeks improving the generated i386 code (most of which would also apply to x86-64, but I've not done that). Not sure what the protocol regarding patches is, so for now you'll find it on pastebin, based on the 0.9.26 release (as one big diff, I'm afraid). http://pastebin.com/vdQuhziY BTW, it looks like the original source was tab-free, but some tabs have snuck in, so you may want to (de)tabify the whole lot. I've also made a couple of spelling corrections. First off, here's the results, building my tcc.exe (I'm on Windows, so I'll also be using Intel syntax) with: original tcc: 225792 bytes my tcc, without optimizations: 218624 bytes (3% reduction) my tcc, with optimizations:169472 bytes (25% reduction) Build times are basically the same (using gcc, it was about 0.01s slower to build with optimizations; using tcc, the optimized version actually built the optimized version about 0.01s quicker than the original). The non-optimized version is smaller, as I've made some changes independent of the optimizations: * 4- 8-byte structs copy as int/long long (all targets); * passing structs = 8 bytes will be treated as int/long long; * returning structs = 8 bytes is done via (edx)eax (PE only); * added ebx to the register list (increasing prolog by one, to save it); * use xor r,r instead of mov r,0; * use the eax-specific form of instructions; * use movzx after setxx instead of mov r,0 before; * use movsx for char short casts, instead of shl+sar; * use the byte form of sub esp (via enhanced gadd_sp() function); * gcall_or_jmp() uses symbols and locals directly (like call [ebp-4]); * use test r,r instead of cmp r,0; * use inc/dec r instead of add/sub r,-1; * use movzx r,br/bw instead of and r,0xff/0x; * or r,-1 (should it occur) replaces its mov r,whatever; * multiply by 0 (should it occur) becomes xor r,r (replacing its mov); * multiply by -1 becomes neg r; * make use of imul r,const; * simplify the float (not) equal test (remove cmp/xor, use jpo/jpe); * fix add in the assembler, to use the byte form when appropriate. To support the optimizations, o() must only be used to start an instruction. I've added ON macros to combine N bytes into a single int and function og() to combine o() and g(). Optimizations are enabled by using -O, but I neglected to add them to the help: -Of - functions -Oj - jumps -Om - multiplications and pointer division -Or - registers -O -O2 -Ox - all optimizations -O1 - all but -Oj (i.e. -Ofmr) -Os - all but -Om (i.e. -Ofjr; also removes PE function alignment) -O0 - no optimizations (default) -Of will minimize the prolog and epilog. The full prolog is jumped over as usual, then when the function is finished, write only what is needed, move everything back (adjusting relocations to suit) and write the needed epilog. As suggested above, I've also aligned PE functions to 16 bytes - this always happens, unless -Os is used (maybe it's not needed, but I'm so used to seeing it in disassembly listings, it just looks wrong without it :)). -Oj will optimize various usages of jump. Jumps to jmp will be replaced with the destination of the jmp; resulting skipped jmps will be removed. Common code before a jmp and its destination (up to eight instructions, the reason for the o() restriction) will result in removal of the code before the jmp, changing the jmp destination. Casting to boolean will use setxx/movzx or stc/sbb/inc when appropriate. Conditional jumps over a jmp will invert the condition and change the destination, removing the jmp. Jumps to the epilog will be replaced with the epilog itself (if it's only one or two bytes with -Os). Appropriate near jumps will be converted to short. -Om will use lea (possibly followed by add, shl or another lea) to do appropriate constant multiplication. Pointer division is done by reciprocal multiplication (which should probably also be used for normal division, don't know why I didn't). -Or improves register usage. Previous values are remembered (this would ideally be done as part of tccgen). Appropriate function arguments are pushed directly. A load const/store pair stores the const directly. Suitable adds are turned into a displacement (greatly improving struct and long long access). A couple of things I didn't do was combine arithmetic operators (even though register displacement combines adds) or remove unused locals (remembering register values means writing to a temporary probably won't read from it). And doing it all for x86-64 (in particular, returning small structs should be done, as that's expected by Windows). In addition, I've tweaked the Win32 build. Build-tcc.bat will