Re: [Tinycc-devel] Generating better i386 code

2013-11-11 Thread Jason Hood
On 24/10/2013 0:36, grischka wrote:
 I'm slightly skeptical about the register caching, i.e. how

The register cache caused a lot of problems...

 $ .\tcc.exe -O ..\tcc.c -DONE_SOURCE -DTCC_TARGET_PE -o tcc1.exe -bench
 26168 idents, 65111 lines, 2198309 bytes, 0.188 s, 346335 lines/s, 11.7 MB/s
 $ .\tcc1.exe -O ..\tcc.c -DONE_SOURCE -DTCC_TARGET_PE -o tcc2.exe -bench
 26168 idents, 65111 lines, 2198309 bytes, 0.001 s, 65111000 lines/s, 2198.3 
 MB/s
 
 Note the 0.001 s part, something must be wrong there.

...which apparently I still haven't solved.  It seems to work okay with
my 0.9.26 release, so maybe it's something that's been added since.

 There is other stuff that is probably not worth it because the gain
 is minimal, such as with the split of chkstk.S.

Sure, but library files should be separate, and unused code is
still unused code...

 I'd like to encourage you to push this on a fork, as with the fork

Actually, I was hoping someone else would run with it, as I wasn't
really planning on looking at it again (indeed, I was just about to
unsubscribe, but I'll wait another month now).

-- 
Jason.

___
Tinycc-devel mailing list
Tinycc-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/tinycc-devel


Re: [Tinycc-devel] Generating better i386 code

2013-10-24 Thread grischka

grischka wrote:

Anyway I think this experiment is definitely worth to be kept around.
I'd like to encourage you to push this on a fork, as with the fork
link top of -- http://repo.or.cz/w/tinycc.git


I just noticed that repo.or.cz now supports Personal mob branches:
http://repo.or.cz/h/mob.html

Looks like a good thing for stuff of various kinds such as not or not
yet meant to go into mainline.

--- gr


___
Tinycc-devel mailing list
Tinycc-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/tinycc-devel


Re: [Tinycc-devel] Generating better i386 code

2013-10-23 Thread grischka

Jason Hood wrote:

Greetings.

It's rather funny timing that a couple of topics have come up about
optimization and exe size, as I've just spent the past couple of weeks
improving the generated i386 code (most of which would also apply to
x86-64, but I've not done that).  Not sure what the protocol regarding
patches is, so for now you'll find it on pastebin, based on the 0.9.26
release (as one big diff, I'm afraid).

http://pastebin.com/vdQuhziY


I found some time to try this and I'm actually quite impressed how
this produces much better code wrt. both size and speed with quite
moderate effort.  It's almost gcc -O0 level I guess.

I really like the jump optimization part.  If TCC had a say generic
infrastructure to move around compiled code it could be beneficial
also for the other targets I guess.

I'm slightly skeptical about the register caching, i.e. how
correct can it be under all circumstances, given the hackishness
in the register handling that TCC alreay has.  The RESET_CACHE_IND
macro in various places is to me like a warning not to immediately
trust this. ;)  One symptom I happened to notice (on win32):

$ .\tcc.exe -O ..\tcc.c -DONE_SOURCE -DTCC_TARGET_PE -o tcc1.exe -bench
26168 idents, 65111 lines, 2198309 bytes, 0.188 s, 346335 lines/s, 11.7 MB/s
$ .\tcc1.exe -O ..\tcc.c -DONE_SOURCE -DTCC_TARGET_PE -o tcc2.exe -bench
26168 idents, 65111 lines, 2198309 bytes, 0.001 s, 65111000 lines/s, 2198.3 MB/s

Note the 0.001 s part, something must be wrong there.

There is other stuff that is probably not worth it because the gain
is minimal, such as with the split of chkstk.S.

Anyway I think this experiment is definitely worth to be kept around.
I'd like to encourage you to push this on a fork, as with the fork
link top of -- http://repo.or.cz/w/tinycc.git

Ideally of course as a series of single patches for each feature. :)

Thanks,

--- grischka



BTW, it looks like the original source was tab-free, but some tabs have
snuck in, so you may want to (de)tabify the whole lot.  I've also made a
couple of spelling corrections.

First off, here's the results, building my tcc.exe (I'm on Windows, so
I'll also be using Intel syntax) with:

original tcc:  225792 bytes
my tcc, without optimizations: 218624 bytes (3% reduction)
my tcc, with optimizations:169472 bytes (25% reduction)

Build times are basically the same (using gcc, it was about 0.01s slower
to build with optimizations; using tcc, the optimized version actually
built the optimized version about 0.01s quicker than the original).

The non-optimized version is smaller, as I've made some changes
independent of the optimizations:

* 4-  8-byte structs copy as int/long long (all targets);
* passing structs = 8 bytes will be treated as int/long long;
* returning structs = 8 bytes is done via (edx)eax (PE only);
* added ebx to the register list (increasing prolog by one, to save it);
* use xor r,r instead of mov r,0;
* use the eax-specific form of instructions;
* use movzx after setxx instead of mov r,0 before;
* use movsx for char  short casts, instead of shl+sar;
* use the byte form of sub esp (via enhanced gadd_sp() function);
* gcall_or_jmp() uses symbols and locals directly (like call [ebp-4]);
* use test r,r instead of cmp r,0;
* use inc/dec r instead of add/sub r,-1;
* use movzx r,br/bw instead of and r,0xff/0x;
* or r,-1 (should it occur) replaces its mov r,whatever;
* multiply by 0 (should it occur) becomes xor r,r (replacing its mov);
* multiply by -1 becomes neg r;
* make use of imul r,const;
* simplify the float (not) equal test (remove cmp/xor, use jpo/jpe);
* fix add in the assembler, to use the byte form when appropriate.

To support the optimizations, o() must only be used to start an
instruction.  I've added ON macros to combine N bytes into a single
int and function og() to combine o() and g().

Optimizations are enabled by using -O, but I neglected to add them to
the help:

-Of - functions
-Oj - jumps
-Om - multiplications and pointer division
-Or - registers
-O -O2 -Ox - all optimizations
-O1 - all but -Oj (i.e. -Ofmr)
-Os - all but -Om (i.e. -Ofjr; also removes PE function alignment)
-O0 - no optimizations (default)

-Of will minimize the prolog and epilog.  The full prolog is jumped over
as usual, then when the function is finished, write only what is needed,
move everything back (adjusting relocations to suit) and write the
needed epilog.  As suggested above, I've also aligned PE functions to
16 bytes - this always happens, unless -Os is used (maybe it's not needed,
but I'm so used to seeing it in disassembly listings, it just looks wrong
without it :)).

-Oj will optimize various usages of jump.  Jumps to jmp will be replaced
with the destination of the jmp; resulting skipped jmps will be removed.
Common code before a jmp and its destination (up to eight instructions,
the reason for the o() restriction) will result in removal of the code
before the jmp, changing the jmp 

Re: [Tinycc-devel] Generating better i386 code

2013-09-27 Thread Daniel Glöckner
On Fri, Sep 27, 2013 at 11:21:19AM +1000, Jason Hood wrote:
 On 26/09/2013 16:30, Daniel Glöckner wrote:
  On Thu, Sep 26, 2013 at 03:39:45PM +1000, Jason Hood wrote:
  * 4-  8-byte structs copy as int/long long (all targets);
  
  did you check if the structure is aligned to a multiple of 4 bytes?
  Otherwise it will crash on ARM.
 
 No, as I thought structures of these sizes would already be
 aligned (as if they were int or long long).  Is that not
 necessarily the case?

No,
struct {
char x[4];
}
has an alignment of 1 byte.

  Daniel

___
Tinycc-devel mailing list
Tinycc-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/tinycc-devel


Re: [Tinycc-devel] Generating better i386 code

2013-09-27 Thread Jason Hood
On 27/09/2013 16:54, Daniel Glöckner wrote:
 No,
 struct {
   char x[4];
 }
 has an alignment of 1 byte.

Right, well I think that's simple enough:

--- tccgen~.c   2013-09-25 19:24:46 +1000
+++ tccgen.c2013-09-27 19:33:08 +1000
@@ -2405,12 +2405,18 @@
 if (!nocode_wanted) {
 size = type_size(vtop-type, align);
 
+#ifdef TCC_TARGET_ARM
+if (!(align  3)) {
+#endif
 if (size == 4)
 goto small_struct;
 if (size == 8) {
 ft = VT_LLONG;
 goto small_struct;
 }
+#ifdef TCC_TARGET_ARM
+}
+#endif
 
 /* destination */
 vswap();

-- 
Jason.

___
Tinycc-devel mailing list
Tinycc-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/tinycc-devel


Re: [Tinycc-devel] Generating better i386 code

2013-09-26 Thread Robert Clausecker
Looks like a nice patch. I have some issues though.

 * 4-  8-byte structs copy as int/long long (all targets);
 * passing structs = 8 bytes will be treated as int/long long;
 * returning structs = 8 bytes is done via (edx)eax (PE only);
 * added ebx to the register list (increasing prolog by one, to save
   it)

These changes look like they might change the ABI of tcc. Did you check
them for ABI compatibility? It would be great if it would still be
possible to use code generated by tcc together with code generated by
other, ABI conforming compilers.

 * simplify the float (not) equal test (remove cmp/xor, use jpo/jpe);

Does this change correctly handle NaNs? The following code must print 0
when executed:

#include stdio.h
#include math.h

int main() {
double x = NAN;
printf(%d\n,x == x);
}


___
Tinycc-devel mailing list
Tinycc-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/tinycc-devel


Re: [Tinycc-devel] Generating better i386 code

2013-09-26 Thread Daniel Glöckner
Hi Jason,

On Thu, Sep 26, 2013 at 03:39:45PM +1000, Jason Hood wrote:
 * 4-  8-byte structs copy as int/long long (all targets);

did you check if the structure is aligned to a multiple of 4 bytes?
Otherwise it will crash on ARM.

  Daniel

___
Tinycc-devel mailing list
Tinycc-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/tinycc-devel


Re: [Tinycc-devel] Generating better i386 code

2013-09-26 Thread Jason Hood
On 26/09/2013 16:30, Daniel Glöckner wrote:
 On Thu, Sep 26, 2013 at 03:39:45PM +1000, Jason Hood wrote:
 * 4-  8-byte structs copy as int/long long (all targets);
 
 did you check if the structure is aligned to a multiple of 4 bytes?
 Otherwise it will crash on ARM.

No, as I thought structures of these sizes would already be
aligned (as if they were int or long long).  Is that not
necessarily the case?

-- 
Jason.

___
Tinycc-devel mailing list
Tinycc-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/tinycc-devel


Re: [Tinycc-devel] Generating better i386 code

2013-09-25 Thread Yann Collet
that's brillliant


2013/9/26 Jason Hood jad...@yahoo.com.au

 Greetings.

 It's rather funny timing that a couple of topics have come up about
 optimization and exe size, as I've just spent the past couple of weeks
 improving the generated i386 code (most of which would also apply to
 x86-64, but I've not done that).  Not sure what the protocol regarding
 patches is, so for now you'll find it on pastebin, based on the 0.9.26
 release (as one big diff, I'm afraid).

 http://pastebin.com/vdQuhziY

 BTW, it looks like the original source was tab-free, but some tabs have
 snuck in, so you may want to (de)tabify the whole lot.  I've also made a
 couple of spelling corrections.

 First off, here's the results, building my tcc.exe (I'm on Windows, so
 I'll also be using Intel syntax) with:

 original tcc:  225792 bytes
 my tcc, without optimizations: 218624 bytes (3% reduction)
 my tcc, with optimizations:169472 bytes (25% reduction)

 Build times are basically the same (using gcc, it was about 0.01s slower
 to build with optimizations; using tcc, the optimized version actually
 built the optimized version about 0.01s quicker than the original).

 The non-optimized version is smaller, as I've made some changes
 independent of the optimizations:

 * 4-  8-byte structs copy as int/long long (all targets);
 * passing structs = 8 bytes will be treated as int/long long;
 * returning structs = 8 bytes is done via (edx)eax (PE only);
 * added ebx to the register list (increasing prolog by one, to save it);
 * use xor r,r instead of mov r,0;
 * use the eax-specific form of instructions;
 * use movzx after setxx instead of mov r,0 before;
 * use movsx for char  short casts, instead of shl+sar;
 * use the byte form of sub esp (via enhanced gadd_sp() function);
 * gcall_or_jmp() uses symbols and locals directly (like call [ebp-4]);
 * use test r,r instead of cmp r,0;
 * use inc/dec r instead of add/sub r,-1;
 * use movzx r,br/bw instead of and r,0xff/0x;
 * or r,-1 (should it occur) replaces its mov r,whatever;
 * multiply by 0 (should it occur) becomes xor r,r (replacing its mov);
 * multiply by -1 becomes neg r;
 * make use of imul r,const;
 * simplify the float (not) equal test (remove cmp/xor, use jpo/jpe);
 * fix add in the assembler, to use the byte form when appropriate.

 To support the optimizations, o() must only be used to start an
 instruction.  I've added ON macros to combine N bytes into a single
 int and function og() to combine o() and g().

 Optimizations are enabled by using -O, but I neglected to add them to
 the help:

 -Of - functions
 -Oj - jumps
 -Om - multiplications and pointer division
 -Or - registers
 -O -O2 -Ox - all optimizations
 -O1 - all but -Oj (i.e. -Ofmr)
 -Os - all but -Om (i.e. -Ofjr; also removes PE function alignment)
 -O0 - no optimizations (default)

 -Of will minimize the prolog and epilog.  The full prolog is jumped over
 as usual, then when the function is finished, write only what is needed,
 move everything back (adjusting relocations to suit) and write the
 needed epilog.  As suggested above, I've also aligned PE functions to
 16 bytes - this always happens, unless -Os is used (maybe it's not needed,
 but I'm so used to seeing it in disassembly listings, it just looks wrong
 without it :)).

 -Oj will optimize various usages of jump.  Jumps to jmp will be replaced
 with the destination of the jmp; resulting skipped jmps will be removed.
 Common code before a jmp and its destination (up to eight instructions,
 the reason for the o() restriction) will result in removal of the code
 before the jmp, changing the jmp destination.  Casting to boolean will
 use setxx/movzx or stc/sbb/inc when appropriate.  Conditional jumps
 over a jmp will invert the condition and change the destination,
 removing the jmp.  Jumps to the epilog will be replaced with the epilog
 itself (if it's only one or two bytes with -Os).  Appropriate near jumps
 will be converted to short.

 -Om will use lea (possibly followed by add, shl or another lea) to do
 appropriate constant multiplication.  Pointer division is done by
 reciprocal multiplication (which should probably also be used for normal
 division, don't know why I didn't).

 -Or improves register usage.  Previous values are remembered (this would
 ideally be done as part of tccgen).  Appropriate function arguments are
 pushed directly.  A load const/store pair stores the const directly.
 Suitable adds are turned into a displacement (greatly improving struct
 and long long access).

 A couple of things I didn't do was combine arithmetic operators (even
 though register displacement combines adds) or remove unused locals
 (remembering register values means writing to a temporary probably won't
 read from it).  And doing it all for x86-64 (in particular, returning
 small structs should be done, as that's expected by Windows).

 In addition, I've tweaked the Win32 build.  Build-tcc.bat will