--- Comment #21 from ubizjak at gmail dot com 2007-04-06 07:37 ---
Strange things happen.
I have fully removed gcc build directory and bootstrapped gcc from scratch. To
my suprise, the difference with -msse and without -msse is now gone and
optimized dumps are now the same. For
--- Comment #13 from bonzini at gnu dot org 2007-04-05 11:01 ---
So this is an unstable sorting. Adding dnovillo.
--
bonzini at gnu dot org changed:
What|Removed |Added
--- Comment #12 from ubizjak at gmail dot com 2007-04-05 11:00 ---
(In reply to comment #11)
with -msse compile flag. Note different variable suffixes that create
different
sort order. This is (IMO) due to fact that -msse enables lots of additional
__builtin functions (these can
--- Comment #11 from ubizjak at gmail dot com 2007-04-05 10:58 ---
(In reply to comment #10)
I would look at the lreg output, which contains the results of regclass.
No, the difference is due to ssa pass that generates:
# v1z_10 = PHI v1z_13(2), v1z_32(3)
# v1y_9 = PHI v1y_12(2),
--- Comment #14 from dnovillo at gcc dot gnu dot org 2007-04-05 12:49
---
(In reply to comment #11)
So, why does SSA pass have to interfere with computation dataflow? This
interferece makes things worse and effectively takes away user's control on
the
flow of data.
Huh? How
--- Comment #15 from bonzini at gnu dot org 2007-04-05 13:03 ---
Transformations do not, but out-of-SSA could. Is there a way to ensure
ordering of PHI functions unlike what Uros's dumps suggest?
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780
--- Comment #16 from dnovillo at redhat dot com 2007-04-05 13:15 ---
Subject: Re: Floating point computation far slower
for -mfpmath=sse
bonzini at gnu dot org wrote on 04/05/07 08:03:
Is there a way to ensure ordering of PHI functions unlike what Uros's
dumps suggest?
No.
I
--- Comment #17 from amacleod at redhat dot com 2007-04-05 14:23 ---
Is the output from .optimized different? (once the ssa versions numbers have
been stripped). Those PHIs should be irrelevant, the question is whether the
different versioning has any effect.
The only way I can
--- Comment #18 from ubizjak at gmail dot com 2007-04-05 16:39 ---
(In reply to comment #17)
Is the output from .optimized different? (once the ssa versions numbers have
been stripped). Those PHIs should be irrelevant, the question is whether the
different versioning has any
--- Comment #19 from amacleod at redhat dot com 2007-04-05 17:24 ---
what are you using for a compiler? Im using a mainline from mid march, and with
it, my .optimized files diff exactly the same, and I get the aforementioned
time differences in the executables.
(sse.c and sse-bad.c are
--- Comment #20 from ubizjak at gmail dot com 2007-04-05 19:39 ---
(In reply to comment #19)
what are you using for a compiler? Im using a mainline from mid march, and
gcc version 4.3.0 20070404 (experimental) on i686-pc-linux-gnu
with
it, my .optimized files diff exactly the same,
--- Comment #8 from bonzini at gnu dot org 2007-04-03 12:43 ---
what's the generated code for -ffast-math? in principle i don't see a reason
why it should make any difference...
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780
--- Comment #9 from ubizjak at gmail dot com 2007-04-03 13:32 ---
(In reply to comment #8)
what's the generated code for -ffast-math? in principle i don't see a reason
why it should make any difference...
Trying to answer your question, I have played a bit with compile flags and
--- Comment #10 from bonzini at gnu dot org 2007-04-03 13:36 ---
I would look at the lreg output, which contains the results of regclass.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780
--- Comment #6 from uros at kss-loka dot si 2006-10-25 12:04 ---
(In reply to comment #5)
With more registers (x86_64) the stack moves are gone, but: (!)
(testing done on AMD Athlon fam 15 model 35 stepping 2)
On Xeon 3.6, SSE is now faster:
gcc -O2 -march=pentium4 -mfpmath=387
--- Comment #7 from uros at kss-loka dot si 2006-10-25 12:18 ---
(In reply to comment #6)
On Xeon 3.6, SSE is now faster:
... but for -ffast-math:
SSE: user0m0.756s
x87: user0m0.612s
Yes, x87 is faster for -ffast-math by some 20%.
--
--- Comment #5 from rguenth at gcc dot gnu dot org 2006-10-24 13:28 ---
With more registers (x86_64) the stack moves are gone, but: (!)
[EMAIL PROTECTED]:/abuild/rguenther/trunk-g/gcc ./xgcc -B. -O2 -o t t.c
-mfpmath=387
[EMAIL PROTECTED]:/abuild/rguenther/trunk-g/gcc /usr/bin/time ./t
--- Comment #4 from bonzini at gnu dot org 2006-08-11 10:22 ---
Except that PPC uses 12 registers f0 f6 f7 f8 f9 f10 f11 f12 f13 f29 f30 f31.
Not that we can blame GCC for using 12, but it is not a fair comparison. :-)
In fact, 8 registers are enough, but it is quite tricky to obtain
--- Additional Comments From pinskia at gcc dot gnu dot org 2005-09-29
04:06 ---
Oh, and this looks very related to two operand instructions issue.
PPC gives optimial code:
L2:
fmul f0,f6,f9
fmul f13,f7,f10
fmul f12,f8,f11
fmsub f29,f8,f10,f0
--- Additional Comments From pinskia at gcc dot gnu dot org 2005-09-29
04:05 ---
Confirmed. This is weird and this is an ra issue. I don't understand why the
ra is spilling it to the stack
as there are enough SSE registers to hold the 6 registers.
--
What|Removed
20 matches
Mail list logo