[Bug rtl-optimization/21395] Performance degradation when building code that uses MMX intrinsics with gcc-4.0.0

2008-03-23 Thread michaelni at gmx dot at


--- Comment #11 from michaelni at gmx dot at  2008-03-24 00:08 ---
Subject: Re:  Performance degradation when
building code that uses MMX intrinsics with gcc-4.0.0

On Sun, Mar 23, 2008 at 10:46:41AM -, ubizjak at gmail dot com wrote:
 
 
 --- Comment #10 from ubizjak at gmail dot com  2008-03-23 10:46 ---
 (In reply to comment #9)
 
  So on my duron 4.3 seems to beat 4.4 as i expected from the generated asm.
 
 Can you tell from code dumps of 4.4 vs 4.3, where you think that 4.4 code is
 worse than 4.3 for Duron? For Core2, 4.4 avoids store forwarding stall, but 
 I'm
 not sure why Duron prefers moves via memory instead of keeping values in %mm
 registers.

--- freaky_mmx_code-4.3.s   2008-03-24 00:48:11.0 +0100
+++ freaky_mmx_code-4.4.s   2008-03-24 00:48:03.0 +0100
...
 .L24:
movl-36(%ebp), %eax
testl   %ebx, %ebx
movl(%edi,%esi,4), %edx
movl(%eax,%esi,4), %ecx
@@ -182,113 +183,102 @@
xorl%eax, %eax
movq-24(%ebp), %mm2
.p2align 4,,7
.p2align 3

 .L23:
movq(%ecx,%eax,2), %mm0
psubw   (%edx,%eax,2), %mm0
addl$4, %eax
cmpl%eax, %ebx
movq%mm0, %mm1
-   psraw   $15, %mm0
-   pxor%mm0, %mm1
-   psubw   %mm0, %mm1
-   movq%mm1, %mm0
-   punpcklwd   %mm1, %mm1
-   punpckhwd   %mm3, %mm0
-   psrad   $16, %mm1
-   paddd   %mm0, %mm1
-   paddd   %mm1, %mm2
+   psraw   $15, %mm1
+   pxor%mm1, %mm0
+   psubw   %mm1, %mm0
+   movq%mm0, %mm1
+   punpcklwd   %mm0, %mm0
+   punpckhwd   %mm3, %mm1
+   psrad   $16, %mm0
+   paddd   %mm1, %mm0
+   paddd   %mm0, %mm2
movq%mm2, -24(%ebp)
jg  .L23
 .L22:
addl$1, %esi
-   cmpl%esi, -40(%ebp)
-   jg  .L24
+   cmpl-40(%ebp), %esi
+   jl  .L24

...
-   .ident  GCC: (Debian 4.3.0-1) 4.3.1 20080309 (prerelease)
+   .ident  GCC: (GNU) 4.4.0 20080321 (experimental)
--
What i _think_ makes 4.4 slower on duron is that
psraw   $15, %mm1
reads a register which has been written in the previous instruction
while 4.3 choose the other register which contains the same value.
4.4 simply has a longer dependancy chain than 4.3.

PS: both compiled with -mmmx -O2 -S
PS2: 4.3 is from debian, 4.4 is from gcc svn

[...]


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21395



[Bug rtl-optimization/21395] Performance degradation when building code that uses MMX intrinsics with gcc-4.0.0

2008-03-22 Thread michaelni at gmx dot at


--- Comment #9 from michaelni at gmx dot at  2008-03-23 02:49 ---
Subject: Re:  Performance degradation when
building code that uses MMX intrinsics with gcc-4.0.0

On Sat, Mar 22, 2008 at 11:01:55AM -, ubizjak at gmail dot com wrote:
 
 
 --- Comment #8 from ubizjak at gmail dot com  2008-03-22 11:01 ---
 (In reply to comment #6)
  As Uros has challenged me to beat performance of gcc-4.4 generated code by
  hand-crafted assembly using the example of PR 21395 heres my entry, sadly i
  only have gcc-4.3 compiled ATM for comparission but 4.3 generates better 
  code
  than 4.4 so i guess thats ok its inner loop is:
 
 Not!
 
 This is the comparison of runtimes for the original test, comparing 4.3.0 vs
 4.4.0 compiled code on core2D EE:
 
 $ g++ -V 4.3.0 -m32 -march=core2 -O2 mmx.cpp
 $ time ./a.out
 144
 
 real0m0.619s
 user0m0.620s
 sys 0m0.000s
 
 $ g++ -V 4.4.0 -m32 -march=core2 -O2 mmx.cpp
 $ time ./a.out
 144
 
 real0m0.398s
 user0m0.400s
 sys 0m0.000s

On my duron with -O2 -mmmx i get
g++-4.3 (Debian 4.3.0-1) 4.3.1 20080309 (prerelease)
144

real0m2.077s
user0m1.912s
sys 0m0.019s


g++-4.4 (GCC) 4.4.0 20080321 (experimental)
144

real0m2.172s
user0m2.004s
sys 0m0.021s


with -m32 -march=core2 (incorrect as doesnt match cpu!)
g++-4.3 (Debian 4.3.0-1) 4.3.1 20080309 (prerelease)
144

real0m3.644s
user0m3.389s
sys 0m0.022s


g++-4.4 (GCC) 4.4.0 20080321 (experimental)
Illegal instruction (yes yes i know i asked for it)

real0m0.011s
user0m0.003s
sys 0m0.007s


So on my duron 4.3 seems to beat 4.4 as i expected from the generated asm.



 
 gcc 4.4.0 with your modified computation kernel:
 
 $ g++ -m32 -march=core2 -O2 mmx-1.cpp
 $ time ./a.out
 144
 
 real0m0.309s
 user0m0.308s
 sys 0m0.000s
 
 To be honest, I didn't expect you to completely rewrite the computation 
 kernel,
 so we are comparing apples to oranges. 

Well nothing stops gcc from rewriting the intrinsics either :)


 However, you can rewrite your ASM code
 using intrinsic functions from __mmintrin.h, and you will get all 
 optimizations
 (scheduling, unrolling, etc) for free, while you are still in control of code
 generation on a fairly low level. Using intrinsics, you leave to the compiler
 things that the compiler is good at (loop handling, register allocation,
 scheduling).
 
 Are you interested in this experiment? 

Iam surely interrested but iam a little busy with google summer of code
students currently. We have to choose wisely which applications and students
we select for ffmpeg this summer ... that means alot of code reviewing from
what the students submit as qualification tasks ...
So i wont rewrite this in intrinsics, at least not anytime soon.


 The results of this experiment would
 perhaps be interesting to ffmpeg people to consider rewriting their asm blocks
 into intrinsics.

well ...
Iam not a friend of intrinsics, but i think you guessed that already :)
The thing i like on asm() is that it produces the same performance and code
with every compiler. Its largely a write once and forget thing. A problem
with asm() is almost always of the compile time error sort like 
cant find register in class blah these things are vissible and can be dealt
with ...
With intrinsics its all a gamble, just look at this PR, how hugely performance
differs between gcc versions. If ffmpeg where using intrinsics instead of
asm we would have to spend considerable time dealing with such variations
somehow.


 
 And really thanks for your detailed benchmark results! And since your
 computation kernel is already 30% faster than current implementation, I'm sure
 that Dirac people (in CC of this PR) will be very interested in your
 computational kernel.

yes, iam also fine with them using it under whichever FOSS license they want.

[...]


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21395



[Bug rtl-optimization/21395] Performance degradation when building code that uses MMX intrinsics with gcc-4.0.0

2008-03-21 Thread michaelni at gmx dot at


--- Comment #6 from michaelni at gmx dot at  2008-03-22 02:15 ---
As Uros has challenged me to beat performance of gcc-4.4 generated code by
hand-crafted assembly using the example of PR 21395 heres my entry, sadly i
only have gcc-4.3 compiled ATM for comparission but 4.3 generates better code
than 4.4 so i guess thats ok its inner loop is:
.L23:
movq(%ecx,%eax,2), %mm0
psubw   (%edx,%eax,2), %mm0
addl$4, %eax
cmpl%eax, %ebx
movq%mm0, %mm1
psraw   $15, %mm0
pxor%mm0, %mm1
psubw   %mm0, %mm1
movq%mm1, %mm0
punpcklwd   %mm1, %mm1
punpckhwd   %mm3, %mm0
psrad   $16, %mm1
paddd   %mm0, %mm1
paddd   %mm1, %mm2
movq%mm2, -24(%ebp)
jg  .L23

Its better because the psraw doesnt depend on the previous movq result.

Now heres my code (this is naivly written and not unrolled or hand scheduled,
it also uses hardcoded registers, so i suspect it can be improved further ...)
int SimpleBlockDiff::Diff ()
{
#ifdef __MMX__
int sum;
int x1b=-2*xl;
int ylb= yl;

asm volatile(
xorl %%edx, %%edx  \n\t
pcmpeqw %%mm6, %%mm6   \n\t
pxor %%mm7, %%mm7  \n\t
psrlw $15, %%mm6   \n\t
1: \n\t
movl (%1, %%edx, 4), %%eax \n\t
movl (%2, %%edx, 4), %%esi \n\t
movl %3, %%ecx \n\t
subl %%ecx, %%eax  \n\t
subl %%ecx, %%esi  \n\t
2: \n\t
pxor %%mm1, %%mm1  \n\t
movq  (%%eax, %%ecx), %%mm0\n\t
psubw (%%esi, %%ecx), %%mm0\n\t
#if 0
psubw %%mm0, %%mm1 \n\t
pmaxsw %%mm1, %%mm0\n\t
#else
pcmpgtw %%mm0, %%mm1   \n\t
pxor %%mm1, %%mm0  \n\t
psubw %%mm1, %%mm0 \n\t
#endif
pmaddwd %%mm6, %%mm0   \n\t
paddd %%mm0, %%mm7 \n\t
addl $8, %%ecx \n\t
 jnz 2b\n\t
incl %%edx \n\t
cmpl %%edx, %4 \n\t
 jnz 1b\n\t
movq %%mm7, %%mm0  \n\t
psrlq $32, %%mm7   \n\t
paddd %%mm7, %%mm0 \n\t
movd %%mm0, %0 \n\t
:=g (sum)
:r (pic_data), r (ref_data), m(x1b), m(ylb)
: %eax, %esi, %ecx, %edx
);
return sum;
--
and benchmarks:

on a duron:
gcc-4.3:
real0m2.034s
user0m1.882s
sys 0m0.017s

asm:
real0m1.312s
user0m1.208s
sys 0m0.016s

on a 500mhz pentium3:
gcc-4.3
real0m4.021s
user0m3.767s
sys 0m0.009s

asm:
real0m2.827s
user0m2.565s
sys 0m0.055s


-- 

michaelni at gmx dot at changed:

   What|Removed |Added

 CC||michaelni at gmx dot at


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21395



[Bug target/14552] compiled trivial vector intrinsic code is inefficient

2008-03-21 Thread michaelni at gmx dot at


--- Comment #37 from michaelni at gmx dot at  2008-03-22 02:39 ---
Subject: Re:  compiled trivial vector intrinsic code is
inefficient

On Fri, Mar 21, 2008 at 10:34:00AM -, ubizjak at gmail dot com wrote:
 
 
 --- Comment #36 from ubizjak at gmail dot com  2008-03-21 10:33 ---
 (In reply to comment #35)
 
  Also ffmpeg uses almost entirely asm() instead of intrinsics so this alone 
  is
  not so much a problem for ffmpeg than it is for others who followed the
  recommandition of intrinsics are better than asm.
  
  About trolling, well i made no attempt to reply politely and diplomatic, no.
  But solving a problem in some use case by droping support for that use
  case is kinda extreem.
  
  The way i see it is that
  * Its non trivial to place emms optimally and automatically
  * there needs to be a emms between mmx code and fpu code
  
  The solutions to this would be any one of
  A. let the programmer place emms like it has been in the past
  B. dont support mmx at all
  C. dont support x87 fpu at all
  D. place emms after every bunch of mmx instructions
  E. solve a quite non trivial problem and place emms optimally
  
  The solution which has been selected apparently is B., why was that choosen?
  Instead of lets say A.?
  
  If i do write SIMD code then i do know that i need an emms on x86. Its
  trivial for the programmer to place it optimally.
 
 I don't know where you get the idea that MMX support was dropped in any way. I

Maybe because the SIMD code in this PR compiled with -mmmx does not use mmx
but very significantly less efficient integer instructions. And you added a
test to gcc which ensures that this case does not use mmx instructions.

This is pretty much the definion of droping mmx support (for this specific
case).


 won't engage in a discussion about autovectorisation, intrinsics, builtins,
 generic vectorisation, etc, etc with you,

And somehow iam glad about that.


 but please look at PR 21395 how
 performance PR should be filled. 

 The MMX code in that PR is _far_ from trivial,

Well that is something i would disagree about.


 but since it is well written using intrinsic instructions, it enables
 jaw-dropping performance increase that is simply not possible when ASM blocks
 are used.
 
 Now, I'm sure that you have your numbers ready to back up your claims from
 Comment #33 about performance of generated code, and I challenge you to beat
 performance of gcc-4.4 generated code by hand-crafted assembly using the
 example of PR 21395.

done, 
jaw-dropping intrinsics need 
2.034s 

stinky hand written asm needs 
1.312s

But you can read the details in PR 21395.

[...]


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



[Bug rtl-optimization/21395] Performance degradation when building code that uses MMX intrinsics with gcc-4.0.0

2008-03-21 Thread michaelni at gmx dot at


--- Comment #7 from michaelni at gmx dot at  2008-03-22 02:51 ---
You can also replace the inner loop by:

2: \n\t
pxor %%mm1, %%mm1  \n\t
movq  (%%eax, %%ecx), %%mm0\n\t
psubw (%%esi, %%ecx), %%mm0\n\t
pcmpgtw %%mm0, %%mm1   \n\t
por %%mm6, %%mm1   \n\t
pmaddwd %%mm1, %%mm0   \n\t
paddd %%mm0, %%mm7 \n\t
addl $8, %%ecx \n\t
 jnz 2b\n\t

Which has one instruction less, its a hair faster on my p3 but a little slower
on my duron.
And of course the most obvious optimization is to unroll this and do a bunch of
them at once.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21395



[Bug target/14552] compiled trivial vector intrinsic code is inefficient

2008-03-20 Thread michaelni at gmx dot at


--- Comment #35 from michaelni at gmx dot at  2008-03-20 17:18 ---
Subject: Re:  compiled trivial vector intrinsic code is
inefficient

On Thu, Mar 20, 2008 at 09:49:22AM -, ubizjak at gmail dot com wrote:
 
 
 --- Comment #34 from ubizjak at gmail dot com  2008-03-20 09:49 ---
 (In reply to comment #33)
 
  Anyway iam glad ffmpeg compiles fine under icc.
 
 Me to. Now you will troll in their support lists.

No, truth be, i dont plan to switch to icc yet. Somehow i do prefer to use
free tools. Of course if the gap becomes too big i as well as most others
will switch to icc ...
Also ffmpeg uses almost entirely asm() instead of intrinsics so this alone is
not so much a problem for ffmpeg than it is for others who followed the
recommandition of intrinsics are better than asm.

About trolling, well i made no attempt to reply politely and diplomatic, no.
But solving a problem in some use case by droping support for that use
case is kinda extreem.

The way i see it is that
* Its non trivial to place emms optimally and automatically
* there needs to be a emms between mmx code and fpu code

The solutions to this would be any one of
A. let the programmer place emms like it has been in the past
B. dont support mmx at all
C. dont support x87 fpu at all
D. place emms after every bunch of mmx instructions
E. solve a quite non trivial problem and place emms optimally

The solution which has been selected apparently is B., why was that choosen?
Instead of lets say A.?

If i do write SIMD code then i do know that i need an emms on x86. Its
trivial for the programmer to place it optimally.

[...]


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



[Bug target/14552] compiled trivial vector intrinsic code is inefficient

2008-03-19 Thread michaelni at gmx dot at


--- Comment #33 from michaelni at gmx dot at  2008-03-20 01:37 ---
Subject: Re:  compiled trivial vector intrinsic code is
inefficient

On Wed, Mar 19, 2008 at 11:39:18PM -, uros at gcc dot gnu dot org wrote:
 
 
 --- Comment #26 from uros at gcc dot gnu dot org  2008-03-19 23:39 ---
 Subject: Bug 14552
[...]
 * gcc.target/i386/pr14552.c: New test.
 
 
 Added:
 trunk/gcc/testsuite/gcc.target/i386/pr14552.c

Thanks, i was already scared that the inverse proportional relation between
version number and performance which was so nicely followed since 2.95
would stop.
Adding a test to the testsuit to ensure that mmx intrinsics dont use
mmx registers is well, just brilliant.
Iam already eagerly awaiting the testcase which will check that floating
point code doesnt use the FPU, i assume that will happen in gcc 5.0?

Anyway iam glad ffmpeg compiles fine under icc.

[...]


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



[Bug c/35058] New: -Werror= works only with some warnings

2008-02-02 Thread michaelni at gmx dot at
-Werror=declaration-after-statement and -Werror=pointer-arith
only generate warnings not errors. 
Example
-
void *a;

void *test(){
if(a=a) a++;
int x=5;
return a+x;
}

gcc-4.3 -Werror=declaration-after-statement -Werror=pointer-arith testX.c -c -o
testX


adding -Werror=parentheses generates an error as expected though
also interrestingly -fdiagnostics-show-option shows only [-Wparentheses] and
nothing for the other 2 warnings

Note, this issue has been seen on x86-32 and ppc


-- 
   Summary: -Werror= works only with some warnings
   Product: gcc
   Version: 4.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: michaelni at gmx dot at


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35058



[Bug inline-asm/11203] source doesn't compile with -O0 but they compile with -O3

2007-02-27 Thread michaelni at gmx dot at


--- Comment #39 from michaelni at gmx dot at  2007-02-27 22:50 ---
(In reply to comment #38)
 (In reply to comment #37)
  now if there is a unwritten rule that m operands and variations of them
  cannot be copied anywhere, then it would be very desireable to have a asm
  constraint like m without this restriction this would resolve this and
  several other bugs
  also it would be very nice if such a dont copy restriction on m if it does
  exist could be documented
 
 Copying m operands onto the stack might not be such a great thing to wish
 for.  Imagine if you used asm(movaps %xmm0, %0: =m(x[i]));  If x[i] is 
 only
 32-bits, and gcc copied it onto the stack, then writing 16 bytes with movaps
 wouldn't also write to x[i+1] to x[i+3] as intended.  I know there is a plenty
 of asm code in ffmpeg that overwrites or overreads memory operands and will
 fail if gcc tried to move them onto the stack.  There is also alignment. 
 movaps requires an aligned address, and maybe you have chosen x and i in such 
 a
 way that it will be aligned.  But when gcc copies the value onto the stack, 
 how
 is it supposed to know what alignment it needs?

well the data type used in m() must of course be correct, that is here a
128bit type, alignment can be handled like with all other types, double also
gets aligned if the architecture needs it, so a uint128_t or sse128 or whatever
can as well, the example you show is a fairly obscure special case in respect
to moving m to the stack, in the end theres a need for a m like constraint
which must not be moveable and a m like constraint which should be moveable
(to the stack for example) the exact letters used are irrelevant


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11203



[Bug inline-asm/11203] source doesn't compile with -O0 but they compile with -O3

2006-11-08 Thread michaelni at gmx dot at


--- Comment #37 from michaelni at gmx dot at  2006-11-08 20:45 ---
(In reply to comment #36)
 (In reply to comment #21)
  asm volatile(
  : =m (*(unsigned int*)(src + 0*stride)),
=m (*(unsigned int*)(src + 1*stride)),
=m (*(unsigned int*)(src + 2*stride)),
=m (*(unsigned int*)(src + 3*stride)),
=m (*(unsigned int*)(src + 4*stride)),
=m (*(unsigned int*)(src + 5*stride)),
=m (*(unsigned int*)(src + 6*stride)),
=m (*(unsigned int*)(src + 7*stride))
  );
 
 (In reply to comment #26)
  it might also happen that in some intentionally overconstrained cases it 
  ends up
  searching the whole 5040 possible assignments of 7 registers onto 7 non 
  memory
  operands but still it wont fail
 
 The example Martin gave has *8* operands.  You can try every possible direct
 mapping of those 8 addresses to just 7 registers, but they will obviously all
 fail.  Except with ia32 addressing modes it _can_ be done, and with only 4
 registers.
 
 reg1 = src, reg2 = stride, reg3 = src+stride*3, reg4 = src+stride*6
 Then the 8 memory operands are:
 (reg1), (reg1,reg2,1), (reg1,reg2,2), (reg3),
 (reg1,reg2,4), (reg3,reg2,2), (reg4), (reg3,reg2,4)
 
 When one considers all the addressing modes, there are not just 7 possible
 registers, but (I think) 261 possible addresses.  There are not just 5040
 possibilities as Michael said, but over 76 x 10^15 possible ways of assigning
 these addresses to 7 operands!  Then each register can be loaded not just with
 an address but with some sub-expression too, like how I loaded reg2 with
 stride.

m operands and variations can be copied onto the stack and accessed from
there, so no matter how many memory operands there are they can always be
accessed over esp on ia32, so whatever you did calculate it is meaningless

now if there is a unwritten rule that m operands and variations of them
cannot be copied anywhere, then it would be very desireable to have a asm
constraint like m without this restriction this would resolve this and
several other bugs
also it would be very nice if such a dont copy restriction on m if it does
exist could be documented


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11203



[Bug target/12395] Suboptimal code with global variables

2006-02-11 Thread michaelni at gmx dot at


--- Comment #8 from michaelni at gmx dot at  2006-02-11 11:40 ---
I really think this should be fixed, otherwise gcc wont be able to follow its
exponential decaying performance which it has so accurately followed since 2.95
at least, to show clearer how much speed we could loose by fixing this i was
nice and benchmarked the code (a simple for loop running 100 times with the
code inside, rdtsc based timing outside with a 1000 times executed loop
surounding it
benchmarink was done on a 800mhz duron and a 500mhz pentium3, the first number
is the number of cpu cycles for the duron, second one for p3

first let me show you the optimal code by steven boscher?
addl$1,a\n
   je  .L1\n
addl$1,a\n
.L1:\n
11.557 / 12.514

now what gcc 3.4/3.2 generated:
movla, %%eax\n
incl%%eax\n
testl   %%eax, %%eax\n
movl%%eax, a\n
je  .L1\n
incl%%eax\n
movl%%eax, a\n
.L1:\n
//6.220 / 6.159

the code generated by mainline had 2 ret so it didnt fit in my benchmark loop

the even better code by segher AT d12relay01 DOT megacenter.de.ibm.com
addl$1,a\n
sbbl   $-1,a\n
//11.755 / 15.111


one case which you must be carefull not to generate as its almost twice as fast
as the on above while still being just 2 instructions is:
cmpl   $-1,a\n
adcl$1,a\n
//7.827 / 7.422

another 2 slightly faster variants are:
movla, %%eax\n
cmpl   $-1,%%eax\n
adcl$1,%%eax\n
movl  %%eax,a\n
//6.567 / 8.811

movla, %%eax\n
addl$1,%%eax\n
sbbl   $-1,%%eax\n
movl  %%eax,a\n
//6.564 / 8.813


what a 14year old script kid would write and what gcc would generate if it
where local variables:
movla, %%eax\n
incl%%eax\n
je  .L1\n
incl%%eax\n
.L1:\n
movl%%eax, a\n
//6.162 / 5.426

what i would write (as the variable isnt used in my testcase):
\n
//2.155 / 2.410


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395



[Bug target/12395] Suboptimal code with global variables

2006-02-11 Thread michaelni at gmx dot at


--- Comment #11 from michaelni at gmx dot at  2006-02-11 13:54 ---
(In reply to comment #9)
 Re. comment #8:
 exponential decaying performance which it has so accurately followed since
 2.95
 
 Can you back this up with numbers, or are you just trolling?  If the latter,
 please don't do that, you are insulting the work of a dedicated few.  Maybe 
 you
 should help out instead of trolling, if you think you're so good.  If you
 continue to make this kind of unhelpful comments, I will ask to have you
 blocked from our bugzilla.

the benchmark was unhelpfull?
anyway, compiling dsputil.c from libavcodec takes
gcc 2.950m26.530s
gcc 3.4 0m46.839s
gcc 4.0 1m 1.515s

(time /usr/bin/gcc-4.0 -O3 -g -DHAVE_AV_CONFIG_H -I..
-I'/home/michael/ffmpeg-write2/ffmpeg'/libavutil -D_FILE_OFFSET_BITS=64
-D_LARGEFILE_SOURCE -D_GNU_SOURCE   -c -o dsputil.o dsputil.c)


and runtime performance, just try the recommanded way of writing asm/mmx code
for gcc 2.95 vs gcc 3/4.*, handwritten asm code is quite a bit faster then what
gcc creates from these intrinsics sometimes

sure saying gcc gets exponentially slower in general isnt true but in some
specific and common cases there is a big speedloss ...


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395



[Bug inline-asm/23313] New: gcc ignores ebp on the clobber list

2005-08-10 Thread michaelni at gmx dot at
* the code segfaults
* there is no error message, not even a warning
* the docs dont say ebp on the clobber list has undefinied behaviour though you
could argue its common knowledgde that gcc-asm has undefined behavior in general

testcase:
int main(){
int i;

asm (
xorl %%ebp, %%ebp\n\t
movl %0, %%ebp\n\t
:: m (i)
: %ebp
);
return 0;
}

-- 
   Summary: gcc ignores ebp on the clobber list
   Product: gcc
   Version: 4.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P2
 Component: inline-asm
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: michaelni at gmx dot at
CC: gcc-bugs at gcc dot gnu dot org
 GCC build triplet: x86-linux
  GCC host triplet: x86-linux
GCC target triplet: x86-linux


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=23313


[Bug inline-asm/11203] source doesn't compile with -O0 but they compile with -O3

2005-01-22 Thread michaelni at gmx dot at

--- Additional Comments From michaelni at gmx dot at  2005-01-22 17:10 
---
(In reply to comment #14)
 In any case, just because code is syntactically valid 
 GNU C doesn't mean gcc can always compile it.  With this kind of inline asm, 
 you're bound to confuse the register allocator.  The fact that it works at O3 
 is pure luck and not a bug.  

well, you are the gcc developers so theres not much arguing about what you
consider valid, but last time i checked the docs did not mention that asm
statemts may fail to compile at random, and IMO as long as this is not clearly
stated in the docs this bugreport really shouldnt be marked as invalid, say you
dont want to fix it, say it would be too complicated to fix or whatever but its
not invalid


 Note that you're hitting an *error*, not an ICE. 

no, at least one of the bugreports marked as duplicate of this ends in an ICE



(In reply to comment #24)
 Martin, you should realize that this problem *cannot* be solved. Yes, 
 there will perhaps be a time when this particular test case compiles, 
 though I think that is unlikely.  But anyway, then there will be other 
 cases that fail. 

hmm, so the probelm cannot be solved but then maybe it will be solved but this
doesnt count because there will be other unrelated bugs? i cant follow this
reasoning or do u mean that u can never solve all bugs and so theres no need to
fix any single one?


  
 The reason is dead simple: register allocation is NP-complete, so it 
 is even *theoretically* not possible to write register allocators that 
 always find a coloring. 

register allocation in general is NP-complete, yes, but it seems u forget that
this is about finding the optimal solution while gcc fails finding any solution
which in practice is a matter of assigning the registers beginning from the most
constrained operands to the least, and copying a few things on the stack if gcc
cant figure out howto access them, sure this method might fail in 0.001% of the
practical cases and need a 2nd or 3rd pass where it tries different registers
it might also happen that in some intentionally overconstrained cases it ends up
searching the whole 5040 possible assignments of 7 registers onto 7 non memory
operands but still it wont fail

 That means any register allocator will always 
 fail on some very constrained asm input.

now that statement is just false, not to mention irrelevant as none of these asm
statemets are unreasonably constrained


  And you cannot allow it to 
 run indefinitely until a coloring is found, because then you've turned 
 the graph coloring problem into the halting problem because you can't 
 prove that a coloring exists and that the register allocator algorithm 
 will terminate. 

this is ridiculous, the number of possible colorings is finite, u can always try
them all in finite time

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11203


[Bug inline-asm/11203] source doesn't compile with -O0 but they compile with -O3

2005-01-01 Thread michaelni at gmx dot at

--- Additional Comments From michaelni at gmx dot at  2005-01-01 18:57 
---
(In reply to comment #12)
 Why do people write inline-asm like this?

why not? its valid code and a compiler should compile valid code ...


 It is crazy to do so.  Split up the inline-asm correctly.

fix gcc first so it doesnt loadstore more then needed between the splited up 
parts


 Anyone who writes like inline-asm should get what they get.
 For mmx inline-asm, you should be using the intrinsics instead as suggested 
 before

lets see why its not using intrinsics
* it was written before intrinsics support was common
* intrinsics fail / get misscompiled commonly, its so bad that some of the
altivec intrinsic code has been disabled in ffmpeg if standard gcc is detected,
there also have been very serious and similar problems in mplayer with
altivec-intrinsics, sadly i cant provide more details as i dont have a ppc
* many if not most of the mplayer developers still use gcc 2.95 because gcc 3.*
is slower and needs more memory, and AFAIK 2.95 doesnt support intrinsics
* it is alot of work to rewrite and debug it just to make it compileable with
gcc -O0


 or just write real asm file.

thats not a good idea either as:
* its slower due to the additional call/ret/parameter passing
* there are some symbol name mangling issues on some obscure systems (see
mplayer-dev or cvslog mailinglist, it was disscussed there a long time ago)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11203