[Bug rtl-optimization/21395] Performance degradation when building code that uses MMX intrinsics with gcc-4.0.0
--- Comment #11 from michaelni at gmx dot at 2008-03-24 00:08 --- Subject: Re: Performance degradation when building code that uses MMX intrinsics with gcc-4.0.0 On Sun, Mar 23, 2008 at 10:46:41AM -, ubizjak at gmail dot com wrote: --- Comment #10 from ubizjak at gmail dot com 2008-03-23 10:46 --- (In reply to comment #9) So on my duron 4.3 seems to beat 4.4 as i expected from the generated asm. Can you tell from code dumps of 4.4 vs 4.3, where you think that 4.4 code is worse than 4.3 for Duron? For Core2, 4.4 avoids store forwarding stall, but I'm not sure why Duron prefers moves via memory instead of keeping values in %mm registers. --- freaky_mmx_code-4.3.s 2008-03-24 00:48:11.0 +0100 +++ freaky_mmx_code-4.4.s 2008-03-24 00:48:03.0 +0100 ... .L24: movl-36(%ebp), %eax testl %ebx, %ebx movl(%edi,%esi,4), %edx movl(%eax,%esi,4), %ecx @@ -182,113 +183,102 @@ xorl%eax, %eax movq-24(%ebp), %mm2 .p2align 4,,7 .p2align 3 .L23: movq(%ecx,%eax,2), %mm0 psubw (%edx,%eax,2), %mm0 addl$4, %eax cmpl%eax, %ebx movq%mm0, %mm1 - psraw $15, %mm0 - pxor%mm0, %mm1 - psubw %mm0, %mm1 - movq%mm1, %mm0 - punpcklwd %mm1, %mm1 - punpckhwd %mm3, %mm0 - psrad $16, %mm1 - paddd %mm0, %mm1 - paddd %mm1, %mm2 + psraw $15, %mm1 + pxor%mm1, %mm0 + psubw %mm1, %mm0 + movq%mm0, %mm1 + punpcklwd %mm0, %mm0 + punpckhwd %mm3, %mm1 + psrad $16, %mm0 + paddd %mm1, %mm0 + paddd %mm0, %mm2 movq%mm2, -24(%ebp) jg .L23 .L22: addl$1, %esi - cmpl%esi, -40(%ebp) - jg .L24 + cmpl-40(%ebp), %esi + jl .L24 ... - .ident GCC: (Debian 4.3.0-1) 4.3.1 20080309 (prerelease) + .ident GCC: (GNU) 4.4.0 20080321 (experimental) -- What i _think_ makes 4.4 slower on duron is that psraw $15, %mm1 reads a register which has been written in the previous instruction while 4.3 choose the other register which contains the same value. 4.4 simply has a longer dependancy chain than 4.3. PS: both compiled with -mmmx -O2 -S PS2: 4.3 is from debian, 4.4 is from gcc svn [...] -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21395
[Bug rtl-optimization/21395] Performance degradation when building code that uses MMX intrinsics with gcc-4.0.0
--- Comment #9 from michaelni at gmx dot at 2008-03-23 02:49 --- Subject: Re: Performance degradation when building code that uses MMX intrinsics with gcc-4.0.0 On Sat, Mar 22, 2008 at 11:01:55AM -, ubizjak at gmail dot com wrote: --- Comment #8 from ubizjak at gmail dot com 2008-03-22 11:01 --- (In reply to comment #6) As Uros has challenged me to beat performance of gcc-4.4 generated code by hand-crafted assembly using the example of PR 21395 heres my entry, sadly i only have gcc-4.3 compiled ATM for comparission but 4.3 generates better code than 4.4 so i guess thats ok its inner loop is: Not! This is the comparison of runtimes for the original test, comparing 4.3.0 vs 4.4.0 compiled code on core2D EE: $ g++ -V 4.3.0 -m32 -march=core2 -O2 mmx.cpp $ time ./a.out 144 real0m0.619s user0m0.620s sys 0m0.000s $ g++ -V 4.4.0 -m32 -march=core2 -O2 mmx.cpp $ time ./a.out 144 real0m0.398s user0m0.400s sys 0m0.000s On my duron with -O2 -mmmx i get g++-4.3 (Debian 4.3.0-1) 4.3.1 20080309 (prerelease) 144 real0m2.077s user0m1.912s sys 0m0.019s g++-4.4 (GCC) 4.4.0 20080321 (experimental) 144 real0m2.172s user0m2.004s sys 0m0.021s with -m32 -march=core2 (incorrect as doesnt match cpu!) g++-4.3 (Debian 4.3.0-1) 4.3.1 20080309 (prerelease) 144 real0m3.644s user0m3.389s sys 0m0.022s g++-4.4 (GCC) 4.4.0 20080321 (experimental) Illegal instruction (yes yes i know i asked for it) real0m0.011s user0m0.003s sys 0m0.007s So on my duron 4.3 seems to beat 4.4 as i expected from the generated asm. gcc 4.4.0 with your modified computation kernel: $ g++ -m32 -march=core2 -O2 mmx-1.cpp $ time ./a.out 144 real0m0.309s user0m0.308s sys 0m0.000s To be honest, I didn't expect you to completely rewrite the computation kernel, so we are comparing apples to oranges. Well nothing stops gcc from rewriting the intrinsics either :) However, you can rewrite your ASM code using intrinsic functions from __mmintrin.h, and you will get all optimizations (scheduling, unrolling, etc) for free, while you are still in control of code generation on a fairly low level. Using intrinsics, you leave to the compiler things that the compiler is good at (loop handling, register allocation, scheduling). Are you interested in this experiment? Iam surely interrested but iam a little busy with google summer of code students currently. We have to choose wisely which applications and students we select for ffmpeg this summer ... that means alot of code reviewing from what the students submit as qualification tasks ... So i wont rewrite this in intrinsics, at least not anytime soon. The results of this experiment would perhaps be interesting to ffmpeg people to consider rewriting their asm blocks into intrinsics. well ... Iam not a friend of intrinsics, but i think you guessed that already :) The thing i like on asm() is that it produces the same performance and code with every compiler. Its largely a write once and forget thing. A problem with asm() is almost always of the compile time error sort like cant find register in class blah these things are vissible and can be dealt with ... With intrinsics its all a gamble, just look at this PR, how hugely performance differs between gcc versions. If ffmpeg where using intrinsics instead of asm we would have to spend considerable time dealing with such variations somehow. And really thanks for your detailed benchmark results! And since your computation kernel is already 30% faster than current implementation, I'm sure that Dirac people (in CC of this PR) will be very interested in your computational kernel. yes, iam also fine with them using it under whichever FOSS license they want. [...] -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21395
[Bug rtl-optimization/21395] Performance degradation when building code that uses MMX intrinsics with gcc-4.0.0
--- Comment #6 from michaelni at gmx dot at 2008-03-22 02:15 --- As Uros has challenged me to beat performance of gcc-4.4 generated code by hand-crafted assembly using the example of PR 21395 heres my entry, sadly i only have gcc-4.3 compiled ATM for comparission but 4.3 generates better code than 4.4 so i guess thats ok its inner loop is: .L23: movq(%ecx,%eax,2), %mm0 psubw (%edx,%eax,2), %mm0 addl$4, %eax cmpl%eax, %ebx movq%mm0, %mm1 psraw $15, %mm0 pxor%mm0, %mm1 psubw %mm0, %mm1 movq%mm1, %mm0 punpcklwd %mm1, %mm1 punpckhwd %mm3, %mm0 psrad $16, %mm1 paddd %mm0, %mm1 paddd %mm1, %mm2 movq%mm2, -24(%ebp) jg .L23 Its better because the psraw doesnt depend on the previous movq result. Now heres my code (this is naivly written and not unrolled or hand scheduled, it also uses hardcoded registers, so i suspect it can be improved further ...) int SimpleBlockDiff::Diff () { #ifdef __MMX__ int sum; int x1b=-2*xl; int ylb= yl; asm volatile( xorl %%edx, %%edx \n\t pcmpeqw %%mm6, %%mm6 \n\t pxor %%mm7, %%mm7 \n\t psrlw $15, %%mm6 \n\t 1: \n\t movl (%1, %%edx, 4), %%eax \n\t movl (%2, %%edx, 4), %%esi \n\t movl %3, %%ecx \n\t subl %%ecx, %%eax \n\t subl %%ecx, %%esi \n\t 2: \n\t pxor %%mm1, %%mm1 \n\t movq (%%eax, %%ecx), %%mm0\n\t psubw (%%esi, %%ecx), %%mm0\n\t #if 0 psubw %%mm0, %%mm1 \n\t pmaxsw %%mm1, %%mm0\n\t #else pcmpgtw %%mm0, %%mm1 \n\t pxor %%mm1, %%mm0 \n\t psubw %%mm1, %%mm0 \n\t #endif pmaddwd %%mm6, %%mm0 \n\t paddd %%mm0, %%mm7 \n\t addl $8, %%ecx \n\t jnz 2b\n\t incl %%edx \n\t cmpl %%edx, %4 \n\t jnz 1b\n\t movq %%mm7, %%mm0 \n\t psrlq $32, %%mm7 \n\t paddd %%mm7, %%mm0 \n\t movd %%mm0, %0 \n\t :=g (sum) :r (pic_data), r (ref_data), m(x1b), m(ylb) : %eax, %esi, %ecx, %edx ); return sum; -- and benchmarks: on a duron: gcc-4.3: real0m2.034s user0m1.882s sys 0m0.017s asm: real0m1.312s user0m1.208s sys 0m0.016s on a 500mhz pentium3: gcc-4.3 real0m4.021s user0m3.767s sys 0m0.009s asm: real0m2.827s user0m2.565s sys 0m0.055s -- michaelni at gmx dot at changed: What|Removed |Added CC||michaelni at gmx dot at http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21395
[Bug target/14552] compiled trivial vector intrinsic code is inefficient
--- Comment #37 from michaelni at gmx dot at 2008-03-22 02:39 --- Subject: Re: compiled trivial vector intrinsic code is inefficient On Fri, Mar 21, 2008 at 10:34:00AM -, ubizjak at gmail dot com wrote: --- Comment #36 from ubizjak at gmail dot com 2008-03-21 10:33 --- (In reply to comment #35) Also ffmpeg uses almost entirely asm() instead of intrinsics so this alone is not so much a problem for ffmpeg than it is for others who followed the recommandition of intrinsics are better than asm. About trolling, well i made no attempt to reply politely and diplomatic, no. But solving a problem in some use case by droping support for that use case is kinda extreem. The way i see it is that * Its non trivial to place emms optimally and automatically * there needs to be a emms between mmx code and fpu code The solutions to this would be any one of A. let the programmer place emms like it has been in the past B. dont support mmx at all C. dont support x87 fpu at all D. place emms after every bunch of mmx instructions E. solve a quite non trivial problem and place emms optimally The solution which has been selected apparently is B., why was that choosen? Instead of lets say A.? If i do write SIMD code then i do know that i need an emms on x86. Its trivial for the programmer to place it optimally. I don't know where you get the idea that MMX support was dropped in any way. I Maybe because the SIMD code in this PR compiled with -mmmx does not use mmx but very significantly less efficient integer instructions. And you added a test to gcc which ensures that this case does not use mmx instructions. This is pretty much the definion of droping mmx support (for this specific case). won't engage in a discussion about autovectorisation, intrinsics, builtins, generic vectorisation, etc, etc with you, And somehow iam glad about that. but please look at PR 21395 how performance PR should be filled. The MMX code in that PR is _far_ from trivial, Well that is something i would disagree about. but since it is well written using intrinsic instructions, it enables jaw-dropping performance increase that is simply not possible when ASM blocks are used. Now, I'm sure that you have your numbers ready to back up your claims from Comment #33 about performance of generated code, and I challenge you to beat performance of gcc-4.4 generated code by hand-crafted assembly using the example of PR 21395. done, jaw-dropping intrinsics need 2.034s stinky hand written asm needs 1.312s But you can read the details in PR 21395. [...] -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552
[Bug rtl-optimization/21395] Performance degradation when building code that uses MMX intrinsics with gcc-4.0.0
--- Comment #7 from michaelni at gmx dot at 2008-03-22 02:51 --- You can also replace the inner loop by: 2: \n\t pxor %%mm1, %%mm1 \n\t movq (%%eax, %%ecx), %%mm0\n\t psubw (%%esi, %%ecx), %%mm0\n\t pcmpgtw %%mm0, %%mm1 \n\t por %%mm6, %%mm1 \n\t pmaddwd %%mm1, %%mm0 \n\t paddd %%mm0, %%mm7 \n\t addl $8, %%ecx \n\t jnz 2b\n\t Which has one instruction less, its a hair faster on my p3 but a little slower on my duron. And of course the most obvious optimization is to unroll this and do a bunch of them at once. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21395
[Bug target/14552] compiled trivial vector intrinsic code is inefficient
--- Comment #35 from michaelni at gmx dot at 2008-03-20 17:18 --- Subject: Re: compiled trivial vector intrinsic code is inefficient On Thu, Mar 20, 2008 at 09:49:22AM -, ubizjak at gmail dot com wrote: --- Comment #34 from ubizjak at gmail dot com 2008-03-20 09:49 --- (In reply to comment #33) Anyway iam glad ffmpeg compiles fine under icc. Me to. Now you will troll in their support lists. No, truth be, i dont plan to switch to icc yet. Somehow i do prefer to use free tools. Of course if the gap becomes too big i as well as most others will switch to icc ... Also ffmpeg uses almost entirely asm() instead of intrinsics so this alone is not so much a problem for ffmpeg than it is for others who followed the recommandition of intrinsics are better than asm. About trolling, well i made no attempt to reply politely and diplomatic, no. But solving a problem in some use case by droping support for that use case is kinda extreem. The way i see it is that * Its non trivial to place emms optimally and automatically * there needs to be a emms between mmx code and fpu code The solutions to this would be any one of A. let the programmer place emms like it has been in the past B. dont support mmx at all C. dont support x87 fpu at all D. place emms after every bunch of mmx instructions E. solve a quite non trivial problem and place emms optimally The solution which has been selected apparently is B., why was that choosen? Instead of lets say A.? If i do write SIMD code then i do know that i need an emms on x86. Its trivial for the programmer to place it optimally. [...] -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552
[Bug target/14552] compiled trivial vector intrinsic code is inefficient
--- Comment #33 from michaelni at gmx dot at 2008-03-20 01:37 --- Subject: Re: compiled trivial vector intrinsic code is inefficient On Wed, Mar 19, 2008 at 11:39:18PM -, uros at gcc dot gnu dot org wrote: --- Comment #26 from uros at gcc dot gnu dot org 2008-03-19 23:39 --- Subject: Bug 14552 [...] * gcc.target/i386/pr14552.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/pr14552.c Thanks, i was already scared that the inverse proportional relation between version number and performance which was so nicely followed since 2.95 would stop. Adding a test to the testsuit to ensure that mmx intrinsics dont use mmx registers is well, just brilliant. Iam already eagerly awaiting the testcase which will check that floating point code doesnt use the FPU, i assume that will happen in gcc 5.0? Anyway iam glad ffmpeg compiles fine under icc. [...] -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552
[Bug c/35058] New: -Werror= works only with some warnings
-Werror=declaration-after-statement and -Werror=pointer-arith only generate warnings not errors. Example - void *a; void *test(){ if(a=a) a++; int x=5; return a+x; } gcc-4.3 -Werror=declaration-after-statement -Werror=pointer-arith testX.c -c -o testX adding -Werror=parentheses generates an error as expected though also interrestingly -fdiagnostics-show-option shows only [-Wparentheses] and nothing for the other 2 warnings Note, this issue has been seen on x86-32 and ppc -- Summary: -Werror= works only with some warnings Product: gcc Version: 4.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: michaelni at gmx dot at http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35058
[Bug inline-asm/11203] source doesn't compile with -O0 but they compile with -O3
--- Comment #39 from michaelni at gmx dot at 2007-02-27 22:50 --- (In reply to comment #38) (In reply to comment #37) now if there is a unwritten rule that m operands and variations of them cannot be copied anywhere, then it would be very desireable to have a asm constraint like m without this restriction this would resolve this and several other bugs also it would be very nice if such a dont copy restriction on m if it does exist could be documented Copying m operands onto the stack might not be such a great thing to wish for. Imagine if you used asm(movaps %xmm0, %0: =m(x[i])); If x[i] is only 32-bits, and gcc copied it onto the stack, then writing 16 bytes with movaps wouldn't also write to x[i+1] to x[i+3] as intended. I know there is a plenty of asm code in ffmpeg that overwrites or overreads memory operands and will fail if gcc tried to move them onto the stack. There is also alignment. movaps requires an aligned address, and maybe you have chosen x and i in such a way that it will be aligned. But when gcc copies the value onto the stack, how is it supposed to know what alignment it needs? well the data type used in m() must of course be correct, that is here a 128bit type, alignment can be handled like with all other types, double also gets aligned if the architecture needs it, so a uint128_t or sse128 or whatever can as well, the example you show is a fairly obscure special case in respect to moving m to the stack, in the end theres a need for a m like constraint which must not be moveable and a m like constraint which should be moveable (to the stack for example) the exact letters used are irrelevant -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11203
[Bug inline-asm/11203] source doesn't compile with -O0 but they compile with -O3
--- Comment #37 from michaelni at gmx dot at 2006-11-08 20:45 --- (In reply to comment #36) (In reply to comment #21) asm volatile( : =m (*(unsigned int*)(src + 0*stride)), =m (*(unsigned int*)(src + 1*stride)), =m (*(unsigned int*)(src + 2*stride)), =m (*(unsigned int*)(src + 3*stride)), =m (*(unsigned int*)(src + 4*stride)), =m (*(unsigned int*)(src + 5*stride)), =m (*(unsigned int*)(src + 6*stride)), =m (*(unsigned int*)(src + 7*stride)) ); (In reply to comment #26) it might also happen that in some intentionally overconstrained cases it ends up searching the whole 5040 possible assignments of 7 registers onto 7 non memory operands but still it wont fail The example Martin gave has *8* operands. You can try every possible direct mapping of those 8 addresses to just 7 registers, but they will obviously all fail. Except with ia32 addressing modes it _can_ be done, and with only 4 registers. reg1 = src, reg2 = stride, reg3 = src+stride*3, reg4 = src+stride*6 Then the 8 memory operands are: (reg1), (reg1,reg2,1), (reg1,reg2,2), (reg3), (reg1,reg2,4), (reg3,reg2,2), (reg4), (reg3,reg2,4) When one considers all the addressing modes, there are not just 7 possible registers, but (I think) 261 possible addresses. There are not just 5040 possibilities as Michael said, but over 76 x 10^15 possible ways of assigning these addresses to 7 operands! Then each register can be loaded not just with an address but with some sub-expression too, like how I loaded reg2 with stride. m operands and variations can be copied onto the stack and accessed from there, so no matter how many memory operands there are they can always be accessed over esp on ia32, so whatever you did calculate it is meaningless now if there is a unwritten rule that m operands and variations of them cannot be copied anywhere, then it would be very desireable to have a asm constraint like m without this restriction this would resolve this and several other bugs also it would be very nice if such a dont copy restriction on m if it does exist could be documented -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11203
[Bug target/12395] Suboptimal code with global variables
--- Comment #8 from michaelni at gmx dot at 2006-02-11 11:40 --- I really think this should be fixed, otherwise gcc wont be able to follow its exponential decaying performance which it has so accurately followed since 2.95 at least, to show clearer how much speed we could loose by fixing this i was nice and benchmarked the code (a simple for loop running 100 times with the code inside, rdtsc based timing outside with a 1000 times executed loop surounding it benchmarink was done on a 800mhz duron and a 500mhz pentium3, the first number is the number of cpu cycles for the duron, second one for p3 first let me show you the optimal code by steven boscher? addl$1,a\n je .L1\n addl$1,a\n .L1:\n 11.557 / 12.514 now what gcc 3.4/3.2 generated: movla, %%eax\n incl%%eax\n testl %%eax, %%eax\n movl%%eax, a\n je .L1\n incl%%eax\n movl%%eax, a\n .L1:\n //6.220 / 6.159 the code generated by mainline had 2 ret so it didnt fit in my benchmark loop the even better code by segher AT d12relay01 DOT megacenter.de.ibm.com addl$1,a\n sbbl $-1,a\n //11.755 / 15.111 one case which you must be carefull not to generate as its almost twice as fast as the on above while still being just 2 instructions is: cmpl $-1,a\n adcl$1,a\n //7.827 / 7.422 another 2 slightly faster variants are: movla, %%eax\n cmpl $-1,%%eax\n adcl$1,%%eax\n movl %%eax,a\n //6.567 / 8.811 movla, %%eax\n addl$1,%%eax\n sbbl $-1,%%eax\n movl %%eax,a\n //6.564 / 8.813 what a 14year old script kid would write and what gcc would generate if it where local variables: movla, %%eax\n incl%%eax\n je .L1\n incl%%eax\n .L1:\n movl%%eax, a\n //6.162 / 5.426 what i would write (as the variable isnt used in my testcase): \n //2.155 / 2.410 -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395
[Bug target/12395] Suboptimal code with global variables
--- Comment #11 from michaelni at gmx dot at 2006-02-11 13:54 --- (In reply to comment #9) Re. comment #8: exponential decaying performance which it has so accurately followed since 2.95 Can you back this up with numbers, or are you just trolling? If the latter, please don't do that, you are insulting the work of a dedicated few. Maybe you should help out instead of trolling, if you think you're so good. If you continue to make this kind of unhelpful comments, I will ask to have you blocked from our bugzilla. the benchmark was unhelpfull? anyway, compiling dsputil.c from libavcodec takes gcc 2.950m26.530s gcc 3.4 0m46.839s gcc 4.0 1m 1.515s (time /usr/bin/gcc-4.0 -O3 -g -DHAVE_AV_CONFIG_H -I.. -I'/home/michael/ffmpeg-write2/ffmpeg'/libavutil -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_GNU_SOURCE -c -o dsputil.o dsputil.c) and runtime performance, just try the recommanded way of writing asm/mmx code for gcc 2.95 vs gcc 3/4.*, handwritten asm code is quite a bit faster then what gcc creates from these intrinsics sometimes sure saying gcc gets exponentially slower in general isnt true but in some specific and common cases there is a big speedloss ... -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395
[Bug inline-asm/23313] New: gcc ignores ebp on the clobber list
* the code segfaults * there is no error message, not even a warning * the docs dont say ebp on the clobber list has undefinied behaviour though you could argue its common knowledgde that gcc-asm has undefined behavior in general testcase: int main(){ int i; asm ( xorl %%ebp, %%ebp\n\t movl %0, %%ebp\n\t :: m (i) : %ebp ); return 0; } -- Summary: gcc ignores ebp on the clobber list Product: gcc Version: 4.0.1 Status: UNCONFIRMED Severity: normal Priority: P2 Component: inline-asm AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: michaelni at gmx dot at CC: gcc-bugs at gcc dot gnu dot org GCC build triplet: x86-linux GCC host triplet: x86-linux GCC target triplet: x86-linux http://gcc.gnu.org/bugzilla/show_bug.cgi?id=23313
[Bug inline-asm/11203] source doesn't compile with -O0 but they compile with -O3
--- Additional Comments From michaelni at gmx dot at 2005-01-22 17:10 --- (In reply to comment #14) In any case, just because code is syntactically valid GNU C doesn't mean gcc can always compile it. With this kind of inline asm, you're bound to confuse the register allocator. The fact that it works at O3 is pure luck and not a bug. well, you are the gcc developers so theres not much arguing about what you consider valid, but last time i checked the docs did not mention that asm statemts may fail to compile at random, and IMO as long as this is not clearly stated in the docs this bugreport really shouldnt be marked as invalid, say you dont want to fix it, say it would be too complicated to fix or whatever but its not invalid Note that you're hitting an *error*, not an ICE. no, at least one of the bugreports marked as duplicate of this ends in an ICE (In reply to comment #24) Martin, you should realize that this problem *cannot* be solved. Yes, there will perhaps be a time when this particular test case compiles, though I think that is unlikely. But anyway, then there will be other cases that fail. hmm, so the probelm cannot be solved but then maybe it will be solved but this doesnt count because there will be other unrelated bugs? i cant follow this reasoning or do u mean that u can never solve all bugs and so theres no need to fix any single one? The reason is dead simple: register allocation is NP-complete, so it is even *theoretically* not possible to write register allocators that always find a coloring. register allocation in general is NP-complete, yes, but it seems u forget that this is about finding the optimal solution while gcc fails finding any solution which in practice is a matter of assigning the registers beginning from the most constrained operands to the least, and copying a few things on the stack if gcc cant figure out howto access them, sure this method might fail in 0.001% of the practical cases and need a 2nd or 3rd pass where it tries different registers it might also happen that in some intentionally overconstrained cases it ends up searching the whole 5040 possible assignments of 7 registers onto 7 non memory operands but still it wont fail That means any register allocator will always fail on some very constrained asm input. now that statement is just false, not to mention irrelevant as none of these asm statemets are unreasonably constrained And you cannot allow it to run indefinitely until a coloring is found, because then you've turned the graph coloring problem into the halting problem because you can't prove that a coloring exists and that the register allocator algorithm will terminate. this is ridiculous, the number of possible colorings is finite, u can always try them all in finite time -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11203
[Bug inline-asm/11203] source doesn't compile with -O0 but they compile with -O3
--- Additional Comments From michaelni at gmx dot at 2005-01-01 18:57 --- (In reply to comment #12) Why do people write inline-asm like this? why not? its valid code and a compiler should compile valid code ... It is crazy to do so. Split up the inline-asm correctly. fix gcc first so it doesnt loadstore more then needed between the splited up parts Anyone who writes like inline-asm should get what they get. For mmx inline-asm, you should be using the intrinsics instead as suggested before lets see why its not using intrinsics * it was written before intrinsics support was common * intrinsics fail / get misscompiled commonly, its so bad that some of the altivec intrinsic code has been disabled in ffmpeg if standard gcc is detected, there also have been very serious and similar problems in mplayer with altivec-intrinsics, sadly i cant provide more details as i dont have a ppc * many if not most of the mplayer developers still use gcc 2.95 because gcc 3.* is slower and needs more memory, and AFAIK 2.95 doesnt support intrinsics * it is alot of work to rewrite and debug it just to make it compileable with gcc -O0 or just write real asm file. thats not a good idea either as: * its slower due to the additional call/ret/parameter passing * there are some symbol name mangling issues on some obscure systems (see mplayer-dev or cvslog mailinglist, it was disscussed there a long time ago) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11203