On Thursday 13 January 2011 12:48 AM, Albert ARIBAUD wrote: > (I realize I did not answer the other ones) > > Le 08/01/2011 11:06, Aneesh V a écrit : > >>> Out of curiosity, can you elaborate on why the compiler would optimize >>> better in these cases? >> >> While counting down the termination condition check is against 0. So >> you can just decrement the loop count using a 'subs' and do a 'bne'. >> When you count up you have to do a comparison with a non-zero value. So >> you will need one 'cmp' instruction extra:-) > > I would not try to be too smart about what instructions are generated > and how by a compiler such as gcc which has rather complex code > generation optimizations.
IMHO, on ARM comparing with 0 is always going to be efficient than comparing with a non-zero number for a termination condition, assuming a decent compiler. > >> bigger loop inside because that reduces the frequency at which your >> outer parameter changes and hence the overall number of instructions >> executed. Consider this: >> 1. We encode both the loop counts along with other data into a register >> that is finally written to CP15 register. >> 2. outer loop has the code for shifting and ORing the outer variable to >> this register. >> 3. Inner loop has the code for shifting and ORing the inner variable. >> Step (3) has to be executed 'way x set' number of times anyways. >> But having bigger loop inside makes sure that 2 is executed fewer times! > It's not a constant calculation. It's based on loop index. And this optimization is not relying on compiler specifics. This is a logic level optimization. It should generally give good results with all compilers. Perhaps I was wrong in stating that it helps in getting better assembly. It just helps in better run-time efficiency. > Here too it seems like you're underestimating the compiler's optimizing > capabilities -- your explanation seems to amount to extracting a > constant calculation from a loop, something that is rather usual in code > optimizing. Actually, in my experience(in this same context) GCC does a terrible job at this! For instance: + for (set = num_sets - 1; set >= 0; set--) { + setway = (level << 1) | (set << log2_line_len) | + (way << way_shift); Here, way_shift = 32 - log2_num_ways But if you substitute way_shift with the latter, GCC will put the subtraction instruction inside the loop! - where as it is clearly loop invariant. So, I had to move it explicitly out of the loop! In fact, I was thinking of giving this feedback to GCC. > >> With these tweaks the assembly code generated by this C code is as good >> as the original hand-written assembly code with my compiler. > > How about other compilers? > I haven't tested other compilers. However, as I mentioned above the latter one is a logic optimization. The former hopefully should help all ARM compilers. As you must be knowing, existing code for cache maintenance was in assembly. When I moved it to C I wanted to make sure that the generated code is as good as the original assembly for this critical piece of code (I expected some criticism about moving it to C :-)). That's why I checked the generated code and did these ,hopefully, minor tweaks to make it better. I hope they don't have any serious drawbacks. Best regards, Aneesh _______________________________________________ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot