[Bug rtl-optimization/55757] Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55757 Yann Poupet changed: What|Removed |Added CC||yann at poupet dot eu --- Comment #8 from Yann Poupet --- Hi, I had the same issue and modified GCC so that the prologue/epilogue do not save/restore R4-R11 if it's not required - assuming these are caller-saved (same effect as -fcall-used-[r4...r11]), with a new function attribute. I'm not sure if this has a chance to be accepted upstream though. I'm using it in 2 cases: - ISR as described above - for the tasks launched by my own home made microkernel. Indeed, when the kernel starts a task starting its entry function, there's no need to save any register, it's just a waste of stack space. Anyone still interested with a solution ? The patch is very small, maybe 10 lines. Cheers Yann
[Bug rtl-optimization/55757] Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55757 Timmy Brolin web at brolinembedded dot se changed: What|Removed |Added CC||web at brolinembedded dot ||se --- Comment #7 from Timmy Brolin web at brolinembedded dot se 2013-03-02 15:23:11 UTC --- (In reply to comment #2) If you really know that you don't need stack-alignment on an M3, then just remove the interrupt attribute. It really doesn't serve any other purpose on M-profile cores other than to cause the stack realignment. What you suggest requires a change in the C-code depending on the processor. That is, one piece of C-code will not compile optimally for different Cortex-M3 revisions without modifications to the C-code itself. This is not good for code which is intended to be used on multiple platforms. Cortex-M3 r0p0 needs the prologue/epilogue. Cortex-M3 r1p0 has a new configuration bit called STKALIGN which when enabled makes the prologue/epilogue unnecessary. (But the default setting is that it still needs the prologue/epilogue) Cortex-M3 r2p0 changed the default setting of STKALIGN so that the prologue/epilogue are unnecessary by default. I would suggest that the prologue/epilogue should be removed by default when compiling for r2p0 or higher, but be kept by default for older revisions. There should also be a compilation switch to manually enable/disable the prologue/epilogue according to the chosen setting of STKALIGN. Interrupts can often be time critical, so ISR entry is probably the worst possible place for extra instructions.
[Bug rtl-optimization/55757] Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55757 --- Comment #1 from Freddie Chopin freddie_chopin at op dot pl 2012-12-20 15:23:25 UTC --- BTW - it seems that optimization settings don't make any difference here - the code below was compiled with -Os, on all other levels (1,2,3) the assembly looks like this: 2e90 DMA_IRQHandler: void DMA_IRQHandler(void) __attribute((interrupt)); void DMA_IRQHandler(void) { 2e90:4668 movr0, sp 2e92:f020 0107 bic.wr1, r0, #7 2e96:468d movsp, r1 2e98:b401 push{r0} } 2e9a:bc01 pop{r0} 2e9c:4685 movsp, r0 2e9e:4770 bxlr So it just saves r0 only, without saving lr. It's actually 2 bytes smaller than the assembly done for size optimizations (; Without optimization (-O0) I get: 473c DMA_IRQHandler: void DMA_IRQHandler(void) __attribute((interrupt)); void DMA_IRQHandler(void) { 473c:4668 movr0, sp 473e:f020 0107 bic.wr1, r0, #7 4742:468d movsp, r1 4744:b481 push{r0, r7} 4746:af00 addr7, sp, #0 } 4748:46bd movsp, r7 474a:bc81 pop{r0, r7} 474c:4685 movsp, r0 474e:4770 bxlr The commandline options used to compile: arm-none-eabi-gcc -c -mcpu=cortex-m3 -mthumb -O0 -ffunction-sections -fdata-sections -Wall -Wstrict-prototypes -Wextra -std=gnu99 -g -ggdb3 -fverbose-asm -Wa,-ahlms=out/uart.lst -MD -MP -MF out/uart.d some include dirs input file output file
[Bug rtl-optimization/55757] Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55757 Richard Earnshaw rearnsha at gcc dot gnu.org changed: What|Removed |Added Priority|P3 |P5 Status|UNCONFIRMED |NEW Last reconfirmed||2012-12-20 Ever Confirmed|0 |1 Severity|normal |enhancement --- Comment #2 from Richard Earnshaw rearnsha at gcc dot gnu.org 2012-12-20 16:52:05 UTC --- The code is there to re-align the stack to 64-bit alignment as required by the ABI (early versions of the M3 did not have the ability to do this in HW). The reason two registers are pushed, rather than one is that this is also needed to keep the stack aligned and pushing two registers uses less code than adjusting the stack in a separate insn. Of course, in this trivial case, the stack realignment isn't necessary as the compiler should be able to tell that nothing requires re-alignment of the stack. But it's a corner case and it's much more common for this to be needed. If you really know that you don't need stack-alignment on an M3, then just remove the interrupt attribute. It really doesn't serve any other purpose on M-profile cores other than to cause the stack realignment. Marking as an enhancement. The code generated today is correct, but sub-optimal.
[Bug rtl-optimization/55757] Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55757 --- Comment #3 from Freddie Chopin freddie_chopin at op dot pl 2012-12-20 17:07:47 UTC --- Indeed that's a trivial case, but other - useful - cases also show strange behavior which I cannot clearly explain, so while we're at it I'd be grateful for some explanation... An interrupt handler function (void something(void)), but without attribute, doing something inside (posts a FreeRTOS semaphore, calls vPortYieldFromISR() if it's needed) actually saves a lot of registers on entry: 23b4:b507 push{r0, r1, r2, lr} From what I know r0-r3 as scratch registers don't need to be saved on entry, as it's the callers duty. There are also no parameters to be saved, as it's a void function... I observed the same behavior with some non-trivial functions from the lwIP TCP/IP stack - they are also save scratch registers on entry, even when they are void ...(void): 5d00 dns_init: void dns_init() { 5d00:b537 push{r0, r1, r2, r4, r5, lr} Is that a bug or maybe I don't understand the calling conventions? ; BTW: The reason two registers are pushed, rather than one is that this is also needed to keep the stack aligned and pushing two registers uses less code than adjusting the stack in a separate insn. But for optimization level 1, 2 and 3 only one reg is pushed... Thx in advance!
[Bug rtl-optimization/55757] Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55757 Joey Ye joey.ye at arm dot com changed: What|Removed |Added CC||joey.ye at arm dot com --- Comment #4 from Joey Ye joey.ye at arm dot com 2012-12-21 03:23:07 UTC --- An interrupt handler function (void something(void)), but without attribute, doing something inside (posts a FreeRTOS semaphore, calls vPortYieldFromISR() if it's needed) actually saves a lot of registers on entry: 23b4:b507 push{r0, r1, r2, lr} Pushing of scratch registers can be used to 1. align stack, which Richard has explained 2. allocate stack frame, as a code size optimization of sub sp, #x Explain with following example: extern void bar(int *, int *); void foo() { int a, b; bar(a, b); } Built with -Os -mcpu=cortex-m3: push {r0, r1, r2, lr} Here, pushing of r0 and r1 allocates a 8-byte frame for local variables. Pushing of r2 is to make sp aligned to 8 bytes together with pushing lr. Values of r0-r2 pushed to stack don't really matter. But built with -O2: push{lr} sub sp, sp, #12 Former is better on code size, latter wins on performance. Hopefully this explains everything.
[Bug rtl-optimization/55757] Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55757 --- Comment #5 from Joey Ye joey.ye at arm dot com 2012-12-21 03:32:21 UTC --- However, there is room to improve both performance and stack consumption in case of Os: extern void bar(int *); void foo() { int a; bar(a); } Built with -mcpu=cortex-m3 -Os: push{r0, r1, r2, lr} addr0, sp, #4 blbar pop{r1, r2, r3, pc} Apparently it should be optimized to save 8 bytes of stack consumption and two stores: push{r0, lr} movr0, sp blbar pop{r1, pc}
[Bug rtl-optimization/55757] Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55757 --- Comment #6 from Freddie Chopin freddie_chopin at op dot pl 2012-12-21 07:08:59 UTC --- (In reply to comment #4) Former is better on code size, latter wins on performance. Hopefully this explains everything. Indeed, it's clear now. Thank you for your time!