[Bug rtl-optimization/55757] Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3)

2023-11-28 Thread yann at poupet dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55757

Yann Poupet  changed:

   What|Removed |Added

 CC||yann at poupet dot eu

--- Comment #8 from Yann Poupet  ---
Hi,

I had the same issue and modified GCC so that the prologue/epilogue do not
save/restore R4-R11 if it's not required - assuming these are caller-saved
(same effect as -fcall-used-[r4...r11]), with a new function attribute. I'm not
sure if this has a chance to be accepted upstream though.

I'm using it in 2 cases:

- ISR as described above
- for the tasks launched by my own home made microkernel. Indeed, when the
kernel starts a task starting its entry function, there's no need to save any
register, it's just a waste of stack space.

Anyone still interested with a solution ?
The patch is very small, maybe 10 lines.

Cheers
Yann

[Bug rtl-optimization/55757] Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3)

2013-03-02 Thread web at brolinembedded dot se


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55757



Timmy Brolin web at brolinembedded dot se changed:



   What|Removed |Added



 CC||web at brolinembedded dot

   ||se



--- Comment #7 from Timmy Brolin web at brolinembedded dot se 2013-03-02 
15:23:11 UTC ---

(In reply to comment #2)

 If you really know that you don't need stack-alignment on an M3, then just

 remove the interrupt attribute.  It really doesn't serve any other purpose on

 M-profile cores other than to cause the stack realignment.



What you suggest requires a change in the C-code depending on the processor.

That is, one piece of C-code will not compile optimally for different Cortex-M3

revisions without modifications to the C-code itself. This is not good for code

which is intended to be used on multiple platforms.





Cortex-M3 r0p0 needs the prologue/epilogue.



Cortex-M3 r1p0 has a new configuration bit called STKALIGN which when enabled

makes the prologue/epilogue unnecessary. (But the default setting is that it

still needs the prologue/epilogue)



Cortex-M3 r2p0 changed the default setting of STKALIGN so that the

prologue/epilogue are unnecessary by default.





I would suggest that the prologue/epilogue should be removed by default when

compiling for r2p0 or higher, but be kept by default for older revisions.

There should also be a compilation switch to manually enable/disable the

prologue/epilogue according to the chosen setting of STKALIGN.



Interrupts can often be time critical, so ISR entry is probably the worst

possible place for extra instructions.


[Bug rtl-optimization/55757] Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3)

2012-12-20 Thread freddie_chopin at op dot pl


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55757



--- Comment #1 from Freddie Chopin freddie_chopin at op dot pl 2012-12-20 
15:23:25 UTC ---

BTW - it seems that optimization settings don't make any difference here - the

code below was compiled with -Os, on all other levels (1,2,3) the assembly

looks like this:



2e90 DMA_IRQHandler:

void DMA_IRQHandler(void) __attribute((interrupt));

void DMA_IRQHandler(void)

{

2e90:4668  movr0, sp

2e92:f020 0107 bic.wr1, r0, #7

2e96:468d  movsp, r1

2e98:b401  push{r0}

}

2e9a:bc01  pop{r0}

2e9c:4685  movsp, r0

2e9e:4770  bxlr





So it just saves r0 only, without saving lr. It's actually 2 bytes smaller than

the assembly done for size optimizations (;



Without optimization (-O0) I get:



473c DMA_IRQHandler:

void DMA_IRQHandler(void) __attribute((interrupt));

void DMA_IRQHandler(void)

{

473c:4668  movr0, sp

473e:f020 0107 bic.wr1, r0, #7

4742:468d  movsp, r1

4744:b481  push{r0, r7}

4746:af00  addr7, sp, #0

}

4748:46bd  movsp, r7

474a:bc81  pop{r0, r7}

474c:4685  movsp, r0

474e:4770  bxlr



The commandline options used to compile:



arm-none-eabi-gcc -c -mcpu=cortex-m3 -mthumb -O0 -ffunction-sections

-fdata-sections -Wall -Wstrict-prototypes -Wextra -std=gnu99 -g -ggdb3

-fverbose-asm -Wa,-ahlms=out/uart.lst   -MD -MP -MF out/uart.d some include

dirs input file output file


[Bug rtl-optimization/55757] Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3)

2012-12-20 Thread rearnsha at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55757



Richard Earnshaw rearnsha at gcc dot gnu.org changed:



   What|Removed |Added



   Priority|P3  |P5

 Status|UNCONFIRMED |NEW

   Last reconfirmed||2012-12-20

 Ever Confirmed|0   |1

   Severity|normal  |enhancement



--- Comment #2 from Richard Earnshaw rearnsha at gcc dot gnu.org 2012-12-20 
16:52:05 UTC ---

The code is there to re-align the stack to 64-bit alignment as required by the

ABI (early versions of the M3 did not have the ability to do this in HW).  The

reason two registers are pushed, rather than one is that this is also needed to

keep the stack aligned and pushing two registers uses less code than adjusting

the stack in a separate insn.



Of course, in this trivial case, the stack realignment isn't necessary as the

compiler should be able to tell that nothing requires re-alignment of the

stack.  But it's a corner case and it's much more common for this to be needed.



If you really know that you don't need stack-alignment on an M3, then just

remove the interrupt attribute.  It really doesn't serve any other purpose on

M-profile cores other than to cause the stack realignment.



Marking as an enhancement.  The code generated today is correct, but

sub-optimal.


[Bug rtl-optimization/55757] Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3)

2012-12-20 Thread freddie_chopin at op dot pl


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55757



--- Comment #3 from Freddie Chopin freddie_chopin at op dot pl 2012-12-20 
17:07:47 UTC ---

Indeed that's a trivial case, but other - useful - cases also show strange

behavior which I cannot clearly explain, so while we're at it I'd be grateful

for some explanation...



An interrupt handler function (void something(void)), but without attribute,

doing something inside (posts a FreeRTOS semaphore, calls vPortYieldFromISR()

if it's needed) actually saves a lot of registers on entry:

23b4:b507  push{r0, r1, r2, lr}

From what I know r0-r3 as scratch registers don't need to be saved on entry, as

it's the callers duty. There are also no parameters to be saved, as it's a void

function...



I observed the same behavior with some non-trivial functions from the lwIP

TCP/IP stack - they are also save scratch registers on entry, even when they

are void ...(void):



5d00 dns_init:

void

dns_init()

{

5d00:b537  push{r0, r1, r2, r4, r5, lr}



Is that a bug or maybe I don't understand the calling conventions? ;



BTW:

 The reason two registers are pushed, rather than one is that this is also 
 needed to

 keep the stack aligned and pushing two registers uses less code than 
 adjusting the stack in a separate insn.



But for optimization level 1, 2 and 3 only one reg is pushed...



Thx in advance!


[Bug rtl-optimization/55757] Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3)

2012-12-20 Thread joey.ye at arm dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55757



Joey Ye joey.ye at arm dot com changed:



   What|Removed |Added



 CC||joey.ye at arm dot com



--- Comment #4 from Joey Ye joey.ye at arm dot com 2012-12-21 03:23:07 UTC ---

 An interrupt handler function (void something(void)), but without attribute,

 doing something inside (posts a FreeRTOS semaphore, calls vPortYieldFromISR()

 if it's needed) actually saves a lot of registers on entry:

 23b4:b507  push{r0, r1, r2, lr}

Pushing of scratch registers can be used to 

1. align stack, which Richard has explained

2. allocate stack frame, as a code size optimization of sub sp, #x



Explain with following example:

extern void bar(int *, int *);

void foo()

{

int a, b;

bar(a, b);

}

Built with -Os -mcpu=cortex-m3:

push {r0, r1, r2, lr} 



Here, pushing of r0 and r1 allocates a 8-byte frame for local variables.

Pushing of r2 is to make sp aligned to 8 bytes together with pushing lr. Values

of r0-r2 pushed to stack don't really matter.



But built with -O2:

push{lr}

sub sp, sp, #12



Former is better on code size, latter wins on performance. Hopefully this

explains everything.


[Bug rtl-optimization/55757] Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3)

2012-12-20 Thread joey.ye at arm dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55757



--- Comment #5 from Joey Ye joey.ye at arm dot com 2012-12-21 03:32:21 UTC ---

However, there is room to improve both performance and stack consumption in

case of Os:



extern void bar(int *);



void foo()

{

int a;

bar(a);

}



Built with -mcpu=cortex-m3 -Os:

push{r0, r1, r2, lr}

addr0, sp, #4

blbar

pop{r1, r2, r3, pc}



Apparently it should be optimized to save 8 bytes of stack consumption and two

stores:

push{r0, lr}

movr0, sp

blbar

pop{r1, pc}


[Bug rtl-optimization/55757] Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3)

2012-12-20 Thread freddie_chopin at op dot pl


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55757



--- Comment #6 from Freddie Chopin freddie_chopin at op dot pl 2012-12-21 
07:08:59 UTC ---

(In reply to comment #4)

 Former is better on code size, latter wins on performance. Hopefully this

 explains everything.



Indeed, it's clear now. Thank you for your time!