Re: [PATCH] Frame pointer for arm with THUMB2 mode
Richard, thanks for the answer, got it. On 09/05/2018 07:51 PM, Richard Earnshaw (lists) wrote: On 05/09/18 17:43, Denis Khalikov wrote: Thanks for the answers. I understood that, this hack makes more mess in codegen for arm, but can you please clarify what did you mean by Only a single register can be used as the frame chain. As far as I understood, GCC for arm with THUMB2 mode uses r7 register as frame pointer register by default (with -fno-omit-frame-pointer flag), GCC for arm with ARM mode uses r11, so I didn't really propose to change the frame pointer register. On entry to a function the code has to save the existing frame register. It doesn't know (can't trivially know) whether the caller is code compiled in Arm state or Thumb state. So how can it save the caller's frame register if they are not the same? Furthermore, the 'other' frame register (ie r7 in Arm state, r11 in Thumb) is available as a call-saved register, so can contain any random value. If you try to use that random value during a frame chain walk your program will most like take an access violation. It will certainly give you a garbage frame chain. R. Thanks. On 09/05/2018 06:32 PM, Wilco Dijkstra wrote: Hi Denis, Adding support for a frame chain would require an ABI change. It would have to > work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial amount of > effort. Clang already works that way. No, that's incorrect like Richard pointed out. Only a single register can be used as the frame chain. If we change the size of the traces to 2, it could be something like this: ... At the first example we lost the full context, from where the control/data flow comes from. If 2 is not sufficient, then try 3 or 4. It may also be feasible to only enable deeper unwinding for particular libraries so you only pay an extra cost for leaks you are interested in. The stack layout like this enables only with compile time flag (-mthumb-fp and works only together with -mthumb and -fno-omit-frame-pointer). It does not affect other codegen. But any code built like that will *always* run slower even if you don't use the sanitizer. Wilco
Re: [PATCH] Frame pointer for arm with THUMB2 mode
On 05/09/18 17:43, Denis Khalikov wrote: > Thanks for the answers. > > I understood that, this hack makes more mess in codegen for arm, > but can you please clarify what did you mean by > >>Only a single register can be used >> as the frame chain. > > As far as I understood, GCC for arm with THUMB2 mode uses r7 register as > frame pointer register by default (with -fno-omit-frame-pointer flag), > GCC for arm with ARM mode uses r11, so I didn't really propose > to change the frame pointer register. > On entry to a function the code has to save the existing frame register. It doesn't know (can't trivially know) whether the caller is code compiled in Arm state or Thumb state. So how can it save the caller's frame register if they are not the same? Furthermore, the 'other' frame register (ie r7 in Arm state, r11 in Thumb) is available as a call-saved register, so can contain any random value. If you try to use that random value during a frame chain walk your program will most like take an access violation. It will certainly give you a garbage frame chain. R. > Thanks. > > On 09/05/2018 06:32 PM, Wilco Dijkstra wrote: >> Hi Denis, >> Adding support for a frame chain would require an ABI change. It >>> would have to >>> > work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial >>> amount of >>> > effort. >> >>> Clang already works that way. >> >> No, that's incorrect like Richard pointed out. Only a single register >> can be used >> as the frame chain. >> >>> If we change the size of the traces to 2, it could be something like >>> this: >> ... >>> At the first example we lost the full context, from where the >>> control/data flow comes from. >> >> If 2 is not sufficient, then try 3 or 4. It may also be feasible to >> only enable >> deeper unwinding for particular libraries so you only pay an extra >> cost for >> leaks you are interested in. >> >>> The stack layout like this enables only with compile time flag >>> (-mthumb-fp and works only together with -mthumb and >>> -fno-omit-frame-pointer). It does not affect other codegen. >> >> But any code built like that will *always* run slower even if you >> don't use >> the sanitizer. >> >> Wilco >>
Re: [PATCH] Frame pointer for arm with THUMB2 mode
Thanks for the answers. I understood that, this hack makes more mess in codegen for arm, but can you please clarify what did you mean by >Only a single register can be used > as the frame chain. As far as I understood, GCC for arm with THUMB2 mode uses r7 register as frame pointer register by default (with -fno-omit-frame-pointer flag), GCC for arm with ARM mode uses r11, so I didn't really propose to change the frame pointer register. Thanks. On 09/05/2018 06:32 PM, Wilco Dijkstra wrote: Hi Denis, Adding support for a frame chain would require an ABI change. It would have to > work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial amount of > effort. Clang already works that way. No, that's incorrect like Richard pointed out. Only a single register can be used as the frame chain. If we change the size of the traces to 2, it could be something like this: ... At the first example we lost the full context, from where the control/data flow comes from. If 2 is not sufficient, then try 3 or 4. It may also be feasible to only enable deeper unwinding for particular libraries so you only pay an extra cost for leaks you are interested in. The stack layout like this enables only with compile time flag (-mthumb-fp and works only together with -mthumb and -fno-omit-frame-pointer). It does not affect other codegen. But any code built like that will *always* run slower even if you don't use the sanitizer. Wilco
Re: [PATCH] Frame pointer for arm with THUMB2 mode
Hi Denis, >> Adding support for a frame chain would require an ABI change. It > would have to > > work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial amount of > > effort. > Clang already works that way. No, that's incorrect like Richard pointed out. Only a single register can be used as the frame chain. > If we change the size of the traces to 2, it could be something like this: ... > At the first example we lost the full context, from where the > control/data flow comes from. If 2 is not sufficient, then try 3 or 4. It may also be feasible to only enable deeper unwinding for particular libraries so you only pay an extra cost for leaks you are interested in. > The stack layout like this enables only with compile time flag > (-mthumb-fp and works only together with -mthumb and > -fno-omit-frame-pointer). It does not affect other codegen. But any code built like that will *always* run slower even if you don't use the sanitizer. Wilco
Re: [PATCH] Frame pointer for arm with THUMB2 mode
On 05/09/18 14:55, Denis Khalikov wrote: > Hi Wilco, > thanks for the answer. > >> Adding support for a frame chain would require an ABI change. It would > have to >> work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial amount of >> effort. > > Clang already works that way. > Please look at this commit: > http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/ARM/ARMCallingConv.td?r1=269459&r2=269458&pathrev=269459 > > > This is an example of code which clang generates > with -mthumb -fno-omit-frame-pointer -O2. > > @ %bb.0: > push {r4, r5, r6, r7, lr} > add r7, sp, #12 > push.w {r8, r9, r10} > sub sp, #56 > mov r8, r0 > movw r0, :lower16:_ZTVN6sensor14sensor_handlerE > movt r0, :upper16:_ZTVN6sensor14sensor_handlerE > mov r9, r8 > adds r0, #8 > str r0, [r9], #4 > > The only difference is clang sets frame pointer to the r7 on the stack > instead gcc sets to lr, > but it already handles by the Sanitizers. Then Clang is broken. You can't have different frame pointers in Arm and Thumb code. I object to another hack going in for another ill-specified frame pointer variant until such time as the ABI is updated to sort this out properly. The frame handling code is just too critical to overall performance and the prologue and epilogue code itself is also quite fragile in this respect. Adding more mess on top of that is going to make sorting this out even more tricky: and once a patch like this goes on it's not easy to remove it again. So until the ABI sanctions a proper inter-function frame chain record, GCC will only support local use of the frame pointer and no chaining. R. > >> >> So this sounds like the first thing to do is reducing the size of the > stack traces. >> The default is 30 which is far larger than useful. Using 1 for example > should >> always be fast (since you can use __builtin_return_address(0)) and > still get >> the function that called malloc. Also if the unwinder happens to be > too slow, >> it should be optimized and caching added etc. >> > > If we change the size of the traces to 2, it could be something like this: > > 0xb42a50a0 is located 0 bytes inside of 88-byte region > [0xb42a50a0,0xb42a50f8) > freed by thread T0 here: > #0 0xb6a35cc7 in __interceptor_free (/usr/lib/libasan.so+0x127cc7) > #1 0xb5fa64e3 in ipc::message::unref() > (/lib/libsensord-shared.so+0xe4e3) > > previously allocated by thread T0 here: > #0 0xb6a36157 in malloc (/usr/lib/libasan.so+0x128157) > #1 0xb2f8852d in accel_device::get_data(unsigned int, > sensor_data_t**, int*) (/usr/lib/sensor/libsensor-hal-tm1.so+0x1052d) > > Instead this: > > 0xb42250a0 is located 0 bytes inside of 88-byte region > [0xb42250a0,0xb42250f8) > freed by thread T0 here: > #0 0xb6989cff in __interceptor_free (/usr/lib/libasan.so+0x127cff) > #1 0xb5fa64e3 in ipc::message::unref() > (/lib/libsensord-shared.so+0xe4e3) > #2 0xb6efbf47 in sensor::sensor_handler::notify(char const*, > sensor_data_t*, int) (/usr/bin/sensord+0x1ef47) > #3 0xb6efad43 in sensor::sensor_event_handler::handle(int, unsigned > int) (/usr/bin/sensord+0x1dd43) > #4 0xb5fa3bbb (/lib/libsensord-shared.so+0x) > #5 0xb62d9a15 in g_main_context_dispatch > (/lib/libglib-2.0.so.0+0x91a15) > #6 0xb62da2d9 (/lib/libglib-2.0.so.0+0x922d9) > #7 0xb62da9a9 in g_main_loop_run (/lib/libglib-2.0.so.0+0x929a9) > #8 0xb5fa4e1b in ipc::event_loop::run(int) > (/lib/libsensord-shared.so+0xce1b) > #9 0xb6eec9a5 in main (/usr/bin/sensord+0xf9a5) > #10 0xb5cb663b in __libc_start_main (/lib/libc.so.6+0x1663b) > > previously allocated by thread T0 here: > #0 0xb698a18f in malloc (/usr/lib/libasan.so+0x12818f) > #1 0xb2f8852d in accel_device::get_data(unsigned int, > sensor_data_t**, int*) (/usr/lib/sensor/libsensor-hal-tm1.so+0x1052d) > #2 0xb6ef848f in > sensor::physical_sensor_handler::get_data(sensor_data_t**, int*) > (/usr/bin/sensord+0x1b48f) > #3 0xb6efaa51 in sensor::sensor_event_handler::handle(int, unsigned > int) (/usr/bin/sensord+0x1da51) > #4 0xb5fa3bbb (/lib/libsensord-shared.so+0x) > #5 0xb62d9a15 in g_main_context_dispatch > (/lib/libglib-2.0.so.0+0x91a15) > #6 0xb62da2d9 (/lib/libglib-2.0.so.0+0x922d9) > #7 0xb62da9a9 in g_main_loop_run (/lib/libglib-2.0.so.0+0x929a9) > #8 0xb5fa4e1b in ipc::event_loop::run(int) > (/lib/libsensord-shared.so+0xce1b) > #9 0xb6eec9a5 in main (/usr/bin/sensord+0xf9a5) > #10 0xb5cb663b in __libc_start_main (/lib/libc.so.6+0x1663b) > > > At the first example we lost the full context, from where the > control/data flow comes from. > >> >> The issue is that the frame pointer and frame chain always add a large >> overhead even when you do not use any sanitizers. This is especially bad >> for the proposed patch - you lose much of the benefit of using Thumb-2... >> > > The stack la
Re: [PATCH] Frame pointer for arm with THUMB2 mode
Hi Wilco, thanks for the answer. > Adding support for a frame chain would require an ABI change. It would have to > work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial amount of > effort. Clang already works that way. Please look at this commit: http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/ARM/ARMCallingConv.td?r1=269459&r2=269458&pathrev=269459 This is an example of code which clang generates with -mthumb -fno-omit-frame-pointer -O2. @ %bb.0: push{r4, r5, r6, r7, lr} add r7, sp, #12 push.w {r8, r9, r10} sub sp, #56 mov r8, r0 movwr0, :lower16:_ZTVN6sensor14sensor_handlerE movtr0, :upper16:_ZTVN6sensor14sensor_handlerE mov r9, r8 addsr0, #8 str r0, [r9], #4 The only difference is clang sets frame pointer to the r7 on the stack instead gcc sets to lr, but it already handles by the Sanitizers. > > So this sounds like the first thing to do is reducing the size of the stack traces. > The default is 30 which is far larger than useful. Using 1 for example should > always be fast (since you can use __builtin_return_address(0)) and still get > the function that called malloc. Also if the unwinder happens to be too slow, > it should be optimized and caching added etc. > If we change the size of the traces to 2, it could be something like this: 0xb42a50a0 is located 0 bytes inside of 88-byte region [0xb42a50a0,0xb42a50f8) freed by thread T0 here: #0 0xb6a35cc7 in __interceptor_free (/usr/lib/libasan.so+0x127cc7) #1 0xb5fa64e3 in ipc::message::unref() (/lib/libsensord-shared.so+0xe4e3) previously allocated by thread T0 here: #0 0xb6a36157 in malloc (/usr/lib/libasan.so+0x128157) #1 0xb2f8852d in accel_device::get_data(unsigned int, sensor_data_t**, int*) (/usr/lib/sensor/libsensor-hal-tm1.so+0x1052d) Instead this: 0xb42250a0 is located 0 bytes inside of 88-byte region [0xb42250a0,0xb42250f8) freed by thread T0 here: #0 0xb6989cff in __interceptor_free (/usr/lib/libasan.so+0x127cff) #1 0xb5fa64e3 in ipc::message::unref() (/lib/libsensord-shared.so+0xe4e3) #2 0xb6efbf47 in sensor::sensor_handler::notify(char const*, sensor_data_t*, int) (/usr/bin/sensord+0x1ef47) #3 0xb6efad43 in sensor::sensor_event_handler::handle(int, unsigned int) (/usr/bin/sensord+0x1dd43) #4 0xb5fa3bbb (/lib/libsensord-shared.so+0x) #5 0xb62d9a15 in g_main_context_dispatch (/lib/libglib-2.0.so.0+0x91a15) #6 0xb62da2d9 (/lib/libglib-2.0.so.0+0x922d9) #7 0xb62da9a9 in g_main_loop_run (/lib/libglib-2.0.so.0+0x929a9) #8 0xb5fa4e1b in ipc::event_loop::run(int) (/lib/libsensord-shared.so+0xce1b) #9 0xb6eec9a5 in main (/usr/bin/sensord+0xf9a5) #10 0xb5cb663b in __libc_start_main (/lib/libc.so.6+0x1663b) previously allocated by thread T0 here: #0 0xb698a18f in malloc (/usr/lib/libasan.so+0x12818f) #1 0xb2f8852d in accel_device::get_data(unsigned int, sensor_data_t**, int*) (/usr/lib/sensor/libsensor-hal-tm1.so+0x1052d) #2 0xb6ef848f in sensor::physical_sensor_handler::get_data(sensor_data_t**, int*) (/usr/bin/sensord+0x1b48f) #3 0xb6efaa51 in sensor::sensor_event_handler::handle(int, unsigned int) (/usr/bin/sensord+0x1da51) #4 0xb5fa3bbb (/lib/libsensord-shared.so+0x) #5 0xb62d9a15 in g_main_context_dispatch (/lib/libglib-2.0.so.0+0x91a15) #6 0xb62da2d9 (/lib/libglib-2.0.so.0+0x922d9) #7 0xb62da9a9 in g_main_loop_run (/lib/libglib-2.0.so.0+0x929a9) #8 0xb5fa4e1b in ipc::event_loop::run(int) (/lib/libsensord-shared.so+0xce1b) #9 0xb6eec9a5 in main (/usr/bin/sensord+0xf9a5) #10 0xb5cb663b in __libc_start_main (/lib/libc.so.6+0x1663b) At the first example we lost the full context, from where the control/data flow comes from. > > The issue is that the frame pointer and frame chain always add a large > overhead even when you do not use any sanitizers. This is especially bad > for the proposed patch - you lose much of the benefit of using Thumb-2... > The stack layout like this enables only with compile time flag (-mthumb-fp and works only together with -mthumb and -fno-omit-frame-pointer). It does not affect other codegen. Thanks. On 09/05/2018 03:11 PM, Wilco Dijkstra wrote: Hi Denis, We are working on applying Address/LeakSanitizer for the full Tizen OS distribution. It's about ~1000 packages, ASan/LSan runtime is installed to ld.so.preload. As we know ASan/LSan has interceptors for allocators/deallocators such as (malloc/realloc/calloc/free) and so on. On every allocation from user space program, ASan calls GET_STACK_TRACE_MALLOC; which unwinds the stack frame, and by default uses frame based stack unwinder. So, it requires to build with "-fno-omit-frame-pointer", switching it to default unwinder really hits the performance in our case. So this sounds like the first thing to do is reducing the size of the stack traces. The default
Re: [PATCH] Frame pointer for arm with THUMB2 mode
Hi Wilco, thanks for the answer. > Adding support for a frame chain would require an ABI change. It would have to > work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial amount of > effort. Clang already works that way. Please look at this commit: http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/ARM/ARMCallingConv.td?r1=269459&r2=269458&pathrev=269459 This is an example of code which clang generates with -mthumb -fno-omit-frame-pointer -O2. @ %bb.0: push{r4, r5, r6, r7, lr} add r7, sp, #12 push.w {r8, r9, r10} sub sp, #56 mov r8, r0 movwr0, :lower16:_ZTVN6sensor14sensor_handlerE movtr0, :upper16:_ZTVN6sensor14sensor_handlerE mov r9, r8 addsr0, #8 str r0, [r9], #4 The only difference is clang sets frame pointer to the r7 on the stack instead gcc sets to lr, but it already handles by the Sanitizers. > > So this sounds like the first thing to do is reducing the size of the stack traces. > The default is 30 which is far larger than useful. Using 1 for example should > always be fast (since you can use __builtin_return_address(0)) and still get > the function that called malloc. Also if the unwinder happens to be too slow, > it should be optimized and caching added etc. > If we change the size of the traces to 2, it could be something like this: 0xb42a50a0 is located 0 bytes inside of 88-byte region [0xb42a50a0,0xb42a50f8) freed by thread T0 here: #0 0xb6a35cc7 in __interceptor_free (/usr/lib/libasan.so+0x127cc7) #1 0xb5fa64e3 in ipc::message::unref() (/lib/libsensord-shared.so+0xe4e3) previously allocated by thread T0 here: #0 0xb6a36157 in malloc (/usr/lib/libasan.so+0x128157) #1 0xb2f8852d in accel_device::get_data(unsigned int, sensor_data_t**, int*) (/usr/lib/sensor/libsensor-hal-tm1.so+0x1052d) Instead this: 0xb42250a0 is located 0 bytes inside of 88-byte region [0xb42250a0,0xb42250f8) freed by thread T0 here: #0 0xb6989cff in __interceptor_free (/usr/lib/libasan.so+0x127cff) #1 0xb5fa64e3 in ipc::message::unref() (/lib/libsensord-shared.so+0xe4e3) #2 0xb6efbf47 in sensor::sensor_handler::notify(char const*, sensor_data_t*, int) (/usr/bin/sensord+0x1ef47) #3 0xb6efad43 in sensor::sensor_event_handler::handle(int, unsigned int) (/usr/bin/sensord+0x1dd43) #4 0xb5fa3bbb (/lib/libsensord-shared.so+0x) #5 0xb62d9a15 in g_main_context_dispatch (/lib/libglib-2.0.so.0+0x91a15) #6 0xb62da2d9 (/lib/libglib-2.0.so.0+0x922d9) #7 0xb62da9a9 in g_main_loop_run (/lib/libglib-2.0.so.0+0x929a9) #8 0xb5fa4e1b in ipc::event_loop::run(int) (/lib/libsensord-shared.so+0xce1b) #9 0xb6eec9a5 in main (/usr/bin/sensord+0xf9a5) #10 0xb5cb663b in __libc_start_main (/lib/libc.so.6+0x1663b) previously allocated by thread T0 here: #0 0xb698a18f in malloc (/usr/lib/libasan.so+0x12818f) #1 0xb2f8852d in accel_device::get_data(unsigned int, sensor_data_t**, int*) (/usr/lib/sensor/libsensor-hal-tm1.so+0x1052d) #2 0xb6ef848f in sensor::physical_sensor_handler::get_data(sensor_data_t**, int*) (/usr/bin/sensord+0x1b48f) #3 0xb6efaa51 in sensor::sensor_event_handler::handle(int, unsigned int) (/usr/bin/sensord+0x1da51) #4 0xb5fa3bbb (/lib/libsensord-shared.so+0x) #5 0xb62d9a15 in g_main_context_dispatch (/lib/libglib-2.0.so.0+0x91a15) #6 0xb62da2d9 (/lib/libglib-2.0.so.0+0x922d9) #7 0xb62da9a9 in g_main_loop_run (/lib/libglib-2.0.so.0+0x929a9) #8 0xb5fa4e1b in ipc::event_loop::run(int) (/lib/libsensord-shared.so+0xce1b) #9 0xb6eec9a5 in main (/usr/bin/sensord+0xf9a5) #10 0xb5cb663b in __libc_start_main (/lib/libc.so.6+0x1663b) At the first example we lost the full context, from where the control/data flow comes from. > > The issue is that the frame pointer and frame chain always add a large > overhead even when you do not use any sanitizers. This is especially bad > for the proposed patch - you lose much of the benefit of using Thumb-2... > The stack layout like this enables only with compile time flag (-mthumb-fp and works only together with -mthumb and -fno-omit-frame-pointer). It does not affect other codegen. Thanks. On 09/05/2018 03:11 PM, Wilco Dijkstra wrote: Hi Denis, We are working on applying Address/LeakSanitizer for the full Tizen OS distribution. It's about ~1000 packages, ASan/LSan runtime is installed to ld.so.preload. As we know ASan/LSan has interceptors for allocators/deallocators such as (malloc/realloc/calloc/free) and so on. On every allocation from user space program, ASan calls GET_STACK_TRACE_MALLOC; which unwinds the stack frame, and by default uses frame based stack unwinder. So, it requires to build with "-fno-omit-frame-pointer", switching it to default unwinder really hits the performance in our case. So this sounds like the first thing to do is reducing the size of the stack traces. The defau
Re: [PATCH] Frame pointer for arm with THUMB2 mode
Hi Denis, > We are working on applying Address/LeakSanitizer for the full Tizen OS > distribution. It's about ~1000 packages, ASan/LSan runtime is installed > to ld.so.preload. As we know ASan/LSan has interceptors for > allocators/deallocators such as (malloc/realloc/calloc/free) and so on. > On every allocation from user space program, ASan calls > GET_STACK_TRACE_MALLOC; > which unwinds the stack frame, and by default uses frame based stack > unwinder. So, it requires to build with "-fno-omit-frame-pointer", > switching it to default unwinder really hits the performance in our case. So this sounds like the first thing to do is reducing the size of the stack traces. The default is 30 which is far larger than useful. Using 1 for example should always be fast (since you can use __builtin_return_address(0)) and still get the function that called malloc. Also if the unwinder happens to be too slow, it should be optimized and caching added etc. >> Doing real unwinding is also far more accurate than frame pointer based > > unwinding (the latter doesn't handle leaf functions correctly, > entry/exit in > > non-leaf functions and shrinkwrapped functions - and this breaks > callgraph > > profiling). > > I agree, but in our case, all interceptors for allocators are > leaf functions, so the frame based stack unwinder works well for us. Yes a frame chain would work for this case. But it's not currently supported. > By default we build packages with ("-marm" "-fno-omit-frame-pointer"), > because need frame based stack unwinder for every allocation, as I said > before. As we know GCC sets fp to lr on the stack with > ("-fno-omit-frame-pointer" and "-marm") and I don't really know why. > But the binary size is bigger than for thumb, so, we cannot use default > thumb frame pointer and want to reduce binary size for the full > sanitized image. The issue is that the frame pointer and frame chain always add a large overhead even when you do not use any sanitizers. This is especially bad for the proposed patch - you lose much of the benefit of using Thumb-2... Using normal unwinding means your code runs at full speed and still can be used by the sanitizer. > In other case clang works the same way, as I offered at the patch. > It has the same issue, but it was fixed at the end of 2017 > https://bugs.llvm.org/show_bug.cgi?id=18505 (The topics starts from > discussion about APCS, but it is not the main point.) > > Also, unresolved issue related to this > https://github.com/google/sanitizers/issues/640 Adding support for a frame chain would require an ABI change. It would have to work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial amount of effort. Wilco
[PING][PATCH] Frame pointer for arm with THUMB2 mode
Hello everyone, can someone, please, take a look on this patch https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01656.html Thanks.
Re: [PATCH] Frame pointer for arm with THUMB2 mode
Hi, thanks for the answer. > Switching on the frame pointer typically costs 1-2% performance, so it's a bad > idea to use it. However changing the frame pointer like in the proposed patch > will have a much larger cost - both in performance and codesize. You'd be > lucky if it is less than 10%. This is due to placing the frame pointer at the top > rather than the bottom of the frame, and that is very inefficient in Thumb-2. > > You would need to unwind ~100k times a second before you might see a > performance gain. However you pay the performance cost all the time, even > when no unwinding is required. So I don't see this as being a good idea. > > If unwind performance is an issue, it would make far more sense to solve that. > Profiling typically hits the same functions again and again. Callgraph profiling to > a fixed depth hits the same functions all the time. So caching is the obvious > solution. We are working on applying Address/LeakSanitizer for the full Tizen OS distribution. It's about ~1000 packages, ASan/LSan runtime is installed to ld.so.preload. As we know ASan/LSan has interceptors for allocators/deallocators such as (malloc/realloc/calloc/free) and so on. On every allocation from user space program, ASan calls GET_STACK_TRACE_MALLOC; which unwinds the stack frame, and by default uses frame based stack unwinder. So, it requires to build with "-fno-omit-frame-pointer", switching it to default unwinder really hits the performance in our case. > Doing real unwinding is also far more accurate than frame pointer based > unwinding (the latter doesn't handle leaf functions correctly, entry/exit in > non-leaf functions and shrinkwrapped functions - and this breaks callgraph > profiling). I agree, but in our case, all interceptors for allocators are leaf functions, so the frame based stack unwinder works well for us. > So my question is, is there any point in making code run significantly slower > all the time and in return get inaccurate unwinding? By default we build packages with ("-marm" "-fno-omit-frame-pointer"), because need frame based stack unwinder for every allocation, as I said before. As we know GCC sets fp to lr on the stack with ("-fno-omit-frame-pointer" and "-marm") and I don't really know why. But the binary size is bigger than for thumb, so, we cannot use default thumb frame pointer and want to reduce binary size for the full sanitized image. In other case clang works the same way, as I offered at the patch. It has the same issue, but it was fixed at the end of 2017 https://bugs.llvm.org/show_bug.cgi?id=18505 (The topics starts from discussion about APCS, but it is not the main point.) Also, unresolved issue related to this https://github.com/google/sanitizers/issues/640 Thanks. On 08/28/2018 03:48 AM, Wilco Dijkstra wrote: Hi, But we still have an issue with performance, when we are using default unwinder, which uses unwind tables. It could be up to 10 times faster to use frame based stack unwinder instead "default unwinder". Switching on the frame pointer typically costs 1-2% performance, so it's a bad idea to use it. However changing the frame pointer like in the proposed patch will have a much larger cost - both in performance and codesize. You'd be lucky if it is less than 10%. This is due to placing the frame pointer at the top rather than the bottom of the frame, and that is very inefficient in Thumb-2. You would need to unwind ~100k times a second before you might see a performance gain. However you pay the performance cost all the time, even when no unwinding is required. So I don't see this as being a good idea. If unwind performance is an issue, it would make far more sense to solve that. Profiling typically hits the same functions again and again. Callgraph profiling to a fixed depth hits the same functions all the time. So caching is the obvious solution. Doing real unwinding is also far more accurate than frame pointer based unwinding (the latter doesn't handle leaf functions correctly, entry/exit in non-leaf functions and shrinkwrapped functions - and this breaks callgraph profiling). So my question is, is there any point in making code run significantly slower all the time and in return get inaccurate unwinding? Cheers, Wilco
Re: [PATCH] Frame pointer for arm with THUMB2 mode
Hi, > But we still have an issue with performance, when we are using default > unwinder, which uses unwind tables. It could be up to 10 times faster to > use frame based stack unwinder instead "default unwinder". Switching on the frame pointer typically costs 1-2% performance, so it's a bad idea to use it. However changing the frame pointer like in the proposed patch will have a much larger cost - both in performance and codesize. You'd be lucky if it is less than 10%. This is due to placing the frame pointer at the top rather than the bottom of the frame, and that is very inefficient in Thumb-2. You would need to unwind ~100k times a second before you might see a performance gain. However you pay the performance cost all the time, even when no unwinding is required. So I don't see this as being a good idea. If unwind performance is an issue, it would make far more sense to solve that. Profiling typically hits the same functions again and again. Callgraph profiling to a fixed depth hits the same functions all the time. So caching is the obvious solution. Doing real unwinding is also far more accurate than frame pointer based unwinding (the latter doesn't handle leaf functions correctly, entry/exit in non-leaf functions and shrinkwrapped functions - and this breaks callgraph profiling). So my question is, is there any point in making code run significantly slower all the time and in return get inaccurate unwinding? Cheers, Wilco
[PATCH] Frame pointer for arm with THUMB2 mode
Hello everyone, I've created the patch which sets the frame pointer to the predictable location in the stack frame for arm with THUMB2 and non-leaf functions. Issue: At this moment GCC emits frame layout for arm with THUMB2 mode, "-fno-omit-frame-pointer" and AAPCS, and does not set frame pointer to predictable location at the function frame for non-leaf functions. And this is right thing to do regarding to answer on my question https://gcc.gnu.org/ml/gcc-help/2018-07/msg00093.html. But we still have an issue with performance, when we are using default unwinder, which uses unwind tables. It could be up to 10 times faster to use frame based stack unwinder instead "default unwinder". As we know, the frame based stack unwinder works good for arm with ARM32 mode ("-marm" compile-time option), because fp points to the lr on the stack, but the binary size could be about 30%-40% larger than with THUMB2 mode ("-mthumb" compile time option). So, I think it could be good to have an ability to use frame based stack unwinder for thumb mode too. In this case AddressSanitizer, which uses frame based stack unwinder, by default for every allocation/deallocation, could benefit too. Unresolved issue related to AddressSanitizer/LeakSanitzer: https://github.com/google/sanitizers/issues/640 Problems: By default arm with THUMB2 mode uses r7 register as frame pointer register and it could be hard to set frame pointer to the predictable location in the frame for some functions and some level of optimization. For example interceptor for malloc in libasan built with ("-mthumb -fno-omit-frame-pointer -O2"): 000aa118 <__interceptor_malloc>: stmdb sp!, {r4, r5, r6, r7, r8, r9, r10, r11, lr} subwsp, sp, #1076 add r7, sp, #16 ... function body ... addwr7, r7, #1060 mov sp, r7 ldmia.w sp!, {r4, r5, r6, r7, r8, r9, r10, r11, pc} r7 points to sp + 16 after allocating 1076 bytes on the stack and it's impossible to find the lr at the runtime. In other way registers should be pushed in ascending order and we cannot change function prologue to this: push {r4, r5, r6, r8, r9, sl, r11, r7, lr } And seems to me, the right solution is to make multiple push and multiple pop in the same order (clang works that way): push {r4, r5, r6, r7, lr} push {r8, r9, sl, r11 } add r7, sp, #32 subw sp, sp #1076 ... function body ... subw r7, #32 mov sp, r7 pop {r8, r9, sl, r11 } pop {r4, r5, r6, r7, pc } So the r7 points to lr on the stack after function prologue and we can find previous frame pointer by subw r7, r7, #4 ldr r7, [r7] at the runtime. Verification steps: 1. I've build full Tizen OS distribution which consists of about ~1000 packages with that patch "-mthumb -fno-omit-frame-pointer -mthumb-fp -O2" and AddressSanitizer. 2. I've built the image from those packages and verified it on real device with armv7 ISA, and it seems works. Also stack based frame unwinder works. 3. It's hard to write correct test case, because gcc pushes {r8... r11} registers only on some level of optimization and the source file should have some amount of function. So, it's kinda hard to find the small test case. I have many preprocessed files with size about ~1kk bytes. 4. The frame layout for malloc interceptor in libasan which was built with that patch and ("-fno-omit-frame-poiniter -O2 -mthumb -mthumb-fp") looks like this: 000bb1f4 <__interceptor_malloc>: push{r4, r5, r6, r7, lr} ldr r6, [pc, #340] ; (bb34c <__interceptor_malloc+0x158>) ldr r3, [pc, #340] ; (bb350 <__interceptor_malloc+0x15c>) add r6, pc stmdb sp!, {r8, r9, sl, fp} add r7, sp, #32 subwsp, sp, #1076 ; 0x434 ... function body ... subsr7, #32 mov sp, r7 ldmia.w sp!, {r8, r9, sl, fp} pop {r4, r5, r6, r7, pc} 5. I've faced only one build fail related to this patch - glibc package, but this is related to option name, some build machinery does not recognize the "-mthumb-fp" option, and I'm not sure that glibc could be build with frame layout like this. It would be nice to get review for this patch by some experts. I could miss a lot of corner cases. Thanks. From: Denis Khalikov Date: Thu, 23 Aug 2018 16:33:58 +0300 Subject: [PATCH] Frame pointer for arm with THUMB2 mode. Set frame pointer to the predictable location in the stack frame for arm with THUMB2 mode. 2018-08-23 Denis Khalikov * config/arm/arm.c (arm_emit_multi_reg_pop_no_return): New function. (arm_compute_initial_elimination_offset): Add support for TARGET_THUMB_STACK_UNWIND. (arm_expand_prologue): Emit function prologue related to TARGET_THUMB_STACK_UNWIND. (thumb2_expand_return): Emit function epilogue related to TARGET_THUMB_STACK_UNWIND. (arm_expand_epilogue): Emit function epilogue relat