Re: [PATCH] Frame pointer for arm with THUMB2 mode

2018-09-05 Thread Denis Khalikov

Richard,
thanks for the answer, got it.

On 09/05/2018 07:51 PM, Richard Earnshaw (lists) wrote:

On 05/09/18 17:43, Denis Khalikov wrote:

Thanks for the answers.

I understood that, this hack makes more mess in codegen for arm,
but can you please clarify what did you mean by


Only a single register can be used
as the frame chain.


As far as I understood, GCC for arm with THUMB2 mode uses r7 register as
frame pointer register by default (with -fno-omit-frame-pointer flag),
GCC for arm with ARM mode uses r11, so I didn't really propose
to change the frame pointer register.



On entry to a function the code has to save the existing frame register.
 It doesn't know (can't trivially know) whether the caller is code
compiled in Arm state or Thumb state.  So how can it save the caller's
frame register if they are not the same?

Furthermore, the 'other' frame register (ie r7 in Arm state, r11 in
Thumb) is available as a call-saved register, so can contain any random
value.  If you try to use that random value during a frame chain walk
your program will most like take an access violation.  It will certainly
give you a garbage frame chain.

R.


Thanks.

On 09/05/2018 06:32 PM, Wilco Dijkstra wrote:

Hi Denis,


Adding support for a frame chain would require an ABI change. It

would have to
 > work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial
amount of
 > effort.



Clang already works that way.


No, that's incorrect like Richard pointed out. Only a single register
can be used
as the frame chain.


If we change the size of the traces to 2, it could be something like
this:

...

At the first example we lost the full context, from where the
control/data flow comes from.


If 2 is not sufficient, then try 3 or 4. It may also be feasible to
only enable
deeper unwinding for particular libraries so you only pay an extra
cost for
leaks you are interested in.


The stack layout like this enables only with compile time flag
(-mthumb-fp and works only together with -mthumb and
-fno-omit-frame-pointer). It does not affect other codegen.


But any code built like that will *always* run slower even if you
don't use
the sanitizer.

Wilco







Re: [PATCH] Frame pointer for arm with THUMB2 mode

2018-09-05 Thread Richard Earnshaw (lists)
On 05/09/18 17:43, Denis Khalikov wrote:
> Thanks for the answers.
> 
> I understood that, this hack makes more mess in codegen for arm,
> but can you please clarify what did you mean by
> 
>>Only a single register can be used
>> as the frame chain.
> 
> As far as I understood, GCC for arm with THUMB2 mode uses r7 register as
> frame pointer register by default (with -fno-omit-frame-pointer flag),
> GCC for arm with ARM mode uses r11, so I didn't really propose
> to change the frame pointer register.
> 

On entry to a function the code has to save the existing frame register.
 It doesn't know (can't trivially know) whether the caller is code
compiled in Arm state or Thumb state.  So how can it save the caller's
frame register if they are not the same?

Furthermore, the 'other' frame register (ie r7 in Arm state, r11 in
Thumb) is available as a call-saved register, so can contain any random
value.  If you try to use that random value during a frame chain walk
your program will most like take an access violation.  It will certainly
give you a garbage frame chain.

R.

> Thanks.
> 
> On 09/05/2018 06:32 PM, Wilco Dijkstra wrote:
>> Hi Denis,
>>
 Adding support for a frame chain would require an ABI change. It
>>> would have to
>>>  > work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial
>>> amount of
>>>  > effort.
>>
>>> Clang already works that way.
>>
>> No, that's incorrect like Richard pointed out. Only a single register
>> can be used
>> as the frame chain.
>>
>>> If we change the size of the traces to 2, it could be something like
>>> this:
>> ...
>>> At the first example we lost the full context, from where the
>>> control/data flow comes from.
>>
>> If 2 is not sufficient, then try 3 or 4. It may also be feasible to
>> only enable
>> deeper unwinding for particular libraries so you only pay an extra
>> cost for
>> leaks you are interested in.
>>
>>> The stack layout like this enables only with compile time flag
>>> (-mthumb-fp and works only together with -mthumb and
>>> -fno-omit-frame-pointer). It does not affect other codegen.
>>
>> But any code built like that will *always* run slower even if you
>> don't use
>> the sanitizer.
>>
>> Wilco
>>



Re: [PATCH] Frame pointer for arm with THUMB2 mode

2018-09-05 Thread Denis Khalikov

Thanks for the answers.

I understood that, this hack makes more mess in codegen for arm,
but can you please clarify what did you mean by

>Only a single register can be used
> as the frame chain.

As far as I understood, GCC for arm with THUMB2 mode uses r7 register as
frame pointer register by default (with -fno-omit-frame-pointer flag),
GCC for arm with ARM mode uses r11, so I didn't really propose
to change the frame pointer register.

Thanks.

On 09/05/2018 06:32 PM, Wilco Dijkstra wrote:

Hi Denis,


Adding support for a frame chain would require an ABI change. It

would have to
 > work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial amount of
 > effort.



Clang already works that way.


No, that's incorrect like Richard pointed out. Only a single register can be 
used
as the frame chain.


If we change the size of the traces to 2, it could be something like this:

...

At the first example we lost the full context, from where the
control/data flow comes from.


If 2 is not sufficient, then try 3 or 4. It may also be feasible to only enable
deeper unwinding for particular libraries so you only pay an extra cost for
leaks you are interested in.


The stack layout like this enables only with compile time flag
(-mthumb-fp and works only together with -mthumb and
-fno-omit-frame-pointer). It does not affect other codegen.


But any code built like that will *always* run slower even if you don't use
the sanitizer.

Wilco



Re: [PATCH] Frame pointer for arm with THUMB2 mode

2018-09-05 Thread Wilco Dijkstra
Hi Denis,

>> Adding support for a frame chain would require an ABI change. It 
> would have to
> > work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial amount of
> > effort.

> Clang already works that way.

No, that's incorrect like Richard pointed out. Only a single register can be 
used
as the frame chain.

> If we change the size of the traces to 2, it could be something like this:
...
> At the first example we lost the full context, from where the 
> control/data flow comes from.

If 2 is not sufficient, then try 3 or 4. It may also be feasible to only enable
deeper unwinding for particular libraries so you only pay an extra cost for 
leaks you are interested in.

> The stack layout like this enables only with compile time flag 
> (-mthumb-fp and works only together with -mthumb and
> -fno-omit-frame-pointer). It does not affect other codegen.

But any code built like that will *always* run slower even if you don't use
the sanitizer.

Wilco

Re: [PATCH] Frame pointer for arm with THUMB2 mode

2018-09-05 Thread Richard Earnshaw (lists)
On 05/09/18 14:55, Denis Khalikov wrote:
> Hi Wilco,
> thanks for the answer.
> 
>> Adding support for a frame chain would require an ABI change. It would
> have to
>> work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial amount of
>> effort.
> 
> Clang already works that way.


> Please look at this commit:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/ARM/ARMCallingConv.td?r1=269459&r2=269458&pathrev=269459
> 
> 
> This is an example of code which clang generates
> with -mthumb -fno-omit-frame-pointer -O2.
> 
>  @ %bb.0:
>  push    {r4, r5, r6, r7, lr}
>  add r7, sp, #12
>  push.w  {r8, r9, r10}
>  sub sp, #56
>  mov r8, r0
>  movw    r0, :lower16:_ZTVN6sensor14sensor_handlerE
>  movt    r0, :upper16:_ZTVN6sensor14sensor_handlerE
>  mov r9, r8
>  adds    r0, #8
>  str r0, [r9], #4
> 
> The only difference is clang sets frame pointer to the r7 on the stack
> instead gcc sets to lr,
> but it already handles by the Sanitizers.

Then Clang is broken.  You can't have different frame pointers in Arm
and Thumb code.

I object to another hack going in for another ill-specified frame
pointer variant until such time as the ABI is updated to sort this out
properly.  The frame handling code is just too critical to overall
performance and the prologue and epilogue code itself is also quite
fragile in this respect.  Adding more mess on top of that is going to
make sorting this out even more tricky: and once a patch like this goes
on it's not easy to remove it again.

So until the ABI sanctions a proper inter-function frame chain record,
GCC will only support local use of the frame pointer and no chaining.

R.

> 
>>
>> So this sounds like the first thing to do is reducing the size of the
> stack traces.
>> The default is 30 which is far larger than useful. Using 1 for example
> should
>> always be fast (since you can use __builtin_return_address(0)) and
> still get
>> the function that called malloc. Also if the unwinder happens to be
> too slow,
>> it should be optimized and caching added etc.
>>
> 
> If we change the size of the traces to 2, it could be something like this:
> 
> 0xb42a50a0 is located 0 bytes inside of 88-byte region
> [0xb42a50a0,0xb42a50f8)
> freed by thread T0 here:
>     #0 0xb6a35cc7 in __interceptor_free (/usr/lib/libasan.so+0x127cc7)
>     #1 0xb5fa64e3 in ipc::message::unref()
> (/lib/libsensord-shared.so+0xe4e3)
> 
> previously allocated by thread T0 here:
>     #0 0xb6a36157 in malloc (/usr/lib/libasan.so+0x128157)
>     #1 0xb2f8852d in accel_device::get_data(unsigned int,
> sensor_data_t**, int*) (/usr/lib/sensor/libsensor-hal-tm1.so+0x1052d)
> 
> Instead this:
> 
> 0xb42250a0 is located 0 bytes inside of 88-byte region
> [0xb42250a0,0xb42250f8)
> freed by thread T0 here:
>     #0 0xb6989cff in __interceptor_free (/usr/lib/libasan.so+0x127cff)
>     #1 0xb5fa64e3 in ipc::message::unref()
> (/lib/libsensord-shared.so+0xe4e3)
>     #2 0xb6efbf47 in sensor::sensor_handler::notify(char const*,
> sensor_data_t*, int) (/usr/bin/sensord+0x1ef47)
>     #3 0xb6efad43 in sensor::sensor_event_handler::handle(int, unsigned
> int) (/usr/bin/sensord+0x1dd43)
>     #4 0xb5fa3bbb  (/lib/libsensord-shared.so+0x)
>     #5 0xb62d9a15 in g_main_context_dispatch
> (/lib/libglib-2.0.so.0+0x91a15)
>     #6 0xb62da2d9  (/lib/libglib-2.0.so.0+0x922d9)
>     #7 0xb62da9a9 in g_main_loop_run (/lib/libglib-2.0.so.0+0x929a9)
>     #8 0xb5fa4e1b in ipc::event_loop::run(int)
> (/lib/libsensord-shared.so+0xce1b)
>     #9 0xb6eec9a5 in main (/usr/bin/sensord+0xf9a5)
>     #10 0xb5cb663b in __libc_start_main (/lib/libc.so.6+0x1663b)
> 
> previously allocated by thread T0 here:
>     #0 0xb698a18f in malloc (/usr/lib/libasan.so+0x12818f)
>     #1 0xb2f8852d in accel_device::get_data(unsigned int,
> sensor_data_t**, int*) (/usr/lib/sensor/libsensor-hal-tm1.so+0x1052d)
>     #2 0xb6ef848f in
> sensor::physical_sensor_handler::get_data(sensor_data_t**, int*)
> (/usr/bin/sensord+0x1b48f)
>     #3 0xb6efaa51 in sensor::sensor_event_handler::handle(int, unsigned
> int) (/usr/bin/sensord+0x1da51)
>     #4 0xb5fa3bbb  (/lib/libsensord-shared.so+0x)
>     #5 0xb62d9a15 in g_main_context_dispatch
> (/lib/libglib-2.0.so.0+0x91a15)
>     #6 0xb62da2d9  (/lib/libglib-2.0.so.0+0x922d9)
>     #7 0xb62da9a9 in g_main_loop_run (/lib/libglib-2.0.so.0+0x929a9)
>     #8 0xb5fa4e1b in ipc::event_loop::run(int)
> (/lib/libsensord-shared.so+0xce1b)
>     #9 0xb6eec9a5 in main (/usr/bin/sensord+0xf9a5)
>     #10 0xb5cb663b in __libc_start_main (/lib/libc.so.6+0x1663b)
> 
> 
> At the first example we lost the full context, from where the
> control/data flow comes from.
> 
>>
>> The issue is that the frame pointer and frame chain always add a large
>> overhead even when you do not use any sanitizers. This is especially bad
>> for the proposed patch - you lose much of the benefit of using Thumb-2...
>>
> 
> The stack la

Re: [PATCH] Frame pointer for arm with THUMB2 mode

2018-09-05 Thread Denis Khalikov

Hi Wilco,
thanks for the answer.

> Adding support for a frame chain would require an ABI change. It 
would have to

> work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial amount of
> effort.

Clang already works that way.
Please look at this commit:
http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/ARM/ARMCallingConv.td?r1=269459&r2=269458&pathrev=269459

This is an example of code which clang generates
with -mthumb -fno-omit-frame-pointer -O2.

 @ %bb.0:
 push{r4, r5, r6, r7, lr}
 add r7, sp, #12
 push.w  {r8, r9, r10}
 sub sp, #56
 mov r8, r0
 movwr0, :lower16:_ZTVN6sensor14sensor_handlerE
 movtr0, :upper16:_ZTVN6sensor14sensor_handlerE
 mov r9, r8
 addsr0, #8
 str r0, [r9], #4

The only difference is clang sets frame pointer to the r7 on the stack 
instead gcc sets to lr,

but it already handles by the Sanitizers.

>
> So this sounds like the first thing to do is reducing the size of the 
stack traces.
> The default is 30 which is far larger than useful. Using 1 for 
example should
> always be fast (since you can use __builtin_return_address(0)) and 
still get
> the function that called malloc. Also if the unwinder happens to be 
too slow,

> it should be optimized and caching added etc.
>

If we change the size of the traces to 2, it could be something like this:

0xb42a50a0 is located 0 bytes inside of 88-byte region 
[0xb42a50a0,0xb42a50f8)

freed by thread T0 here:
#0 0xb6a35cc7 in __interceptor_free (/usr/lib/libasan.so+0x127cc7)
#1 0xb5fa64e3 in ipc::message::unref() 
(/lib/libsensord-shared.so+0xe4e3)


previously allocated by thread T0 here:
#0 0xb6a36157 in malloc (/usr/lib/libasan.so+0x128157)
#1 0xb2f8852d in accel_device::get_data(unsigned int, 
sensor_data_t**, int*) (/usr/lib/sensor/libsensor-hal-tm1.so+0x1052d)


Instead this:

0xb42250a0 is located 0 bytes inside of 88-byte region 
[0xb42250a0,0xb42250f8)

freed by thread T0 here:
#0 0xb6989cff in __interceptor_free (/usr/lib/libasan.so+0x127cff)
#1 0xb5fa64e3 in ipc::message::unref() 
(/lib/libsensord-shared.so+0xe4e3)
#2 0xb6efbf47 in sensor::sensor_handler::notify(char const*, 
sensor_data_t*, int) (/usr/bin/sensord+0x1ef47)
#3 0xb6efad43 in sensor::sensor_event_handler::handle(int, unsigned 
int) (/usr/bin/sensord+0x1dd43)

#4 0xb5fa3bbb  (/lib/libsensord-shared.so+0x)
#5 0xb62d9a15 in g_main_context_dispatch 
(/lib/libglib-2.0.so.0+0x91a15)

#6 0xb62da2d9  (/lib/libglib-2.0.so.0+0x922d9)
#7 0xb62da9a9 in g_main_loop_run (/lib/libglib-2.0.so.0+0x929a9)
#8 0xb5fa4e1b in ipc::event_loop::run(int) 
(/lib/libsensord-shared.so+0xce1b)

#9 0xb6eec9a5 in main (/usr/bin/sensord+0xf9a5)
#10 0xb5cb663b in __libc_start_main (/lib/libc.so.6+0x1663b)

previously allocated by thread T0 here:
#0 0xb698a18f in malloc (/usr/lib/libasan.so+0x12818f)
#1 0xb2f8852d in accel_device::get_data(unsigned int, 
sensor_data_t**, int*) (/usr/lib/sensor/libsensor-hal-tm1.so+0x1052d)
#2 0xb6ef848f in 
sensor::physical_sensor_handler::get_data(sensor_data_t**, int*) 
(/usr/bin/sensord+0x1b48f)
#3 0xb6efaa51 in sensor::sensor_event_handler::handle(int, unsigned 
int) (/usr/bin/sensord+0x1da51)

#4 0xb5fa3bbb  (/lib/libsensord-shared.so+0x)
#5 0xb62d9a15 in g_main_context_dispatch 
(/lib/libglib-2.0.so.0+0x91a15)

#6 0xb62da2d9  (/lib/libglib-2.0.so.0+0x922d9)
#7 0xb62da9a9 in g_main_loop_run (/lib/libglib-2.0.so.0+0x929a9)
#8 0xb5fa4e1b in ipc::event_loop::run(int) 
(/lib/libsensord-shared.so+0xce1b)

#9 0xb6eec9a5 in main (/usr/bin/sensord+0xf9a5)
#10 0xb5cb663b in __libc_start_main (/lib/libc.so.6+0x1663b)


At the first example we lost the full context, from where the 
control/data flow comes from.


>
> The issue is that the frame pointer and frame chain always add a large
> overhead even when you do not use any sanitizers. This is especially bad
> for the proposed patch - you lose much of the benefit of using Thumb-2...
>

The stack layout like this enables only with compile time flag 
(-mthumb-fp and works only together with -mthumb and

-fno-omit-frame-pointer). It does not affect other codegen.

Thanks.

On 09/05/2018 03:11 PM, Wilco Dijkstra wrote:

Hi Denis,


We are working on applying Address/LeakSanitizer for the full Tizen OS
distribution. It's about ~1000 packages, ASan/LSan runtime is installed
to ld.so.preload. As we know ASan/LSan has interceptors for
allocators/deallocators such as (malloc/realloc/calloc/free) and so on.
On every allocation from user space program, ASan calls
GET_STACK_TRACE_MALLOC;
which unwinds the stack frame, and by default uses frame based stack
unwinder. So, it requires to build with "-fno-omit-frame-pointer",
switching it to default unwinder really hits the performance in our case.


So this sounds like the first thing to do is reducing the size of the stack 
traces.
The default

Re: [PATCH] Frame pointer for arm with THUMB2 mode

2018-09-05 Thread Denis Khalikov

Hi Wilco,
thanks for the answer.

> Adding support for a frame chain would require an ABI change. It 
would have to

> work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial amount of
> effort.

Clang already works that way.
Please look at this commit:
http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/ARM/ARMCallingConv.td?r1=269459&r2=269458&pathrev=269459


This is an example of code which clang generates
with -mthumb -fno-omit-frame-pointer -O2.

 @ %bb.0:
 push{r4, r5, r6, r7, lr}
 add r7, sp, #12
 push.w  {r8, r9, r10}
 sub sp, #56
 mov r8, r0
 movwr0, :lower16:_ZTVN6sensor14sensor_handlerE
 movtr0, :upper16:_ZTVN6sensor14sensor_handlerE
 mov r9, r8
 addsr0, #8
 str r0, [r9], #4

The only difference is clang sets frame pointer to the r7 on the stack 
instead gcc sets to lr, but it already handles by the Sanitizers.


>
> So this sounds like the first thing to do is reducing the size of the 
stack traces.
> The default is 30 which is far larger than useful. Using 1 for 
example should
> always be fast (since you can use __builtin_return_address(0)) and 
still get
> the function that called malloc. Also if the unwinder happens to be 
too slow,

> it should be optimized and caching added etc.
>
If we change the size of the traces to 2, it could be something like this:

0xb42a50a0 is located 0 bytes inside of 88-byte region 
[0xb42a50a0,0xb42a50f8)

freed by thread T0 here:
#0 0xb6a35cc7 in __interceptor_free (/usr/lib/libasan.so+0x127cc7)
#1 0xb5fa64e3 in ipc::message::unref() 
(/lib/libsensord-shared.so+0xe4e3)


previously allocated by thread T0 here:
#0 0xb6a36157 in malloc (/usr/lib/libasan.so+0x128157)
#1 0xb2f8852d in accel_device::get_data(unsigned int, 
sensor_data_t**, int*) (/usr/lib/sensor/libsensor-hal-tm1.so+0x1052d)


Instead this:

0xb42250a0 is located 0 bytes inside of 88-byte region 
[0xb42250a0,0xb42250f8)

freed by thread T0 here:
#0 0xb6989cff in __interceptor_free (/usr/lib/libasan.so+0x127cff)
#1 0xb5fa64e3 in ipc::message::unref() 
(/lib/libsensord-shared.so+0xe4e3)
#2 0xb6efbf47 in sensor::sensor_handler::notify(char const*, 
sensor_data_t*, int) (/usr/bin/sensord+0x1ef47)
#3 0xb6efad43 in sensor::sensor_event_handler::handle(int, unsigned 
int) (/usr/bin/sensord+0x1dd43)

#4 0xb5fa3bbb  (/lib/libsensord-shared.so+0x)
#5 0xb62d9a15 in g_main_context_dispatch 
(/lib/libglib-2.0.so.0+0x91a15)

#6 0xb62da2d9  (/lib/libglib-2.0.so.0+0x922d9)
#7 0xb62da9a9 in g_main_loop_run (/lib/libglib-2.0.so.0+0x929a9)
#8 0xb5fa4e1b in ipc::event_loop::run(int) 
(/lib/libsensord-shared.so+0xce1b)

#9 0xb6eec9a5 in main (/usr/bin/sensord+0xf9a5)
#10 0xb5cb663b in __libc_start_main (/lib/libc.so.6+0x1663b)

previously allocated by thread T0 here:
#0 0xb698a18f in malloc (/usr/lib/libasan.so+0x12818f)
#1 0xb2f8852d in accel_device::get_data(unsigned int, 
sensor_data_t**, int*) (/usr/lib/sensor/libsensor-hal-tm1.so+0x1052d)
#2 0xb6ef848f in 
sensor::physical_sensor_handler::get_data(sensor_data_t**, int*) 
(/usr/bin/sensord+0x1b48f)
#3 0xb6efaa51 in sensor::sensor_event_handler::handle(int, unsigned 
int) (/usr/bin/sensord+0x1da51)

#4 0xb5fa3bbb  (/lib/libsensord-shared.so+0x)
#5 0xb62d9a15 in g_main_context_dispatch 
(/lib/libglib-2.0.so.0+0x91a15)

#6 0xb62da2d9  (/lib/libglib-2.0.so.0+0x922d9)
#7 0xb62da9a9 in g_main_loop_run (/lib/libglib-2.0.so.0+0x929a9)
#8 0xb5fa4e1b in ipc::event_loop::run(int) 
(/lib/libsensord-shared.so+0xce1b)

#9 0xb6eec9a5 in main (/usr/bin/sensord+0xf9a5)
#10 0xb5cb663b in __libc_start_main (/lib/libc.so.6+0x1663b)


At the first example we lost the full context, from where the 
control/data flow comes from.


>
> The issue is that the frame pointer and frame chain always add a large
> overhead even when you do not use any sanitizers. This is especially bad
> for the proposed patch - you lose much of the benefit of using Thumb-2...
>

The stack layout like this enables only with compile time flag 
(-mthumb-fp and works only together with -mthumb and

-fno-omit-frame-pointer). It does not affect other codegen.


Thanks.


On 09/05/2018 03:11 PM, Wilco Dijkstra wrote:

Hi Denis,


We are working on applying Address/LeakSanitizer for the full Tizen OS
distribution. It's about ~1000 packages, ASan/LSan runtime is installed
to ld.so.preload. As we know ASan/LSan has interceptors for
allocators/deallocators such as (malloc/realloc/calloc/free) and so on.
On every allocation from user space program, ASan calls
GET_STACK_TRACE_MALLOC;
which unwinds the stack frame, and by default uses frame based stack
unwinder. So, it requires to build with "-fno-omit-frame-pointer",
switching it to default unwinder really hits the performance in our case.


So this sounds like the first thing to do is reducing the size of the stack 
traces.
The defau

Re: [PATCH] Frame pointer for arm with THUMB2 mode

2018-09-05 Thread Wilco Dijkstra
Hi Denis,

> We are working on applying Address/LeakSanitizer for the full Tizen OS
> distribution. It's about ~1000 packages, ASan/LSan runtime is installed 
> to ld.so.preload. As we know ASan/LSan has interceptors for 
> allocators/deallocators such as (malloc/realloc/calloc/free) and so on.
> On every allocation from user space program, ASan calls
> GET_STACK_TRACE_MALLOC;
> which unwinds the stack frame, and by default uses frame based stack
> unwinder. So, it requires to build with "-fno-omit-frame-pointer", 
> switching it to default unwinder really hits the performance in our case.

So this sounds like the first thing to do is reducing the size of the stack 
traces.
The default is 30 which is far larger than useful. Using 1 for example should
always be fast (since you can use __builtin_return_address(0)) and still get
the function that called malloc. Also if the unwinder happens to be too slow,
it should be optimized and caching added etc.

>> Doing real unwinding is also far more accurate than frame pointer based
> > unwinding (the latter doesn't handle leaf functions correctly, 
> entry/exit in
> > non-leaf functions and shrinkwrapped functions - and this breaks 
> callgraph
> > profiling).
>
> I agree, but in our case, all interceptors for allocators are
> leaf functions, so the frame based stack unwinder works well for us.

Yes a frame chain would work for this case. But it's not currently supported.

> By default we build packages with ("-marm" "-fno-omit-frame-pointer"),
> because need frame based stack unwinder for every allocation, as I said
> before. As we know GCC sets fp to lr on the stack with 
> ("-fno-omit-frame-pointer" and "-marm") and I don't really know why.
> But the binary size is bigger than for thumb, so, we cannot use default 
> thumb frame pointer and want to reduce binary size for the full 
> sanitized image.

The issue is that the frame pointer and frame chain always add a large
overhead even when you do not use any sanitizers. This is especially bad
for the proposed patch - you lose much of the benefit of using Thumb-2...

Using normal unwinding means your code runs at full speed and still can be
used by the sanitizer.

> In other case clang works the same way, as I offered at the patch.
> It has the same issue, but it was fixed at the end of 2017
> https://bugs.llvm.org/show_bug.cgi?id=18505 (The topics starts from
> discussion about APCS, but it is not the main point.)
>
> Also, unresolved issue related to this
> https://github.com/google/sanitizers/issues/640

Adding support for a frame chain would require an ABI change. It would have to 
work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial amount of
effort.

Wilco


[PING][PATCH] Frame pointer for arm with THUMB2 mode

2018-09-05 Thread Denis Khalikov
Hello everyone,
can someone, please, take a look on this patch
https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01656.html

Thanks.


Re: [PATCH] Frame pointer for arm with THUMB2 mode

2018-08-28 Thread Denis Khalikov

Hi,
thanks for the answer.

> Switching on the frame pointer typically costs 1-2% performance, so 
it's a bad
> idea to use it. However changing the frame pointer like in the 
proposed patch
> will have a much larger cost - both in performance and codesize. 
You'd be
> lucky if it is less than 10%. This is due to placing the frame 
pointer at the top
> rather than the bottom of the frame, and that is very inefficient in 
Thumb-2.

>
> You would need to unwind ~100k times a second before you might see a
> performance gain. However you pay the performance cost all the time, even
> when no unwinding is required. So I don't see this as being a good idea.
>
> If unwind performance is an issue, it would make far more sense to 
solve that.
> Profiling typically hits the same functions again and again. 
Callgraph profiling to
> a fixed depth hits the same functions all the time. So caching is the 
obvious

> solution.

We are working on applying Address/LeakSanitizer for the full Tizen OS
distribution. It's about ~1000 packages, ASan/LSan runtime is installed 
to ld.so.preload. As we know ASan/LSan has interceptors for 
allocators/deallocators such as (malloc/realloc/calloc/free) and so on.

On every allocation from user space program, ASan calls
GET_STACK_TRACE_MALLOC;
which unwinds the stack frame, and by default uses frame based stack
unwinder. So, it requires to build with "-fno-omit-frame-pointer", 
switching it to default unwinder really hits the performance in our case.


> Doing real unwinding is also far more accurate than frame pointer based
> unwinding (the latter doesn't handle leaf functions correctly, 
entry/exit in
> non-leaf functions and shrinkwrapped functions - and this breaks 
callgraph

> profiling).

I agree, but in our case, all interceptors for allocators are
leaf functions, so the frame based stack unwinder works well for us.

> So my question is, is there any point in making code run 
significantly slower

> all the time and in return get inaccurate unwinding?

By default we build packages with ("-marm" "-fno-omit-frame-pointer"),
because need frame based stack unwinder for every allocation, as I said
before. As we know GCC sets fp to lr on the stack with 
("-fno-omit-frame-pointer" and "-marm") and I don't really know why.
But the binary size is bigger than for thumb, so, we cannot use default 
thumb frame pointer and want to reduce binary size for the full 
sanitized image.


In other case clang works the same way, as I offered at the patch.
It has the same issue, but it was fixed at the end of 2017
https://bugs.llvm.org/show_bug.cgi?id=18505 (The topics starts from
discussion about APCS, but it is not the main point.)

Also, unresolved issue related to this
https://github.com/google/sanitizers/issues/640

Thanks.


On 08/28/2018 03:48 AM, Wilco Dijkstra wrote:

Hi,


But we still have an issue with performance, when we are using default
unwinder, which uses unwind tables. It could be up to 10 times faster to
use frame based stack unwinder instead "default unwinder".


Switching on the frame pointer typically costs 1-2% performance, so it's a bad
idea to use it. However changing the frame pointer like in the proposed patch
will have a much larger cost - both in performance and codesize. You'd be
lucky if it is less than 10%. This is due to placing the frame pointer at the 
top
rather than the bottom of the frame, and that is very inefficient in Thumb-2.

You would need to unwind ~100k times a second before you might see a
performance gain. However you pay the performance cost all the time, even
when no unwinding is required. So I don't see this as being a good idea.

If unwind performance is an issue, it would make far more sense to solve that.
Profiling typically hits the same functions again and again. Callgraph 
profiling to
a fixed depth hits the same functions all the time. So caching is the obvious
solution.

Doing real unwinding is also far more accurate than frame pointer based
unwinding (the latter doesn't handle leaf functions correctly, entry/exit in
non-leaf functions and shrinkwrapped functions - and this breaks callgraph
profiling).

So my question is, is there any point in making code run significantly slower
all the time and in return get inaccurate unwinding?

Cheers,
Wilco



Re: [PATCH] Frame pointer for arm with THUMB2 mode

2018-08-27 Thread Wilco Dijkstra
Hi,

> But we still have an issue with performance, when we are using default
> unwinder, which uses unwind tables. It could be up to 10 times faster to 
> use frame based stack unwinder instead "default unwinder".

Switching on the frame pointer typically costs 1-2% performance, so it's a bad
idea to use it. However changing the frame pointer like in the proposed patch
will have a much larger cost - both in performance and codesize. You'd be 
lucky if it is less than 10%. This is due to placing the frame pointer at the 
top
rather than the bottom of the frame, and that is very inefficient in Thumb-2.

You would need to unwind ~100k times a second before you might see a 
performance gain. However you pay the performance cost all the time, even
when no unwinding is required. So I don't see this as being a good idea.

If unwind performance is an issue, it would make far more sense to solve that.
Profiling typically hits the same functions again and again. Callgraph 
profiling to
a fixed depth hits the same functions all the time. So caching is the obvious
solution.

Doing real unwinding is also far more accurate than frame pointer based
unwinding (the latter doesn't handle leaf functions correctly, entry/exit in
non-leaf functions and shrinkwrapped functions - and this breaks callgraph
profiling).

So my question is, is there any point in making code run significantly slower
all the time and in return get inaccurate unwinding?

Cheers,
Wilco

[PATCH] Frame pointer for arm with THUMB2 mode

2018-08-27 Thread Denis Khalikov
Hello everyone,
I've created the patch which sets the frame pointer to the predictable 
location in the stack frame for arm with THUMB2 and non-leaf functions.

Issue:
At this moment GCC emits frame layout for arm with THUMB2 mode,
"-fno-omit-frame-pointer" and AAPCS, and does not set frame pointer to
predictable location at the function frame for non-leaf functions. And
this is right thing to do regarding to answer on my
question https://gcc.gnu.org/ml/gcc-help/2018-07/msg00093.html.
But we still have an issue with performance, when we are using default
unwinder, which uses unwind tables. It could be up to 10 times faster to 
use frame based stack unwinder instead "default unwinder". As we know, 
the frame based stack unwinder works good for arm with ARM32 mode 
("-marm" compile-time option), because fp points to the lr on the stack, 
but the binary size could be about 30%-40% larger than with THUMB2 mode 
("-mthumb" compile time option). So, I think it
could be good to have an ability to use frame based stack unwinder for
thumb mode too. In this case AddressSanitizer, which uses frame based
stack unwinder, by default for every allocation/deallocation, could
benefit too.

Unresolved issue related to AddressSanitizer/LeakSanitzer:
https://github.com/google/sanitizers/issues/640

Problems:
By default arm with THUMB2 mode uses r7 register as frame pointer
register and it could be hard to set frame pointer to the predictable
location in the frame for some functions and some level of optimization.
For example interceptor for malloc in libasan built with
("-mthumb -fno-omit-frame-pointer -O2"):

000aa118 <__interceptor_malloc>:
stmdb   sp!, {r4, r5, r6, r7, r8, r9, r10, r11, lr}
subwsp, sp, #1076
add r7, sp, #16
...
function body
...
addwr7, r7, #1060
mov sp, r7
ldmia.w sp!, {r4, r5, r6, r7, r8, r9, r10, r11, pc}

r7 points to sp + 16 after allocating 1076 bytes on the stack and
it's impossible to find the lr at the runtime.

In other way registers should be pushed in ascending order and we cannot 
change function prologue to this:

   push {r4, r5, r6, r8, r9, sl, r11, r7, lr }

And seems to me, the right solution is to make multiple push and 
multiple pop in the same order (clang works that way):

   push {r4, r5, r6, r7, lr}
   push {r8, r9, sl, r11 }
   add r7, sp, #32
   subw sp, sp #1076
   ...
   function body
   ...
   subw r7, #32
   mov sp, r7
   pop {r8, r9, sl, r11 }
   pop {r4, r5, r6, r7, pc }

So the r7 points to lr on the stack after function prologue and we can
find previous frame pointer by

subw r7, r7, #4
ldr r7, [r7]

at the runtime.

Verification steps:
1. I've build full Tizen OS distribution which consists of about ~1000
packages with that patch "-mthumb -fno-omit-frame-pointer -mthumb-fp 
-O2" and AddressSanitizer.
2. I've built the image from those packages and verified it on real
device with armv7 ISA, and it seems works. Also stack based frame
unwinder works.
3. It's hard to write correct test case, because gcc pushes {r8... r11}
registers only on some level of optimization and the source file should
have some amount of function. So, it's kinda hard to find the small test 
case. I have many preprocessed files with size about ~1kk bytes.
4. The frame layout for malloc interceptor in libasan which was built
with that patch and ("-fno-omit-frame-poiniter -O2 -mthumb -mthumb-fp")
looks like this:

000bb1f4 <__interceptor_malloc>:
   push{r4, r5, r6, r7, lr}
   ldr r6, [pc, #340]  ; (bb34c <__interceptor_malloc+0x158>)
   ldr r3, [pc, #340]  ; (bb350 <__interceptor_malloc+0x15c>)
   add r6, pc
   stmdb   sp!, {r8, r9, sl, fp}
   add r7, sp, #32
   subwsp, sp, #1076   ; 0x434
   ...
   function body
   ...
   subsr7, #32
   mov sp, r7
   ldmia.w sp!, {r8, r9, sl, fp}
   pop {r4, r5, r6, r7, pc}

5. I've faced only one build fail related to this patch - glibc package,
but this is related to option name, some build machinery does not
recognize the "-mthumb-fp" option, and I'm not sure that glibc could be
build with frame layout like this.

It would be nice to get review for this patch by some experts. I could
miss a lot of corner cases.
Thanks.
From: Denis Khalikov 
Date: Thu, 23 Aug 2018 16:33:58 +0300
Subject: [PATCH] Frame pointer for arm with THUMB2 mode.

Set frame pointer to the predictable location in the stack frame
for arm with THUMB2 mode.

2018-08-23  Denis Khalikov  

  * config/arm/arm.c (arm_emit_multi_reg_pop_no_return): New function.
  (arm_compute_initial_elimination_offset): Add support for
  TARGET_THUMB_STACK_UNWIND.
  (arm_expand_prologue): Emit function prologue related to
  TARGET_THUMB_STACK_UNWIND.
  (thumb2_expand_return): Emit function epilogue related to
  TARGET_THUMB_STACK_UNWIND.
  (arm_expand_epilogue): Emit function epilogue relat