[PATCH 4.9 101/310] perf report: Fix off-by-one for non-activation frames
4.9-stable review patch. If anyone has any objections, please let me know. -- From: Milian Wolff[ Upstream commit 1982ad48fc82c284a5cc55697a012d3357e84d01 ] As the documentation for dwfl_frame_pc says, frames that are no activation frames need to have their program counter decremented by one to properly find the function of the caller. This fixes many cases where perf report currently attributes the cost to the next line. I.e. I have code like this: ~~~ #include #include using namespace std; int main() { this_thread::sleep_for(chrono::milliseconds(1000)); this_thread::sleep_for(chrono::milliseconds(100)); this_thread::sleep_for(chrono::milliseconds(10)); return 0; } ~~~ Now compile and record it: ~~~ g++ -std=c++11 -g -O2 test.cpp echo 1 | sudo tee /proc/sys/kernel/sched_schedstats perf record \ --event sched:sched_stat_sleep \ --event sched:sched_process_exit \ --event sched:sched_switch --call-graph=dwarf \ --output perf.data.raw \ ./a.out echo 0 | sudo tee /proc/sys/kernel/sched_schedstats perf inject --sched-stat --input perf.data.raw --output perf.data ~~~ Before this patch, the report clearly shows the off-by-one issue. Most notably, the last sleep invocation is incorrectly attributed to the "return 0;" line: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for > thread:323 | |--90.08%--main test.cpp:9 | __libc_start_main | _start | |--9.01%--main test.cpp:10 | __libc_start_main | _start | --0.91%--main test.cpp:13 __libc_start_main _start ~~~ With this patch here applied, the issue is fixed. The report becomes much more usable: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for > thread:323 | |--90.08%--main test.cpp:8 | __libc_start_main | _start | |--9.01%--main test.cpp:9 | __libc_start_main | _start | --0.91%--main test.cpp:10 __libc_start_main _start ~~~ Similarly it works for signal frames: ~~~ __noinline void bar(void) { volatile long cnt = 0; for (cnt = 0; cnt < 1; cnt++); } __noinline void foo(void) { bar(); } void sig_handler(int sig) { foo(); } int main(void) { signal(SIGUSR1, sig_handler); raise(SIGUSR1); foo(); return 0; } Before, the report wrongly points to `signal.c:29` after raise(): $ perf report --stdio --no-children -g srcline -s srcline ... 100.00% signal.c:11 | ---bar signal.c:11 | |--50.49%--main signal.c:29 | __libc_start_main | _start | --49.51%--0x33a8f raise .:0 main signal.c:29 __libc_start_main _start With this patch in, the issue is fixed and we instead get: 100.00% signal signal[.] bar | ---bar signal.c:11 | |--50.49%--main signal.c:29 | __libc_start_main | _start | --49.51%--0x33a8f raise .:0 main signal.c:27 __libc_start_main _start Note how this patch fixes this issue for both unwinding methods, i.e. both dwfl and libunwind. The former case is straight-forward thanks to dwfl_frame_pc(). For libunwind, we replace the functionality via unw_is_signal_frame() for any but the very first frame. Signed-off-by:
[PATCH 4.9 101/310] perf report: Fix off-by-one for non-activation frames
4.9-stable review patch. If anyone has any objections, please let me know. -- From: Milian Wolff [ Upstream commit 1982ad48fc82c284a5cc55697a012d3357e84d01 ] As the documentation for dwfl_frame_pc says, frames that are no activation frames need to have their program counter decremented by one to properly find the function of the caller. This fixes many cases where perf report currently attributes the cost to the next line. I.e. I have code like this: ~~~ #include #include using namespace std; int main() { this_thread::sleep_for(chrono::milliseconds(1000)); this_thread::sleep_for(chrono::milliseconds(100)); this_thread::sleep_for(chrono::milliseconds(10)); return 0; } ~~~ Now compile and record it: ~~~ g++ -std=c++11 -g -O2 test.cpp echo 1 | sudo tee /proc/sys/kernel/sched_schedstats perf record \ --event sched:sched_stat_sleep \ --event sched:sched_process_exit \ --event sched:sched_switch --call-graph=dwarf \ --output perf.data.raw \ ./a.out echo 0 | sudo tee /proc/sys/kernel/sched_schedstats perf inject --sched-stat --input perf.data.raw --output perf.data ~~~ Before this patch, the report clearly shows the off-by-one issue. Most notably, the last sleep invocation is incorrectly attributed to the "return 0;" line: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for > thread:323 | |--90.08%--main test.cpp:9 | __libc_start_main | _start | |--9.01%--main test.cpp:10 | __libc_start_main | _start | --0.91%--main test.cpp:13 __libc_start_main _start ~~~ With this patch here applied, the issue is fixed. The report becomes much more usable: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for > thread:323 | |--90.08%--main test.cpp:8 | __libc_start_main | _start | |--9.01%--main test.cpp:9 | __libc_start_main | _start | --0.91%--main test.cpp:10 __libc_start_main _start ~~~ Similarly it works for signal frames: ~~~ __noinline void bar(void) { volatile long cnt = 0; for (cnt = 0; cnt < 1; cnt++); } __noinline void foo(void) { bar(); } void sig_handler(int sig) { foo(); } int main(void) { signal(SIGUSR1, sig_handler); raise(SIGUSR1); foo(); return 0; } Before, the report wrongly points to `signal.c:29` after raise(): $ perf report --stdio --no-children -g srcline -s srcline ... 100.00% signal.c:11 | ---bar signal.c:11 | |--50.49%--main signal.c:29 | __libc_start_main | _start | --49.51%--0x33a8f raise .:0 main signal.c:29 __libc_start_main _start With this patch in, the issue is fixed and we instead get: 100.00% signal signal[.] bar | ---bar signal.c:11 | |--50.49%--main signal.c:29 | __libc_start_main | _start | --49.51%--0x33a8f raise .:0 main signal.c:27 __libc_start_main _start Note how this patch fixes this issue for both unwinding methods, i.e. both dwfl and libunwind. The former case is straight-forward thanks to dwfl_frame_pc(). For libunwind, we replace the functionality via unw_is_signal_frame() for any but the very first frame. Signed-off-by: Milian Wolff Signed-off-by: Namhyung Kim Cc: Arnaldo Carvalho de Melo Cc:
[PATCH AUTOSEL for 4.9 102/281] perf report: Fix off-by-one for non-activation frames
From: Milian Wolff[ Upstream commit 1982ad48fc82c284a5cc55697a012d3357e84d01 ] As the documentation for dwfl_frame_pc says, frames that are no activation frames need to have their program counter decremented by one to properly find the function of the caller. This fixes many cases where perf report currently attributes the cost to the next line. I.e. I have code like this: ~~~ #include #include using namespace std; int main() { this_thread::sleep_for(chrono::milliseconds(1000)); this_thread::sleep_for(chrono::milliseconds(100)); this_thread::sleep_for(chrono::milliseconds(10)); return 0; } ~~~ Now compile and record it: ~~~ g++ -std=c++11 -g -O2 test.cpp echo 1 | sudo tee /proc/sys/kernel/sched_schedstats perf record \ --event sched:sched_stat_sleep \ --event sched:sched_process_exit \ --event sched:sched_switch --call-graph=dwarf \ --output perf.data.raw \ ./a.out echo 0 | sudo tee /proc/sys/kernel/sched_schedstats perf inject --sched-stat --input perf.data.raw --output perf.data ~~~ Before this patch, the report clearly shows the off-by-one issue. Most notably, the last sleep invocation is incorrectly attributed to the "return 0;" line: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for > thread:323 | |--90.08%--main test.cpp:9 | __libc_start_main | _start | |--9.01%--main test.cpp:10 | __libc_start_main | _start | --0.91%--main test.cpp:13 __libc_start_main _start ~~~ With this patch here applied, the issue is fixed. The report becomes much more usable: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for > thread:323 | |--90.08%--main test.cpp:8 | __libc_start_main | _start | |--9.01%--main test.cpp:9 | __libc_start_main | _start | --0.91%--main test.cpp:10 __libc_start_main _start ~~~ Similarly it works for signal frames: ~~~ __noinline void bar(void) { volatile long cnt = 0; for (cnt = 0; cnt < 1; cnt++); } __noinline void foo(void) { bar(); } void sig_handler(int sig) { foo(); } int main(void) { signal(SIGUSR1, sig_handler); raise(SIGUSR1); foo(); return 0; } Before, the report wrongly points to `signal.c:29` after raise(): $ perf report --stdio --no-children -g srcline -s srcline ... 100.00% signal.c:11 | ---bar signal.c:11 | |--50.49%--main signal.c:29 | __libc_start_main | _start | --49.51%--0x33a8f raise .:0 main signal.c:29 __libc_start_main _start With this patch in, the issue is fixed and we instead get: 100.00% signal signal[.] bar | ---bar signal.c:11 | |--50.49%--main signal.c:29 | __libc_start_main | _start | --49.51%--0x33a8f raise .:0 main signal.c:27 __libc_start_main _start Note how this patch fixes this issue for both unwinding methods, i.e. both dwfl and libunwind. The former case is straight-forward thanks to dwfl_frame_pc(). For libunwind, we replace the functionality via unw_is_signal_frame() for any but the very first frame. Signed-off-by: Milian Wolff Signed-off-by: Namhyung Kim Cc: Arnaldo
[PATCH AUTOSEL for 4.9 102/281] perf report: Fix off-by-one for non-activation frames
From: Milian Wolff [ Upstream commit 1982ad48fc82c284a5cc55697a012d3357e84d01 ] As the documentation for dwfl_frame_pc says, frames that are no activation frames need to have their program counter decremented by one to properly find the function of the caller. This fixes many cases where perf report currently attributes the cost to the next line. I.e. I have code like this: ~~~ #include #include using namespace std; int main() { this_thread::sleep_for(chrono::milliseconds(1000)); this_thread::sleep_for(chrono::milliseconds(100)); this_thread::sleep_for(chrono::milliseconds(10)); return 0; } ~~~ Now compile and record it: ~~~ g++ -std=c++11 -g -O2 test.cpp echo 1 | sudo tee /proc/sys/kernel/sched_schedstats perf record \ --event sched:sched_stat_sleep \ --event sched:sched_process_exit \ --event sched:sched_switch --call-graph=dwarf \ --output perf.data.raw \ ./a.out echo 0 | sudo tee /proc/sys/kernel/sched_schedstats perf inject --sched-stat --input perf.data.raw --output perf.data ~~~ Before this patch, the report clearly shows the off-by-one issue. Most notably, the last sleep invocation is incorrectly attributed to the "return 0;" line: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for > thread:323 | |--90.08%--main test.cpp:9 | __libc_start_main | _start | |--9.01%--main test.cpp:10 | __libc_start_main | _start | --0.91%--main test.cpp:13 __libc_start_main _start ~~~ With this patch here applied, the issue is fixed. The report becomes much more usable: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for > thread:323 | |--90.08%--main test.cpp:8 | __libc_start_main | _start | |--9.01%--main test.cpp:9 | __libc_start_main | _start | --0.91%--main test.cpp:10 __libc_start_main _start ~~~ Similarly it works for signal frames: ~~~ __noinline void bar(void) { volatile long cnt = 0; for (cnt = 0; cnt < 1; cnt++); } __noinline void foo(void) { bar(); } void sig_handler(int sig) { foo(); } int main(void) { signal(SIGUSR1, sig_handler); raise(SIGUSR1); foo(); return 0; } Before, the report wrongly points to `signal.c:29` after raise(): $ perf report --stdio --no-children -g srcline -s srcline ... 100.00% signal.c:11 | ---bar signal.c:11 | |--50.49%--main signal.c:29 | __libc_start_main | _start | --49.51%--0x33a8f raise .:0 main signal.c:29 __libc_start_main _start With this patch in, the issue is fixed and we instead get: 100.00% signal signal[.] bar | ---bar signal.c:11 | |--50.49%--main signal.c:29 | __libc_start_main | _start | --49.51%--0x33a8f raise .:0 main signal.c:27 __libc_start_main _start Note how this patch fixes this issue for both unwinding methods, i.e. both dwfl and libunwind. The former case is straight-forward thanks to dwfl_frame_pc(). For libunwind, we replace the functionality via unw_is_signal_frame() for any but the very first frame. Signed-off-by: Milian Wolff Signed-off-by: Namhyung Kim Cc: Arnaldo Carvalho de Melo Cc: Arnaldo Carvalho de Melo Cc: David Ahern Cc: Jiri Olsa Cc: Jiri Olsa Cc: Linus Torvalds Cc:
Re: perf report: fix off-by-one for non-activation frames
On Monday, June 19, 2017 8:59:39 PM CEST Arnaldo Carvalho de Melo wrote: > Em Sat, Jun 17, 2017 at 01:13:11PM +0200, Milian Wolff escreveu: > > On Samstag, 17. Juni 2017 10:04:02 CEST Jan Kratochvil wrote: > > > On Sat, 17 Jun 2017 09:56:57 +0200, Namhyung Kim wrote: > > > > Not sure whether it needs be fixed or not. If we fix it, srcline and > > > > address would not match so it can give its own confusion to users. > > > > Ideally it should display an addressof the instruction before the > > > > address IMHO. > > > > > > One can figure million ways how it can behave and each one has its pros > > > and > > > cons. I was just describing the current behavior of GDB and LLDB which > > > people are used to already. > > > > Personally, I agree with Jan that we should mimick existing tool's > > behavior. I just fear that it's not trivial to do it with the current > > code base... > But we agree it is a worthwhile change (have backtraces in perf match > what gdb, etc show), right? > > If you can, please try to do this, your attempt will help us understand > more the extent of the changes needed and perhaps someonw can come up > with simplifications... Sorry for the (long) delay, but I'm sadly busy on other things right now. I have this on my radar and will try to find time to look into it. But it probably won't happen before end of July. If this is urgent, maybe someone else needs to look into it before me. Cheers -- Milian Wolff | milian.wo...@kdab.com | Senior Software Engineer KDAB (Deutschland) GmbH KG, a KDAB Group company Tel: +49-30-521325470 KDAB - The Qt Experts smime.p7s Description: S/MIME cryptographic signature
Re: perf report: fix off-by-one for non-activation frames
On Monday, June 19, 2017 8:59:39 PM CEST Arnaldo Carvalho de Melo wrote: > Em Sat, Jun 17, 2017 at 01:13:11PM +0200, Milian Wolff escreveu: > > On Samstag, 17. Juni 2017 10:04:02 CEST Jan Kratochvil wrote: > > > On Sat, 17 Jun 2017 09:56:57 +0200, Namhyung Kim wrote: > > > > Not sure whether it needs be fixed or not. If we fix it, srcline and > > > > address would not match so it can give its own confusion to users. > > > > Ideally it should display an addressof the instruction before the > > > > address IMHO. > > > > > > One can figure million ways how it can behave and each one has its pros > > > and > > > cons. I was just describing the current behavior of GDB and LLDB which > > > people are used to already. > > > > Personally, I agree with Jan that we should mimick existing tool's > > behavior. I just fear that it's not trivial to do it with the current > > code base... > But we agree it is a worthwhile change (have backtraces in perf match > what gdb, etc show), right? > > If you can, please try to do this, your attempt will help us understand > more the extent of the changes needed and perhaps someonw can come up > with simplifications... Sorry for the (long) delay, but I'm sadly busy on other things right now. I have this on my radar and will try to find time to look into it. But it probably won't happen before end of July. If this is urgent, maybe someone else needs to look into it before me. Cheers -- Milian Wolff | milian.wo...@kdab.com | Senior Software Engineer KDAB (Deutschland) GmbH KG, a KDAB Group company Tel: +49-30-521325470 KDAB - The Qt Experts smime.p7s Description: S/MIME cryptographic signature
Re: perf report: fix off-by-one for non-activation frames
Em Sat, Jun 17, 2017 at 01:13:11PM +0200, Milian Wolff escreveu: > On Samstag, 17. Juni 2017 10:04:02 CEST Jan Kratochvil wrote: > > On Sat, 17 Jun 2017 09:56:57 +0200, Namhyung Kim wrote: > > > Not sure whether it needs be fixed or not. If we fix it, srcline and > > > address would not match so it can give its own confusion to users. > > > Ideally it should display an addressof the instruction before the > > > address IMHO. > > > > One can figure million ways how it can behave and each one has its pros and > > cons. I was just describing the current behavior of GDB and LLDB which > > people are used to already. > > Personally, I agree with Jan that we should mimick existing tool's behavior. > I > just fear that it's not trivial to do it with the current code base... But we agree it is a worthwhile change (have backtraces in perf match what gdb, etc show), right? If you can, please try to do this, your attempt will help us understand more the extent of the changes needed and perhaps someonw can come up with simplifications... - Arnaldo
Re: perf report: fix off-by-one for non-activation frames
Em Sat, Jun 17, 2017 at 01:13:11PM +0200, Milian Wolff escreveu: > On Samstag, 17. Juni 2017 10:04:02 CEST Jan Kratochvil wrote: > > On Sat, 17 Jun 2017 09:56:57 +0200, Namhyung Kim wrote: > > > Not sure whether it needs be fixed or not. If we fix it, srcline and > > > address would not match so it can give its own confusion to users. > > > Ideally it should display an addressof the instruction before the > > > address IMHO. > > > > One can figure million ways how it can behave and each one has its pros and > > cons. I was just describing the current behavior of GDB and LLDB which > > people are used to already. > > Personally, I agree with Jan that we should mimick existing tool's behavior. > I > just fear that it's not trivial to do it with the current code base... But we agree it is a worthwhile change (have backtraces in perf match what gdb, etc show), right? If you can, please try to do this, your attempt will help us understand more the extent of the changes needed and perhaps someonw can come up with simplifications... - Arnaldo
Re: perf report: fix off-by-one for non-activation frames
On Samstag, 17. Juni 2017 10:04:02 CEST Jan Kratochvil wrote: > On Sat, 17 Jun 2017 09:56:57 +0200, Namhyung Kim wrote: > > Not sure whether it needs be fixed or not. If we fix it, srcline and > > address would not match so it can give its own confusion to users. > > Ideally it should display an addressof the instruction before the > > address IMHO. > > One can figure million ways how it can behave and each one has its pros and > cons. I was just describing the current behavior of GDB and LLDB which > people are used to already. Personally, I agree with Jan that we should mimick existing tool's behavior. I just fear that it's not trivial to do it with the current code base... -- Milian Wolff | milian.wo...@kdab.com | Senior Software Engineer KDAB (Deutschland) GmbH KG, a KDAB Group company Tel: +49-30-521325470 KDAB - The Qt Experts
Re: perf report: fix off-by-one for non-activation frames
On Samstag, 17. Juni 2017 10:04:02 CEST Jan Kratochvil wrote: > On Sat, 17 Jun 2017 09:56:57 +0200, Namhyung Kim wrote: > > Not sure whether it needs be fixed or not. If we fix it, srcline and > > address would not match so it can give its own confusion to users. > > Ideally it should display an addressof the instruction before the > > address IMHO. > > One can figure million ways how it can behave and each one has its pros and > cons. I was just describing the current behavior of GDB and LLDB which > people are used to already. Personally, I agree with Jan that we should mimick existing tool's behavior. I just fear that it's not trivial to do it with the current code base... -- Milian Wolff | milian.wo...@kdab.com | Senior Software Engineer KDAB (Deutschland) GmbH KG, a KDAB Group company Tel: +49-30-521325470 KDAB - The Qt Experts
Re: perf report: fix off-by-one for non-activation frames
On Sat, 17 Jun 2017 09:56:57 +0200, Namhyung Kim wrote: > Not sure whether it needs be fixed or not. If we fix it, srcline and > address would not match so it can give its own confusion to users. > Ideally it should display an addressof the instruction before the > address IMHO. One can figure million ways how it can behave and each one has its pros and cons. I was just describing the current behavior of GDB and LLDB which people are used to already. Jan
Re: perf report: fix off-by-one for non-activation frames
On Sat, 17 Jun 2017 09:56:57 +0200, Namhyung Kim wrote: > Not sure whether it needs be fixed or not. If we fix it, srcline and > address would not match so it can give its own confusion to users. > Ideally it should display an addressof the instruction before the > address IMHO. One can figure million ways how it can behave and each one has its pros and cons. I was just describing the current behavior of GDB and LLDB which people are used to already. Jan
Re: perf report: fix off-by-one for non-activation frames
On Sat, Jun 17, 2017 at 4:54 AM, Milian Wolffwrote: > On Freitag, 16. Juni 2017 13:57:44 CEST Jan Kratochvil wrote: >> On Fri, 16 Jun 2017 13:51:37 +0200, Milian Wolff wrote: >> > > perf-4.12.0-0.rc5.git0.1.fc27.x86_64 >> > > >> > > 39e32e gdb_main (/usr/libexec/gdb) >> > > 10b6fa main (/usr/libexec/gdb) >> > > >> > >0x5565f6f6 <+54>:callq 0x558f17a0 >> > > :mov >> > >0x18(%rsp),%rcx >> >> [...] >> >> > Excuse me, but I'm having trouble following you. The non-GDB backtraces >> > you >> > are pasting do not show srcline information. So what exactly is broken? >> >> There is broken that perf now reports address 10b6fa (corresponding to >> relocated address 0x5565f6fa) but there is no instruction on address >> 0x5565f6fa. If you 'objdump -d' it you cannot find any instruction >> on adress 0x5565f6fa (or on address 0x10b6fa). There is >> instruction on address 0x5565f6fb. >> >> > Maybe paste the perf output you get now and highlight what you'd expect >> > instead? >> >> Actual: >> 39e32e gdb_main (/usr/libexec/gdb) >> 10b6fa main (/usr/libexec/gdb) >> Expected: >> 39e32f gdb_main (/usr/libexec/gdb) >> 10b6fb main (/usr/libexec/gdb) >> >> I agree perf needs to calculate with 39e32e and 10b6fa. But it should >> display to user 39e32f and 10b6fb. > > Hmmm this will require some more changes throughout the stack then. I.e. we'll > have to remember the "isactivation" flag along with the original IP, and only > apply the offset then when we query for inliners or srcline information. Maybe > I can pull that off somehow in the patch series I'm working on currently, > which refactors the whole inline/srcline/callchain logic anyways. > > I don't see an easy way to fix the behavior. Does anyone else? So how do we > deal with this situation in the interim? I'd prefer we keep the current > "broken" state, as I consider it less broken than what we had before... I > guess some of the core perf developers should decide how to handle this. Not sure whether it needs be fixed or not. If we fix it, srcline and address would not match so it can give its own confusion to users. Ideally it should display an addressof the instruction before the address IMHO. Thanks, Namhyung
Re: perf report: fix off-by-one for non-activation frames
On Sat, Jun 17, 2017 at 4:54 AM, Milian Wolff wrote: > On Freitag, 16. Juni 2017 13:57:44 CEST Jan Kratochvil wrote: >> On Fri, 16 Jun 2017 13:51:37 +0200, Milian Wolff wrote: >> > > perf-4.12.0-0.rc5.git0.1.fc27.x86_64 >> > > >> > > 39e32e gdb_main (/usr/libexec/gdb) >> > > 10b6fa main (/usr/libexec/gdb) >> > > >> > >0x5565f6f6 <+54>:callq 0x558f17a0 >> > >:mov >> > >0x18(%rsp),%rcx >> >> [...] >> >> > Excuse me, but I'm having trouble following you. The non-GDB backtraces >> > you >> > are pasting do not show srcline information. So what exactly is broken? >> >> There is broken that perf now reports address 10b6fa (corresponding to >> relocated address 0x5565f6fa) but there is no instruction on address >> 0x5565f6fa. If you 'objdump -d' it you cannot find any instruction >> on adress 0x5565f6fa (or on address 0x10b6fa). There is >> instruction on address 0x5565f6fb. >> >> > Maybe paste the perf output you get now and highlight what you'd expect >> > instead? >> >> Actual: >> 39e32e gdb_main (/usr/libexec/gdb) >> 10b6fa main (/usr/libexec/gdb) >> Expected: >> 39e32f gdb_main (/usr/libexec/gdb) >> 10b6fb main (/usr/libexec/gdb) >> >> I agree perf needs to calculate with 39e32e and 10b6fa. But it should >> display to user 39e32f and 10b6fb. > > Hmmm this will require some more changes throughout the stack then. I.e. we'll > have to remember the "isactivation" flag along with the original IP, and only > apply the offset then when we query for inliners or srcline information. Maybe > I can pull that off somehow in the patch series I'm working on currently, > which refactors the whole inline/srcline/callchain logic anyways. > > I don't see an easy way to fix the behavior. Does anyone else? So how do we > deal with this situation in the interim? I'd prefer we keep the current > "broken" state, as I consider it less broken than what we had before... I > guess some of the core perf developers should decide how to handle this. Not sure whether it needs be fixed or not. If we fix it, srcline and address would not match so it can give its own confusion to users. Ideally it should display an addressof the instruction before the address IMHO. Thanks, Namhyung
Re: perf report: fix off-by-one for non-activation frames
On Freitag, 16. Juni 2017 13:57:44 CEST Jan Kratochvil wrote: > On Fri, 16 Jun 2017 13:51:37 +0200, Milian Wolff wrote: > > > perf-4.12.0-0.rc5.git0.1.fc27.x86_64 > > > > > > 39e32e gdb_main (/usr/libexec/gdb) > > > 10b6fa main (/usr/libexec/gdb) > > > > > >0x5565f6f6 <+54>:callq 0x558f17a0 > > >:mov > > >0x18(%rsp),%rcx > > [...] > > > Excuse me, but I'm having trouble following you. The non-GDB backtraces > > you > > are pasting do not show srcline information. So what exactly is broken? > > There is broken that perf now reports address 10b6fa (corresponding to > relocated address 0x5565f6fa) but there is no instruction on address > 0x5565f6fa. If you 'objdump -d' it you cannot find any instruction > on adress 0x5565f6fa (or on address 0x10b6fa). There is > instruction on address 0x5565f6fb. > > > Maybe paste the perf output you get now and highlight what you'd expect > > instead? > > Actual: > 39e32e gdb_main (/usr/libexec/gdb) > 10b6fa main (/usr/libexec/gdb) > Expected: > 39e32f gdb_main (/usr/libexec/gdb) > 10b6fb main (/usr/libexec/gdb) > > I agree perf needs to calculate with 39e32e and 10b6fa. But it should > display to user 39e32f and 10b6fb. Hmmm this will require some more changes throughout the stack then. I.e. we'll have to remember the "isactivation" flag along with the original IP, and only apply the offset then when we query for inliners or srcline information. Maybe I can pull that off somehow in the patch series I'm working on currently, which refactors the whole inline/srcline/callchain logic anyways. I don't see an easy way to fix the behavior. Does anyone else? So how do we deal with this situation in the interim? I'd prefer we keep the current "broken" state, as I consider it less broken than what we had before... I guess some of the core perf developers should decide how to handle this. Thanks -- Milian Wolff | milian.wo...@kdab.com | Senior Software Engineer KDAB (Deutschland) GmbH KG, a KDAB Group company Tel: +49-30-521325470 KDAB - The Qt Experts
Re: perf report: fix off-by-one for non-activation frames
On Freitag, 16. Juni 2017 13:57:44 CEST Jan Kratochvil wrote: > On Fri, 16 Jun 2017 13:51:37 +0200, Milian Wolff wrote: > > > perf-4.12.0-0.rc5.git0.1.fc27.x86_64 > > > > > > 39e32e gdb_main (/usr/libexec/gdb) > > > 10b6fa main (/usr/libexec/gdb) > > > > > >0x5565f6f6 <+54>:callq 0x558f17a0 > > >:mov > > >0x18(%rsp),%rcx > > [...] > > > Excuse me, but I'm having trouble following you. The non-GDB backtraces > > you > > are pasting do not show srcline information. So what exactly is broken? > > There is broken that perf now reports address 10b6fa (corresponding to > relocated address 0x5565f6fa) but there is no instruction on address > 0x5565f6fa. If you 'objdump -d' it you cannot find any instruction > on adress 0x5565f6fa (or on address 0x10b6fa). There is > instruction on address 0x5565f6fb. > > > Maybe paste the perf output you get now and highlight what you'd expect > > instead? > > Actual: > 39e32e gdb_main (/usr/libexec/gdb) > 10b6fa main (/usr/libexec/gdb) > Expected: > 39e32f gdb_main (/usr/libexec/gdb) > 10b6fb main (/usr/libexec/gdb) > > I agree perf needs to calculate with 39e32e and 10b6fa. But it should > display to user 39e32f and 10b6fb. Hmmm this will require some more changes throughout the stack then. I.e. we'll have to remember the "isactivation" flag along with the original IP, and only apply the offset then when we query for inliners or srcline information. Maybe I can pull that off somehow in the patch series I'm working on currently, which refactors the whole inline/srcline/callchain logic anyways. I don't see an easy way to fix the behavior. Does anyone else? So how do we deal with this situation in the interim? I'd prefer we keep the current "broken" state, as I consider it less broken than what we had before... I guess some of the core perf developers should decide how to handle this. Thanks -- Milian Wolff | milian.wo...@kdab.com | Senior Software Engineer KDAB (Deutschland) GmbH KG, a KDAB Group company Tel: +49-30-521325470 KDAB - The Qt Experts
Re: perf report: fix off-by-one for non-activation frames
On Fri, 16 Jun 2017 13:51:37 +0200, Milian Wolff wrote: > > perf-4.12.0-0.rc5.git0.1.fc27.x86_64 > > 39e32e gdb_main (/usr/libexec/gdb) > > 10b6fa main (/usr/libexec/gdb) > >0x5565f6f6 <+54>:callq 0x558f17a0 > >>0x5565f6fb <+59>:mov0x18(%rsp),%rcx [...] > Excuse me, but I'm having trouble following you. The non-GDB backtraces you > are pasting do not show srcline information. So what exactly is broken? There is broken that perf now reports address 10b6fa (corresponding to relocated address 0x5565f6fa) but there is no instruction on address 0x5565f6fa. If you 'objdump -d' it you cannot find any instruction on adress 0x5565f6fa (or on address 0x10b6fa). There is instruction on address 0x5565f6fb. > Maybe paste the perf output you get now and highlight what you'd expect > instead? Actual: 39e32e gdb_main (/usr/libexec/gdb) 10b6fa main (/usr/libexec/gdb) Expected: 39e32f gdb_main (/usr/libexec/gdb) 10b6fb main (/usr/libexec/gdb) I agree perf needs to calculate with 39e32e and 10b6fa. But it should display to user 39e32f and 10b6fb. Jan
Re: perf report: fix off-by-one for non-activation frames
On Fri, 16 Jun 2017 13:51:37 +0200, Milian Wolff wrote: > > perf-4.12.0-0.rc5.git0.1.fc27.x86_64 > > 39e32e gdb_main (/usr/libexec/gdb) > > 10b6fa main (/usr/libexec/gdb) > >0x5565f6f6 <+54>:callq 0x558f17a0 > > >0x5565f6fb <+59>:mov0x18(%rsp),%rcx [...] > Excuse me, but I'm having trouble following you. The non-GDB backtraces you > are pasting do not show srcline information. So what exactly is broken? There is broken that perf now reports address 10b6fa (corresponding to relocated address 0x5565f6fa) but there is no instruction on address 0x5565f6fa. If you 'objdump -d' it you cannot find any instruction on adress 0x5565f6fa (or on address 0x10b6fa). There is instruction on address 0x5565f6fb. > Maybe paste the perf output you get now and highlight what you'd expect > instead? Actual: 39e32e gdb_main (/usr/libexec/gdb) 10b6fa main (/usr/libexec/gdb) Expected: 39e32f gdb_main (/usr/libexec/gdb) 10b6fb main (/usr/libexec/gdb) I agree perf needs to calculate with 39e32e and 10b6fa. But it should display to user 39e32f and 10b6fb. Jan
Re: perf report: fix off-by-one for non-activation frames
On Freitag, 16. Juni 2017 08:14:56 CEST Jan Kratochvil wrote: > On Mon, 15 May 2017 17:04:44 +0200, Milian Wolff wrote: > > commit 1982ad48fc82c284a5cc55697a012d3357e84d01 > Author: Milian Wolff> Date: Wed May 24 15:21:25 2017 +0900 > > > --- a/tools/perf/util/unwind-libdw.c > > +++ b/tools/perf/util/unwind-libdw.c > > @@ -168,12 +168,16 @@ frame_callback(Dwfl_Frame *state, void *arg) > > ... > > > + if (!isactivation) > > + --pc; > > + > > FYI I find it as a regression a bit: > > perf-4.11.4-200.fc25.x86_64 > 30c563 gdb_main (/usr/libexec/gdb) >fae48 main (/usr/libexec/gdb) >0x5564ee43 <+51>:callq 0x5585f340 > 0x5564ee48 <+56>:mov > 0x18(%rsp),%rcx > > perf-4.12.0-0.rc5.git0.1.fc27.x86_64 > 39e32e gdb_main (/usr/libexec/gdb) > 10b6fa main (/usr/libexec/gdb) >0x5565f6f6 <+54>:callq 0x558f17a0 > 0x5565f6fb <+59>:mov > 0x18(%rsp),%rcx > > In backtraces it is correct to show the source line of the calling line - as > perf does now after your fix - but one still should report PC address of > the start of the next instruction. At least this is what debuggers are > used to do: > > #9 gdb_main (args=0x7fffe2e0) at ../../gdb/main.c:1257 > #10 0x5565f6fb in main (argc=, argv=) > at ../../gdb/gdb.c:40 0x5565f6f6 <+54>: callq 0x558f17a0 > => 0x5565f6fb <+59>: mov > 0x18(%rsp),%rcx > Line 40 of "../../gdb/gdb.c" starts at address 0x5565f6f6 char**)+54> and ends at 0x5565f6fb . Line 41 of > "../../gdb/gdb.c" starts at address 0x5565f6fb > and ends at 0x5565f715. > > You see "gdb.c:40" and 0x5565f6fb in the backtrace despite > 0x5565f6fb is already line 41. > > This is also why elfutils reports separately PC and 'isactivation' flag. > Instead of just reporting decreased PC. Excuse me, but I'm having trouble following you. The non-GDB backtraces you are pasting do not show srcline information. So what exactly is broken? Can you show me the differences a bit more clearly? Maybe paste the perf output you get now and highlight what you'd expect instead? Best would be an accompanying test case that I can use to improve the situation, if possible? Thanks -- Milian Wolff | milian.wo...@kdab.com | Senior Software Engineer KDAB (Deutschland) GmbH KG, a KDAB Group company Tel: +49-30-521325470 KDAB - The Qt Experts
Re: perf report: fix off-by-one for non-activation frames
On Freitag, 16. Juni 2017 08:14:56 CEST Jan Kratochvil wrote: > On Mon, 15 May 2017 17:04:44 +0200, Milian Wolff wrote: > > commit 1982ad48fc82c284a5cc55697a012d3357e84d01 > Author: Milian Wolff > Date: Wed May 24 15:21:25 2017 +0900 > > > --- a/tools/perf/util/unwind-libdw.c > > +++ b/tools/perf/util/unwind-libdw.c > > @@ -168,12 +168,16 @@ frame_callback(Dwfl_Frame *state, void *arg) > > ... > > > + if (!isactivation) > > + --pc; > > + > > FYI I find it as a regression a bit: > > perf-4.11.4-200.fc25.x86_64 > 30c563 gdb_main (/usr/libexec/gdb) >fae48 main (/usr/libexec/gdb) >0x5564ee43 <+51>:callq 0x5585f340 > 0x5564ee48 <+56>:mov > 0x18(%rsp),%rcx > > perf-4.12.0-0.rc5.git0.1.fc27.x86_64 > 39e32e gdb_main (/usr/libexec/gdb) > 10b6fa main (/usr/libexec/gdb) >0x5565f6f6 <+54>:callq 0x558f17a0 > 0x5565f6fb <+59>:mov > 0x18(%rsp),%rcx > > In backtraces it is correct to show the source line of the calling line - as > perf does now after your fix - but one still should report PC address of > the start of the next instruction. At least this is what debuggers are > used to do: > > #9 gdb_main (args=0x7fffe2e0) at ../../gdb/main.c:1257 > #10 0x5565f6fb in main (argc=, argv=) > at ../../gdb/gdb.c:40 0x5565f6f6 <+54>: callq 0x558f17a0 > => 0x5565f6fb <+59>: mov > 0x18(%rsp),%rcx > Line 40 of "../../gdb/gdb.c" starts at address 0x5565f6f6 char**)+54> and ends at 0x5565f6fb . Line 41 of > "../../gdb/gdb.c" starts at address 0x5565f6fb > and ends at 0x5565f715. > > You see "gdb.c:40" and 0x5565f6fb in the backtrace despite > 0x5565f6fb is already line 41. > > This is also why elfutils reports separately PC and 'isactivation' flag. > Instead of just reporting decreased PC. Excuse me, but I'm having trouble following you. The non-GDB backtraces you are pasting do not show srcline information. So what exactly is broken? Can you show me the differences a bit more clearly? Maybe paste the perf output you get now and highlight what you'd expect instead? Best would be an accompanying test case that I can use to improve the situation, if possible? Thanks -- Milian Wolff | milian.wo...@kdab.com | Senior Software Engineer KDAB (Deutschland) GmbH KG, a KDAB Group company Tel: +49-30-521325470 KDAB - The Qt Experts
Re: perf report: fix off-by-one for non-activation frames
On Mon, 15 May 2017 17:04:44 +0200, Milian Wolff wrote: commit 1982ad48fc82c284a5cc55697a012d3357e84d01 Author: Milian WolffDate: Wed May 24 15:21:25 2017 +0900 > --- a/tools/perf/util/unwind-libdw.c > +++ b/tools/perf/util/unwind-libdw.c > @@ -168,12 +168,16 @@ frame_callback(Dwfl_Frame *state, void *arg) ... > + if (!isactivation) > + --pc; > + FYI I find it as a regression a bit: perf-4.11.4-200.fc25.x86_64 30c563 gdb_main (/usr/libexec/gdb) fae48 main (/usr/libexec/gdb) 0x5564ee43 <+51>:callq 0x5585f340 0x5564ee48 <+56>:mov0x18(%rsp),%rcx perf-4.12.0-0.rc5.git0.1.fc27.x86_64 39e32e gdb_main (/usr/libexec/gdb) 10b6fa main (/usr/libexec/gdb) 0x5565f6f6 <+54>:callq 0x558f17a0 0x5565f6fb <+59>:mov0x18(%rsp),%rcx In backtraces it is correct to show the source line of the calling line - as perf does now after your fix - but one still should report PC address of the start of the next instruction. At least this is what debuggers are used to do: #9 gdb_main (args=0x7fffe2e0) at ../../gdb/main.c:1257 #10 0x5565f6fb in main (argc=, argv=) at ../../gdb/gdb.c:40 0x5565f6f6 <+54>:callq 0x558f17a0 => 0x5565f6fb <+59>:mov0x18(%rsp),%rcx Line 40 of "../../gdb/gdb.c" starts at address 0x5565f6f6 and ends at 0x5565f6fb . Line 41 of "../../gdb/gdb.c" starts at address 0x5565f6fb and ends at 0x5565f715. You see "gdb.c:40" and 0x5565f6fb in the backtrace despite 0x5565f6fb is already line 41. This is also why elfutils reports separately PC and 'isactivation' flag. Instead of just reporting decreased PC. Jan
Re: perf report: fix off-by-one for non-activation frames
On Mon, 15 May 2017 17:04:44 +0200, Milian Wolff wrote: commit 1982ad48fc82c284a5cc55697a012d3357e84d01 Author: Milian Wolff Date: Wed May 24 15:21:25 2017 +0900 > --- a/tools/perf/util/unwind-libdw.c > +++ b/tools/perf/util/unwind-libdw.c > @@ -168,12 +168,16 @@ frame_callback(Dwfl_Frame *state, void *arg) ... > + if (!isactivation) > + --pc; > + FYI I find it as a regression a bit: perf-4.11.4-200.fc25.x86_64 30c563 gdb_main (/usr/libexec/gdb) fae48 main (/usr/libexec/gdb) 0x5564ee43 <+51>:callq 0x5585f340 0x5564ee48 <+56>:mov0x18(%rsp),%rcx perf-4.12.0-0.rc5.git0.1.fc27.x86_64 39e32e gdb_main (/usr/libexec/gdb) 10b6fa main (/usr/libexec/gdb) 0x5565f6f6 <+54>:callq 0x558f17a0 0x5565f6fb <+59>:mov0x18(%rsp),%rcx In backtraces it is correct to show the source line of the calling line - as perf does now after your fix - but one still should report PC address of the start of the next instruction. At least this is what debuggers are used to do: #9 gdb_main (args=0x7fffe2e0) at ../../gdb/main.c:1257 #10 0x5565f6fb in main (argc=, argv=) at ../../gdb/gdb.c:40 0x5565f6f6 <+54>:callq 0x558f17a0 => 0x5565f6fb <+59>:mov0x18(%rsp),%rcx Line 40 of "../../gdb/gdb.c" starts at address 0x5565f6f6 and ends at 0x5565f6fb . Line 41 of "../../gdb/gdb.c" starts at address 0x5565f6fb and ends at 0x5565f715. You see "gdb.c:40" and 0x5565f6fb in the backtrace despite 0x5565f6fb is already line 41. This is also why elfutils reports separately PC and 'isactivation' flag. Instead of just reporting decreased PC. Jan
[tip:perf/urgent] perf report: Fix off-by-one for non-activation frames
Commit-ID: 1982ad48fc82c284a5cc55697a012d3357e84d01 Gitweb: http://git.kernel.org/tip/1982ad48fc82c284a5cc55697a012d3357e84d01 Author: Milian Wolff <milian.wo...@kdab.com> AuthorDate: Wed, 24 May 2017 15:21:25 +0900 Committer: Ingo Molnar <mi...@kernel.org> CommitDate: Wed, 24 May 2017 08:41:48 +0200 perf report: Fix off-by-one for non-activation frames As the documentation for dwfl_frame_pc says, frames that are no activation frames need to have their program counter decremented by one to properly find the function of the caller. This fixes many cases where perf report currently attributes the cost to the next line. I.e. I have code like this: ~~~ #include #include using namespace std; int main() { this_thread::sleep_for(chrono::milliseconds(1000)); this_thread::sleep_for(chrono::milliseconds(100)); this_thread::sleep_for(chrono::milliseconds(10)); return 0; } ~~~ Now compile and record it: ~~~ g++ -std=c++11 -g -O2 test.cpp echo 1 | sudo tee /proc/sys/kernel/sched_schedstats perf record \ --event sched:sched_stat_sleep \ --event sched:sched_process_exit \ --event sched:sched_switch --call-graph=dwarf \ --output perf.data.raw \ ./a.out echo 0 | sudo tee /proc/sys/kernel/sched_schedstats perf inject --sched-stat --input perf.data.raw --output perf.data ~~~ Before this patch, the report clearly shows the off-by-one issue. Most notably, the last sleep invocation is incorrectly attributed to the "return 0;" line: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for<long, std::ratio<1l, 1000l> > thread:323 | |--90.08%--main test.cpp:9 | __libc_start_main | _start | |--9.01%--main test.cpp:10 | __libc_start_main | _start | --0.91%--main test.cpp:13 __libc_start_main _start ~~~ With this patch here applied, the issue is fixed. The report becomes much more usable: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for<long, std::ratio<1l, 1000l> > thread:323 | |--90.08%--main test.cpp:8 | __libc_start_main | _start | |--9.01%--main test.cpp:9 | __libc_start_main | _start | --0.91%--main test.cpp:10 __libc_start_main _start ~~~ Similarly it works for signal frames: ~~~ __noinline void bar(void) { volatile long cnt = 0; for (cnt = 0; cnt < 1; cnt++); } __noinline void foo(void) { bar(); } void sig_handler(int sig) { foo(); } int main(void) { signal(SIGUSR1, sig_handler); raise(SIGUSR1); foo(); return 0; } Before, the report wrongly points to `signal.c:29` after raise(): $ perf report --stdio --no-children -g srcline -s srcline ... 100.00% signal.c:11 | ---bar signal.c:11 | |--50.49%--main signal.c:29 | __libc_start_main | _start | --49.51%--0x33a8f raise .:0 main signal.c:29 __libc_start_main _start With this patch in, the issue is fixed and we instead get: 100.00% signal signal[.] bar | ---bar signal.c:11 | |--50.49%--main signal.c:29 | __libc_start_main | _start | --49.51%--0x33a8f raise .:0 main signal.c:27 __libc_start_main _start Note how this patch fixes this issue for both unwinding metho
[tip:perf/urgent] perf report: Fix off-by-one for non-activation frames
Commit-ID: 1982ad48fc82c284a5cc55697a012d3357e84d01 Gitweb: http://git.kernel.org/tip/1982ad48fc82c284a5cc55697a012d3357e84d01 Author: Milian Wolff AuthorDate: Wed, 24 May 2017 15:21:25 +0900 Committer: Ingo Molnar CommitDate: Wed, 24 May 2017 08:41:48 +0200 perf report: Fix off-by-one for non-activation frames As the documentation for dwfl_frame_pc says, frames that are no activation frames need to have their program counter decremented by one to properly find the function of the caller. This fixes many cases where perf report currently attributes the cost to the next line. I.e. I have code like this: ~~~ #include #include using namespace std; int main() { this_thread::sleep_for(chrono::milliseconds(1000)); this_thread::sleep_for(chrono::milliseconds(100)); this_thread::sleep_for(chrono::milliseconds(10)); return 0; } ~~~ Now compile and record it: ~~~ g++ -std=c++11 -g -O2 test.cpp echo 1 | sudo tee /proc/sys/kernel/sched_schedstats perf record \ --event sched:sched_stat_sleep \ --event sched:sched_process_exit \ --event sched:sched_switch --call-graph=dwarf \ --output perf.data.raw \ ./a.out echo 0 | sudo tee /proc/sys/kernel/sched_schedstats perf inject --sched-stat --input perf.data.raw --output perf.data ~~~ Before this patch, the report clearly shows the off-by-one issue. Most notably, the last sleep invocation is incorrectly attributed to the "return 0;" line: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for > thread:323 | |--90.08%--main test.cpp:9 | __libc_start_main | _start | |--9.01%--main test.cpp:10 | __libc_start_main | _start | --0.91%--main test.cpp:13 __libc_start_main _start ~~~ With this patch here applied, the issue is fixed. The report becomes much more usable: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for > thread:323 | |--90.08%--main test.cpp:8 | __libc_start_main | _start | |--9.01%--main test.cpp:9 | __libc_start_main | _start | --0.91%--main test.cpp:10 __libc_start_main _start ~~~ Similarly it works for signal frames: ~~~ __noinline void bar(void) { volatile long cnt = 0; for (cnt = 0; cnt < 1; cnt++); } __noinline void foo(void) { bar(); } void sig_handler(int sig) { foo(); } int main(void) { signal(SIGUSR1, sig_handler); raise(SIGUSR1); foo(); return 0; } Before, the report wrongly points to `signal.c:29` after raise(): $ perf report --stdio --no-children -g srcline -s srcline ... 100.00% signal.c:11 | ---bar signal.c:11 | |--50.49%--main signal.c:29 | __libc_start_main | _start | --49.51%--0x33a8f raise .:0 main signal.c:29 __libc_start_main _start With this patch in, the issue is fixed and we instead get: 100.00% signal signal[.] bar | ---bar signal.c:11 | |--50.49%--main signal.c:29 | __libc_start_main | _start | --49.51%--0x33a8f raise .:0 main signal.c:27 __libc_start_main _start Note how this patch fixes this issue for both unwinding methods, i.e. both dwfl and libunwind. The former case is straight-forward thanks to dwfl_frame_pc(). For libunwind, we replace th
[PATCH 3/7] perf report: fix off-by-one for non-activation frames
From: Milian WolffAs the documentation for dwfl_frame_pc says, frames that are no activation frames need to have their program counter decremented by one to properly find the function of the caller. This fixes many cases where perf report currently attributes the cost to the next line. I.e. I have code like this: ~~~ #include #include using namespace std; int main() { this_thread::sleep_for(chrono::milliseconds(1000)); this_thread::sleep_for(chrono::milliseconds(100)); this_thread::sleep_for(chrono::milliseconds(10)); return 0; } ~~~ Now compile and record it: ~~~ g++ -std=c++11 -g -O2 test.cpp echo 1 | sudo tee /proc/sys/kernel/sched_schedstats perf record \ --event sched:sched_stat_sleep \ --event sched:sched_process_exit \ --event sched:sched_switch --call-graph=dwarf \ --output perf.data.raw \ ./a.out echo 0 | sudo tee /proc/sys/kernel/sched_schedstats perf inject --sched-stat --input perf.data.raw --output perf.data ~~~ Before this patch, the report clearly shows the off-by-one issue. Most notably, the last sleep invocation is incorrectly attributed to the "return 0;" line: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for > thread:323 | |--90.08%--main test.cpp:9 | __libc_start_main | _start | |--9.01%--main test.cpp:10 | __libc_start_main | _start | --0.91%--main test.cpp:13 __libc_start_main _start ~~~ With this patch here applied, the issue is fixed. The report becomes much more usable: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for > thread:323 | |--90.08%--main test.cpp:8 | __libc_start_main | _start | |--9.01%--main test.cpp:9 | __libc_start_main | _start | --0.91%--main test.cpp:10 __libc_start_main _start ~~~ Similarly it works for signal frames: ~~~ __noinline void bar(void) { volatile long cnt = 0; for (cnt = 0; cnt < 1; cnt++); } __noinline void foo(void) { bar(); } void sig_handler(int sig) { foo(); } int main(void) { signal(SIGUSR1, sig_handler); raise(SIGUSR1); foo(); return 0; } Before, the report wrongly points to `signal.c:29` after raise(): $ perf report --stdio --no-children -g srcline -s srcline ... 100.00% signal.c:11 | ---bar signal.c:11 | |--50.49%--main signal.c:29 | __libc_start_main | _start | --49.51%--0x33a8f raise .:0 main signal.c:29 __libc_start_main _start With this patch in, the issue is fixed and we instead get: 100.00% signal signal[.] bar | ---bar signal.c:11 | |--50.49%--main signal.c:29 | __libc_start_main | _start | --49.51%--0x33a8f raise .:0 main signal.c:27 __libc_start_main _start Note how this patch fixes this issue for both unwinding methods, i.e. both dwfl and libunwind. The former case is straight-forward thanks to dwfl_frame_pc. For libunwind, we replace the functionality via unw_is_signal_frame for any but the very first frame. Cc: Arnaldo Carvalho de Melo Cc: David Ahern Cc: Peter Zijlstra Cc: Yao Jin Signed-off-by: Milian Wolff Signed-off-by:
[PATCH 3/7] perf report: fix off-by-one for non-activation frames
From: Milian Wolff As the documentation for dwfl_frame_pc says, frames that are no activation frames need to have their program counter decremented by one to properly find the function of the caller. This fixes many cases where perf report currently attributes the cost to the next line. I.e. I have code like this: ~~~ #include #include using namespace std; int main() { this_thread::sleep_for(chrono::milliseconds(1000)); this_thread::sleep_for(chrono::milliseconds(100)); this_thread::sleep_for(chrono::milliseconds(10)); return 0; } ~~~ Now compile and record it: ~~~ g++ -std=c++11 -g -O2 test.cpp echo 1 | sudo tee /proc/sys/kernel/sched_schedstats perf record \ --event sched:sched_stat_sleep \ --event sched:sched_process_exit \ --event sched:sched_switch --call-graph=dwarf \ --output perf.data.raw \ ./a.out echo 0 | sudo tee /proc/sys/kernel/sched_schedstats perf inject --sched-stat --input perf.data.raw --output perf.data ~~~ Before this patch, the report clearly shows the off-by-one issue. Most notably, the last sleep invocation is incorrectly attributed to the "return 0;" line: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for > thread:323 | |--90.08%--main test.cpp:9 | __libc_start_main | _start | |--9.01%--main test.cpp:10 | __libc_start_main | _start | --0.91%--main test.cpp:13 __libc_start_main _start ~~~ With this patch here applied, the issue is fixed. The report becomes much more usable: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for > thread:323 | |--90.08%--main test.cpp:8 | __libc_start_main | _start | |--9.01%--main test.cpp:9 | __libc_start_main | _start | --0.91%--main test.cpp:10 __libc_start_main _start ~~~ Similarly it works for signal frames: ~~~ __noinline void bar(void) { volatile long cnt = 0; for (cnt = 0; cnt < 1; cnt++); } __noinline void foo(void) { bar(); } void sig_handler(int sig) { foo(); } int main(void) { signal(SIGUSR1, sig_handler); raise(SIGUSR1); foo(); return 0; } Before, the report wrongly points to `signal.c:29` after raise(): $ perf report --stdio --no-children -g srcline -s srcline ... 100.00% signal.c:11 | ---bar signal.c:11 | |--50.49%--main signal.c:29 | __libc_start_main | _start | --49.51%--0x33a8f raise .:0 main signal.c:29 __libc_start_main _start With this patch in, the issue is fixed and we instead get: 100.00% signal signal[.] bar | ---bar signal.c:11 | |--50.49%--main signal.c:29 | __libc_start_main | _start | --49.51%--0x33a8f raise .:0 main signal.c:27 __libc_start_main _start Note how this patch fixes this issue for both unwinding methods, i.e. both dwfl and libunwind. The former case is straight-forward thanks to dwfl_frame_pc. For libunwind, we replace the functionality via unw_is_signal_frame for any but the very first frame. Cc: Arnaldo Carvalho de Melo Cc: David Ahern Cc: Peter Zijlstra Cc: Yao Jin Signed-off-by: Milian Wolff Signed-off-by: Namhyung Kim --- tools/perf/util/unwind-libdw.c | 6 +- tools/perf/util/unwind-libunwind-local.c | 11 +++ 2 files changed, 16 insertions(+), 1 deletion(-) diff
Re: [PATCH v3] perf report: fix off-by-one for non-activation frames
Hi Milian, On Thu, May 18, 2017 at 10:37:35AM +0200, Milian Wolff wrote: > As the documentation for dwfl_frame_pc says, frames that > are no activation frames need to have their program counter > decremented by one to properly find the function of the caller. > > This fixes many cases where perf report currently attributes > the cost to the next line. I.e. I have code like this: > > ~~~ > #include > #include > > using namespace std; > > int main() > { > this_thread::sleep_for(chrono::milliseconds(1000)); > this_thread::sleep_for(chrono::milliseconds(100)); > this_thread::sleep_for(chrono::milliseconds(10)); > > return 0; > } > ~~~ > > Now compile and record it: > > ~~~ > g++ -std=c++11 -g -O2 test.cpp > echo 1 | sudo tee /proc/sys/kernel/sched_schedstats > perf record \ > --event sched:sched_stat_sleep \ > --event sched:sched_process_exit \ > --event sched:sched_switch --call-graph=dwarf \ > --output perf.data.raw \ > ./a.out > echo 0 | sudo tee /proc/sys/kernel/sched_schedstats > perf inject --sched-stat --input perf.data.raw --output perf.data > ~~~ > > Before this patch, the report clearly shows the off-by-one issue. > Most notably, the last sleep invocation is incorrectly attributed > to the "return 0;" line: > > ~~~ > Overhead Source:Line > ... > >100.00% core.c:0 > | > ---__schedule core.c:0 >schedule >do_nanosleep hrtimer.c:0 >hrtimer_nanosleep >sys_nanosleep >entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 >__nanosleep_nocancel .:0 >std::this_thread::sleep_for> > thread:323 >| >|--90.08%--main test.cpp:9 >| __libc_start_main >| _start >| >|--9.01%--main test.cpp:10 >| __libc_start_main >| _start >| > --0.91%--main test.cpp:13 > __libc_start_main > _start > ~~~ > > With this patch here applied, the issue is fixed. The report becomes > much more usable: > > ~~~ > Overhead Source:Line > ... > >100.00% core.c:0 > | > ---__schedule core.c:0 >schedule >do_nanosleep hrtimer.c:0 >hrtimer_nanosleep >sys_nanosleep >entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 >__nanosleep_nocancel .:0 >std::this_thread::sleep_for > > thread:323 >| >|--90.08%--main test.cpp:8 >| __libc_start_main >| _start >| >|--9.01%--main test.cpp:9 >| __libc_start_main >| _start >| > --0.91%--main test.cpp:10 > __libc_start_main > _start > ~~~ > > Similarly it works for signal frames: > > ~~~ > #include > #include > #include > > #define __noinline __attribute__((noinline)) > > __noinline void bar(void) > { > volatile long cnt = 0; > > for (cnt = 0; cnt < 1; cnt++); > } > > __noinline void foo(void) > { > bar(); > } > > void sig_handler(int sig) > { > foo(); > } > > int main(void) > { > signal(SIGUSR1, sig_handler); > raise(SIGUSR1); > > foo(); > return 0; > } > > > Before, the report wrongly points to `signal.c:29` after raise(): > > > $ perf report --stdio --no-children -g srcline -s srcline > ... >100.00% signal.c:11 > | > ---bar signal.c:11 >| >|--50.49%--main signal.c:29 >| __libc_start_main >| _start >| > --49.51%--0x33a8f > raise .:0 > main signal.c:29 > __libc_start_main > _start > > > With this patch in, the issue is fixed and we instead get: > > >100.00% signal signal[.] bar > | > ---bar signal.c:11 >| >|--50.49%--main signal.c:29 >| __libc_start_main >| _start >| > --49.51%--0x33a8f > raise .:0 > main signal.c:27 > __libc_start_main > _start > > > Note how this patch fixes
Re: [PATCH v3] perf report: fix off-by-one for non-activation frames
Hi Milian, On Thu, May 18, 2017 at 10:37:35AM +0200, Milian Wolff wrote: > As the documentation for dwfl_frame_pc says, frames that > are no activation frames need to have their program counter > decremented by one to properly find the function of the caller. > > This fixes many cases where perf report currently attributes > the cost to the next line. I.e. I have code like this: > > ~~~ > #include > #include > > using namespace std; > > int main() > { > this_thread::sleep_for(chrono::milliseconds(1000)); > this_thread::sleep_for(chrono::milliseconds(100)); > this_thread::sleep_for(chrono::milliseconds(10)); > > return 0; > } > ~~~ > > Now compile and record it: > > ~~~ > g++ -std=c++11 -g -O2 test.cpp > echo 1 | sudo tee /proc/sys/kernel/sched_schedstats > perf record \ > --event sched:sched_stat_sleep \ > --event sched:sched_process_exit \ > --event sched:sched_switch --call-graph=dwarf \ > --output perf.data.raw \ > ./a.out > echo 0 | sudo tee /proc/sys/kernel/sched_schedstats > perf inject --sched-stat --input perf.data.raw --output perf.data > ~~~ > > Before this patch, the report clearly shows the off-by-one issue. > Most notably, the last sleep invocation is incorrectly attributed > to the "return 0;" line: > > ~~~ > Overhead Source:Line > ... > >100.00% core.c:0 > | > ---__schedule core.c:0 >schedule >do_nanosleep hrtimer.c:0 >hrtimer_nanosleep >sys_nanosleep >entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 >__nanosleep_nocancel .:0 >std::this_thread::sleep_for > > thread:323 >| >|--90.08%--main test.cpp:9 >| __libc_start_main >| _start >| >|--9.01%--main test.cpp:10 >| __libc_start_main >| _start >| > --0.91%--main test.cpp:13 > __libc_start_main > _start > ~~~ > > With this patch here applied, the issue is fixed. The report becomes > much more usable: > > ~~~ > Overhead Source:Line > ... > >100.00% core.c:0 > | > ---__schedule core.c:0 >schedule >do_nanosleep hrtimer.c:0 >hrtimer_nanosleep >sys_nanosleep >entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 >__nanosleep_nocancel .:0 >std::this_thread::sleep_for > > thread:323 >| >|--90.08%--main test.cpp:8 >| __libc_start_main >| _start >| >|--9.01%--main test.cpp:9 >| __libc_start_main >| _start >| > --0.91%--main test.cpp:10 > __libc_start_main > _start > ~~~ > > Similarly it works for signal frames: > > ~~~ > #include > #include > #include > > #define __noinline __attribute__((noinline)) > > __noinline void bar(void) > { > volatile long cnt = 0; > > for (cnt = 0; cnt < 1; cnt++); > } > > __noinline void foo(void) > { > bar(); > } > > void sig_handler(int sig) > { > foo(); > } > > int main(void) > { > signal(SIGUSR1, sig_handler); > raise(SIGUSR1); > > foo(); > return 0; > } > > > Before, the report wrongly points to `signal.c:29` after raise(): > > > $ perf report --stdio --no-children -g srcline -s srcline > ... >100.00% signal.c:11 > | > ---bar signal.c:11 >| >|--50.49%--main signal.c:29 >| __libc_start_main >| _start >| > --49.51%--0x33a8f > raise .:0 > main signal.c:29 > __libc_start_main > _start > > > With this patch in, the issue is fixed and we instead get: > > >100.00% signal signal[.] bar > | > ---bar signal.c:11 >| >|--50.49%--main signal.c:29 >| __libc_start_main >| _start >| > --49.51%--0x33a8f > raise .:0 > main signal.c:27 > __libc_start_main > _start > > > Note how this patch fixes this issue for both unwinding methods, i.e. > both dwfl
[PATCH v3] perf report: fix off-by-one for non-activation frames
As the documentation for dwfl_frame_pc says, frames that are no activation frames need to have their program counter decremented by one to properly find the function of the caller. This fixes many cases where perf report currently attributes the cost to the next line. I.e. I have code like this: ~~~ #include #include using namespace std; int main() { this_thread::sleep_for(chrono::milliseconds(1000)); this_thread::sleep_for(chrono::milliseconds(100)); this_thread::sleep_for(chrono::milliseconds(10)); return 0; } ~~~ Now compile and record it: ~~~ g++ -std=c++11 -g -O2 test.cpp echo 1 | sudo tee /proc/sys/kernel/sched_schedstats perf record \ --event sched:sched_stat_sleep \ --event sched:sched_process_exit \ --event sched:sched_switch --call-graph=dwarf \ --output perf.data.raw \ ./a.out echo 0 | sudo tee /proc/sys/kernel/sched_schedstats perf inject --sched-stat --input perf.data.raw --output perf.data ~~~ Before this patch, the report clearly shows the off-by-one issue. Most notably, the last sleep invocation is incorrectly attributed to the "return 0;" line: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for> thread:323 | |--90.08%--main test.cpp:9 | __libc_start_main | _start | |--9.01%--main test.cpp:10 | __libc_start_main | _start | --0.91%--main test.cpp:13 __libc_start_main _start ~~~ With this patch here applied, the issue is fixed. The report becomes much more usable: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for > thread:323 | |--90.08%--main test.cpp:8 | __libc_start_main | _start | |--9.01%--main test.cpp:9 | __libc_start_main | _start | --0.91%--main test.cpp:10 __libc_start_main _start ~~~ Similarly it works for signal frames: ~~~ #include #include #include #define __noinline __attribute__((noinline)) __noinline void bar(void) { volatile long cnt = 0; for (cnt = 0; cnt < 1; cnt++); } __noinline void foo(void) { bar(); } void sig_handler(int sig) { foo(); } int main(void) { signal(SIGUSR1, sig_handler); raise(SIGUSR1); foo(); return 0; } Before, the report wrongly points to `signal.c:29` after raise(): $ perf report --stdio --no-children -g srcline -s srcline ... 100.00% signal.c:11 | ---bar signal.c:11 | |--50.49%--main signal.c:29 | __libc_start_main | _start | --49.51%--0x33a8f raise .:0 main signal.c:29 __libc_start_main _start With this patch in, the issue is fixed and we instead get: 100.00% signal signal[.] bar | ---bar signal.c:11 | |--50.49%--main signal.c:29 | __libc_start_main | _start | --49.51%--0x33a8f raise .:0 main signal.c:27 __libc_start_main _start Note how this patch fixes this issue for both unwinding methods, i.e. both dwfl and libunwind. The former case is straight-forward thanks to dwfl_frame_pc. For libunwind, we replace the functionality via unw_is_signal_frame for any but the very first frame. Cc: Arnaldo Carvalho de Melo Cc: David Ahern Cc: Namhyung Kim Cc: Peter Zijlstra Cc: Yao Jin
[PATCH v3] perf report: fix off-by-one for non-activation frames
As the documentation for dwfl_frame_pc says, frames that are no activation frames need to have their program counter decremented by one to properly find the function of the caller. This fixes many cases where perf report currently attributes the cost to the next line. I.e. I have code like this: ~~~ #include #include using namespace std; int main() { this_thread::sleep_for(chrono::milliseconds(1000)); this_thread::sleep_for(chrono::milliseconds(100)); this_thread::sleep_for(chrono::milliseconds(10)); return 0; } ~~~ Now compile and record it: ~~~ g++ -std=c++11 -g -O2 test.cpp echo 1 | sudo tee /proc/sys/kernel/sched_schedstats perf record \ --event sched:sched_stat_sleep \ --event sched:sched_process_exit \ --event sched:sched_switch --call-graph=dwarf \ --output perf.data.raw \ ./a.out echo 0 | sudo tee /proc/sys/kernel/sched_schedstats perf inject --sched-stat --input perf.data.raw --output perf.data ~~~ Before this patch, the report clearly shows the off-by-one issue. Most notably, the last sleep invocation is incorrectly attributed to the "return 0;" line: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for > thread:323 | |--90.08%--main test.cpp:9 | __libc_start_main | _start | |--9.01%--main test.cpp:10 | __libc_start_main | _start | --0.91%--main test.cpp:13 __libc_start_main _start ~~~ With this patch here applied, the issue is fixed. The report becomes much more usable: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for > thread:323 | |--90.08%--main test.cpp:8 | __libc_start_main | _start | |--9.01%--main test.cpp:9 | __libc_start_main | _start | --0.91%--main test.cpp:10 __libc_start_main _start ~~~ Similarly it works for signal frames: ~~~ #include #include #include #define __noinline __attribute__((noinline)) __noinline void bar(void) { volatile long cnt = 0; for (cnt = 0; cnt < 1; cnt++); } __noinline void foo(void) { bar(); } void sig_handler(int sig) { foo(); } int main(void) { signal(SIGUSR1, sig_handler); raise(SIGUSR1); foo(); return 0; } Before, the report wrongly points to `signal.c:29` after raise(): $ perf report --stdio --no-children -g srcline -s srcline ... 100.00% signal.c:11 | ---bar signal.c:11 | |--50.49%--main signal.c:29 | __libc_start_main | _start | --49.51%--0x33a8f raise .:0 main signal.c:29 __libc_start_main _start With this patch in, the issue is fixed and we instead get: 100.00% signal signal[.] bar | ---bar signal.c:11 | |--50.49%--main signal.c:29 | __libc_start_main | _start | --49.51%--0x33a8f raise .:0 main signal.c:27 __libc_start_main _start Note how this patch fixes this issue for both unwinding methods, i.e. both dwfl and libunwind. The former case is straight-forward thanks to dwfl_frame_pc. For libunwind, we replace the functionality via unw_is_signal_frame for any but the very first frame. Cc: Arnaldo Carvalho de Melo Cc: David Ahern Cc: Namhyung Kim Cc: Peter Zijlstra Cc: Yao Jin Signed-off-by: Milian Wolff --- tools/perf/util/unwind-libdw.c | 6 +- tools/perf/util/unwind-libunwind-local.c | 11 +++ 2 files
Re: [PATCH v2] perf report: fix off-by-one for non-activation frames
On Mittwoch, 17. Mai 2017 07:12:21 CEST Namhyung Kim wrote: > On Tue, May 16, 2017 at 06:26:47PM +0200, Milian Wolff wrote: > > On Dienstag, 16. Mai 2017 18:17:26 CEST Milian Wolff wrote: > > > On Dienstag, 16. Mai 2017 16:38:29 CEST Namhyung Kim wrote: > > > > On Tue, May 16, 2017 at 10:59:51AM +0200, Milian Wolff wrote: > > > > > As the documentation for dwfl_frame_pc says, frames that > > > > > are no activation frames need to have their program counter > > > > > decremented by one to properly find the function of the caller. > > > > > > > > > > This fixes many cases where perf report currently attributes > > > > > the cost to the next line. I.e. I have code like this: > > > > > > > > > > ~~~ > > > > > > > > > > #include > > > > > #include > > > > > > > > > > using namespace std; > > > > > > > > > > int main() > > > > > { > > > > > > > > > > this_thread::sleep_for(chrono::milliseconds(1000)); > > > > > this_thread::sleep_for(chrono::milliseconds(100)); > > > > > this_thread::sleep_for(chrono::milliseconds(10)); > > > > > > > > > > return 0; > > > > > > > > > > } > > > > > > > > > > ~~~ > > > > > > > > It'd be nice if the test program has a signal frame for verification. > > > > > > I have pretty much zero experience about signals. Would it be enough to > > > add > > > a signal handler for, say, SIGUSR1 to my test application and then > > > trigger > > > a sleep when that signal is delivered? If that should be enough, I'll > > > write > > > and test it out. > > > > ~~~ > > #include > > #include > > #include > > > > using namespace std; > > > > volatile bool run_loop = true; > > > > void my_handler(int signum) > > { > > > > this_thread::sleep_for(chrono::milliseconds(1000)); > > this_thread::sleep_for(chrono::milliseconds(100)); > > this_thread::sleep_for(chrono::milliseconds(10)); > > run_loop = false; > > > > } > > > > int main() > > { > > > > signal(SIGUSR1, my_handler); > > > > while (run_loop) {} > > > > return 0; > > > > } > > ~~~ > > > > This does not properly unwind neither before nor after this patch. I only > > ever> > > get: > >100.00% core.c:0 > > > > ---__schedule core.c:0 > > > >schedule > >do_nanosleep hrtimer.c:0 > >hrtimer_nanosleep > >sys_nanosleep > >entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 > >__nanosleep_nocancel .:0 > >std::this_thread::sleep_for> > > > > thread:323 > > > > So... should this work? Please tell me how to test this properly. > > How did you send the SIGUSR1 to the process? > > Anyway it does nothing to do with inlining, right? I just wrote a > test code below to burn a cpu with or without a signal frame. > > $ nl -ba frame-addr.c > 1#include > 2#include > 3#include > 4 > 5#define __noinline __attribute__((noinline)) > 6 > 7__noinline void bar(void) > 8{ > 9 volatile long cnt = 0; > 10 > 11 for (cnt = 0; cnt < 1; cnt++); > 12} > 13 > 14__noinline void foo(void) > 15{ > 16 bar(); > 17} > 18 > 19void sig_handler(int sig) > 20{ > 21 foo(); > 22} > 23 > 24int main(void) > 25{ > 26 signal(SIGUSR1, sig_handler); > 27 raise(SIGUSR1); > 28 > 29 foo(); > 30 return 0; > 31} > > $ gcc -O2 -g -fno-optimize-sibling-calls -o frame-addr frame-addr.c > > $ perf record --call-graph dwarf ./frame-addr > > $ perf report -q -g srcline | head -15 > 99.88% frame-addr frame-addr[.] bar > > ---bar frame-addr.c:11 > foo frame-addr.c:16 > > |--51.12%--main frame-addr.c:29 > | > | __libc_start_main > | _start > > --48.76%--sig_handler frame-addr.c:21 > 0x33a8f > raise .:0 > main frame-addr.c:29 <--- bad > __libc_start_main > _start > > Note that 'raise' was called at line 27. It seems that simply > checking current frame fixes it. Got it - thanks for your test case. I fixed it and will resend v3 of this patch shortly. Thanks -- Milian Wolff | milian.wo...@kdab.com | Software Engineer KDAB (Deutschland) GmbH KG, a KDAB Group company Tel: +49-30-521325470 KDAB - The Qt Experts smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH v2] perf report: fix off-by-one for non-activation frames
On Mittwoch, 17. Mai 2017 07:12:21 CEST Namhyung Kim wrote: > On Tue, May 16, 2017 at 06:26:47PM +0200, Milian Wolff wrote: > > On Dienstag, 16. Mai 2017 18:17:26 CEST Milian Wolff wrote: > > > On Dienstag, 16. Mai 2017 16:38:29 CEST Namhyung Kim wrote: > > > > On Tue, May 16, 2017 at 10:59:51AM +0200, Milian Wolff wrote: > > > > > As the documentation for dwfl_frame_pc says, frames that > > > > > are no activation frames need to have their program counter > > > > > decremented by one to properly find the function of the caller. > > > > > > > > > > This fixes many cases where perf report currently attributes > > > > > the cost to the next line. I.e. I have code like this: > > > > > > > > > > ~~~ > > > > > > > > > > #include > > > > > #include > > > > > > > > > > using namespace std; > > > > > > > > > > int main() > > > > > { > > > > > > > > > > this_thread::sleep_for(chrono::milliseconds(1000)); > > > > > this_thread::sleep_for(chrono::milliseconds(100)); > > > > > this_thread::sleep_for(chrono::milliseconds(10)); > > > > > > > > > > return 0; > > > > > > > > > > } > > > > > > > > > > ~~~ > > > > > > > > It'd be nice if the test program has a signal frame for verification. > > > > > > I have pretty much zero experience about signals. Would it be enough to > > > add > > > a signal handler for, say, SIGUSR1 to my test application and then > > > trigger > > > a sleep when that signal is delivered? If that should be enough, I'll > > > write > > > and test it out. > > > > ~~~ > > #include > > #include > > #include > > > > using namespace std; > > > > volatile bool run_loop = true; > > > > void my_handler(int signum) > > { > > > > this_thread::sleep_for(chrono::milliseconds(1000)); > > this_thread::sleep_for(chrono::milliseconds(100)); > > this_thread::sleep_for(chrono::milliseconds(10)); > > run_loop = false; > > > > } > > > > int main() > > { > > > > signal(SIGUSR1, my_handler); > > > > while (run_loop) {} > > > > return 0; > > > > } > > ~~~ > > > > This does not properly unwind neither before nor after this patch. I only > > ever> > > get: > >100.00% core.c:0 > > > > ---__schedule core.c:0 > > > >schedule > >do_nanosleep hrtimer.c:0 > >hrtimer_nanosleep > >sys_nanosleep > >entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 > >__nanosleep_nocancel .:0 > >std::this_thread::sleep_for > > > > > thread:323 > > > > So... should this work? Please tell me how to test this properly. > > How did you send the SIGUSR1 to the process? > > Anyway it does nothing to do with inlining, right? I just wrote a > test code below to burn a cpu with or without a signal frame. > > $ nl -ba frame-addr.c > 1#include > 2#include > 3#include > 4 > 5#define __noinline __attribute__((noinline)) > 6 > 7__noinline void bar(void) > 8{ > 9 volatile long cnt = 0; > 10 > 11 for (cnt = 0; cnt < 1; cnt++); > 12} > 13 > 14__noinline void foo(void) > 15{ > 16 bar(); > 17} > 18 > 19void sig_handler(int sig) > 20{ > 21 foo(); > 22} > 23 > 24int main(void) > 25{ > 26 signal(SIGUSR1, sig_handler); > 27 raise(SIGUSR1); > 28 > 29 foo(); > 30 return 0; > 31} > > $ gcc -O2 -g -fno-optimize-sibling-calls -o frame-addr frame-addr.c > > $ perf record --call-graph dwarf ./frame-addr > > $ perf report -q -g srcline | head -15 > 99.88% frame-addr frame-addr[.] bar > > ---bar frame-addr.c:11 > foo frame-addr.c:16 > > |--51.12%--main frame-addr.c:29 > | > | __libc_start_main > | _start > > --48.76%--sig_handler frame-addr.c:21 > 0x33a8f > raise .:0 > main frame-addr.c:29 <--- bad > __libc_start_main > _start > > Note that 'raise' was called at line 27. It seems that simply > checking current frame fixes it. Got it - thanks for your test case. I fixed it and will resend v3 of this patch shortly. Thanks -- Milian Wolff | milian.wo...@kdab.com | Software Engineer KDAB (Deutschland) GmbH KG, a KDAB Group company Tel: +49-30-521325470 KDAB - The Qt Experts smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH v2] perf report: fix off-by-one for non-activation frames
On Tue, May 16, 2017 at 06:26:47PM +0200, Milian Wolff wrote: > On Dienstag, 16. Mai 2017 18:17:26 CEST Milian Wolff wrote: > > On Dienstag, 16. Mai 2017 16:38:29 CEST Namhyung Kim wrote: > > > On Tue, May 16, 2017 at 10:59:51AM +0200, Milian Wolff wrote: > > > > As the documentation for dwfl_frame_pc says, frames that > > > > are no activation frames need to have their program counter > > > > decremented by one to properly find the function of the caller. > > > > > > > > This fixes many cases where perf report currently attributes > > > > the cost to the next line. I.e. I have code like this: > > > > > > > > ~~~ > > > > > > > > #include > > > > #include > > > > > > > > using namespace std; > > > > > > > > int main() > > > > { > > > > > > > > this_thread::sleep_for(chrono::milliseconds(1000)); > > > > this_thread::sleep_for(chrono::milliseconds(100)); > > > > this_thread::sleep_for(chrono::milliseconds(10)); > > > > > > > > return 0; > > > > > > > > } > > > > > > > > ~~~ > > > > > > It'd be nice if the test program has a signal frame for verification. > > > > I have pretty much zero experience about signals. Would it be enough to add > > a signal handler for, say, SIGUSR1 to my test application and then trigger > > a sleep when that signal is delivered? If that should be enough, I'll write > > and test it out. > > ~~~ > #include > #include > #include > > using namespace std; > > volatile bool run_loop = true; > > void my_handler(int signum) > { > this_thread::sleep_for(chrono::milliseconds(1000)); > this_thread::sleep_for(chrono::milliseconds(100)); > this_thread::sleep_for(chrono::milliseconds(10)); > run_loop = false; > } > > int main() > { > signal(SIGUSR1, my_handler); > > while (run_loop) {} > > return 0; > } > ~~~ > > This does not properly unwind neither before nor after this patch. I only > ever > get: > >100.00% core.c:0 > | > ---__schedule core.c:0 >schedule >do_nanosleep hrtimer.c:0 >hrtimer_nanosleep >sys_nanosleep >entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 >__nanosleep_nocancel .:0 >std::this_thread::sleep_for> > thread:323 > > So... should this work? Please tell me how to test this properly. How did you send the SIGUSR1 to the process? Anyway it does nothing to do with inlining, right? I just wrote a test code below to burn a cpu with or without a signal frame. $ nl -ba frame-addr.c 1 #include 2 #include 3 #include 4 5 #define __noinline __attribute__((noinline)) 6 7 __noinline void bar(void) 8 { 9volatile long cnt = 0; 10 11for (cnt = 0; cnt < 1; cnt++); 12 } 13 14 __noinline void foo(void) 15 { 16bar(); 17 } 18 19 void sig_handler(int sig) 20 { 21foo(); 22 } 23 24 int main(void) 25 { 26signal(SIGUSR1, sig_handler); 27raise(SIGUSR1); 28 29foo(); 30return 0; 31 } $ gcc -O2 -g -fno-optimize-sibling-calls -o frame-addr frame-addr.c $ perf record --call-graph dwarf ./frame-addr $ perf report -q -g srcline | head -15 99.88% frame-addr frame-addr[.] bar | ---bar frame-addr.c:11 foo frame-addr.c:16 | |--51.12%--main frame-addr.c:29 | __libc_start_main | _start | --48.76%--sig_handler frame-addr.c:21 0x33a8f raise .:0 main frame-addr.c:29 <--- bad __libc_start_main _start Note that 'raise' was called at line 27. It seems that simply checking current frame fixes it. Thanks, Namhyung
Re: [PATCH v2] perf report: fix off-by-one for non-activation frames
On Tue, May 16, 2017 at 06:26:47PM +0200, Milian Wolff wrote: > On Dienstag, 16. Mai 2017 18:17:26 CEST Milian Wolff wrote: > > On Dienstag, 16. Mai 2017 16:38:29 CEST Namhyung Kim wrote: > > > On Tue, May 16, 2017 at 10:59:51AM +0200, Milian Wolff wrote: > > > > As the documentation for dwfl_frame_pc says, frames that > > > > are no activation frames need to have their program counter > > > > decremented by one to properly find the function of the caller. > > > > > > > > This fixes many cases where perf report currently attributes > > > > the cost to the next line. I.e. I have code like this: > > > > > > > > ~~~ > > > > > > > > #include > > > > #include > > > > > > > > using namespace std; > > > > > > > > int main() > > > > { > > > > > > > > this_thread::sleep_for(chrono::milliseconds(1000)); > > > > this_thread::sleep_for(chrono::milliseconds(100)); > > > > this_thread::sleep_for(chrono::milliseconds(10)); > > > > > > > > return 0; > > > > > > > > } > > > > > > > > ~~~ > > > > > > It'd be nice if the test program has a signal frame for verification. > > > > I have pretty much zero experience about signals. Would it be enough to add > > a signal handler for, say, SIGUSR1 to my test application and then trigger > > a sleep when that signal is delivered? If that should be enough, I'll write > > and test it out. > > ~~~ > #include > #include > #include > > using namespace std; > > volatile bool run_loop = true; > > void my_handler(int signum) > { > this_thread::sleep_for(chrono::milliseconds(1000)); > this_thread::sleep_for(chrono::milliseconds(100)); > this_thread::sleep_for(chrono::milliseconds(10)); > run_loop = false; > } > > int main() > { > signal(SIGUSR1, my_handler); > > while (run_loop) {} > > return 0; > } > ~~~ > > This does not properly unwind neither before nor after this patch. I only > ever > get: > >100.00% core.c:0 > | > ---__schedule core.c:0 >schedule >do_nanosleep hrtimer.c:0 >hrtimer_nanosleep >sys_nanosleep >entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 >__nanosleep_nocancel .:0 >std::this_thread::sleep_for > > thread:323 > > So... should this work? Please tell me how to test this properly. How did you send the SIGUSR1 to the process? Anyway it does nothing to do with inlining, right? I just wrote a test code below to burn a cpu with or without a signal frame. $ nl -ba frame-addr.c 1 #include 2 #include 3 #include 4 5 #define __noinline __attribute__((noinline)) 6 7 __noinline void bar(void) 8 { 9volatile long cnt = 0; 10 11for (cnt = 0; cnt < 1; cnt++); 12 } 13 14 __noinline void foo(void) 15 { 16bar(); 17 } 18 19 void sig_handler(int sig) 20 { 21foo(); 22 } 23 24 int main(void) 25 { 26signal(SIGUSR1, sig_handler); 27raise(SIGUSR1); 28 29foo(); 30return 0; 31 } $ gcc -O2 -g -fno-optimize-sibling-calls -o frame-addr frame-addr.c $ perf record --call-graph dwarf ./frame-addr $ perf report -q -g srcline | head -15 99.88% frame-addr frame-addr[.] bar | ---bar frame-addr.c:11 foo frame-addr.c:16 | |--51.12%--main frame-addr.c:29 | __libc_start_main | _start | --48.76%--sig_handler frame-addr.c:21 0x33a8f raise .:0 main frame-addr.c:29 <--- bad __libc_start_main _start Note that 'raise' was called at line 27. It seems that simply checking current frame fixes it. Thanks, Namhyung
Re: [PATCH v2] perf report: fix off-by-one for non-activation frames
On Dienstag, 16. Mai 2017 18:17:26 CEST Milian Wolff wrote: > On Dienstag, 16. Mai 2017 16:38:29 CEST Namhyung Kim wrote: > > On Tue, May 16, 2017 at 10:59:51AM +0200, Milian Wolff wrote: > > > As the documentation for dwfl_frame_pc says, frames that > > > are no activation frames need to have their program counter > > > decremented by one to properly find the function of the caller. > > > > > > This fixes many cases where perf report currently attributes > > > the cost to the next line. I.e. I have code like this: > > > > > > ~~~ > > > > > > #include > > > #include > > > > > > using namespace std; > > > > > > int main() > > > { > > > > > > this_thread::sleep_for(chrono::milliseconds(1000)); > > > this_thread::sleep_for(chrono::milliseconds(100)); > > > this_thread::sleep_for(chrono::milliseconds(10)); > > > > > > return 0; > > > > > > } > > > > > > ~~~ > > > > It'd be nice if the test program has a signal frame for verification. > > I have pretty much zero experience about signals. Would it be enough to add > a signal handler for, say, SIGUSR1 to my test application and then trigger > a sleep when that signal is delivered? If that should be enough, I'll write > and test it out. ~~~ #include #include #include using namespace std; volatile bool run_loop = true; void my_handler(int signum) { this_thread::sleep_for(chrono::milliseconds(1000)); this_thread::sleep_for(chrono::milliseconds(100)); this_thread::sleep_for(chrono::milliseconds(10)); run_loop = false; } int main() { signal(SIGUSR1, my_handler); while (run_loop) {} return 0; } ~~~ This does not properly unwind neither before nor after this patch. I only ever get: 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for> thread:323 So... should this work? Please tell me how to test this properly. Thanks -- Milian Wolff | milian.wo...@kdab.com | Software Engineer KDAB (Deutschland) GmbH KG, a KDAB Group company Tel: +49-30-521325470 KDAB - The Qt Experts smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH v2] perf report: fix off-by-one for non-activation frames
On Dienstag, 16. Mai 2017 18:17:26 CEST Milian Wolff wrote: > On Dienstag, 16. Mai 2017 16:38:29 CEST Namhyung Kim wrote: > > On Tue, May 16, 2017 at 10:59:51AM +0200, Milian Wolff wrote: > > > As the documentation for dwfl_frame_pc says, frames that > > > are no activation frames need to have their program counter > > > decremented by one to properly find the function of the caller. > > > > > > This fixes many cases where perf report currently attributes > > > the cost to the next line. I.e. I have code like this: > > > > > > ~~~ > > > > > > #include > > > #include > > > > > > using namespace std; > > > > > > int main() > > > { > > > > > > this_thread::sleep_for(chrono::milliseconds(1000)); > > > this_thread::sleep_for(chrono::milliseconds(100)); > > > this_thread::sleep_for(chrono::milliseconds(10)); > > > > > > return 0; > > > > > > } > > > > > > ~~~ > > > > It'd be nice if the test program has a signal frame for verification. > > I have pretty much zero experience about signals. Would it be enough to add > a signal handler for, say, SIGUSR1 to my test application and then trigger > a sleep when that signal is delivered? If that should be enough, I'll write > and test it out. ~~~ #include #include #include using namespace std; volatile bool run_loop = true; void my_handler(int signum) { this_thread::sleep_for(chrono::milliseconds(1000)); this_thread::sleep_for(chrono::milliseconds(100)); this_thread::sleep_for(chrono::milliseconds(10)); run_loop = false; } int main() { signal(SIGUSR1, my_handler); while (run_loop) {} return 0; } ~~~ This does not properly unwind neither before nor after this patch. I only ever get: 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for > thread:323 So... should this work? Please tell me how to test this properly. Thanks -- Milian Wolff | milian.wo...@kdab.com | Software Engineer KDAB (Deutschland) GmbH KG, a KDAB Group company Tel: +49-30-521325470 KDAB - The Qt Experts smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH v2] perf report: fix off-by-one for non-activation frames
On Dienstag, 16. Mai 2017 16:38:29 CEST Namhyung Kim wrote: > On Tue, May 16, 2017 at 10:59:51AM +0200, Milian Wolff wrote: > > As the documentation for dwfl_frame_pc says, frames that > > are no activation frames need to have their program counter > > decremented by one to properly find the function of the caller. > > > > This fixes many cases where perf report currently attributes > > the cost to the next line. I.e. I have code like this: > > > > ~~~ > > > > #include > > #include > > > > using namespace std; > > > > int main() > > { > > > > this_thread::sleep_for(chrono::milliseconds(1000)); > > this_thread::sleep_for(chrono::milliseconds(100)); > > this_thread::sleep_for(chrono::milliseconds(10)); > > > > return 0; > > > > } > > > > ~~~ > > It'd be nice if the test program has a signal frame for verification. I have pretty much zero experience about signals. Would it be enough to add a signal handler for, say, SIGUSR1 to my test application and then trigger a sleep when that signal is delivered? If that should be enough, I'll write and test it out. > > diff --git a/tools/perf/util/unwind-libunwind-local.c > > b/tools/perf/util/unwind-libunwind-local.c index > > f8455bed6e65..30ab26375c80 100644 > > --- a/tools/perf/util/unwind-libunwind-local.c > > +++ b/tools/perf/util/unwind-libunwind-local.c > > @@ -690,8 +690,22 @@ static int get_entries(struct unwind_info *ui, > > unwind_entry_cb_t cb,> > > if (ret) > > > > display_error(ret); > > > > + bool previous_frame_was_signal = false; > > > > while (!ret && (unw_step() > 0) && i < max_stack) { > > > > unw_get_reg(, UNW_REG_IP, [i]); > > > > + > > + /* > > +* Decrement the IP for any non-activation frames. > > +* this is required to properly find the srcline > > +* for caller frames. > > +* See also the documentation for dwfl_frame_pc, > > +* which this code tries to replicate. > > +*/ > > + bool frame_is_signal = unw_is_signal_frame() > 0; > > + if (!previous_frame_was_signal && !frame_is_signal) > > + --ips[i]; > > + previous_frame_was_signal = frame_is_signal; > > Does it need to check previous frame too? That's what dwfl_frame_pc does, if I'm not misunderstanding it's source code? Bye -- Milian Wolff | milian.wo...@kdab.com | Software Engineer KDAB (Deutschland) GmbH KG, a KDAB Group company Tel: +49-30-521325470 KDAB - The Qt Experts smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH v2] perf report: fix off-by-one for non-activation frames
On Dienstag, 16. Mai 2017 16:38:29 CEST Namhyung Kim wrote: > On Tue, May 16, 2017 at 10:59:51AM +0200, Milian Wolff wrote: > > As the documentation for dwfl_frame_pc says, frames that > > are no activation frames need to have their program counter > > decremented by one to properly find the function of the caller. > > > > This fixes many cases where perf report currently attributes > > the cost to the next line. I.e. I have code like this: > > > > ~~~ > > > > #include > > #include > > > > using namespace std; > > > > int main() > > { > > > > this_thread::sleep_for(chrono::milliseconds(1000)); > > this_thread::sleep_for(chrono::milliseconds(100)); > > this_thread::sleep_for(chrono::milliseconds(10)); > > > > return 0; > > > > } > > > > ~~~ > > It'd be nice if the test program has a signal frame for verification. I have pretty much zero experience about signals. Would it be enough to add a signal handler for, say, SIGUSR1 to my test application and then trigger a sleep when that signal is delivered? If that should be enough, I'll write and test it out. > > diff --git a/tools/perf/util/unwind-libunwind-local.c > > b/tools/perf/util/unwind-libunwind-local.c index > > f8455bed6e65..30ab26375c80 100644 > > --- a/tools/perf/util/unwind-libunwind-local.c > > +++ b/tools/perf/util/unwind-libunwind-local.c > > @@ -690,8 +690,22 @@ static int get_entries(struct unwind_info *ui, > > unwind_entry_cb_t cb,> > > if (ret) > > > > display_error(ret); > > > > + bool previous_frame_was_signal = false; > > > > while (!ret && (unw_step() > 0) && i < max_stack) { > > > > unw_get_reg(, UNW_REG_IP, [i]); > > > > + > > + /* > > +* Decrement the IP for any non-activation frames. > > +* this is required to properly find the srcline > > +* for caller frames. > > +* See also the documentation for dwfl_frame_pc, > > +* which this code tries to replicate. > > +*/ > > + bool frame_is_signal = unw_is_signal_frame() > 0; > > + if (!previous_frame_was_signal && !frame_is_signal) > > + --ips[i]; > > + previous_frame_was_signal = frame_is_signal; > > Does it need to check previous frame too? That's what dwfl_frame_pc does, if I'm not misunderstanding it's source code? Bye -- Milian Wolff | milian.wo...@kdab.com | Software Engineer KDAB (Deutschland) GmbH KG, a KDAB Group company Tel: +49-30-521325470 KDAB - The Qt Experts smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH v2] perf report: fix off-by-one for non-activation frames
On Tue, May 16, 2017 at 10:59:51AM +0200, Milian Wolff wrote: > As the documentation for dwfl_frame_pc says, frames that > are no activation frames need to have their program counter > decremented by one to properly find the function of the caller. > > This fixes many cases where perf report currently attributes > the cost to the next line. I.e. I have code like this: > > ~~~ > #include > #include > > using namespace std; > > int main() > { > this_thread::sleep_for(chrono::milliseconds(1000)); > this_thread::sleep_for(chrono::milliseconds(100)); > this_thread::sleep_for(chrono::milliseconds(10)); > > return 0; > } > ~~~ It'd be nice if the test program has a signal frame for verification. > > Now compile and record it: > > ~~~ > g++ -std=c++11 -g -O2 test.cpp > echo 1 | sudo tee /proc/sys/kernel/sched_schedstats > perf record \ > --event sched:sched_stat_sleep \ > --event sched:sched_process_exit \ > --event sched:sched_switch --call-graph=dwarf \ > --output perf.data.raw \ > ./a.out > echo 0 | sudo tee /proc/sys/kernel/sched_schedstats > perf inject --sched-stat --input perf.data.raw --output perf.data > ~~~ > > Before this patch, the report clearly shows the off-by-one issue. > Most notably, the last sleep invocation is incorrectly attributed > to the "return 0;" line: > > ~~~ > Overhead Source:Line > ... > >100.00% core.c:0 > | > ---__schedule core.c:0 >schedule >do_nanosleep hrtimer.c:0 >hrtimer_nanosleep >sys_nanosleep >entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 >__nanosleep_nocancel .:0 >std::this_thread::sleep_for> > thread:323 >| >|--90.08%--main test.cpp:9 >| __libc_start_main >| _start >| >|--9.01%--main test.cpp:10 >| __libc_start_main >| _start >| > --0.91%--main test.cpp:13 > __libc_start_main > _start > ~~~ > > With this patch here applied, the issue is fixed. The report becomes > much more usable: > > ~~~ > Overhead Source:Line > ... > >100.00% core.c:0 > | > ---__schedule core.c:0 >schedule >do_nanosleep hrtimer.c:0 >hrtimer_nanosleep >sys_nanosleep >entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 >__nanosleep_nocancel .:0 >std::this_thread::sleep_for > > thread:323 >| >|--90.08%--main test.cpp:8 >| __libc_start_main >| _start >| >|--9.01%--main test.cpp:9 >| __libc_start_main >| _start >| > --0.91%--main test.cpp:10 > __libc_start_main > _start > ~~~ > > Note how this patch fixes this issue for both unwinding methods, i.e. > both dwfl and libunwind. The former case is straight-forward thanks > to dwfl_frame_pc. For libunwind, we replace the functionality via > unw_is_signal_frame for any but the very first frame. > > Cc: Arnaldo Carvalho de Melo > Cc: David Ahern > Cc: Namhyung Kim > Cc: Peter Zijlstra > Cc: Yao Jin > Signed-off-by: Milian Wolff > --- > tools/perf/util/unwind-libdw.c | 6 +- > tools/perf/util/unwind-libunwind-local.c | 14 ++ > 2 files changed, 19 insertions(+), 1 deletion(-) > > v2: > - use unw_is_signal_frame to also fix this issue for libunwind > > diff --git a/tools/perf/util/unwind-libdw.c b/tools/perf/util/unwind-libdw.c > index f90e11a555b2..943a06291587 100644 > --- a/tools/perf/util/unwind-libdw.c > +++ b/tools/perf/util/unwind-libdw.c > @@ -168,12 +168,16 @@ frame_callback(Dwfl_Frame *state, void *arg) > { > struct unwind_info *ui = arg; > Dwarf_Addr pc; > + bool isactivation; > > - if (!dwfl_frame_pc(state, , NULL)) { > + if (!dwfl_frame_pc(state, , )) { > pr_err("%s", dwfl_errmsg(-1)); > return DWARF_CB_ABORT; > } > > + if (!isactivation) > + --pc; > + > return entry(pc, ui) || !(--ui->max_stack) ? > DWARF_CB_ABORT : DWARF_CB_OK; > } > diff --git a/tools/perf/util/unwind-libunwind-local.c > b/tools/perf/util/unwind-libunwind-local.c > index f8455bed6e65..30ab26375c80 100644 >
Re: [PATCH v2] perf report: fix off-by-one for non-activation frames
On Tue, May 16, 2017 at 10:59:51AM +0200, Milian Wolff wrote: > As the documentation for dwfl_frame_pc says, frames that > are no activation frames need to have their program counter > decremented by one to properly find the function of the caller. > > This fixes many cases where perf report currently attributes > the cost to the next line. I.e. I have code like this: > > ~~~ > #include > #include > > using namespace std; > > int main() > { > this_thread::sleep_for(chrono::milliseconds(1000)); > this_thread::sleep_for(chrono::milliseconds(100)); > this_thread::sleep_for(chrono::milliseconds(10)); > > return 0; > } > ~~~ It'd be nice if the test program has a signal frame for verification. > > Now compile and record it: > > ~~~ > g++ -std=c++11 -g -O2 test.cpp > echo 1 | sudo tee /proc/sys/kernel/sched_schedstats > perf record \ > --event sched:sched_stat_sleep \ > --event sched:sched_process_exit \ > --event sched:sched_switch --call-graph=dwarf \ > --output perf.data.raw \ > ./a.out > echo 0 | sudo tee /proc/sys/kernel/sched_schedstats > perf inject --sched-stat --input perf.data.raw --output perf.data > ~~~ > > Before this patch, the report clearly shows the off-by-one issue. > Most notably, the last sleep invocation is incorrectly attributed > to the "return 0;" line: > > ~~~ > Overhead Source:Line > ... > >100.00% core.c:0 > | > ---__schedule core.c:0 >schedule >do_nanosleep hrtimer.c:0 >hrtimer_nanosleep >sys_nanosleep >entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 >__nanosleep_nocancel .:0 >std::this_thread::sleep_for > > thread:323 >| >|--90.08%--main test.cpp:9 >| __libc_start_main >| _start >| >|--9.01%--main test.cpp:10 >| __libc_start_main >| _start >| > --0.91%--main test.cpp:13 > __libc_start_main > _start > ~~~ > > With this patch here applied, the issue is fixed. The report becomes > much more usable: > > ~~~ > Overhead Source:Line > ... > >100.00% core.c:0 > | > ---__schedule core.c:0 >schedule >do_nanosleep hrtimer.c:0 >hrtimer_nanosleep >sys_nanosleep >entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 >__nanosleep_nocancel .:0 >std::this_thread::sleep_for > > thread:323 >| >|--90.08%--main test.cpp:8 >| __libc_start_main >| _start >| >|--9.01%--main test.cpp:9 >| __libc_start_main >| _start >| > --0.91%--main test.cpp:10 > __libc_start_main > _start > ~~~ > > Note how this patch fixes this issue for both unwinding methods, i.e. > both dwfl and libunwind. The former case is straight-forward thanks > to dwfl_frame_pc. For libunwind, we replace the functionality via > unw_is_signal_frame for any but the very first frame. > > Cc: Arnaldo Carvalho de Melo > Cc: David Ahern > Cc: Namhyung Kim > Cc: Peter Zijlstra > Cc: Yao Jin > Signed-off-by: Milian Wolff > --- > tools/perf/util/unwind-libdw.c | 6 +- > tools/perf/util/unwind-libunwind-local.c | 14 ++ > 2 files changed, 19 insertions(+), 1 deletion(-) > > v2: > - use unw_is_signal_frame to also fix this issue for libunwind > > diff --git a/tools/perf/util/unwind-libdw.c b/tools/perf/util/unwind-libdw.c > index f90e11a555b2..943a06291587 100644 > --- a/tools/perf/util/unwind-libdw.c > +++ b/tools/perf/util/unwind-libdw.c > @@ -168,12 +168,16 @@ frame_callback(Dwfl_Frame *state, void *arg) > { > struct unwind_info *ui = arg; > Dwarf_Addr pc; > + bool isactivation; > > - if (!dwfl_frame_pc(state, , NULL)) { > + if (!dwfl_frame_pc(state, , )) { > pr_err("%s", dwfl_errmsg(-1)); > return DWARF_CB_ABORT; > } > > + if (!isactivation) > + --pc; > + > return entry(pc, ui) || !(--ui->max_stack) ? > DWARF_CB_ABORT : DWARF_CB_OK; > } > diff --git a/tools/perf/util/unwind-libunwind-local.c > b/tools/perf/util/unwind-libunwind-local.c > index f8455bed6e65..30ab26375c80 100644 > --- a/tools/perf/util/unwind-libunwind-local.c > +++ b/tools/perf/util/unwind-libunwind-local.c > @@ -690,8 +690,22 @@ static int get_entries(struct unwind_info *ui, >
Re: [PATCH] perf report: fix off-by-one for non-activation frames
On Dienstag, 16. Mai 2017 03:57:53 CEST Namhyung Kim wrote: > On Mon, May 15, 2017 at 05:13:06PM +0200, Milian Wolff wrote: > > On Monday, May 15, 2017 5:04:44 PM CEST Milian Wolff wrote: > > > As the documentation for dwfl_frame_pc says, frames that > > > are no activation frames need to have their program counter > > > decremented by one to properly find the function of the caller. > > > > Note that this leaves the perf build against libunwind in the current, > > broken state. I do not know how to detect the activation property there. > > Does anyone else? See elfutils source code for what it is doing: > > > > https://sourceware.org/git/?p=elfutils.git;a=blob;f=libdwfl/ > > dwfl_frame_pc.c;h=296c815b9c73f42d79ac1778d2a0c420b89ee4eb;hb=HEAD > > It seems that you can use unw_is_signal_frame(). Thank you! The v2 patch I just sent now uses that and it seems to work reliably in my quick test. Cheers -- Milian Wolff | milian.wo...@kdab.com | Software Engineer KDAB (Deutschland) GmbH KG, a KDAB Group company Tel: +49-30-521325470 KDAB - The Qt Experts smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH] perf report: fix off-by-one for non-activation frames
On Dienstag, 16. Mai 2017 03:57:53 CEST Namhyung Kim wrote: > On Mon, May 15, 2017 at 05:13:06PM +0200, Milian Wolff wrote: > > On Monday, May 15, 2017 5:04:44 PM CEST Milian Wolff wrote: > > > As the documentation for dwfl_frame_pc says, frames that > > > are no activation frames need to have their program counter > > > decremented by one to properly find the function of the caller. > > > > Note that this leaves the perf build against libunwind in the current, > > broken state. I do not know how to detect the activation property there. > > Does anyone else? See elfutils source code for what it is doing: > > > > https://sourceware.org/git/?p=elfutils.git;a=blob;f=libdwfl/ > > dwfl_frame_pc.c;h=296c815b9c73f42d79ac1778d2a0c420b89ee4eb;hb=HEAD > > It seems that you can use unw_is_signal_frame(). Thank you! The v2 patch I just sent now uses that and it seems to work reliably in my quick test. Cheers -- Milian Wolff | milian.wo...@kdab.com | Software Engineer KDAB (Deutschland) GmbH KG, a KDAB Group company Tel: +49-30-521325470 KDAB - The Qt Experts smime.p7s Description: S/MIME cryptographic signature
[PATCH v2] perf report: fix off-by-one for non-activation frames
As the documentation for dwfl_frame_pc says, frames that are no activation frames need to have their program counter decremented by one to properly find the function of the caller. This fixes many cases where perf report currently attributes the cost to the next line. I.e. I have code like this: ~~~ #include #include using namespace std; int main() { this_thread::sleep_for(chrono::milliseconds(1000)); this_thread::sleep_for(chrono::milliseconds(100)); this_thread::sleep_for(chrono::milliseconds(10)); return 0; } ~~~ Now compile and record it: ~~~ g++ -std=c++11 -g -O2 test.cpp echo 1 | sudo tee /proc/sys/kernel/sched_schedstats perf record \ --event sched:sched_stat_sleep \ --event sched:sched_process_exit \ --event sched:sched_switch --call-graph=dwarf \ --output perf.data.raw \ ./a.out echo 0 | sudo tee /proc/sys/kernel/sched_schedstats perf inject --sched-stat --input perf.data.raw --output perf.data ~~~ Before this patch, the report clearly shows the off-by-one issue. Most notably, the last sleep invocation is incorrectly attributed to the "return 0;" line: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for> thread:323 | |--90.08%--main test.cpp:9 | __libc_start_main | _start | |--9.01%--main test.cpp:10 | __libc_start_main | _start | --0.91%--main test.cpp:13 __libc_start_main _start ~~~ With this patch here applied, the issue is fixed. The report becomes much more usable: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for > thread:323 | |--90.08%--main test.cpp:8 | __libc_start_main | _start | |--9.01%--main test.cpp:9 | __libc_start_main | _start | --0.91%--main test.cpp:10 __libc_start_main _start ~~~ Note how this patch fixes this issue for both unwinding methods, i.e. both dwfl and libunwind. The former case is straight-forward thanks to dwfl_frame_pc. For libunwind, we replace the functionality via unw_is_signal_frame for any but the very first frame. Cc: Arnaldo Carvalho de Melo Cc: David Ahern Cc: Namhyung Kim Cc: Peter Zijlstra Cc: Yao Jin Signed-off-by: Milian Wolff --- tools/perf/util/unwind-libdw.c | 6 +- tools/perf/util/unwind-libunwind-local.c | 14 ++ 2 files changed, 19 insertions(+), 1 deletion(-) v2: - use unw_is_signal_frame to also fix this issue for libunwind diff --git a/tools/perf/util/unwind-libdw.c b/tools/perf/util/unwind-libdw.c index f90e11a555b2..943a06291587 100644 --- a/tools/perf/util/unwind-libdw.c +++ b/tools/perf/util/unwind-libdw.c @@ -168,12 +168,16 @@ frame_callback(Dwfl_Frame *state, void *arg) { struct unwind_info *ui = arg; Dwarf_Addr pc; + bool isactivation; - if (!dwfl_frame_pc(state, , NULL)) { + if (!dwfl_frame_pc(state, , )) { pr_err("%s", dwfl_errmsg(-1)); return DWARF_CB_ABORT; } + if (!isactivation) + --pc; + return entry(pc, ui) || !(--ui->max_stack) ? DWARF_CB_ABORT : DWARF_CB_OK; } diff --git a/tools/perf/util/unwind-libunwind-local.c b/tools/perf/util/unwind-libunwind-local.c index f8455bed6e65..30ab26375c80 100644 --- a/tools/perf/util/unwind-libunwind-local.c +++ b/tools/perf/util/unwind-libunwind-local.c @@ -690,8 +690,22 @@ static int get_entries(struct unwind_info *ui, unwind_entry_cb_t cb, if (ret) display_error(ret); + bool previous_frame_was_signal = false; while (!ret && (unw_step() > 0) && i < max_stack) {
[PATCH v2] perf report: fix off-by-one for non-activation frames
As the documentation for dwfl_frame_pc says, frames that are no activation frames need to have their program counter decremented by one to properly find the function of the caller. This fixes many cases where perf report currently attributes the cost to the next line. I.e. I have code like this: ~~~ #include #include using namespace std; int main() { this_thread::sleep_for(chrono::milliseconds(1000)); this_thread::sleep_for(chrono::milliseconds(100)); this_thread::sleep_for(chrono::milliseconds(10)); return 0; } ~~~ Now compile and record it: ~~~ g++ -std=c++11 -g -O2 test.cpp echo 1 | sudo tee /proc/sys/kernel/sched_schedstats perf record \ --event sched:sched_stat_sleep \ --event sched:sched_process_exit \ --event sched:sched_switch --call-graph=dwarf \ --output perf.data.raw \ ./a.out echo 0 | sudo tee /proc/sys/kernel/sched_schedstats perf inject --sched-stat --input perf.data.raw --output perf.data ~~~ Before this patch, the report clearly shows the off-by-one issue. Most notably, the last sleep invocation is incorrectly attributed to the "return 0;" line: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for > thread:323 | |--90.08%--main test.cpp:9 | __libc_start_main | _start | |--9.01%--main test.cpp:10 | __libc_start_main | _start | --0.91%--main test.cpp:13 __libc_start_main _start ~~~ With this patch here applied, the issue is fixed. The report becomes much more usable: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for > thread:323 | |--90.08%--main test.cpp:8 | __libc_start_main | _start | |--9.01%--main test.cpp:9 | __libc_start_main | _start | --0.91%--main test.cpp:10 __libc_start_main _start ~~~ Note how this patch fixes this issue for both unwinding methods, i.e. both dwfl and libunwind. The former case is straight-forward thanks to dwfl_frame_pc. For libunwind, we replace the functionality via unw_is_signal_frame for any but the very first frame. Cc: Arnaldo Carvalho de Melo Cc: David Ahern Cc: Namhyung Kim Cc: Peter Zijlstra Cc: Yao Jin Signed-off-by: Milian Wolff --- tools/perf/util/unwind-libdw.c | 6 +- tools/perf/util/unwind-libunwind-local.c | 14 ++ 2 files changed, 19 insertions(+), 1 deletion(-) v2: - use unw_is_signal_frame to also fix this issue for libunwind diff --git a/tools/perf/util/unwind-libdw.c b/tools/perf/util/unwind-libdw.c index f90e11a555b2..943a06291587 100644 --- a/tools/perf/util/unwind-libdw.c +++ b/tools/perf/util/unwind-libdw.c @@ -168,12 +168,16 @@ frame_callback(Dwfl_Frame *state, void *arg) { struct unwind_info *ui = arg; Dwarf_Addr pc; + bool isactivation; - if (!dwfl_frame_pc(state, , NULL)) { + if (!dwfl_frame_pc(state, , )) { pr_err("%s", dwfl_errmsg(-1)); return DWARF_CB_ABORT; } + if (!isactivation) + --pc; + return entry(pc, ui) || !(--ui->max_stack) ? DWARF_CB_ABORT : DWARF_CB_OK; } diff --git a/tools/perf/util/unwind-libunwind-local.c b/tools/perf/util/unwind-libunwind-local.c index f8455bed6e65..30ab26375c80 100644 --- a/tools/perf/util/unwind-libunwind-local.c +++ b/tools/perf/util/unwind-libunwind-local.c @@ -690,8 +690,22 @@ static int get_entries(struct unwind_info *ui, unwind_entry_cb_t cb, if (ret) display_error(ret); + bool previous_frame_was_signal = false; while (!ret && (unw_step() > 0) && i < max_stack) { unw_get_reg(, UNW_REG_IP, [i]); + + /* +* Decrement the IP for any non-activation frames. +* this is required to
Re: [PATCH] perf report: fix off-by-one for non-activation frames
On Mon, May 15, 2017 at 05:13:06PM +0200, Milian Wolff wrote: > On Monday, May 15, 2017 5:04:44 PM CEST Milian Wolff wrote: > > As the documentation for dwfl_frame_pc says, frames that > > are no activation frames need to have their program counter > > decremented by one to properly find the function of the caller. > > Note that this leaves the perf build against libunwind in the current, broken > state. I do not know how to detect the activation property there. Does anyone > else? See elfutils source code for what it is doing: > > https://sourceware.org/git/?p=elfutils.git;a=blob;f=libdwfl/ > dwfl_frame_pc.c;h=296c815b9c73f42d79ac1778d2a0c420b89ee4eb;hb=HEAD It seems that you can use unw_is_signal_frame(). Thanks, Namhyung
Re: [PATCH] perf report: fix off-by-one for non-activation frames
On Mon, May 15, 2017 at 05:13:06PM +0200, Milian Wolff wrote: > On Monday, May 15, 2017 5:04:44 PM CEST Milian Wolff wrote: > > As the documentation for dwfl_frame_pc says, frames that > > are no activation frames need to have their program counter > > decremented by one to properly find the function of the caller. > > Note that this leaves the perf build against libunwind in the current, broken > state. I do not know how to detect the activation property there. Does anyone > else? See elfutils source code for what it is doing: > > https://sourceware.org/git/?p=elfutils.git;a=blob;f=libdwfl/ > dwfl_frame_pc.c;h=296c815b9c73f42d79ac1778d2a0c420b89ee4eb;hb=HEAD It seems that you can use unw_is_signal_frame(). Thanks, Namhyung
Re: [PATCH] perf report: fix off-by-one for non-activation frames
On Monday, May 15, 2017 5:04:44 PM CEST Milian Wolff wrote: > As the documentation for dwfl_frame_pc says, frames that > are no activation frames need to have their program counter > decremented by one to properly find the function of the caller. Note that this leaves the perf build against libunwind in the current, broken state. I do not know how to detect the activation property there. Does anyone else? See elfutils source code for what it is doing: https://sourceware.org/git/?p=elfutils.git;a=blob;f=libdwfl/ dwfl_frame_pc.c;h=296c815b9c73f42d79ac1778d2a0c420b89ee4eb;hb=HEAD Cheers -- Milian Wolff | milian.wo...@kdab.com | Software Engineer KDAB (Deutschland) GmbH KG, a KDAB Group company Tel: +49-30-521325470 KDAB - The Qt Experts smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH] perf report: fix off-by-one for non-activation frames
On Monday, May 15, 2017 5:04:44 PM CEST Milian Wolff wrote: > As the documentation for dwfl_frame_pc says, frames that > are no activation frames need to have their program counter > decremented by one to properly find the function of the caller. Note that this leaves the perf build against libunwind in the current, broken state. I do not know how to detect the activation property there. Does anyone else? See elfutils source code for what it is doing: https://sourceware.org/git/?p=elfutils.git;a=blob;f=libdwfl/ dwfl_frame_pc.c;h=296c815b9c73f42d79ac1778d2a0c420b89ee4eb;hb=HEAD Cheers -- Milian Wolff | milian.wo...@kdab.com | Software Engineer KDAB (Deutschland) GmbH KG, a KDAB Group company Tel: +49-30-521325470 KDAB - The Qt Experts smime.p7s Description: S/MIME cryptographic signature
[PATCH] perf report: fix off-by-one for non-activation frames
As the documentation for dwfl_frame_pc says, frames that are no activation frames need to have their program counter decremented by one to properly find the function of the caller. This fixes many cases where perf report currently attributes the cost to the next line. I.e. I have code like this: ~~~ #include #include using namespace std; int main() { this_thread::sleep_for(chrono::milliseconds(1000)); this_thread::sleep_for(chrono::milliseconds(100)); this_thread::sleep_for(chrono::milliseconds(10)); return 0; } ~~~ Now compile and record it: ~~~ g++ -std=c++11 -g -O2 test.cpp echo 1 | sudo tee /proc/sys/kernel/sched_schedstats perf record \ --event sched:sched_stat_sleep \ --event sched:sched_process_exit \ --event sched:sched_switch --call-graph=dwarf \ --output perf.data.raw \ ./a.out echo 0 | sudo tee /proc/sys/kernel/sched_schedstats perf inject --sched-stat --input perf.data.raw --output perf.data ~~~ Before this patch, the report clearly shows the off-by-one issue. Most notably, the last sleep invocation is incorrectly attributed to the "return 0;" line: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for> thread:323 | |--90.08%--main test.cpp:9 | __libc_start_main | _start | |--9.01%--main test.cpp:10 | __libc_start_main | _start | --0.91%--main test.cpp:13 __libc_start_main _start ~~~ When compiling perf using libdwfl for unwinding instead of libunwind and having this patch here applied, the issue is fixed. The report becomes much more usable: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for > thread:323 | |--90.08%--main test.cpp:8 | __libc_start_main | _start | |--9.01%--main test.cpp:9 | __libc_start_main | _start | --0.91%--main test.cpp:10 __libc_start_main _start ~~~ Cc: Arnaldo Carvalho de Melo Cc: David Ahern Cc: Namhyung Kim Cc: Peter Zijlstra Cc: Yao Jin Signed-off-by: Milian Wolff --- tools/perf/util/unwind-libdw.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/tools/perf/util/unwind-libdw.c b/tools/perf/util/unwind-libdw.c index f90e11a555b2..943a06291587 100644 --- a/tools/perf/util/unwind-libdw.c +++ b/tools/perf/util/unwind-libdw.c @@ -168,12 +168,16 @@ frame_callback(Dwfl_Frame *state, void *arg) { struct unwind_info *ui = arg; Dwarf_Addr pc; + bool isactivation; - if (!dwfl_frame_pc(state, , NULL)) { + if (!dwfl_frame_pc(state, , )) { pr_err("%s", dwfl_errmsg(-1)); return DWARF_CB_ABORT; } + if (!isactivation) + --pc; + return entry(pc, ui) || !(--ui->max_stack) ? DWARF_CB_ABORT : DWARF_CB_OK; } -- 2.13.0
[PATCH] perf report: fix off-by-one for non-activation frames
As the documentation for dwfl_frame_pc says, frames that are no activation frames need to have their program counter decremented by one to properly find the function of the caller. This fixes many cases where perf report currently attributes the cost to the next line. I.e. I have code like this: ~~~ #include #include using namespace std; int main() { this_thread::sleep_for(chrono::milliseconds(1000)); this_thread::sleep_for(chrono::milliseconds(100)); this_thread::sleep_for(chrono::milliseconds(10)); return 0; } ~~~ Now compile and record it: ~~~ g++ -std=c++11 -g -O2 test.cpp echo 1 | sudo tee /proc/sys/kernel/sched_schedstats perf record \ --event sched:sched_stat_sleep \ --event sched:sched_process_exit \ --event sched:sched_switch --call-graph=dwarf \ --output perf.data.raw \ ./a.out echo 0 | sudo tee /proc/sys/kernel/sched_schedstats perf inject --sched-stat --input perf.data.raw --output perf.data ~~~ Before this patch, the report clearly shows the off-by-one issue. Most notably, the last sleep invocation is incorrectly attributed to the "return 0;" line: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for > thread:323 | |--90.08%--main test.cpp:9 | __libc_start_main | _start | |--9.01%--main test.cpp:10 | __libc_start_main | _start | --0.91%--main test.cpp:13 __libc_start_main _start ~~~ When compiling perf using libdwfl for unwinding instead of libunwind and having this patch here applied, the issue is fixed. The report becomes much more usable: ~~~ Overhead Source:Line ... 100.00% core.c:0 | ---__schedule core.c:0 schedule do_nanosleep hrtimer.c:0 hrtimer_nanosleep sys_nanosleep entry_SYSCALL_64_fastpath .tmp_entry_64.o:0 __nanosleep_nocancel .:0 std::this_thread::sleep_for > thread:323 | |--90.08%--main test.cpp:8 | __libc_start_main | _start | |--9.01%--main test.cpp:9 | __libc_start_main | _start | --0.91%--main test.cpp:10 __libc_start_main _start ~~~ Cc: Arnaldo Carvalho de Melo Cc: David Ahern Cc: Namhyung Kim Cc: Peter Zijlstra Cc: Yao Jin Signed-off-by: Milian Wolff --- tools/perf/util/unwind-libdw.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/tools/perf/util/unwind-libdw.c b/tools/perf/util/unwind-libdw.c index f90e11a555b2..943a06291587 100644 --- a/tools/perf/util/unwind-libdw.c +++ b/tools/perf/util/unwind-libdw.c @@ -168,12 +168,16 @@ frame_callback(Dwfl_Frame *state, void *arg) { struct unwind_info *ui = arg; Dwarf_Addr pc; + bool isactivation; - if (!dwfl_frame_pc(state, , NULL)) { + if (!dwfl_frame_pc(state, , )) { pr_err("%s", dwfl_errmsg(-1)); return DWARF_CB_ABORT; } + if (!isactivation) + --pc; + return entry(pc, ui) || !(--ui->max_stack) ? DWARF_CB_ABORT : DWARF_CB_OK; } -- 2.13.0