Re: ftrace performance (sched events): cyclictest shows 25% more latency
On Fri, 6 Jul 2018 08:22:01 +0200 Claudio wrote: > Hello all, > > I have been experimenting with the idea of leaving ftrace enabled, with sched > events, > on production systems. > > The main concern that I am having at the moment is about the impact on the > system. > Enabling the sched events that I currently need for the tracing application > seems to slow down context-switches considerably, and make the system less > responsive. > > I have tested with cyclictest on the mainline kernel, and noticed an increase > of min, avg latencies of around 25%. > > Is this expected? > > Some initial investigation into ftrace seems to point at the reservation and > commit of the events into the ring buffer > as the highest sources of overhead, while event parameters copying, including > COMM, does not seem to have any noticeable effect > relative to those costs. > > I have been running 20 times the following test, and thrown away the first > results: > > $ sudo ./cyclictest --smp -p95 -m -s -N -l 10 -q OK, I just noticed that you are using -N which means all numbers are in nanoseconds. > > $ uname -a > Linux claudio-HP-ProBook-470-G5 4.18.0-rc3+ #3 SMP Tue Jul 3 15:50:30 CEST > 2018 x86_64 x86_64 x86_64 GNU/Linux > > For brevity, this is a comparison of one test's results. All other test > results show the same ~25% increase. > > On the left side, the run without ftrace sched events, on the right side with > ftrace sched events enabled. > > CPUCount Min ActAvgMaxCount > Min-ftrace Act-ftrace Avg-ftrace Max-ftrace > 0 10 2339 2936 2841 139478 10 2900 > 3182 3566 93056 > 1 66742 2365 3386 2874 93639 66750 2959 > 3786 3646 154074 > 2 50080 2376 3058 2910 196221 50097 2997 > 4209 3655 18707 > 3 40076 2394 3461 2931 17914 40091 3006 > 4417 3750 17159 > 4 33404 2371 3612 2834 15336 33419 2997 > 3836 3594 23172 > 5 28635 2387 3313 2885 25863 28649 2995 > 3795 3647 9956 > 6 25058 2384 3428 2968 12162 25071 3051 > 4366 3719 18151 > 7 22275 2381 2859 2982 10706 22287 3046 > 5078 3825 10781 > > I would be thankful for any advice or comments on this, > especially with the goal in mind to lower as much as possible the runtime > impact on the system. Thus, the tracing is causing the wakeup time to be an average of 0.8us longer. Yes that is expected. -- Steve
Re: ftrace performance (sched events): cyclictest shows 25% more latency
On Fri, 6 Jul 2018 08:22:01 +0200 Claudio wrote: > Hello all, > > I have been experimenting with the idea of leaving ftrace enabled, with sched > events, > on production systems. > > The main concern that I am having at the moment is about the impact on the > system. > Enabling the sched events that I currently need for the tracing application > seems to slow down context-switches considerably, and make the system less > responsive. > > I have tested with cyclictest on the mainline kernel, and noticed an increase > of min, avg latencies of around 25%. > > Is this expected? > > Some initial investigation into ftrace seems to point at the reservation and > commit of the events into the ring buffer > as the highest sources of overhead, while event parameters copying, including > COMM, does not seem to have any noticeable effect > relative to those costs. > > I have been running 20 times the following test, and thrown away the first > results: > > $ sudo ./cyclictest --smp -p95 -m -s -N -l 10 -q OK, I just noticed that you are using -N which means all numbers are in nanoseconds. > > $ uname -a > Linux claudio-HP-ProBook-470-G5 4.18.0-rc3+ #3 SMP Tue Jul 3 15:50:30 CEST > 2018 x86_64 x86_64 x86_64 GNU/Linux > > For brevity, this is a comparison of one test's results. All other test > results show the same ~25% increase. > > On the left side, the run without ftrace sched events, on the right side with > ftrace sched events enabled. > > CPUCount Min ActAvgMaxCount > Min-ftrace Act-ftrace Avg-ftrace Max-ftrace > 0 10 2339 2936 2841 139478 10 2900 > 3182 3566 93056 > 1 66742 2365 3386 2874 93639 66750 2959 > 3786 3646 154074 > 2 50080 2376 3058 2910 196221 50097 2997 > 4209 3655 18707 > 3 40076 2394 3461 2931 17914 40091 3006 > 4417 3750 17159 > 4 33404 2371 3612 2834 15336 33419 2997 > 3836 3594 23172 > 5 28635 2387 3313 2885 25863 28649 2995 > 3795 3647 9956 > 6 25058 2384 3428 2968 12162 25071 3051 > 4366 3719 18151 > 7 22275 2381 2859 2982 10706 22287 3046 > 5078 3825 10781 > > I would be thankful for any advice or comments on this, > especially with the goal in mind to lower as much as possible the runtime > impact on the system. Thus, the tracing is causing the wakeup time to be an average of 0.8us longer. Yes that is expected. -- Steve
Re: ftrace performance (sched events): cyclictest shows 25% more latency
On Mon, 9 Jul 2018 16:53:52 +0200 Claudio wrote: > > One additional data point, based on brute force again: > > I applied this change, in order to understand if it was the > > trace_event_raw_event_* (I suppose primarily trace_event_raw_event_switch) > > that contained the latency "offenders": > > diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h > index 4ecdfe2..969467d 100644 > --- a/include/trace/trace_events.h > +++ b/include/trace/trace_events.h > @@ -704,6 +704,8 @@ trace_event_raw_event_##call(void *__data, proto) > > struct trace_event_raw_##call *entry; \ > int __data_size;\ > \ > + return; \ > + \ > if (trace_trigger_soft_disabled(trace_file))\ > return; \ > \ > > > This reduces the latency overhead to 6% down from 25%. > > Maybe obvious? Wanted to share in case it helps, and will dig further. I noticed that just disabling tracing "echo 0 > tracing_on" is very similar. I'm now recording timings of various parts of the code. But at most I've seen is a 12us, which should not add the overhead. So it's triggering something else. I'll be going on PTO next week, and there's things I must do this week, thus I may not have much more time to look into this until I get back from PTO (July 23rd). -- Steve
Re: ftrace performance (sched events): cyclictest shows 25% more latency
On Mon, 9 Jul 2018 16:53:52 +0200 Claudio wrote: > > One additional data point, based on brute force again: > > I applied this change, in order to understand if it was the > > trace_event_raw_event_* (I suppose primarily trace_event_raw_event_switch) > > that contained the latency "offenders": > > diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h > index 4ecdfe2..969467d 100644 > --- a/include/trace/trace_events.h > +++ b/include/trace/trace_events.h > @@ -704,6 +704,8 @@ trace_event_raw_event_##call(void *__data, proto) > > struct trace_event_raw_##call *entry; \ > int __data_size;\ > \ > + return; \ > + \ > if (trace_trigger_soft_disabled(trace_file))\ > return; \ > \ > > > This reduces the latency overhead to 6% down from 25%. > > Maybe obvious? Wanted to share in case it helps, and will dig further. I noticed that just disabling tracing "echo 0 > tracing_on" is very similar. I'm now recording timings of various parts of the code. But at most I've seen is a 12us, which should not add the overhead. So it's triggering something else. I'll be going on PTO next week, and there's things I must do this week, thus I may not have much more time to look into this until I get back from PTO (July 23rd). -- Steve
Re: ftrace performance (sched events): cyclictest shows 25% more latency
On 07/09/2018 12:06 PM, Claudio wrote: > Hello Steve, > > thank you for your response, > > On 07/06/2018 11:24 PM, Steven Rostedt wrote: >> On Fri, 6 Jul 2018 08:22:01 +0200 >> Claudio wrote: >> >>> Hello all, >>> >>> I have been experimenting with the idea of leaving ftrace enabled, with >>> sched events, >>> on production systems. >>> >>> The main concern that I am having at the moment is about the impact on the >>> system. >>> Enabling the sched events that I currently need for the tracing application >>> seems to slow down context-switches considerably, and make the system less >>> responsive. >>> >>> I have tested with cyclictest on the mainline kernel, and noticed an >>> increase of min, avg latencies of around 25%. >>> >>> Is this expected? >> >> No it is not. And this is weird. I'm seeing it too. >> >>> >>> Some initial investigation into ftrace seems to point at the reservation >>> and commit of the events into the ring buffer >>> as the highest sources of overhead, while event parameters copying, >>> including COMM, does not seem to have any noticeable effect >>> relative to those costs. >> >> Did you use perf to see where the issues were? If so what perf command >> did you use? > > I did not use perf, I will now do it and see what I get. > You can discard my comment regarding reservation and commit, it was based on > a wrong premise. > > Regarding parameter copying, I just brutally modified the code to not fill > fields (TP_fast_assign), > and I just noticed that it did not change the result significantly. > >> >> When running just: >> perf record ./cyclictest --smp -p95 -m -s -N -l 100 -q >> >> I get this: >> >>6.08% cyclictest [kernel.vmlinux][k] syscall_return_via_sysret >>4.03% cyclictest cyclictest [.] timerthread >>3.97% cyclictest [kernel.vmlinux][k] __schedule >>3.81% cyclictest [kernel.vmlinux][k] native_write_msr >>3.13% cyclictest [kernel.vmlinux][k] trace_save_cmdline >>2.71% cyclictest [kernel.vmlinux][k] do_nanosleep >>2.48% cyclictest [kernel.vmlinux][k] do_syscall_64 >>2.11% cyclictest [kernel.vmlinux][k] pick_next_task_fair >>2.03% cyclictest [kernel.vmlinux][k] native_sched_clock >>1.99% cyclictest [kernel.vmlinux][k] cpuacct_charge >>1.93% cyclictest [vdso] [.] __vdso_clock_gettime >>1.86% cyclictest [kernel.vmlinux][k] hrtimer_nanosleep >>1.84% cyclictest [kernel.vmlinux][k] cpupri_set >>1.66% cyclictest [kernel.vmlinux][k] update_curr_rt >>1.65% cyclictest [kernel.vmlinux][k] __perf_event_task_sched_out >>1.59% cyclictest [kernel.vmlinux][k] dequeue_rt_stack >>1.55% cyclictest [kernel.vmlinux][k] __rb_reserve_next >> >> Which shows there's a lot being done before we even get to the ring >> buffer reserve. > > I get (CONFIG_PREEMPT_VOLUNTARY): > > SCHED EVENTS DISABLED: > > 8.57% cyclictest [kernel.vmlinux][k] syscall_return_via_sysret > 7.04% cyclictest [kernel.vmlinux][k] native_write_msr > 6.93% cyclictest [kernel.vmlinux][k] entry_SYSCALL_64_trampoline > 4.42% cyclictest cyclictest [.] timerthread > 4.14% cyclictest [kernel.vmlinux][k] __schedule > 3.52% cyclictest [kernel.vmlinux][k] cpupri_set > 2.90% cyclictest [kernel.vmlinux][k] finish_task_switch > 2.42% cyclictest [kernel.vmlinux][k] update_curr_rt > 2.34% cyclictest [kernel.vmlinux][k] _raw_spin_lock > 2.28% cyclictest [kernel.vmlinux][k] ktime_get > 2.14% cyclictest [kernel.vmlinux][k] timerqueue_add > 2.06% cyclictest [kernel.vmlinux][k] read_tsc > 2.06% cyclictest [kernel.vmlinux][k] native_sched_clock > 2.01% cyclictest [vdso] [.] __vdso_clock_gettime > 1.95% cyclictest [kernel.vmlinux][k] do_nanosleep > 1.93% cyclictest [kernel.vmlinux][k] pick_next_task_fair > 1.89% cyclictest [vdso] [.] 0x0977 > 1.75% cyclictest [kernel.vmlinux][k] dequeue_task_rt > 1.64% cyclictest [kernel.vmlinux][k] x86_pmu_disable > 1.57% cyclictest [kernel.vmlinux][k] dequeue_rt_stack > 1.53% cyclictest [kernel.vmlinux][k] __x64_sys_nanosleep > 1.51% cyclictest [kernel.vmlinux][k] do_syscall_64 > 1.41% cyclictest [kernel.vmlinux][k] put_prev_task_rt > 1.34% cyclictest [kernel.vmlinux][k] pick_next_task_dl > 1.28% cyclictest [kernel.vmlinux][k] pick_next_task_rt > 1.25% cyclictest [kernel.vmlinux][k] cpuacct_charge > 1.25% cyclictest [kernel.vmlinux][k] hrtimer_start_range_ns > 1.24% cyclictest [kernel.vmlinux][k] __x86_indirect_thunk_rax > 1.07% cyclictest libpthread-2.23.so [.] 0x00010c1b > 0.94% cyclictest [kernel.vmlinux][k] pull_rt_task > 0.93% cyclictest
Re: ftrace performance (sched events): cyclictest shows 25% more latency
On 07/09/2018 12:06 PM, Claudio wrote: > Hello Steve, > > thank you for your response, > > On 07/06/2018 11:24 PM, Steven Rostedt wrote: >> On Fri, 6 Jul 2018 08:22:01 +0200 >> Claudio wrote: >> >>> Hello all, >>> >>> I have been experimenting with the idea of leaving ftrace enabled, with >>> sched events, >>> on production systems. >>> >>> The main concern that I am having at the moment is about the impact on the >>> system. >>> Enabling the sched events that I currently need for the tracing application >>> seems to slow down context-switches considerably, and make the system less >>> responsive. >>> >>> I have tested with cyclictest on the mainline kernel, and noticed an >>> increase of min, avg latencies of around 25%. >>> >>> Is this expected? >> >> No it is not. And this is weird. I'm seeing it too. >> >>> >>> Some initial investigation into ftrace seems to point at the reservation >>> and commit of the events into the ring buffer >>> as the highest sources of overhead, while event parameters copying, >>> including COMM, does not seem to have any noticeable effect >>> relative to those costs. >> >> Did you use perf to see where the issues were? If so what perf command >> did you use? > > I did not use perf, I will now do it and see what I get. > You can discard my comment regarding reservation and commit, it was based on > a wrong premise. > > Regarding parameter copying, I just brutally modified the code to not fill > fields (TP_fast_assign), > and I just noticed that it did not change the result significantly. > >> >> When running just: >> perf record ./cyclictest --smp -p95 -m -s -N -l 100 -q >> >> I get this: >> >>6.08% cyclictest [kernel.vmlinux][k] syscall_return_via_sysret >>4.03% cyclictest cyclictest [.] timerthread >>3.97% cyclictest [kernel.vmlinux][k] __schedule >>3.81% cyclictest [kernel.vmlinux][k] native_write_msr >>3.13% cyclictest [kernel.vmlinux][k] trace_save_cmdline >>2.71% cyclictest [kernel.vmlinux][k] do_nanosleep >>2.48% cyclictest [kernel.vmlinux][k] do_syscall_64 >>2.11% cyclictest [kernel.vmlinux][k] pick_next_task_fair >>2.03% cyclictest [kernel.vmlinux][k] native_sched_clock >>1.99% cyclictest [kernel.vmlinux][k] cpuacct_charge >>1.93% cyclictest [vdso] [.] __vdso_clock_gettime >>1.86% cyclictest [kernel.vmlinux][k] hrtimer_nanosleep >>1.84% cyclictest [kernel.vmlinux][k] cpupri_set >>1.66% cyclictest [kernel.vmlinux][k] update_curr_rt >>1.65% cyclictest [kernel.vmlinux][k] __perf_event_task_sched_out >>1.59% cyclictest [kernel.vmlinux][k] dequeue_rt_stack >>1.55% cyclictest [kernel.vmlinux][k] __rb_reserve_next >> >> Which shows there's a lot being done before we even get to the ring >> buffer reserve. > > I get (CONFIG_PREEMPT_VOLUNTARY): > > SCHED EVENTS DISABLED: > > 8.57% cyclictest [kernel.vmlinux][k] syscall_return_via_sysret > 7.04% cyclictest [kernel.vmlinux][k] native_write_msr > 6.93% cyclictest [kernel.vmlinux][k] entry_SYSCALL_64_trampoline > 4.42% cyclictest cyclictest [.] timerthread > 4.14% cyclictest [kernel.vmlinux][k] __schedule > 3.52% cyclictest [kernel.vmlinux][k] cpupri_set > 2.90% cyclictest [kernel.vmlinux][k] finish_task_switch > 2.42% cyclictest [kernel.vmlinux][k] update_curr_rt > 2.34% cyclictest [kernel.vmlinux][k] _raw_spin_lock > 2.28% cyclictest [kernel.vmlinux][k] ktime_get > 2.14% cyclictest [kernel.vmlinux][k] timerqueue_add > 2.06% cyclictest [kernel.vmlinux][k] read_tsc > 2.06% cyclictest [kernel.vmlinux][k] native_sched_clock > 2.01% cyclictest [vdso] [.] __vdso_clock_gettime > 1.95% cyclictest [kernel.vmlinux][k] do_nanosleep > 1.93% cyclictest [kernel.vmlinux][k] pick_next_task_fair > 1.89% cyclictest [vdso] [.] 0x0977 > 1.75% cyclictest [kernel.vmlinux][k] dequeue_task_rt > 1.64% cyclictest [kernel.vmlinux][k] x86_pmu_disable > 1.57% cyclictest [kernel.vmlinux][k] dequeue_rt_stack > 1.53% cyclictest [kernel.vmlinux][k] __x64_sys_nanosleep > 1.51% cyclictest [kernel.vmlinux][k] do_syscall_64 > 1.41% cyclictest [kernel.vmlinux][k] put_prev_task_rt > 1.34% cyclictest [kernel.vmlinux][k] pick_next_task_dl > 1.28% cyclictest [kernel.vmlinux][k] pick_next_task_rt > 1.25% cyclictest [kernel.vmlinux][k] cpuacct_charge > 1.25% cyclictest [kernel.vmlinux][k] hrtimer_start_range_ns > 1.24% cyclictest [kernel.vmlinux][k] __x86_indirect_thunk_rax > 1.07% cyclictest libpthread-2.23.so [.] 0x00010c1b > 0.94% cyclictest [kernel.vmlinux][k] pull_rt_task > 0.93% cyclictest
Re: ftrace performance (sched events): cyclictest shows 25% more latency
Hello Steve, thank you for your response, On 07/06/2018 11:24 PM, Steven Rostedt wrote: > On Fri, 6 Jul 2018 08:22:01 +0200 > Claudio wrote: > >> Hello all, >> >> I have been experimenting with the idea of leaving ftrace enabled, with >> sched events, >> on production systems. >> >> The main concern that I am having at the moment is about the impact on the >> system. >> Enabling the sched events that I currently need for the tracing application >> seems to slow down context-switches considerably, and make the system less >> responsive. >> >> I have tested with cyclictest on the mainline kernel, and noticed an >> increase of min, avg latencies of around 25%. >> >> Is this expected? > > No it is not. And this is weird. I'm seeing it too. > >> >> Some initial investigation into ftrace seems to point at the reservation and >> commit of the events into the ring buffer >> as the highest sources of overhead, while event parameters copying, >> including COMM, does not seem to have any noticeable effect >> relative to those costs. > > Did you use perf to see where the issues were? If so what perf command > did you use? I did not use perf, I will now do it and see what I get. You can discard my comment regarding reservation and commit, it was based on a wrong premise. Regarding parameter copying, I just brutally modified the code to not fill fields (TP_fast_assign), and I just noticed that it did not change the result significantly. > > When running just: > perf record ./cyclictest --smp -p95 -m -s -N -l 100 -q > > I get this: > >6.08% cyclictest [kernel.vmlinux][k] syscall_return_via_sysret >4.03% cyclictest cyclictest [.] timerthread >3.97% cyclictest [kernel.vmlinux][k] __schedule >3.81% cyclictest [kernel.vmlinux][k] native_write_msr >3.13% cyclictest [kernel.vmlinux][k] trace_save_cmdline >2.71% cyclictest [kernel.vmlinux][k] do_nanosleep >2.48% cyclictest [kernel.vmlinux][k] do_syscall_64 >2.11% cyclictest [kernel.vmlinux][k] pick_next_task_fair >2.03% cyclictest [kernel.vmlinux][k] native_sched_clock >1.99% cyclictest [kernel.vmlinux][k] cpuacct_charge >1.93% cyclictest [vdso] [.] __vdso_clock_gettime >1.86% cyclictest [kernel.vmlinux][k] hrtimer_nanosleep >1.84% cyclictest [kernel.vmlinux][k] cpupri_set >1.66% cyclictest [kernel.vmlinux][k] update_curr_rt >1.65% cyclictest [kernel.vmlinux][k] __perf_event_task_sched_out >1.59% cyclictest [kernel.vmlinux][k] dequeue_rt_stack >1.55% cyclictest [kernel.vmlinux][k] __rb_reserve_next > > Which shows there's a lot being done before we even get to the ring > buffer reserve. I get (CONFIG_PREEMPT_VOLUNTARY): SCHED EVENTS DISABLED: 8.57% cyclictest [kernel.vmlinux][k] syscall_return_via_sysret 7.04% cyclictest [kernel.vmlinux][k] native_write_msr 6.93% cyclictest [kernel.vmlinux][k] entry_SYSCALL_64_trampoline 4.42% cyclictest cyclictest [.] timerthread 4.14% cyclictest [kernel.vmlinux][k] __schedule 3.52% cyclictest [kernel.vmlinux][k] cpupri_set 2.90% cyclictest [kernel.vmlinux][k] finish_task_switch 2.42% cyclictest [kernel.vmlinux][k] update_curr_rt 2.34% cyclictest [kernel.vmlinux][k] _raw_spin_lock 2.28% cyclictest [kernel.vmlinux][k] ktime_get 2.14% cyclictest [kernel.vmlinux][k] timerqueue_add 2.06% cyclictest [kernel.vmlinux][k] read_tsc 2.06% cyclictest [kernel.vmlinux][k] native_sched_clock 2.01% cyclictest [vdso] [.] __vdso_clock_gettime 1.95% cyclictest [kernel.vmlinux][k] do_nanosleep 1.93% cyclictest [kernel.vmlinux][k] pick_next_task_fair 1.89% cyclictest [vdso] [.] 0x0977 1.75% cyclictest [kernel.vmlinux][k] dequeue_task_rt 1.64% cyclictest [kernel.vmlinux][k] x86_pmu_disable 1.57% cyclictest [kernel.vmlinux][k] dequeue_rt_stack 1.53% cyclictest [kernel.vmlinux][k] __x64_sys_nanosleep 1.51% cyclictest [kernel.vmlinux][k] do_syscall_64 1.41% cyclictest [kernel.vmlinux][k] put_prev_task_rt 1.34% cyclictest [kernel.vmlinux][k] pick_next_task_dl 1.28% cyclictest [kernel.vmlinux][k] pick_next_task_rt 1.25% cyclictest [kernel.vmlinux][k] cpuacct_charge 1.25% cyclictest [kernel.vmlinux][k] hrtimer_start_range_ns 1.24% cyclictest [kernel.vmlinux][k] __x86_indirect_thunk_rax 1.07% cyclictest libpthread-2.23.so [.] 0x00010c1b 0.94% cyclictest [kernel.vmlinux][k] pull_rt_task 0.93% cyclictest [kernel.vmlinux][k] entry_SYSCALL_64_stage2 0.89% cyclictest [kernel.vmlinux][k] hrtimer_nanosleep 0.89% cyclictest libpthread-2.23.so [.] 0x00010c1d
Re: ftrace performance (sched events): cyclictest shows 25% more latency
Hello Steve, thank you for your response, On 07/06/2018 11:24 PM, Steven Rostedt wrote: > On Fri, 6 Jul 2018 08:22:01 +0200 > Claudio wrote: > >> Hello all, >> >> I have been experimenting with the idea of leaving ftrace enabled, with >> sched events, >> on production systems. >> >> The main concern that I am having at the moment is about the impact on the >> system. >> Enabling the sched events that I currently need for the tracing application >> seems to slow down context-switches considerably, and make the system less >> responsive. >> >> I have tested with cyclictest on the mainline kernel, and noticed an >> increase of min, avg latencies of around 25%. >> >> Is this expected? > > No it is not. And this is weird. I'm seeing it too. > >> >> Some initial investigation into ftrace seems to point at the reservation and >> commit of the events into the ring buffer >> as the highest sources of overhead, while event parameters copying, >> including COMM, does not seem to have any noticeable effect >> relative to those costs. > > Did you use perf to see where the issues were? If so what perf command > did you use? I did not use perf, I will now do it and see what I get. You can discard my comment regarding reservation and commit, it was based on a wrong premise. Regarding parameter copying, I just brutally modified the code to not fill fields (TP_fast_assign), and I just noticed that it did not change the result significantly. > > When running just: > perf record ./cyclictest --smp -p95 -m -s -N -l 100 -q > > I get this: > >6.08% cyclictest [kernel.vmlinux][k] syscall_return_via_sysret >4.03% cyclictest cyclictest [.] timerthread >3.97% cyclictest [kernel.vmlinux][k] __schedule >3.81% cyclictest [kernel.vmlinux][k] native_write_msr >3.13% cyclictest [kernel.vmlinux][k] trace_save_cmdline >2.71% cyclictest [kernel.vmlinux][k] do_nanosleep >2.48% cyclictest [kernel.vmlinux][k] do_syscall_64 >2.11% cyclictest [kernel.vmlinux][k] pick_next_task_fair >2.03% cyclictest [kernel.vmlinux][k] native_sched_clock >1.99% cyclictest [kernel.vmlinux][k] cpuacct_charge >1.93% cyclictest [vdso] [.] __vdso_clock_gettime >1.86% cyclictest [kernel.vmlinux][k] hrtimer_nanosleep >1.84% cyclictest [kernel.vmlinux][k] cpupri_set >1.66% cyclictest [kernel.vmlinux][k] update_curr_rt >1.65% cyclictest [kernel.vmlinux][k] __perf_event_task_sched_out >1.59% cyclictest [kernel.vmlinux][k] dequeue_rt_stack >1.55% cyclictest [kernel.vmlinux][k] __rb_reserve_next > > Which shows there's a lot being done before we even get to the ring > buffer reserve. I get (CONFIG_PREEMPT_VOLUNTARY): SCHED EVENTS DISABLED: 8.57% cyclictest [kernel.vmlinux][k] syscall_return_via_sysret 7.04% cyclictest [kernel.vmlinux][k] native_write_msr 6.93% cyclictest [kernel.vmlinux][k] entry_SYSCALL_64_trampoline 4.42% cyclictest cyclictest [.] timerthread 4.14% cyclictest [kernel.vmlinux][k] __schedule 3.52% cyclictest [kernel.vmlinux][k] cpupri_set 2.90% cyclictest [kernel.vmlinux][k] finish_task_switch 2.42% cyclictest [kernel.vmlinux][k] update_curr_rt 2.34% cyclictest [kernel.vmlinux][k] _raw_spin_lock 2.28% cyclictest [kernel.vmlinux][k] ktime_get 2.14% cyclictest [kernel.vmlinux][k] timerqueue_add 2.06% cyclictest [kernel.vmlinux][k] read_tsc 2.06% cyclictest [kernel.vmlinux][k] native_sched_clock 2.01% cyclictest [vdso] [.] __vdso_clock_gettime 1.95% cyclictest [kernel.vmlinux][k] do_nanosleep 1.93% cyclictest [kernel.vmlinux][k] pick_next_task_fair 1.89% cyclictest [vdso] [.] 0x0977 1.75% cyclictest [kernel.vmlinux][k] dequeue_task_rt 1.64% cyclictest [kernel.vmlinux][k] x86_pmu_disable 1.57% cyclictest [kernel.vmlinux][k] dequeue_rt_stack 1.53% cyclictest [kernel.vmlinux][k] __x64_sys_nanosleep 1.51% cyclictest [kernel.vmlinux][k] do_syscall_64 1.41% cyclictest [kernel.vmlinux][k] put_prev_task_rt 1.34% cyclictest [kernel.vmlinux][k] pick_next_task_dl 1.28% cyclictest [kernel.vmlinux][k] pick_next_task_rt 1.25% cyclictest [kernel.vmlinux][k] cpuacct_charge 1.25% cyclictest [kernel.vmlinux][k] hrtimer_start_range_ns 1.24% cyclictest [kernel.vmlinux][k] __x86_indirect_thunk_rax 1.07% cyclictest libpthread-2.23.so [.] 0x00010c1b 0.94% cyclictest [kernel.vmlinux][k] pull_rt_task 0.93% cyclictest [kernel.vmlinux][k] entry_SYSCALL_64_stage2 0.89% cyclictest [kernel.vmlinux][k] hrtimer_nanosleep 0.89% cyclictest libpthread-2.23.so [.] 0x00010c1d
Re: ftrace performance (sched events): cyclictest shows 25% more latency
On Fri, 6 Jul 2018 17:39:22 -0400 Steven Rostedt wrote: > On Fri, 6 Jul 2018 17:24:28 -0400 > Steven Rostedt wrote: > > > I'll investigate further. > > Note, I enabled function tracing to trace the scheduler function: > > # trace-cmd start -p function -l schedule > > And then ran cyclictest. That does not cause any noticeable increase > in latency (try it). So there is some kind of side effect with the > trace event itself, and not the recording of the event. It does add some noise but not nearly as much as a trace event. But, then I did function graph tracing: First I did: trace-cmd record -p function_graph -l schedule /work/git-local/rt-tests.git/cyclictest --smp -p95 -m -s -N -l 10 -q And got this: -0 [000] 3034.534021: sched_switch: swapper/0:0 [120] S ==> trace-cmd:1693 [120] trace-cmd-1693 [000] 3034.534029: funcgraph_entry: | schedule() { trace-cmd-1693 [000] 3034.534043: sched_switch: trace-cmd:1693 [120] D ==> swapper/0:0 [120] -0 [000] 3034.535084: sched_switch: swapper/0:0 [120] S ==> trace-cmd:1693 [120] trace-cmd-1693 [000] 3034.535086: funcgraph_exit: # 1056.055 us | } trace-cmd-1693 [000] 3034.535090: funcgraph_entry: | schedule() { trace-cmd-1693 [000] 3034.535093: sched_switch: trace-cmd:1693 [120] D ==> swapper/0:0 [120] -0 [000] 3034.536146: sched_switch: swapper/0:0 [120] S ==> trace-cmd:1693 [120] trace-cmd-1693 [000] 3034.536147: funcgraph_exit: # 1056.563 us | } trace-cmd-1693 [000] 3034.536150: funcgraph_entry: | schedule() { trace-cmd-1693 [000] 3034.536153: sched_switch: trace-cmd:1693 [120] D ==> swapper/0:0 [120] -0 [000] 3034.537206: sched_switch: swapper/0:0 [120] S ==> trace-cmd:1693 [120] trace-cmd-1693 [000] 3034.537207: funcgraph_exit: # 1056.107 us | } trace-cmd-1693 [000] 3034.537210: funcgraph_entry: | schedule() { trace-cmd-1693 [000] 3034.537213: sched_switch: trace-cmd:1693 [120] D ==> swapper/0:0 [120] -0 [000] 3034.538262: sched_switch: swapper/0:0 [120] S ==> trace-cmd:1693 [120] trace-cmd-1693 [000] 3034.538263: funcgraph_exit: # 1053.077 us | } trace-cmd-1693 [000] 3034.538265: funcgraph_entry: | schedule() { trace-cmd-1693 [000] 3034.538266: sched_switch: trace-cmd:1693 [120] D ==> swapper/0:0 [120] -0 [000] 3034.538554: sched_switch: swapper/0:0 [120] S ==> cyclictest:1698 [120] cyclictest-1698 [000] 3034.538565: funcgraph_entry: | schedule() { cyclictest-1698 [000] 3034.538566: sched_switch: cyclictest:1698 [4] D ==> swapper/0:0 [120] -0 [000] 3034.539318: sched_switch: swapper/0:0 [120] S ==> trace-cmd:1693 [120] trace-cmd-1693 [000] 3034.539319: funcgraph_exit: # 1053.723 us | } trace-cmd-1693 [000] 3034.539322: funcgraph_entry: | schedule() { trace-cmd-1693 [000] 3034.539324: sched_switch: trace-cmd:1693 [120] D ==> swapper/0:0 [120] -0 [000] 3034.539567: sched_switch: swapper/0:0 [120] S ==> cyclictest:1698 [4] cyclictest-1698 [000] 3034.539567: funcgraph_exit: # 1002.624 us | } And that continued on as normal. Then I ran it like this: trace-cmd record -p function_graph -l schedule /work/git-local/rt-tests.git/cyclictest --smp -p95 -m -s -N -l 10 -q And this was my result: trace-cmd-1749 [000] 3321.110772: funcgraph_entry: | schedule() { trace-cmd-1749 [000] 3321.111826: funcgraph_exit: # 1053.004 us | } trace-cmd-1749 [000] 3321.111830: funcgraph_entry: | schedule() { trace-cmd-1749 [000] 3321.112885: funcgraph_exit: # 1054.734 us | } trace-cmd-1749 [000] 3321.112888: funcgraph_entry: | schedule() { cyclictest-1756 [000] 3321.113930: funcgraph_entry: | schedule() { trace-cmd-1749 [000] 3321.113942: funcgraph_exit: # 1054.240 us | } trace-cmd-1749 [000] 3321.113947: funcgraph_entry: | schedule() { cyclictest-1756 [000] 3321.114933: funcgraph_exit: # 1002.600 us | } cyclictest-1756 [000] 3321.114935: funcgraph_entry: | schedule() { trace-cmd-1749 [000] 3321.115000: funcgraph_exit: # 1053.451 us | } trace-cmd-1749 [000] 3321.115004: funcgraph_entry: | schedule() { cyclictest-1756 [000] 3321.115939: funcgraph_exit: # 1003.116 us | } cyclictest-1756 [000] 3321.115941: funcgraph_entry: |
Re: ftrace performance (sched events): cyclictest shows 25% more latency
On Fri, 6 Jul 2018 17:39:22 -0400 Steven Rostedt wrote: > On Fri, 6 Jul 2018 17:24:28 -0400 > Steven Rostedt wrote: > > > I'll investigate further. > > Note, I enabled function tracing to trace the scheduler function: > > # trace-cmd start -p function -l schedule > > And then ran cyclictest. That does not cause any noticeable increase > in latency (try it). So there is some kind of side effect with the > trace event itself, and not the recording of the event. It does add some noise but not nearly as much as a trace event. But, then I did function graph tracing: First I did: trace-cmd record -p function_graph -l schedule /work/git-local/rt-tests.git/cyclictest --smp -p95 -m -s -N -l 10 -q And got this: -0 [000] 3034.534021: sched_switch: swapper/0:0 [120] S ==> trace-cmd:1693 [120] trace-cmd-1693 [000] 3034.534029: funcgraph_entry: | schedule() { trace-cmd-1693 [000] 3034.534043: sched_switch: trace-cmd:1693 [120] D ==> swapper/0:0 [120] -0 [000] 3034.535084: sched_switch: swapper/0:0 [120] S ==> trace-cmd:1693 [120] trace-cmd-1693 [000] 3034.535086: funcgraph_exit: # 1056.055 us | } trace-cmd-1693 [000] 3034.535090: funcgraph_entry: | schedule() { trace-cmd-1693 [000] 3034.535093: sched_switch: trace-cmd:1693 [120] D ==> swapper/0:0 [120] -0 [000] 3034.536146: sched_switch: swapper/0:0 [120] S ==> trace-cmd:1693 [120] trace-cmd-1693 [000] 3034.536147: funcgraph_exit: # 1056.563 us | } trace-cmd-1693 [000] 3034.536150: funcgraph_entry: | schedule() { trace-cmd-1693 [000] 3034.536153: sched_switch: trace-cmd:1693 [120] D ==> swapper/0:0 [120] -0 [000] 3034.537206: sched_switch: swapper/0:0 [120] S ==> trace-cmd:1693 [120] trace-cmd-1693 [000] 3034.537207: funcgraph_exit: # 1056.107 us | } trace-cmd-1693 [000] 3034.537210: funcgraph_entry: | schedule() { trace-cmd-1693 [000] 3034.537213: sched_switch: trace-cmd:1693 [120] D ==> swapper/0:0 [120] -0 [000] 3034.538262: sched_switch: swapper/0:0 [120] S ==> trace-cmd:1693 [120] trace-cmd-1693 [000] 3034.538263: funcgraph_exit: # 1053.077 us | } trace-cmd-1693 [000] 3034.538265: funcgraph_entry: | schedule() { trace-cmd-1693 [000] 3034.538266: sched_switch: trace-cmd:1693 [120] D ==> swapper/0:0 [120] -0 [000] 3034.538554: sched_switch: swapper/0:0 [120] S ==> cyclictest:1698 [120] cyclictest-1698 [000] 3034.538565: funcgraph_entry: | schedule() { cyclictest-1698 [000] 3034.538566: sched_switch: cyclictest:1698 [4] D ==> swapper/0:0 [120] -0 [000] 3034.539318: sched_switch: swapper/0:0 [120] S ==> trace-cmd:1693 [120] trace-cmd-1693 [000] 3034.539319: funcgraph_exit: # 1053.723 us | } trace-cmd-1693 [000] 3034.539322: funcgraph_entry: | schedule() { trace-cmd-1693 [000] 3034.539324: sched_switch: trace-cmd:1693 [120] D ==> swapper/0:0 [120] -0 [000] 3034.539567: sched_switch: swapper/0:0 [120] S ==> cyclictest:1698 [4] cyclictest-1698 [000] 3034.539567: funcgraph_exit: # 1002.624 us | } And that continued on as normal. Then I ran it like this: trace-cmd record -p function_graph -l schedule /work/git-local/rt-tests.git/cyclictest --smp -p95 -m -s -N -l 10 -q And this was my result: trace-cmd-1749 [000] 3321.110772: funcgraph_entry: | schedule() { trace-cmd-1749 [000] 3321.111826: funcgraph_exit: # 1053.004 us | } trace-cmd-1749 [000] 3321.111830: funcgraph_entry: | schedule() { trace-cmd-1749 [000] 3321.112885: funcgraph_exit: # 1054.734 us | } trace-cmd-1749 [000] 3321.112888: funcgraph_entry: | schedule() { cyclictest-1756 [000] 3321.113930: funcgraph_entry: | schedule() { trace-cmd-1749 [000] 3321.113942: funcgraph_exit: # 1054.240 us | } trace-cmd-1749 [000] 3321.113947: funcgraph_entry: | schedule() { cyclictest-1756 [000] 3321.114933: funcgraph_exit: # 1002.600 us | } cyclictest-1756 [000] 3321.114935: funcgraph_entry: | schedule() { trace-cmd-1749 [000] 3321.115000: funcgraph_exit: # 1053.451 us | } trace-cmd-1749 [000] 3321.115004: funcgraph_entry: | schedule() { cyclictest-1756 [000] 3321.115939: funcgraph_exit: # 1003.116 us | } cyclictest-1756 [000] 3321.115941: funcgraph_entry: |
Re: ftrace performance (sched events): cyclictest shows 25% more latency
On Fri, 6 Jul 2018 17:24:28 -0400 Steven Rostedt wrote: > I'll investigate further. Note, I enabled function tracing to trace the scheduler function: # trace-cmd start -p function -l schedule And then ran cyclictest. That does not cause any noticeable increase in latency (try it). So there is some kind of side effect with the trace event itself, and not the recording of the event. -- Steve
Re: ftrace performance (sched events): cyclictest shows 25% more latency
On Fri, 6 Jul 2018 17:24:28 -0400 Steven Rostedt wrote: > I'll investigate further. Note, I enabled function tracing to trace the scheduler function: # trace-cmd start -p function -l schedule And then ran cyclictest. That does not cause any noticeable increase in latency (try it). So there is some kind of side effect with the trace event itself, and not the recording of the event. -- Steve
Re: ftrace performance (sched events): cyclictest shows 25% more latency
On Fri, 6 Jul 2018 08:22:01 +0200 Claudio wrote: > Hello all, > > I have been experimenting with the idea of leaving ftrace enabled, with sched > events, > on production systems. > > The main concern that I am having at the moment is about the impact on the > system. > Enabling the sched events that I currently need for the tracing application > seems to slow down context-switches considerably, and make the system less > responsive. > > I have tested with cyclictest on the mainline kernel, and noticed an increase > of min, avg latencies of around 25%. > > Is this expected? No it is not. And this is weird. I'm seeing it too. > > Some initial investigation into ftrace seems to point at the reservation and > commit of the events into the ring buffer > as the highest sources of overhead, while event parameters copying, including > COMM, does not seem to have any noticeable effect > relative to those costs. Did you use perf to see where the issues were? If so what perf command did you use? When running just: perf record ./cyclictest --smp -p95 -m -s -N -l 100 -q I get this: 6.08% cyclictest [kernel.vmlinux][k] syscall_return_via_sysret 4.03% cyclictest cyclictest [.] timerthread 3.97% cyclictest [kernel.vmlinux][k] __schedule 3.81% cyclictest [kernel.vmlinux][k] native_write_msr 3.13% cyclictest [kernel.vmlinux][k] trace_save_cmdline 2.71% cyclictest [kernel.vmlinux][k] do_nanosleep 2.48% cyclictest [kernel.vmlinux][k] do_syscall_64 2.11% cyclictest [kernel.vmlinux][k] pick_next_task_fair 2.03% cyclictest [kernel.vmlinux][k] native_sched_clock 1.99% cyclictest [kernel.vmlinux][k] cpuacct_charge 1.93% cyclictest [vdso] [.] __vdso_clock_gettime 1.86% cyclictest [kernel.vmlinux][k] hrtimer_nanosleep 1.84% cyclictest [kernel.vmlinux][k] cpupri_set 1.66% cyclictest [kernel.vmlinux][k] update_curr_rt 1.65% cyclictest [kernel.vmlinux][k] __perf_event_task_sched_out 1.59% cyclictest [kernel.vmlinux][k] dequeue_rt_stack 1.55% cyclictest [kernel.vmlinux][k] __rb_reserve_next Which shows there's a lot being done before we even get to the ring buffer reserve. One thing that I notice too is that we have trace_save_cmdline that also gets activated when enabling trace events. Perhaps I should try to optimize that more. > > I have been running 20 times the following test, and thrown away the first > results: > > $ sudo ./cyclictest --smp -p95 -m -s -N -l 10 -q > > $ uname -a > Linux claudio-HP-ProBook-470-G5 4.18.0-rc3+ #3 SMP Tue Jul 3 15:50:30 CEST > 2018 x86_64 x86_64 x86_64 GNU/Linux > > For brevity, this is a comparison of one test's results. All other test > results show the same ~25% increase. > > On the left side, the run without ftrace sched events, on the right side with > ftrace sched events enabled. > > CPUCount Min ActAvgMaxCount > Min-ftrace Act-ftrace Avg-ftrace Max-ftrace > 0 10 2339 2936 2841 139478 10 2900 > 3182 3566 93056 > 1 66742 2365 3386 2874 93639 66750 2959 > 3786 3646 154074 > 2 50080 2376 3058 2910 196221 50097 2997 > 4209 3655 18707 > 3 40076 2394 3461 2931 17914 40091 3006 > 4417 3750 17159 > 4 33404 2371 3612 2834 15336 33419 2997 > 3836 3594 23172 > 5 28635 2387 3313 2885 25863 28649 2995 > 3795 3647 9956 > 6 25058 2384 3428 2968 12162 25071 3051 > 4366 3719 18151 > 7 22275 2381 2859 2982 10706 22287 3046 > 5078 3825 10781 > > I would be thankful for any advice or comments on this, > especially with the goal in mind to lower as much as possible the runtime > impact on the system. But still, we use all sorts of trace events when dealing with real time, and trace events hardly shows anything in the PREEMPT_RT kernel. Where cyclictest shows 20 microseconds, and sched events hardly touches that. But I see this weird side effect on a distro kernel too. Maybe there's something weird going on when we have CONFIG_VOLUNTARY_PREEMPT set :-/ I'll investigate further. Thanks for the report. -- Steve
Re: ftrace performance (sched events): cyclictest shows 25% more latency
On Fri, 6 Jul 2018 08:22:01 +0200 Claudio wrote: > Hello all, > > I have been experimenting with the idea of leaving ftrace enabled, with sched > events, > on production systems. > > The main concern that I am having at the moment is about the impact on the > system. > Enabling the sched events that I currently need for the tracing application > seems to slow down context-switches considerably, and make the system less > responsive. > > I have tested with cyclictest on the mainline kernel, and noticed an increase > of min, avg latencies of around 25%. > > Is this expected? No it is not. And this is weird. I'm seeing it too. > > Some initial investigation into ftrace seems to point at the reservation and > commit of the events into the ring buffer > as the highest sources of overhead, while event parameters copying, including > COMM, does not seem to have any noticeable effect > relative to those costs. Did you use perf to see where the issues were? If so what perf command did you use? When running just: perf record ./cyclictest --smp -p95 -m -s -N -l 100 -q I get this: 6.08% cyclictest [kernel.vmlinux][k] syscall_return_via_sysret 4.03% cyclictest cyclictest [.] timerthread 3.97% cyclictest [kernel.vmlinux][k] __schedule 3.81% cyclictest [kernel.vmlinux][k] native_write_msr 3.13% cyclictest [kernel.vmlinux][k] trace_save_cmdline 2.71% cyclictest [kernel.vmlinux][k] do_nanosleep 2.48% cyclictest [kernel.vmlinux][k] do_syscall_64 2.11% cyclictest [kernel.vmlinux][k] pick_next_task_fair 2.03% cyclictest [kernel.vmlinux][k] native_sched_clock 1.99% cyclictest [kernel.vmlinux][k] cpuacct_charge 1.93% cyclictest [vdso] [.] __vdso_clock_gettime 1.86% cyclictest [kernel.vmlinux][k] hrtimer_nanosleep 1.84% cyclictest [kernel.vmlinux][k] cpupri_set 1.66% cyclictest [kernel.vmlinux][k] update_curr_rt 1.65% cyclictest [kernel.vmlinux][k] __perf_event_task_sched_out 1.59% cyclictest [kernel.vmlinux][k] dequeue_rt_stack 1.55% cyclictest [kernel.vmlinux][k] __rb_reserve_next Which shows there's a lot being done before we even get to the ring buffer reserve. One thing that I notice too is that we have trace_save_cmdline that also gets activated when enabling trace events. Perhaps I should try to optimize that more. > > I have been running 20 times the following test, and thrown away the first > results: > > $ sudo ./cyclictest --smp -p95 -m -s -N -l 10 -q > > $ uname -a > Linux claudio-HP-ProBook-470-G5 4.18.0-rc3+ #3 SMP Tue Jul 3 15:50:30 CEST > 2018 x86_64 x86_64 x86_64 GNU/Linux > > For brevity, this is a comparison of one test's results. All other test > results show the same ~25% increase. > > On the left side, the run without ftrace sched events, on the right side with > ftrace sched events enabled. > > CPUCount Min ActAvgMaxCount > Min-ftrace Act-ftrace Avg-ftrace Max-ftrace > 0 10 2339 2936 2841 139478 10 2900 > 3182 3566 93056 > 1 66742 2365 3386 2874 93639 66750 2959 > 3786 3646 154074 > 2 50080 2376 3058 2910 196221 50097 2997 > 4209 3655 18707 > 3 40076 2394 3461 2931 17914 40091 3006 > 4417 3750 17159 > 4 33404 2371 3612 2834 15336 33419 2997 > 3836 3594 23172 > 5 28635 2387 3313 2885 25863 28649 2995 > 3795 3647 9956 > 6 25058 2384 3428 2968 12162 25071 3051 > 4366 3719 18151 > 7 22275 2381 2859 2982 10706 22287 3046 > 5078 3825 10781 > > I would be thankful for any advice or comments on this, > especially with the goal in mind to lower as much as possible the runtime > impact on the system. But still, we use all sorts of trace events when dealing with real time, and trace events hardly shows anything in the PREEMPT_RT kernel. Where cyclictest shows 20 microseconds, and sched events hardly touches that. But I see this weird side effect on a distro kernel too. Maybe there's something weird going on when we have CONFIG_VOLUNTARY_PREEMPT set :-/ I'll investigate further. Thanks for the report. -- Steve