Re: ftrace performance (sched events): cyclictest shows 25% more latency

2018-07-09 Thread Steven Rostedt
On Fri, 6 Jul 2018 08:22:01 +0200
Claudio  wrote:

> Hello all,
> 
> I have been experimenting with the idea of leaving ftrace enabled, with sched 
> events,
> on production systems.
> 
> The main concern that I am having at the moment is about the impact on the 
> system.
> Enabling the sched events that I currently need for the tracing application
> seems to slow down context-switches considerably, and make the system less 
> responsive.
> 
> I have tested with cyclictest on the mainline kernel, and noticed an increase 
> of min, avg latencies of around 25%.
> 
> Is this expected?
> 
> Some initial investigation into ftrace seems to point at the reservation and 
> commit of the events into the ring buffer
> as the highest sources of overhead, while event parameters copying, including 
> COMM, does not seem to have any noticeable effect
> relative to those costs.
> 
> I have been running 20 times the following test, and thrown away the first 
> results:
> 
> $ sudo ./cyclictest --smp -p95 -m -s -N -l 10 -q

OK, I just noticed that you are using -N which means all numbers are in
nanoseconds.

> 
> $ uname -a
> Linux claudio-HP-ProBook-470-G5 4.18.0-rc3+ #3 SMP Tue Jul 3 15:50:30 CEST 
> 2018 x86_64 x86_64 x86_64 GNU/Linux
> 
> For brevity, this is a comparison of one test's results. All other test 
> results show the same ~25% increase.
> 
> On the left side, the run without ftrace sched events, on the right side with 
> ftrace sched events enabled.
> 
> CPUCount  Min   ActAvgMaxCount  
> Min-ftrace Act-ftrace Avg-ftrace Max-ftrace
> 0 10  2339   2936   2841 139478 10   2900 
>   3182   3566  93056
> 1  66742  2365   3386   2874  93639  66750   2959 
>   3786   3646 154074
> 2  50080  2376   3058   2910 196221  50097   2997 
>   4209   3655  18707
> 3  40076  2394   3461   2931  17914  40091   3006 
>   4417   3750  17159
> 4  33404  2371   3612   2834  15336  33419   2997 
>   3836   3594  23172
> 5  28635  2387   3313   2885  25863  28649   2995 
>   3795   3647   9956
> 6  25058  2384   3428   2968  12162  25071   3051 
>   4366   3719  18151
> 7  22275  2381   2859   2982  10706  22287   3046 
>   5078   3825  10781
> 
> I would be thankful for any advice or comments on this,
> especially with the goal in mind to lower as much as possible the runtime 
> impact on the system.

Thus, the tracing is causing the wakeup time to be an average of 0.8us
longer.

Yes that is expected.

-- Steve


Re: ftrace performance (sched events): cyclictest shows 25% more latency

2018-07-09 Thread Steven Rostedt
On Fri, 6 Jul 2018 08:22:01 +0200
Claudio  wrote:

> Hello all,
> 
> I have been experimenting with the idea of leaving ftrace enabled, with sched 
> events,
> on production systems.
> 
> The main concern that I am having at the moment is about the impact on the 
> system.
> Enabling the sched events that I currently need for the tracing application
> seems to slow down context-switches considerably, and make the system less 
> responsive.
> 
> I have tested with cyclictest on the mainline kernel, and noticed an increase 
> of min, avg latencies of around 25%.
> 
> Is this expected?
> 
> Some initial investigation into ftrace seems to point at the reservation and 
> commit of the events into the ring buffer
> as the highest sources of overhead, while event parameters copying, including 
> COMM, does not seem to have any noticeable effect
> relative to those costs.
> 
> I have been running 20 times the following test, and thrown away the first 
> results:
> 
> $ sudo ./cyclictest --smp -p95 -m -s -N -l 10 -q

OK, I just noticed that you are using -N which means all numbers are in
nanoseconds.

> 
> $ uname -a
> Linux claudio-HP-ProBook-470-G5 4.18.0-rc3+ #3 SMP Tue Jul 3 15:50:30 CEST 
> 2018 x86_64 x86_64 x86_64 GNU/Linux
> 
> For brevity, this is a comparison of one test's results. All other test 
> results show the same ~25% increase.
> 
> On the left side, the run without ftrace sched events, on the right side with 
> ftrace sched events enabled.
> 
> CPUCount  Min   ActAvgMaxCount  
> Min-ftrace Act-ftrace Avg-ftrace Max-ftrace
> 0 10  2339   2936   2841 139478 10   2900 
>   3182   3566  93056
> 1  66742  2365   3386   2874  93639  66750   2959 
>   3786   3646 154074
> 2  50080  2376   3058   2910 196221  50097   2997 
>   4209   3655  18707
> 3  40076  2394   3461   2931  17914  40091   3006 
>   4417   3750  17159
> 4  33404  2371   3612   2834  15336  33419   2997 
>   3836   3594  23172
> 5  28635  2387   3313   2885  25863  28649   2995 
>   3795   3647   9956
> 6  25058  2384   3428   2968  12162  25071   3051 
>   4366   3719  18151
> 7  22275  2381   2859   2982  10706  22287   3046 
>   5078   3825  10781
> 
> I would be thankful for any advice or comments on this,
> especially with the goal in mind to lower as much as possible the runtime 
> impact on the system.

Thus, the tracing is causing the wakeup time to be an average of 0.8us
longer.

Yes that is expected.

-- Steve


Re: ftrace performance (sched events): cyclictest shows 25% more latency

2018-07-09 Thread Steven Rostedt
On Mon, 9 Jul 2018 16:53:52 +0200
Claudio  wrote:

> 
> One additional data point, based on brute force again:
> 
> I applied this change, in order to understand if it was the
> 
> trace_event_raw_event_* (I suppose primarily trace_event_raw_event_switch)
> 
> that contained the latency "offenders":
> 
> diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
> index 4ecdfe2..969467d 100644
> --- a/include/trace/trace_events.h
> +++ b/include/trace/trace_events.h
> @@ -704,6 +704,8 @@ trace_event_raw_event_##call(void *__data, proto) 
>   
> struct trace_event_raw_##call *entry;   \
> int __data_size;\
> \
> +   return; \
> +   \
> if (trace_trigger_soft_disabled(trace_file))\
> return; \
> \
> 
> 
> This reduces the latency overhead to 6% down from 25%.
> 
> Maybe obvious? Wanted to share in case it helps, and will dig further.

I noticed that just disabling tracing "echo 0 > tracing_on" is very
similar. I'm now recording timings of various parts of the code. But at
most I've seen is a 12us, which should not add the overhead. So it's
triggering something else.

I'll be going on PTO next week, and there's things I must do this week,
thus I may not have much more time to look into this until I get back
from PTO (July 23rd).

-- Steve


Re: ftrace performance (sched events): cyclictest shows 25% more latency

2018-07-09 Thread Steven Rostedt
On Mon, 9 Jul 2018 16:53:52 +0200
Claudio  wrote:

> 
> One additional data point, based on brute force again:
> 
> I applied this change, in order to understand if it was the
> 
> trace_event_raw_event_* (I suppose primarily trace_event_raw_event_switch)
> 
> that contained the latency "offenders":
> 
> diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
> index 4ecdfe2..969467d 100644
> --- a/include/trace/trace_events.h
> +++ b/include/trace/trace_events.h
> @@ -704,6 +704,8 @@ trace_event_raw_event_##call(void *__data, proto) 
>   
> struct trace_event_raw_##call *entry;   \
> int __data_size;\
> \
> +   return; \
> +   \
> if (trace_trigger_soft_disabled(trace_file))\
> return; \
> \
> 
> 
> This reduces the latency overhead to 6% down from 25%.
> 
> Maybe obvious? Wanted to share in case it helps, and will dig further.

I noticed that just disabling tracing "echo 0 > tracing_on" is very
similar. I'm now recording timings of various parts of the code. But at
most I've seen is a 12us, which should not add the overhead. So it's
triggering something else.

I'll be going on PTO next week, and there's things I must do this week,
thus I may not have much more time to look into this until I get back
from PTO (July 23rd).

-- Steve


Re: ftrace performance (sched events): cyclictest shows 25% more latency

2018-07-09 Thread Claudio



On 07/09/2018 12:06 PM, Claudio wrote:
> Hello Steve,
> 
> thank you for your response,
> 
> On 07/06/2018 11:24 PM, Steven Rostedt wrote:
>> On Fri, 6 Jul 2018 08:22:01 +0200
>> Claudio  wrote:
>>
>>> Hello all,
>>>
>>> I have been experimenting with the idea of leaving ftrace enabled, with 
>>> sched events,
>>> on production systems.
>>>
>>> The main concern that I am having at the moment is about the impact on the 
>>> system.
>>> Enabling the sched events that I currently need for the tracing application
>>> seems to slow down context-switches considerably, and make the system less 
>>> responsive.
>>>
>>> I have tested with cyclictest on the mainline kernel, and noticed an 
>>> increase of min, avg latencies of around 25%.
>>>
>>> Is this expected?
>>
>> No it is not. And this is weird. I'm seeing it too.
>>
>>>
>>> Some initial investigation into ftrace seems to point at the reservation 
>>> and commit of the events into the ring buffer
>>> as the highest sources of overhead, while event parameters copying, 
>>> including COMM, does not seem to have any noticeable effect
>>> relative to those costs.
>>
>> Did you use perf to see where the issues were? If so what perf command
>> did you use?
> 
> I did not use perf, I will now do it and see what I get.
> You can discard my comment regarding reservation and commit, it was based on 
> a wrong premise.
> 
> Regarding parameter copying, I just brutally modified the code to not fill 
> fields (TP_fast_assign),
> and I just noticed that it did not change the result significantly.
> 
>>
>> When running just:
>>   perf record ./cyclictest --smp -p95 -m -s -N -l 100 -q
>>
>> I get this:
>>
>>6.08%  cyclictest  [kernel.vmlinux][k] syscall_return_via_sysret
>>4.03%  cyclictest  cyclictest  [.] timerthread
>>3.97%  cyclictest  [kernel.vmlinux][k] __schedule
>>3.81%  cyclictest  [kernel.vmlinux][k] native_write_msr
>>3.13%  cyclictest  [kernel.vmlinux][k] trace_save_cmdline
>>2.71%  cyclictest  [kernel.vmlinux][k] do_nanosleep
>>2.48%  cyclictest  [kernel.vmlinux][k] do_syscall_64
>>2.11%  cyclictest  [kernel.vmlinux][k] pick_next_task_fair
>>2.03%  cyclictest  [kernel.vmlinux][k] native_sched_clock
>>1.99%  cyclictest  [kernel.vmlinux][k] cpuacct_charge
>>1.93%  cyclictest  [vdso]  [.] __vdso_clock_gettime
>>1.86%  cyclictest  [kernel.vmlinux][k] hrtimer_nanosleep
>>1.84%  cyclictest  [kernel.vmlinux][k] cpupri_set
>>1.66%  cyclictest  [kernel.vmlinux][k] update_curr_rt
>>1.65%  cyclictest  [kernel.vmlinux][k] __perf_event_task_sched_out
>>1.59%  cyclictest  [kernel.vmlinux][k] dequeue_rt_stack
>>1.55%  cyclictest  [kernel.vmlinux][k] __rb_reserve_next
>>
>> Which shows there's a lot being done before we even get to the ring
>> buffer reserve.
> 
> I get (CONFIG_PREEMPT_VOLUNTARY):
> 
> SCHED EVENTS DISABLED:
> 
>  8.57%  cyclictest  [kernel.vmlinux][k] syscall_return_via_sysret
>  7.04%  cyclictest  [kernel.vmlinux][k] native_write_msr
>  6.93%  cyclictest  [kernel.vmlinux][k] entry_SYSCALL_64_trampoline
>  4.42%  cyclictest  cyclictest  [.] timerthread
>  4.14%  cyclictest  [kernel.vmlinux][k] __schedule
>  3.52%  cyclictest  [kernel.vmlinux][k] cpupri_set
>  2.90%  cyclictest  [kernel.vmlinux][k] finish_task_switch
>  2.42%  cyclictest  [kernel.vmlinux][k] update_curr_rt
>  2.34%  cyclictest  [kernel.vmlinux][k] _raw_spin_lock
>  2.28%  cyclictest  [kernel.vmlinux][k] ktime_get
>  2.14%  cyclictest  [kernel.vmlinux][k] timerqueue_add
>  2.06%  cyclictest  [kernel.vmlinux][k] read_tsc
>  2.06%  cyclictest  [kernel.vmlinux][k] native_sched_clock
>  2.01%  cyclictest  [vdso]  [.] __vdso_clock_gettime
>  1.95%  cyclictest  [kernel.vmlinux][k] do_nanosleep
>  1.93%  cyclictest  [kernel.vmlinux][k] pick_next_task_fair
>  1.89%  cyclictest  [vdso]  [.] 0x0977
>  1.75%  cyclictest  [kernel.vmlinux][k] dequeue_task_rt
>  1.64%  cyclictest  [kernel.vmlinux][k] x86_pmu_disable
>  1.57%  cyclictest  [kernel.vmlinux][k] dequeue_rt_stack
>  1.53%  cyclictest  [kernel.vmlinux][k] __x64_sys_nanosleep
>  1.51%  cyclictest  [kernel.vmlinux][k] do_syscall_64
>  1.41%  cyclictest  [kernel.vmlinux][k] put_prev_task_rt
>  1.34%  cyclictest  [kernel.vmlinux][k] pick_next_task_dl
>  1.28%  cyclictest  [kernel.vmlinux][k] pick_next_task_rt
>  1.25%  cyclictest  [kernel.vmlinux][k] cpuacct_charge
>  1.25%  cyclictest  [kernel.vmlinux][k] hrtimer_start_range_ns
>  1.24%  cyclictest  [kernel.vmlinux][k] __x86_indirect_thunk_rax
>  1.07%  cyclictest  libpthread-2.23.so  [.] 0x00010c1b
>  0.94%  cyclictest  [kernel.vmlinux][k] pull_rt_task
>  0.93%  cyclictest  

Re: ftrace performance (sched events): cyclictest shows 25% more latency

2018-07-09 Thread Claudio



On 07/09/2018 12:06 PM, Claudio wrote:
> Hello Steve,
> 
> thank you for your response,
> 
> On 07/06/2018 11:24 PM, Steven Rostedt wrote:
>> On Fri, 6 Jul 2018 08:22:01 +0200
>> Claudio  wrote:
>>
>>> Hello all,
>>>
>>> I have been experimenting with the idea of leaving ftrace enabled, with 
>>> sched events,
>>> on production systems.
>>>
>>> The main concern that I am having at the moment is about the impact on the 
>>> system.
>>> Enabling the sched events that I currently need for the tracing application
>>> seems to slow down context-switches considerably, and make the system less 
>>> responsive.
>>>
>>> I have tested with cyclictest on the mainline kernel, and noticed an 
>>> increase of min, avg latencies of around 25%.
>>>
>>> Is this expected?
>>
>> No it is not. And this is weird. I'm seeing it too.
>>
>>>
>>> Some initial investigation into ftrace seems to point at the reservation 
>>> and commit of the events into the ring buffer
>>> as the highest sources of overhead, while event parameters copying, 
>>> including COMM, does not seem to have any noticeable effect
>>> relative to those costs.
>>
>> Did you use perf to see where the issues were? If so what perf command
>> did you use?
> 
> I did not use perf, I will now do it and see what I get.
> You can discard my comment regarding reservation and commit, it was based on 
> a wrong premise.
> 
> Regarding parameter copying, I just brutally modified the code to not fill 
> fields (TP_fast_assign),
> and I just noticed that it did not change the result significantly.
> 
>>
>> When running just:
>>   perf record ./cyclictest --smp -p95 -m -s -N -l 100 -q
>>
>> I get this:
>>
>>6.08%  cyclictest  [kernel.vmlinux][k] syscall_return_via_sysret
>>4.03%  cyclictest  cyclictest  [.] timerthread
>>3.97%  cyclictest  [kernel.vmlinux][k] __schedule
>>3.81%  cyclictest  [kernel.vmlinux][k] native_write_msr
>>3.13%  cyclictest  [kernel.vmlinux][k] trace_save_cmdline
>>2.71%  cyclictest  [kernel.vmlinux][k] do_nanosleep
>>2.48%  cyclictest  [kernel.vmlinux][k] do_syscall_64
>>2.11%  cyclictest  [kernel.vmlinux][k] pick_next_task_fair
>>2.03%  cyclictest  [kernel.vmlinux][k] native_sched_clock
>>1.99%  cyclictest  [kernel.vmlinux][k] cpuacct_charge
>>1.93%  cyclictest  [vdso]  [.] __vdso_clock_gettime
>>1.86%  cyclictest  [kernel.vmlinux][k] hrtimer_nanosleep
>>1.84%  cyclictest  [kernel.vmlinux][k] cpupri_set
>>1.66%  cyclictest  [kernel.vmlinux][k] update_curr_rt
>>1.65%  cyclictest  [kernel.vmlinux][k] __perf_event_task_sched_out
>>1.59%  cyclictest  [kernel.vmlinux][k] dequeue_rt_stack
>>1.55%  cyclictest  [kernel.vmlinux][k] __rb_reserve_next
>>
>> Which shows there's a lot being done before we even get to the ring
>> buffer reserve.
> 
> I get (CONFIG_PREEMPT_VOLUNTARY):
> 
> SCHED EVENTS DISABLED:
> 
>  8.57%  cyclictest  [kernel.vmlinux][k] syscall_return_via_sysret
>  7.04%  cyclictest  [kernel.vmlinux][k] native_write_msr
>  6.93%  cyclictest  [kernel.vmlinux][k] entry_SYSCALL_64_trampoline
>  4.42%  cyclictest  cyclictest  [.] timerthread
>  4.14%  cyclictest  [kernel.vmlinux][k] __schedule
>  3.52%  cyclictest  [kernel.vmlinux][k] cpupri_set
>  2.90%  cyclictest  [kernel.vmlinux][k] finish_task_switch
>  2.42%  cyclictest  [kernel.vmlinux][k] update_curr_rt
>  2.34%  cyclictest  [kernel.vmlinux][k] _raw_spin_lock
>  2.28%  cyclictest  [kernel.vmlinux][k] ktime_get
>  2.14%  cyclictest  [kernel.vmlinux][k] timerqueue_add
>  2.06%  cyclictest  [kernel.vmlinux][k] read_tsc
>  2.06%  cyclictest  [kernel.vmlinux][k] native_sched_clock
>  2.01%  cyclictest  [vdso]  [.] __vdso_clock_gettime
>  1.95%  cyclictest  [kernel.vmlinux][k] do_nanosleep
>  1.93%  cyclictest  [kernel.vmlinux][k] pick_next_task_fair
>  1.89%  cyclictest  [vdso]  [.] 0x0977
>  1.75%  cyclictest  [kernel.vmlinux][k] dequeue_task_rt
>  1.64%  cyclictest  [kernel.vmlinux][k] x86_pmu_disable
>  1.57%  cyclictest  [kernel.vmlinux][k] dequeue_rt_stack
>  1.53%  cyclictest  [kernel.vmlinux][k] __x64_sys_nanosleep
>  1.51%  cyclictest  [kernel.vmlinux][k] do_syscall_64
>  1.41%  cyclictest  [kernel.vmlinux][k] put_prev_task_rt
>  1.34%  cyclictest  [kernel.vmlinux][k] pick_next_task_dl
>  1.28%  cyclictest  [kernel.vmlinux][k] pick_next_task_rt
>  1.25%  cyclictest  [kernel.vmlinux][k] cpuacct_charge
>  1.25%  cyclictest  [kernel.vmlinux][k] hrtimer_start_range_ns
>  1.24%  cyclictest  [kernel.vmlinux][k] __x86_indirect_thunk_rax
>  1.07%  cyclictest  libpthread-2.23.so  [.] 0x00010c1b
>  0.94%  cyclictest  [kernel.vmlinux][k] pull_rt_task
>  0.93%  cyclictest  

Re: ftrace performance (sched events): cyclictest shows 25% more latency

2018-07-09 Thread Claudio
Hello Steve,

thank you for your response,

On 07/06/2018 11:24 PM, Steven Rostedt wrote:
> On Fri, 6 Jul 2018 08:22:01 +0200
> Claudio  wrote:
> 
>> Hello all,
>>
>> I have been experimenting with the idea of leaving ftrace enabled, with 
>> sched events,
>> on production systems.
>>
>> The main concern that I am having at the moment is about the impact on the 
>> system.
>> Enabling the sched events that I currently need for the tracing application
>> seems to slow down context-switches considerably, and make the system less 
>> responsive.
>>
>> I have tested with cyclictest on the mainline kernel, and noticed an 
>> increase of min, avg latencies of around 25%.
>>
>> Is this expected?
> 
> No it is not. And this is weird. I'm seeing it too.
> 
>>
>> Some initial investigation into ftrace seems to point at the reservation and 
>> commit of the events into the ring buffer
>> as the highest sources of overhead, while event parameters copying, 
>> including COMM, does not seem to have any noticeable effect
>> relative to those costs.
> 
> Did you use perf to see where the issues were? If so what perf command
> did you use?

I did not use perf, I will now do it and see what I get.
You can discard my comment regarding reservation and commit, it was based on a 
wrong premise.

Regarding parameter copying, I just brutally modified the code to not fill 
fields (TP_fast_assign),
and I just noticed that it did not change the result significantly.

> 
> When running just:
>   perf record ./cyclictest --smp -p95 -m -s -N -l 100 -q
> 
> I get this:
> 
>6.08%  cyclictest  [kernel.vmlinux][k] syscall_return_via_sysret
>4.03%  cyclictest  cyclictest  [.] timerthread
>3.97%  cyclictest  [kernel.vmlinux][k] __schedule
>3.81%  cyclictest  [kernel.vmlinux][k] native_write_msr
>3.13%  cyclictest  [kernel.vmlinux][k] trace_save_cmdline
>2.71%  cyclictest  [kernel.vmlinux][k] do_nanosleep
>2.48%  cyclictest  [kernel.vmlinux][k] do_syscall_64
>2.11%  cyclictest  [kernel.vmlinux][k] pick_next_task_fair
>2.03%  cyclictest  [kernel.vmlinux][k] native_sched_clock
>1.99%  cyclictest  [kernel.vmlinux][k] cpuacct_charge
>1.93%  cyclictest  [vdso]  [.] __vdso_clock_gettime
>1.86%  cyclictest  [kernel.vmlinux][k] hrtimer_nanosleep
>1.84%  cyclictest  [kernel.vmlinux][k] cpupri_set
>1.66%  cyclictest  [kernel.vmlinux][k] update_curr_rt
>1.65%  cyclictest  [kernel.vmlinux][k] __perf_event_task_sched_out
>1.59%  cyclictest  [kernel.vmlinux][k] dequeue_rt_stack
>1.55%  cyclictest  [kernel.vmlinux][k] __rb_reserve_next
> 
> Which shows there's a lot being done before we even get to the ring
> buffer reserve.

I get (CONFIG_PREEMPT_VOLUNTARY):

SCHED EVENTS DISABLED:

 8.57%  cyclictest  [kernel.vmlinux][k] syscall_return_via_sysret
 7.04%  cyclictest  [kernel.vmlinux][k] native_write_msr
 6.93%  cyclictest  [kernel.vmlinux][k] entry_SYSCALL_64_trampoline
 4.42%  cyclictest  cyclictest  [.] timerthread
 4.14%  cyclictest  [kernel.vmlinux][k] __schedule
 3.52%  cyclictest  [kernel.vmlinux][k] cpupri_set
 2.90%  cyclictest  [kernel.vmlinux][k] finish_task_switch
 2.42%  cyclictest  [kernel.vmlinux][k] update_curr_rt
 2.34%  cyclictest  [kernel.vmlinux][k] _raw_spin_lock
 2.28%  cyclictest  [kernel.vmlinux][k] ktime_get
 2.14%  cyclictest  [kernel.vmlinux][k] timerqueue_add
 2.06%  cyclictest  [kernel.vmlinux][k] read_tsc
 2.06%  cyclictest  [kernel.vmlinux][k] native_sched_clock
 2.01%  cyclictest  [vdso]  [.] __vdso_clock_gettime
 1.95%  cyclictest  [kernel.vmlinux][k] do_nanosleep
 1.93%  cyclictest  [kernel.vmlinux][k] pick_next_task_fair
 1.89%  cyclictest  [vdso]  [.] 0x0977
 1.75%  cyclictest  [kernel.vmlinux][k] dequeue_task_rt
 1.64%  cyclictest  [kernel.vmlinux][k] x86_pmu_disable
 1.57%  cyclictest  [kernel.vmlinux][k] dequeue_rt_stack
 1.53%  cyclictest  [kernel.vmlinux][k] __x64_sys_nanosleep
 1.51%  cyclictest  [kernel.vmlinux][k] do_syscall_64
 1.41%  cyclictest  [kernel.vmlinux][k] put_prev_task_rt
 1.34%  cyclictest  [kernel.vmlinux][k] pick_next_task_dl
 1.28%  cyclictest  [kernel.vmlinux][k] pick_next_task_rt
 1.25%  cyclictest  [kernel.vmlinux][k] cpuacct_charge
 1.25%  cyclictest  [kernel.vmlinux][k] hrtimer_start_range_ns
 1.24%  cyclictest  [kernel.vmlinux][k] __x86_indirect_thunk_rax
 1.07%  cyclictest  libpthread-2.23.so  [.] 0x00010c1b
 0.94%  cyclictest  [kernel.vmlinux][k] pull_rt_task
 0.93%  cyclictest  [kernel.vmlinux][k] entry_SYSCALL_64_stage2
 0.89%  cyclictest  [kernel.vmlinux][k] hrtimer_nanosleep
 0.89%  cyclictest  libpthread-2.23.so  [.] 0x00010c1d
 

Re: ftrace performance (sched events): cyclictest shows 25% more latency

2018-07-09 Thread Claudio
Hello Steve,

thank you for your response,

On 07/06/2018 11:24 PM, Steven Rostedt wrote:
> On Fri, 6 Jul 2018 08:22:01 +0200
> Claudio  wrote:
> 
>> Hello all,
>>
>> I have been experimenting with the idea of leaving ftrace enabled, with 
>> sched events,
>> on production systems.
>>
>> The main concern that I am having at the moment is about the impact on the 
>> system.
>> Enabling the sched events that I currently need for the tracing application
>> seems to slow down context-switches considerably, and make the system less 
>> responsive.
>>
>> I have tested with cyclictest on the mainline kernel, and noticed an 
>> increase of min, avg latencies of around 25%.
>>
>> Is this expected?
> 
> No it is not. And this is weird. I'm seeing it too.
> 
>>
>> Some initial investigation into ftrace seems to point at the reservation and 
>> commit of the events into the ring buffer
>> as the highest sources of overhead, while event parameters copying, 
>> including COMM, does not seem to have any noticeable effect
>> relative to those costs.
> 
> Did you use perf to see where the issues were? If so what perf command
> did you use?

I did not use perf, I will now do it and see what I get.
You can discard my comment regarding reservation and commit, it was based on a 
wrong premise.

Regarding parameter copying, I just brutally modified the code to not fill 
fields (TP_fast_assign),
and I just noticed that it did not change the result significantly.

> 
> When running just:
>   perf record ./cyclictest --smp -p95 -m -s -N -l 100 -q
> 
> I get this:
> 
>6.08%  cyclictest  [kernel.vmlinux][k] syscall_return_via_sysret
>4.03%  cyclictest  cyclictest  [.] timerthread
>3.97%  cyclictest  [kernel.vmlinux][k] __schedule
>3.81%  cyclictest  [kernel.vmlinux][k] native_write_msr
>3.13%  cyclictest  [kernel.vmlinux][k] trace_save_cmdline
>2.71%  cyclictest  [kernel.vmlinux][k] do_nanosleep
>2.48%  cyclictest  [kernel.vmlinux][k] do_syscall_64
>2.11%  cyclictest  [kernel.vmlinux][k] pick_next_task_fair
>2.03%  cyclictest  [kernel.vmlinux][k] native_sched_clock
>1.99%  cyclictest  [kernel.vmlinux][k] cpuacct_charge
>1.93%  cyclictest  [vdso]  [.] __vdso_clock_gettime
>1.86%  cyclictest  [kernel.vmlinux][k] hrtimer_nanosleep
>1.84%  cyclictest  [kernel.vmlinux][k] cpupri_set
>1.66%  cyclictest  [kernel.vmlinux][k] update_curr_rt
>1.65%  cyclictest  [kernel.vmlinux][k] __perf_event_task_sched_out
>1.59%  cyclictest  [kernel.vmlinux][k] dequeue_rt_stack
>1.55%  cyclictest  [kernel.vmlinux][k] __rb_reserve_next
> 
> Which shows there's a lot being done before we even get to the ring
> buffer reserve.

I get (CONFIG_PREEMPT_VOLUNTARY):

SCHED EVENTS DISABLED:

 8.57%  cyclictest  [kernel.vmlinux][k] syscall_return_via_sysret
 7.04%  cyclictest  [kernel.vmlinux][k] native_write_msr
 6.93%  cyclictest  [kernel.vmlinux][k] entry_SYSCALL_64_trampoline
 4.42%  cyclictest  cyclictest  [.] timerthread
 4.14%  cyclictest  [kernel.vmlinux][k] __schedule
 3.52%  cyclictest  [kernel.vmlinux][k] cpupri_set
 2.90%  cyclictest  [kernel.vmlinux][k] finish_task_switch
 2.42%  cyclictest  [kernel.vmlinux][k] update_curr_rt
 2.34%  cyclictest  [kernel.vmlinux][k] _raw_spin_lock
 2.28%  cyclictest  [kernel.vmlinux][k] ktime_get
 2.14%  cyclictest  [kernel.vmlinux][k] timerqueue_add
 2.06%  cyclictest  [kernel.vmlinux][k] read_tsc
 2.06%  cyclictest  [kernel.vmlinux][k] native_sched_clock
 2.01%  cyclictest  [vdso]  [.] __vdso_clock_gettime
 1.95%  cyclictest  [kernel.vmlinux][k] do_nanosleep
 1.93%  cyclictest  [kernel.vmlinux][k] pick_next_task_fair
 1.89%  cyclictest  [vdso]  [.] 0x0977
 1.75%  cyclictest  [kernel.vmlinux][k] dequeue_task_rt
 1.64%  cyclictest  [kernel.vmlinux][k] x86_pmu_disable
 1.57%  cyclictest  [kernel.vmlinux][k] dequeue_rt_stack
 1.53%  cyclictest  [kernel.vmlinux][k] __x64_sys_nanosleep
 1.51%  cyclictest  [kernel.vmlinux][k] do_syscall_64
 1.41%  cyclictest  [kernel.vmlinux][k] put_prev_task_rt
 1.34%  cyclictest  [kernel.vmlinux][k] pick_next_task_dl
 1.28%  cyclictest  [kernel.vmlinux][k] pick_next_task_rt
 1.25%  cyclictest  [kernel.vmlinux][k] cpuacct_charge
 1.25%  cyclictest  [kernel.vmlinux][k] hrtimer_start_range_ns
 1.24%  cyclictest  [kernel.vmlinux][k] __x86_indirect_thunk_rax
 1.07%  cyclictest  libpthread-2.23.so  [.] 0x00010c1b
 0.94%  cyclictest  [kernel.vmlinux][k] pull_rt_task
 0.93%  cyclictest  [kernel.vmlinux][k] entry_SYSCALL_64_stage2
 0.89%  cyclictest  [kernel.vmlinux][k] hrtimer_nanosleep
 0.89%  cyclictest  libpthread-2.23.so  [.] 0x00010c1d
 

Re: ftrace performance (sched events): cyclictest shows 25% more latency

2018-07-06 Thread Steven Rostedt
On Fri, 6 Jul 2018 17:39:22 -0400
Steven Rostedt  wrote:

> On Fri, 6 Jul 2018 17:24:28 -0400
> Steven Rostedt  wrote:
> 
> > I'll investigate further.  
> 
> Note, I enabled function tracing to trace the scheduler function:
> 
>  # trace-cmd start -p function -l schedule
> 
> And then ran cyclictest. That does not cause any noticeable increase
> in latency (try it). So there is some kind of side effect with the
> trace event itself, and not the recording of the event.

It does add some noise but not nearly as much as a trace event. But,
then I did function graph tracing:

First I did:

  trace-cmd record -p function_graph -l schedule 
/work/git-local/rt-tests.git/cyclictest --smp -p95 -m -s -N -l 10 -q

And got this:

  -0 [000]  3034.534021: sched_switch: swapper/0:0 
[120] S ==> trace-cmd:1693 [120]
   trace-cmd-1693  [000]  3034.534029: funcgraph_entry:   | 
 schedule() {
   trace-cmd-1693  [000]  3034.534043: sched_switch: trace-cmd:1693 
[120] D ==> swapper/0:0 [120]
  -0 [000]  3034.535084: sched_switch: swapper/0:0 
[120] S ==> trace-cmd:1693 [120]
   trace-cmd-1693  [000]  3034.535086: funcgraph_exit:   # 1056.055 us 
|  }
   trace-cmd-1693  [000]  3034.535090: funcgraph_entry:   | 
 schedule() {
   trace-cmd-1693  [000]  3034.535093: sched_switch: trace-cmd:1693 
[120] D ==> swapper/0:0 [120]
  -0 [000]  3034.536146: sched_switch: swapper/0:0 
[120] S ==> trace-cmd:1693 [120]
   trace-cmd-1693  [000]  3034.536147: funcgraph_exit:   # 1056.563 us 
|  }
   trace-cmd-1693  [000]  3034.536150: funcgraph_entry:   | 
 schedule() {
   trace-cmd-1693  [000]  3034.536153: sched_switch: trace-cmd:1693 
[120] D ==> swapper/0:0 [120]
  -0 [000]  3034.537206: sched_switch: swapper/0:0 
[120] S ==> trace-cmd:1693 [120]
   trace-cmd-1693  [000]  3034.537207: funcgraph_exit:   # 1056.107 us 
|  }
   trace-cmd-1693  [000]  3034.537210: funcgraph_entry:   | 
 schedule() {
   trace-cmd-1693  [000]  3034.537213: sched_switch: trace-cmd:1693 
[120] D ==> swapper/0:0 [120]
  -0 [000]  3034.538262: sched_switch: swapper/0:0 
[120] S ==> trace-cmd:1693 [120]
   trace-cmd-1693  [000]  3034.538263: funcgraph_exit:   # 1053.077 us 
|  }
   trace-cmd-1693  [000]  3034.538265: funcgraph_entry:   | 
 schedule() {
   trace-cmd-1693  [000]  3034.538266: sched_switch: trace-cmd:1693 
[120] D ==> swapper/0:0 [120]
  -0 [000]  3034.538554: sched_switch: swapper/0:0 
[120] S ==> cyclictest:1698 [120]
  cyclictest-1698  [000]  3034.538565: funcgraph_entry:   | 
 schedule() {
  cyclictest-1698  [000]  3034.538566: sched_switch: 
cyclictest:1698 [4] D ==> swapper/0:0 [120]
  -0 [000]  3034.539318: sched_switch: swapper/0:0 
[120] S ==> trace-cmd:1693 [120]
   trace-cmd-1693  [000]  3034.539319: funcgraph_exit:   # 1053.723 us 
|  }
   trace-cmd-1693  [000]  3034.539322: funcgraph_entry:   | 
 schedule() {
   trace-cmd-1693  [000]  3034.539324: sched_switch: trace-cmd:1693 
[120] D ==> swapper/0:0 [120]
  -0 [000]  3034.539567: sched_switch: swapper/0:0 
[120] S ==> cyclictest:1698 [4]
  cyclictest-1698  [000]  3034.539567: funcgraph_exit:   # 1002.624 us 
|  }

And that continued on as normal.

Then I ran it like this:

  trace-cmd record -p function_graph -l schedule 
/work/git-local/rt-tests.git/cyclictest --smp -p95 -m -s -N -l 10 -q


And this was my result:

   trace-cmd-1749  [000]  3321.110772: funcgraph_entry:   | 
 schedule() {
   trace-cmd-1749  [000]  3321.111826: funcgraph_exit:   # 1053.004 us 
|  }
   trace-cmd-1749  [000]  3321.111830: funcgraph_entry:   | 
 schedule() {
   trace-cmd-1749  [000]  3321.112885: funcgraph_exit:   # 1054.734 us 
|  }
   trace-cmd-1749  [000]  3321.112888: funcgraph_entry:   | 
 schedule() {
  cyclictest-1756  [000]  3321.113930: funcgraph_entry:   | 
 schedule() {
   trace-cmd-1749  [000]  3321.113942: funcgraph_exit:   # 1054.240 us 
|  }
   trace-cmd-1749  [000]  3321.113947: funcgraph_entry:   | 
 schedule() {
  cyclictest-1756  [000]  3321.114933: funcgraph_exit:   # 1002.600 us 
|  }
  cyclictest-1756  [000]  3321.114935: funcgraph_entry:   | 
 schedule() {
   trace-cmd-1749  [000]  3321.115000: funcgraph_exit:   # 1053.451 us 
|  }
   trace-cmd-1749  [000]  3321.115004: funcgraph_entry:   | 
 schedule() {
  cyclictest-1756  [000]  3321.115939: funcgraph_exit:   # 1003.116 us 
|  }
  cyclictest-1756  [000]  3321.115941: funcgraph_entry:   | 
 

Re: ftrace performance (sched events): cyclictest shows 25% more latency

2018-07-06 Thread Steven Rostedt
On Fri, 6 Jul 2018 17:39:22 -0400
Steven Rostedt  wrote:

> On Fri, 6 Jul 2018 17:24:28 -0400
> Steven Rostedt  wrote:
> 
> > I'll investigate further.  
> 
> Note, I enabled function tracing to trace the scheduler function:
> 
>  # trace-cmd start -p function -l schedule
> 
> And then ran cyclictest. That does not cause any noticeable increase
> in latency (try it). So there is some kind of side effect with the
> trace event itself, and not the recording of the event.

It does add some noise but not nearly as much as a trace event. But,
then I did function graph tracing:

First I did:

  trace-cmd record -p function_graph -l schedule 
/work/git-local/rt-tests.git/cyclictest --smp -p95 -m -s -N -l 10 -q

And got this:

  -0 [000]  3034.534021: sched_switch: swapper/0:0 
[120] S ==> trace-cmd:1693 [120]
   trace-cmd-1693  [000]  3034.534029: funcgraph_entry:   | 
 schedule() {
   trace-cmd-1693  [000]  3034.534043: sched_switch: trace-cmd:1693 
[120] D ==> swapper/0:0 [120]
  -0 [000]  3034.535084: sched_switch: swapper/0:0 
[120] S ==> trace-cmd:1693 [120]
   trace-cmd-1693  [000]  3034.535086: funcgraph_exit:   # 1056.055 us 
|  }
   trace-cmd-1693  [000]  3034.535090: funcgraph_entry:   | 
 schedule() {
   trace-cmd-1693  [000]  3034.535093: sched_switch: trace-cmd:1693 
[120] D ==> swapper/0:0 [120]
  -0 [000]  3034.536146: sched_switch: swapper/0:0 
[120] S ==> trace-cmd:1693 [120]
   trace-cmd-1693  [000]  3034.536147: funcgraph_exit:   # 1056.563 us 
|  }
   trace-cmd-1693  [000]  3034.536150: funcgraph_entry:   | 
 schedule() {
   trace-cmd-1693  [000]  3034.536153: sched_switch: trace-cmd:1693 
[120] D ==> swapper/0:0 [120]
  -0 [000]  3034.537206: sched_switch: swapper/0:0 
[120] S ==> trace-cmd:1693 [120]
   trace-cmd-1693  [000]  3034.537207: funcgraph_exit:   # 1056.107 us 
|  }
   trace-cmd-1693  [000]  3034.537210: funcgraph_entry:   | 
 schedule() {
   trace-cmd-1693  [000]  3034.537213: sched_switch: trace-cmd:1693 
[120] D ==> swapper/0:0 [120]
  -0 [000]  3034.538262: sched_switch: swapper/0:0 
[120] S ==> trace-cmd:1693 [120]
   trace-cmd-1693  [000]  3034.538263: funcgraph_exit:   # 1053.077 us 
|  }
   trace-cmd-1693  [000]  3034.538265: funcgraph_entry:   | 
 schedule() {
   trace-cmd-1693  [000]  3034.538266: sched_switch: trace-cmd:1693 
[120] D ==> swapper/0:0 [120]
  -0 [000]  3034.538554: sched_switch: swapper/0:0 
[120] S ==> cyclictest:1698 [120]
  cyclictest-1698  [000]  3034.538565: funcgraph_entry:   | 
 schedule() {
  cyclictest-1698  [000]  3034.538566: sched_switch: 
cyclictest:1698 [4] D ==> swapper/0:0 [120]
  -0 [000]  3034.539318: sched_switch: swapper/0:0 
[120] S ==> trace-cmd:1693 [120]
   trace-cmd-1693  [000]  3034.539319: funcgraph_exit:   # 1053.723 us 
|  }
   trace-cmd-1693  [000]  3034.539322: funcgraph_entry:   | 
 schedule() {
   trace-cmd-1693  [000]  3034.539324: sched_switch: trace-cmd:1693 
[120] D ==> swapper/0:0 [120]
  -0 [000]  3034.539567: sched_switch: swapper/0:0 
[120] S ==> cyclictest:1698 [4]
  cyclictest-1698  [000]  3034.539567: funcgraph_exit:   # 1002.624 us 
|  }

And that continued on as normal.

Then I ran it like this:

  trace-cmd record -p function_graph -l schedule 
/work/git-local/rt-tests.git/cyclictest --smp -p95 -m -s -N -l 10 -q


And this was my result:

   trace-cmd-1749  [000]  3321.110772: funcgraph_entry:   | 
 schedule() {
   trace-cmd-1749  [000]  3321.111826: funcgraph_exit:   # 1053.004 us 
|  }
   trace-cmd-1749  [000]  3321.111830: funcgraph_entry:   | 
 schedule() {
   trace-cmd-1749  [000]  3321.112885: funcgraph_exit:   # 1054.734 us 
|  }
   trace-cmd-1749  [000]  3321.112888: funcgraph_entry:   | 
 schedule() {
  cyclictest-1756  [000]  3321.113930: funcgraph_entry:   | 
 schedule() {
   trace-cmd-1749  [000]  3321.113942: funcgraph_exit:   # 1054.240 us 
|  }
   trace-cmd-1749  [000]  3321.113947: funcgraph_entry:   | 
 schedule() {
  cyclictest-1756  [000]  3321.114933: funcgraph_exit:   # 1002.600 us 
|  }
  cyclictest-1756  [000]  3321.114935: funcgraph_entry:   | 
 schedule() {
   trace-cmd-1749  [000]  3321.115000: funcgraph_exit:   # 1053.451 us 
|  }
   trace-cmd-1749  [000]  3321.115004: funcgraph_entry:   | 
 schedule() {
  cyclictest-1756  [000]  3321.115939: funcgraph_exit:   # 1003.116 us 
|  }
  cyclictest-1756  [000]  3321.115941: funcgraph_entry:   | 
 

Re: ftrace performance (sched events): cyclictest shows 25% more latency

2018-07-06 Thread Steven Rostedt
On Fri, 6 Jul 2018 17:24:28 -0400
Steven Rostedt  wrote:

> I'll investigate further.

Note, I enabled function tracing to trace the scheduler function:

 # trace-cmd start -p function -l schedule

And then ran cyclictest. That does not cause any noticeable increase
in latency (try it). So there is some kind of side effect with the
trace event itself, and not the recording of the event.

-- Steve


Re: ftrace performance (sched events): cyclictest shows 25% more latency

2018-07-06 Thread Steven Rostedt
On Fri, 6 Jul 2018 17:24:28 -0400
Steven Rostedt  wrote:

> I'll investigate further.

Note, I enabled function tracing to trace the scheduler function:

 # trace-cmd start -p function -l schedule

And then ran cyclictest. That does not cause any noticeable increase
in latency (try it). So there is some kind of side effect with the
trace event itself, and not the recording of the event.

-- Steve


Re: ftrace performance (sched events): cyclictest shows 25% more latency

2018-07-06 Thread Steven Rostedt
On Fri, 6 Jul 2018 08:22:01 +0200
Claudio  wrote:

> Hello all,
> 
> I have been experimenting with the idea of leaving ftrace enabled, with sched 
> events,
> on production systems.
> 
> The main concern that I am having at the moment is about the impact on the 
> system.
> Enabling the sched events that I currently need for the tracing application
> seems to slow down context-switches considerably, and make the system less 
> responsive.
> 
> I have tested with cyclictest on the mainline kernel, and noticed an increase 
> of min, avg latencies of around 25%.
> 
> Is this expected?

No it is not. And this is weird. I'm seeing it too.

> 
> Some initial investigation into ftrace seems to point at the reservation and 
> commit of the events into the ring buffer
> as the highest sources of overhead, while event parameters copying, including 
> COMM, does not seem to have any noticeable effect
> relative to those costs.

Did you use perf to see where the issues were? If so what perf command
did you use?

When running just:
  perf record ./cyclictest --smp -p95 -m -s -N -l 100 -q

I get this:

   6.08%  cyclictest  [kernel.vmlinux][k] syscall_return_via_sysret
   4.03%  cyclictest  cyclictest  [.] timerthread
   3.97%  cyclictest  [kernel.vmlinux][k] __schedule
   3.81%  cyclictest  [kernel.vmlinux][k] native_write_msr
   3.13%  cyclictest  [kernel.vmlinux][k] trace_save_cmdline
   2.71%  cyclictest  [kernel.vmlinux][k] do_nanosleep
   2.48%  cyclictest  [kernel.vmlinux][k] do_syscall_64
   2.11%  cyclictest  [kernel.vmlinux][k] pick_next_task_fair
   2.03%  cyclictest  [kernel.vmlinux][k] native_sched_clock
   1.99%  cyclictest  [kernel.vmlinux][k] cpuacct_charge
   1.93%  cyclictest  [vdso]  [.] __vdso_clock_gettime
   1.86%  cyclictest  [kernel.vmlinux][k] hrtimer_nanosleep
   1.84%  cyclictest  [kernel.vmlinux][k] cpupri_set
   1.66%  cyclictest  [kernel.vmlinux][k] update_curr_rt
   1.65%  cyclictest  [kernel.vmlinux][k] __perf_event_task_sched_out
   1.59%  cyclictest  [kernel.vmlinux][k] dequeue_rt_stack
   1.55%  cyclictest  [kernel.vmlinux][k] __rb_reserve_next

Which shows there's a lot being done before we even get to the ring
buffer reserve.

One thing that I notice too is that we have trace_save_cmdline that
also gets activated when enabling trace events. Perhaps I should try to
optimize that more.

> 
> I have been running 20 times the following test, and thrown away the first 
> results:
> 
> $ sudo ./cyclictest --smp -p95 -m -s -N -l 10 -q
> 
> $ uname -a
> Linux claudio-HP-ProBook-470-G5 4.18.0-rc3+ #3 SMP Tue Jul 3 15:50:30 CEST 
> 2018 x86_64 x86_64 x86_64 GNU/Linux
> 
> For brevity, this is a comparison of one test's results. All other test 
> results show the same ~25% increase.
> 
> On the left side, the run without ftrace sched events, on the right side with 
> ftrace sched events enabled.
> 
> CPUCount  Min   ActAvgMaxCount  
> Min-ftrace Act-ftrace Avg-ftrace Max-ftrace
> 0 10  2339   2936   2841 139478 10   2900 
>   3182   3566  93056
> 1  66742  2365   3386   2874  93639  66750   2959 
>   3786   3646 154074
> 2  50080  2376   3058   2910 196221  50097   2997 
>   4209   3655  18707
> 3  40076  2394   3461   2931  17914  40091   3006 
>   4417   3750  17159
> 4  33404  2371   3612   2834  15336  33419   2997 
>   3836   3594  23172
> 5  28635  2387   3313   2885  25863  28649   2995 
>   3795   3647   9956
> 6  25058  2384   3428   2968  12162  25071   3051 
>   4366   3719  18151
> 7  22275  2381   2859   2982  10706  22287   3046 
>   5078   3825  10781
> 
> I would be thankful for any advice or comments on this,
> especially with the goal in mind to lower as much as possible the runtime 
> impact on the system.

But still, we use all sorts of trace events when dealing with real
time, and trace events hardly shows anything in the PREEMPT_RT kernel.
Where cyclictest shows 20 microseconds, and sched events hardly touches
that. But I see this weird side effect on a distro kernel too. Maybe
there's something weird going on when we have CONFIG_VOLUNTARY_PREEMPT
set :-/

I'll investigate further.

Thanks for the report.

-- Steve


Re: ftrace performance (sched events): cyclictest shows 25% more latency

2018-07-06 Thread Steven Rostedt
On Fri, 6 Jul 2018 08:22:01 +0200
Claudio  wrote:

> Hello all,
> 
> I have been experimenting with the idea of leaving ftrace enabled, with sched 
> events,
> on production systems.
> 
> The main concern that I am having at the moment is about the impact on the 
> system.
> Enabling the sched events that I currently need for the tracing application
> seems to slow down context-switches considerably, and make the system less 
> responsive.
> 
> I have tested with cyclictest on the mainline kernel, and noticed an increase 
> of min, avg latencies of around 25%.
> 
> Is this expected?

No it is not. And this is weird. I'm seeing it too.

> 
> Some initial investigation into ftrace seems to point at the reservation and 
> commit of the events into the ring buffer
> as the highest sources of overhead, while event parameters copying, including 
> COMM, does not seem to have any noticeable effect
> relative to those costs.

Did you use perf to see where the issues were? If so what perf command
did you use?

When running just:
  perf record ./cyclictest --smp -p95 -m -s -N -l 100 -q

I get this:

   6.08%  cyclictest  [kernel.vmlinux][k] syscall_return_via_sysret
   4.03%  cyclictest  cyclictest  [.] timerthread
   3.97%  cyclictest  [kernel.vmlinux][k] __schedule
   3.81%  cyclictest  [kernel.vmlinux][k] native_write_msr
   3.13%  cyclictest  [kernel.vmlinux][k] trace_save_cmdline
   2.71%  cyclictest  [kernel.vmlinux][k] do_nanosleep
   2.48%  cyclictest  [kernel.vmlinux][k] do_syscall_64
   2.11%  cyclictest  [kernel.vmlinux][k] pick_next_task_fair
   2.03%  cyclictest  [kernel.vmlinux][k] native_sched_clock
   1.99%  cyclictest  [kernel.vmlinux][k] cpuacct_charge
   1.93%  cyclictest  [vdso]  [.] __vdso_clock_gettime
   1.86%  cyclictest  [kernel.vmlinux][k] hrtimer_nanosleep
   1.84%  cyclictest  [kernel.vmlinux][k] cpupri_set
   1.66%  cyclictest  [kernel.vmlinux][k] update_curr_rt
   1.65%  cyclictest  [kernel.vmlinux][k] __perf_event_task_sched_out
   1.59%  cyclictest  [kernel.vmlinux][k] dequeue_rt_stack
   1.55%  cyclictest  [kernel.vmlinux][k] __rb_reserve_next

Which shows there's a lot being done before we even get to the ring
buffer reserve.

One thing that I notice too is that we have trace_save_cmdline that
also gets activated when enabling trace events. Perhaps I should try to
optimize that more.

> 
> I have been running 20 times the following test, and thrown away the first 
> results:
> 
> $ sudo ./cyclictest --smp -p95 -m -s -N -l 10 -q
> 
> $ uname -a
> Linux claudio-HP-ProBook-470-G5 4.18.0-rc3+ #3 SMP Tue Jul 3 15:50:30 CEST 
> 2018 x86_64 x86_64 x86_64 GNU/Linux
> 
> For brevity, this is a comparison of one test's results. All other test 
> results show the same ~25% increase.
> 
> On the left side, the run without ftrace sched events, on the right side with 
> ftrace sched events enabled.
> 
> CPUCount  Min   ActAvgMaxCount  
> Min-ftrace Act-ftrace Avg-ftrace Max-ftrace
> 0 10  2339   2936   2841 139478 10   2900 
>   3182   3566  93056
> 1  66742  2365   3386   2874  93639  66750   2959 
>   3786   3646 154074
> 2  50080  2376   3058   2910 196221  50097   2997 
>   4209   3655  18707
> 3  40076  2394   3461   2931  17914  40091   3006 
>   4417   3750  17159
> 4  33404  2371   3612   2834  15336  33419   2997 
>   3836   3594  23172
> 5  28635  2387   3313   2885  25863  28649   2995 
>   3795   3647   9956
> 6  25058  2384   3428   2968  12162  25071   3051 
>   4366   3719  18151
> 7  22275  2381   2859   2982  10706  22287   3046 
>   5078   3825  10781
> 
> I would be thankful for any advice or comments on this,
> especially with the goal in mind to lower as much as possible the runtime 
> impact on the system.

But still, we use all sorts of trace events when dealing with real
time, and trace events hardly shows anything in the PREEMPT_RT kernel.
Where cyclictest shows 20 microseconds, and sched events hardly touches
that. But I see this weird side effect on a distro kernel too. Maybe
there's something weird going on when we have CONFIG_VOLUNTARY_PREEMPT
set :-/

I'll investigate further.

Thanks for the report.

-- Steve