Re: [PATCH v8 5/8] tracepoint: Make rcuidle tracepoint callers use SRCU
- On May 31, 2018, at 1:51 PM, Joel Fernandes, Google j...@joelfernandes.org wrote: >> I find it odd to have a "return" in a macro that consists of a >> do { } while (0). I'm tempted to replace "return" by "break" here, >> to break the macro do/while (0) loop. > > "return;" is also used from "if (!(cond))" above so I prefer be consistent > and just use return than break as done above, but please let me know if you > still object. It's fine by me, Thanks! Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH v8 5/8] tracepoint: Make rcuidle tracepoint callers use SRCU
On Thu, May 31, 2018 at 02:50:41AM -0400, Mathieu Desnoyers wrote: > - On May 30, 2018, at 2:04 AM, Joel Fernandes joe...@google.com wrote: > > > From: "Joel Fernandes (Google)" > > > > In recent tests with IRQ on/off tracepoints, a large performance > > overhead ~10% is noticed when running hackbench. This is root caused to > > calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the > > tracepoint code. Following a long discussion on the list [1] about this, > > we concluded that srcu is a better alternative for use during rcu idle. > > Although it does involve extra barriers, its lighter than the sched-rcu > > version which has to do additional RCU calls to notify RCU idle about > > entry into RCU sections. > > > > In this patch, we change the underlying implementation of the > > trace_*_rcuidle API to use SRCU. This has shown to improve performance > > alot for the high frequency irq enable/disable tracepoints. > > > > Test: Tested idle and preempt/irq tracepoints. > > > > Here are some performance numbers: > > > > With a run of the following 30 times on a single core x86 Qemu instance > > with 1GB memory: > > hackbench -g 4 -f 2 -l 3000 > > > > Completion times in seconds. CONFIG_PROVE_LOCKING=y. > > > > No patches (without this series) > > Mean: 3.048 > > Median: 3.025 > > Std Dev: 0.064 > > > > With Lockdep using irq tracepoints with RCU implementation: > > Mean: 3.451 (-11.66 %) > > Median: 3.447 (-12.22%) > > Std Dev: 0.049 > > > > With Lockdep using irq tracepoints with SRCU implementation (this series): > > Mean: 3.020 (I would consider the improvement against the "without > >this series" case as just noise). > > Median: 3.013 > > Std Dev: 0.033 > > > > [1] https://patchwork.kernel.org/patch/10344297/ > > > > Signed-off-by: Joel Fernandes (Google) > > --- > > include/linux/tracepoint.h | 48 +++--- > > kernel/tracepoint.c| 15 +++- > > 2 files changed, 54 insertions(+), 9 deletions(-) > > > > diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h > > index c94f466d57ef..880794207921 100644 > > --- a/include/linux/tracepoint.h > > +++ b/include/linux/tracepoint.h > > @@ -15,6 +15,7 @@ > > */ > > > > #include > > +#include > > #include > > #include > > #include > > @@ -33,6 +34,8 @@ struct trace_eval_map { > > > > #define TRACEPOINT_DEFAULT_PRIO 10 > > > > +extern struct srcu_struct tracepoint_srcu; > > + > > extern int > > tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data); > > extern int > > @@ -75,10 +78,15 @@ int unregister_tracepoint_module_notifier(struct > > notifier_block *nb) > > * probe unregistration and the end of module exit to make sure there is no > > * caller executing a probe when it is freed. > > */ > > +#ifdef CONFIG_TRACEPOINTS > > static inline void tracepoint_synchronize_unregister(void) > > { > > + synchronize_srcu(&tracepoint_srcu); > > synchronize_sched(); > > } > > +#else > > +static inline void tracepoint_synchronize_unregister(void) { } > > In order to add some consistency to the style in this header file, > this empty function should look like: > > static inline void tracepoint_synchronize_unregister(void) > { } > > (with newline before "{") Ok, I changed it to the style you're proposing. > > +#endif > > > > #ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS > > extern int syscall_regfunc(void); > > @@ -129,18 +137,38 @@ extern void syscall_unregfunc(void); > > * as "(void *, void)". The DECLARE_TRACE_NOARGS() will pass in just > > * "void *data", where as the DECLARE_TRACE() will pass in "void *data, > > proto". > > */ > > -#define __DO_TRACE(tp, proto, args, cond, rcucheck) > > \ > > +#define __DO_TRACE(tp, proto, args, cond, rcuidle) \ > > do {\ > > struct tracepoint_func *it_func_ptr;\ > > void *it_func; \ > > void *__data; \ > > + int __maybe_unused idx = 0; \ > > \ > > if (!(cond))\ > > return; \ > > - if (rcucheck) \ > > - rcu_irq_enter_irqson(); \ > > - rcu_read_lock_sched_notrace(); \ > > - it_func_ptr = rcu_dereference_sched((tp)->funcs); \ > > + \ > > + /* \ > > +* For rcuidle callers, use srcu since sched-rcu\ > > +* doesn't work from the idle path.
Re: [PATCH v8 5/8] tracepoint: Make rcuidle tracepoint callers use SRCU
- On May 30, 2018, at 2:04 AM, Joel Fernandes joe...@google.com wrote: > From: "Joel Fernandes (Google)" > > In recent tests with IRQ on/off tracepoints, a large performance > overhead ~10% is noticed when running hackbench. This is root caused to > calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the > tracepoint code. Following a long discussion on the list [1] about this, > we concluded that srcu is a better alternative for use during rcu idle. > Although it does involve extra barriers, its lighter than the sched-rcu > version which has to do additional RCU calls to notify RCU idle about > entry into RCU sections. > > In this patch, we change the underlying implementation of the > trace_*_rcuidle API to use SRCU. This has shown to improve performance > alot for the high frequency irq enable/disable tracepoints. > > Test: Tested idle and preempt/irq tracepoints. > > Here are some performance numbers: > > With a run of the following 30 times on a single core x86 Qemu instance > with 1GB memory: > hackbench -g 4 -f 2 -l 3000 > > Completion times in seconds. CONFIG_PROVE_LOCKING=y. > > No patches (without this series) > Mean: 3.048 > Median: 3.025 > Std Dev: 0.064 > > With Lockdep using irq tracepoints with RCU implementation: > Mean: 3.451 (-11.66 %) > Median: 3.447 (-12.22%) > Std Dev: 0.049 > > With Lockdep using irq tracepoints with SRCU implementation (this series): > Mean: 3.020 (I would consider the improvement against the "without > this series" case as just noise). > Median: 3.013 > Std Dev: 0.033 > > [1] https://patchwork.kernel.org/patch/10344297/ > > Signed-off-by: Joel Fernandes (Google) > --- > include/linux/tracepoint.h | 48 +++--- > kernel/tracepoint.c| 15 +++- > 2 files changed, 54 insertions(+), 9 deletions(-) > > diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h > index c94f466d57ef..880794207921 100644 > --- a/include/linux/tracepoint.h > +++ b/include/linux/tracepoint.h > @@ -15,6 +15,7 @@ > */ > > #include > +#include > #include > #include > #include > @@ -33,6 +34,8 @@ struct trace_eval_map { > > #define TRACEPOINT_DEFAULT_PRIO 10 > > +extern struct srcu_struct tracepoint_srcu; > + > extern int > tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data); > extern int > @@ -75,10 +78,15 @@ int unregister_tracepoint_module_notifier(struct > notifier_block *nb) > * probe unregistration and the end of module exit to make sure there is no > * caller executing a probe when it is freed. > */ > +#ifdef CONFIG_TRACEPOINTS > static inline void tracepoint_synchronize_unregister(void) > { > + synchronize_srcu(&tracepoint_srcu); > synchronize_sched(); > } > +#else > +static inline void tracepoint_synchronize_unregister(void) { } In order to add some consistency to the style in this header file, this empty function should look like: static inline void tracepoint_synchronize_unregister(void) { } (with newline before "{") > +#endif > > #ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS > extern int syscall_regfunc(void); > @@ -129,18 +137,38 @@ extern void syscall_unregfunc(void); > * as "(void *, void)". The DECLARE_TRACE_NOARGS() will pass in just > * "void *data", where as the DECLARE_TRACE() will pass in "void *data, > proto". > */ > -#define __DO_TRACE(tp, proto, args, cond, rcucheck) \ > +#define __DO_TRACE(tp, proto, args, cond, rcuidle) \ > do {\ > struct tracepoint_func *it_func_ptr;\ > void *it_func; \ > void *__data; \ > + int __maybe_unused idx = 0; \ > \ > if (!(cond))\ > return; \ > - if (rcucheck) \ > - rcu_irq_enter_irqson(); \ > - rcu_read_lock_sched_notrace(); \ > - it_func_ptr = rcu_dereference_sched((tp)->funcs); \ > + \ > + /* \ > + * For rcuidle callers, use srcu since sched-rcu\ > + * doesn't work from the idle path. \ > + */ \ > + if (rcuidle) { \ > + if (in_nmi()) { \ > + WARN_ON_ONCE(1);\ > +
[PATCH v8 5/8] tracepoint: Make rcuidle tracepoint callers use SRCU
From: "Joel Fernandes (Google)" In recent tests with IRQ on/off tracepoints, a large performance overhead ~10% is noticed when running hackbench. This is root caused to calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the tracepoint code. Following a long discussion on the list [1] about this, we concluded that srcu is a better alternative for use during rcu idle. Although it does involve extra barriers, its lighter than the sched-rcu version which has to do additional RCU calls to notify RCU idle about entry into RCU sections. In this patch, we change the underlying implementation of the trace_*_rcuidle API to use SRCU. This has shown to improve performance alot for the high frequency irq enable/disable tracepoints. Test: Tested idle and preempt/irq tracepoints. Here are some performance numbers: With a run of the following 30 times on a single core x86 Qemu instance with 1GB memory: hackbench -g 4 -f 2 -l 3000 Completion times in seconds. CONFIG_PROVE_LOCKING=y. No patches (without this series) Mean: 3.048 Median: 3.025 Std Dev: 0.064 With Lockdep using irq tracepoints with RCU implementation: Mean: 3.451 (-11.66 %) Median: 3.447 (-12.22%) Std Dev: 0.049 With Lockdep using irq tracepoints with SRCU implementation (this series): Mean: 3.020 (I would consider the improvement against the "without this series" case as just noise). Median: 3.013 Std Dev: 0.033 [1] https://patchwork.kernel.org/patch/10344297/ Signed-off-by: Joel Fernandes (Google) --- include/linux/tracepoint.h | 48 +++--- kernel/tracepoint.c| 15 +++- 2 files changed, 54 insertions(+), 9 deletions(-) diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h index c94f466d57ef..880794207921 100644 --- a/include/linux/tracepoint.h +++ b/include/linux/tracepoint.h @@ -15,6 +15,7 @@ */ #include +#include #include #include #include @@ -33,6 +34,8 @@ struct trace_eval_map { #define TRACEPOINT_DEFAULT_PRIO10 +extern struct srcu_struct tracepoint_srcu; + extern int tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data); extern int @@ -75,10 +78,15 @@ int unregister_tracepoint_module_notifier(struct notifier_block *nb) * probe unregistration and the end of module exit to make sure there is no * caller executing a probe when it is freed. */ +#ifdef CONFIG_TRACEPOINTS static inline void tracepoint_synchronize_unregister(void) { + synchronize_srcu(&tracepoint_srcu); synchronize_sched(); } +#else +static inline void tracepoint_synchronize_unregister(void) { } +#endif #ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS extern int syscall_regfunc(void); @@ -129,18 +137,38 @@ extern void syscall_unregfunc(void); * as "(void *, void)". The DECLARE_TRACE_NOARGS() will pass in just * "void *data", where as the DECLARE_TRACE() will pass in "void *data, proto". */ -#define __DO_TRACE(tp, proto, args, cond, rcucheck)\ +#define __DO_TRACE(tp, proto, args, cond, rcuidle) \ do {\ struct tracepoint_func *it_func_ptr;\ void *it_func; \ void *__data; \ + int __maybe_unused idx = 0; \ \ if (!(cond))\ return; \ - if (rcucheck) \ - rcu_irq_enter_irqson(); \ - rcu_read_lock_sched_notrace(); \ - it_func_ptr = rcu_dereference_sched((tp)->funcs); \ + \ + /* \ +* For rcuidle callers, use srcu since sched-rcu\ +* doesn't work from the idle path. \ +*/ \ + if (rcuidle) { \ + if (in_nmi()) { \ + WARN_ON_ONCE(1);\ + return; /* no srcu from nmi */ \ + } \ + \ + idx = srcu_read_lock_notrace(&tracepoint_srcu); \ + it_func_ptr = \ + srcu_dereference_notrace((tp)->funcs,